Friday, April 30, 2010

RESTful page caching using request parametric signature.

This article discusses how a web application could avoid regenerating a dynamic web page in the situation where a page is requested more than once or by more than one user.

I implemented this strategy in 2004 while employed with a semiconductor company in Massachusetts - i.e., it's tried and tested (using plain SQL rather than JPA and JSP rather than GWT).


The problem
There are two issues causing web page regeneration.
  1. A user needing to go back to previous pages for some information and then go forward again.
  2. Two or more independent users requesting for the same dynamically generated web page.

These issues are especially aggravating for pages that take a long time to generate.
e.g., a car-hotel-airline booking site scanning its repositories for the best deal, or an enterprise report that is generated by heavy mathematical calculations extracted from a complex combination of data sources.

There are three strategies that can help to solve this issue.
  1. Cache the web page.

  2. Cache the data extract used for generating the web page.
    A web page may need to include attributes individualised for each user and request even though the same information is being displayed. For example, the respective user name and attributes.

    This strategy is worthwhile when the complexity of extracting the data for constructing the web page is very resource consuming and that there are too many individualised user attributes on a page.


  3. Cache the components of the web page and regenerate the web page from those components.
    This is a compromise between the first two strategies. All components of the page are generated and cached except for the individualised page attributes.

Identifying web requests
In any of these solutions, there is one common and significant question to be answered first - how to identify the similarity of web requests. The simple answer is, of course, from the request parameters.

However, the answer to the question is not that simple. There are further three issues to consider before using the request parameters as page identity.
  1. Presence of parameters that do not contribute to the identity of a web page. Such additional parameters would make some web requests seem different when they are actually requesting for the same information.

  2. A cached page would become stale, when fresh data is available affecting the information being requested. Therefore, two requests with the same set of parameters may not be asking for the same web page after the the data repository has been updated.

  3. The identity of a web page is due to a large set of parameters. This is especially true for mathematically generated reports.
The solution is using a parametric signature to identify the web page.


Page signature from request LCD
First, the application architect has to design the LCD (lowest common denominators) of the request parameters of a web page. The architect should design web pages to be identifiable by the least number of request parameters.

Second, every http request to the web application must first go through a signature generator. The signature is the compressed value of LCD parameters. The signature is compared against a cache signature table in a database. If the signature is non-existent, it is stored into the table and a fresh web page is generated from the request parameters. If the signature exists, the cached page is served.

What if two or more requests arrive at the same time requesting for the same uncached page? There would be a race to create the signature record in the table. Therefore, a unique indexed-key has to be defined for the signature field.

The cache signature table comprises the columns (using JPA/JDO pseudocode):
@entity class CacheParamSig{
  @id String signature;
  String reservation;
  Datetime requestDatetime;
}

The algorithm
Let's say that a request is received. It is run through the parametric signature generator:
String paramSig = ParamSigGenerator.generate(request);

Each request races against any other to grab the CacheParamSig record and update it with its reservation semaphore:
String reservation = servername + session.getId();
The datetime of that reservation:
Datetime reservationTime = new Datetime();

First, determine if the page is cached, stale or non-existent:
if (cacheParamSig exists){

  // is another request generating the page?
  if (cacheParamSig.reservation != null){
    return PageIsGenerating;
  }

  else { // page is already generated and cached
    TimeDeterminant =
      Use request parameters to determine what database tables will be used
        to generate the page;

    // Get the latest upload time of those tables
    Datetime latestData = getLatest(TimeDeterminant);

    // is cached page stale?
    if (latestData > cacheParamSig.requestDatetime)
      grab the CacheParamSig record to generate a fresh page;

    else
      deliver the cached page to the browser;
  }
}


AJAX polling required
The browser page sending the request should be an AJAX polling loop, which should be easily implemented using GWT RPC. On receiving PageIsGenerating, the requesting page should wait and then attempt to send the request again.

If the page is stale or non-existent, the request proceeds to grab the CacheParamSig record.


Grabbing the reservation
Now, the request starts to race to grab and place a reservation on the CacheParamSig record:
transaction {
  CacheParamSig cacheParamSig =
    get CacheParamSig record using id paramSig;
  if (cacheParamSig exists){
    cacheParamSig.reservation = reservation;
    cacheParamSig.requestDatetime =  reservationTime;
    update cacheParamSig record;
  }
  else {
    cacheParamSig = new CacheParamSig();
    cacheParamSig.signature = paramSig;
    cacheParamSig.reservation = reservation;
    cacheParamSig.requestDatetime =  reservationTime;
    insert cacheParamSig record;
  }
}
at end of transaction {
  if (if transaction failed because another transaction on the record is running){
    rollback;
    return PageIsGenerating;
  }
}

In JPA-pseudocode,
the grabbing reservation race:
try
{
  transaction.begin();
  CacheParamSig cacheParamSig =
    (CacheParamSig)entityManager.find("CacheParamSig", paramSig);

  // if another request is already generating the page
  if (cacheParamSig.reservation != null){
    transaction.rollback();
    entityManager.close();
    return pageIsGenerating();
  }

  TimeDeterminant determinant = TimeDeterminant.get(request);
  Datetime latest = TimeDeterminant.latest(determinant);

  // if cached page is not stale
  if (latest < cacheParamSig.requestDatetime){
    transaction.rollback();
    entityManager.close();

    // paramSig value points to the cache where the page is stored
    return generatedPage(paramSig);
  }

  // otherwise, reserve the record to generate the page
  cacheParamSig.reservation = reservation;
  cacheParamSig.requestDatetime =  reservationTime;
  entityManager.merge(cacheParamSig);
  transaction.commit();
}

// page does not exist
// create and reserve the record to generate the page
catch (EntityNotFoundException e){
  cacheParamSig = new CacheParamSig();
  cacheParamSig.signature = paramSig;
  cacheParamSig.reservation = reservation;
  cacheParamSig.requestDatetime =  reservationTime;
  entityManager.persist(cacheParamSig);
  transaction.commit();
}

finally
{
  // if another transaction is already generating the page
  if ( transaction.isActive())
  {
    transaction.rollback();
    entityManager.close();
    return pageIsGenerating();
  }
}

// the page is cached into the folder named by the String value of paramCacheSig
generatePageIntoCache(request, paramCacheSig);


After generating the page,
  • end the reservation to tell others possibly waiting for the page, that the page generation has completed,
  • and send the page to the requesting page.
try
{
  transaction.begin();

  // tell others possibly waiting that the page generation has completed
  cacheParamSig.reservation = null;
  transaction.commit();
}
finally
{
  // this is not possible, otherwise something's wrong
  // but catch it anyway just in case

  if ( transaction.isActive())
  {
    transaction.rollback();
    entityManager.close();
    return BigTimeError;
  }

  return generatedPage(paramSig);
}


Further minutiae
This article presents a skeletal. There are minutiae that needs to be taken care of.
  1. Need to treat the case where cached page was not stale, but just at the end of transaction, another request came in and found it stale and reserved the ParamCacheSig record to regenerate the page and is in the process of overwriting the cache while you are attempting to send the cache to the requesting page.
  2. If pageIsGenerating, The GWT RPC response should notify the requesting page each time when the current reservation was started.
  3. A job needs to be scheduled to delete stale ParamCacheSig records, to prevent the table from growing. Depending on the distribution of page possibilities, the schedule could be hourly, daily, weekly, etc.
  4. The skeletals above illustrate only the strategy of caching whole pages. Pages need not be HTML. They could be GIF, XLS, CSV, text or a combination various output formats.
Normally, there are two modes of page generating - scheduled reports and ad-hoc pages. When I implemented this strategy, I merged both modes. I created a ScheduledReports table, where each record contains a scheduled report name and its request parameters  and a JSP interface to that table for users to schedule generation of their pages.

I wrote another JSP to read this table and then to perform a http request to generate the respective page. A job scheduler was used to invoke that JSP.


Page sequence tracker
Another interesting feature I had was a page sequence tracker. In statistical analysis, frequently, the output of a page is fed as input into another page. At the start of analysis, the user would check the box for Start page sequence tracker. In statistical analysis, an analyst would try all sorts of sequences of analysis until he/she hits the jackpot of the perfect sequence. Once the jackpot is hit, the analyst would want to turn those adhoc page requests into a scheduled report. The page sequence tracker merely writes the sequence of parameters into the ScheduledReports table under a sequence group, where the reports are generated sequentially at the scheduled time.

With adhoc and scheduled pages generation performed under the same mechanism, I was even able to analyse the most frequently used parameters and turned them into scheduled reports without users knowing - and occasionally, users were congratulating me on how unbelievably quick the response to some of their more complex statistical queries had become. The pages were generated one hour before they normally sat down to start their analysis after their morning cup of coffee.

No comments:

Post a Comment