Improving ATG Performance With a CDN

Why use a CDN?

A Content Delivery Network, or CDN, is essentially a system of geographically distributed web servers which serve static content, typically images, video, and other bandwidth intensive files. This serves two purposes: it keeps your servers from having to handle those requests and it serves those files to the end user from a low latency server closer to the user (network-wise). Both of these aspects improve the user’s perception of page and site performance. CDNs can also be extremely useful for things like streaming video or other very high bandwidth uses.

How do CDNs work?

CDNs typically work in one of two ways: for some you have to deploy the files to the CDN manually via FTP or some similar mechanism while others work as a transparent proxy automatically loading the files from the source or origin (your servers) into the CDN as users request them. The latter is preferable as you don’t need to take the CDN into consideration when building your application’s page and referencing media, this also makes handling non-production environments more complex. Also it allows the media to be reloaded from the origin based on cache expiration headers, so you don’t need to do anything special during deployments of new media. However those CDN solutions also seem to be more expensive, so it’s a balance you have to weigh yourself.

Roll Your Own Apache Pseudo CDN

You can also roll a pseudo-CDN yourself using Apache. I call it a pseudo-CDN because unlike Akamai and other large providers you don’t get the advantages of hundreds or thousands of geographically distributed servers. You also don’t get lots of fancy math routing user’s requests to the quickest servers based on location, network congestion, and more. What you do get is transparent proxying and off-loading the request handling from your application servers.

This means you don’t have to do anything special or complex when coding your web application and your JSPs to facilitate the CDN, and it means that your application servers are freed up from having to handle the requests for static media, large and small, which means they have more CPU time available for handling the real dynamic processing of your web application.

Apache makes this simple by way of the mod_disk_cache module. I’d recommend avoiding the mod_mem_cache. Even though it sounds like it would be the preferred caching mechanism, I have had significant problems with mem_cache, and have abandoned it. If you’re using Linux (and you should be) the kernel’s ability to aggressively cache recently accessed files means that when you’re using mod_disk_cache, Apache will cache the files you specify on the local hard drive and will use all available RAM to cache those files in memory for rapid serving. If you plan on using mod_gzip and mod_disk_cache together, please read my post on the issues encountered using them together.

Improving JSP Serving Time for an ATG Application

Improving the performance of the JSPs that serve your HTML pages is the first step in improving the overall site performance. The user’s browser can not start rendering the page or requesting the secondary media. Also the faster the page request is completed, the sooner you have a thread free to handle the next request.

There are two parts to this: first, the time it takes the JSP servlet to generate the HTML response, and secondly the time it takes to transmit that HTML response back to the user’s browser.

Caching content sections

The easiest way to reduce the time it takes for the JSP servlet to generate the response is by reducing the amount of dynamic content on the page. Or more precisely by reducing the amount of real-time or unique individual content on the page.

The Cache droplet is THE most under-utilized ATG droplet.

The Cache droplet caches the rendered output of the contents of the oparam based on a content key (such as category, user gender, logged in/logged out state, etc..) for a configured period of time. This can be very useful for things like navigation menus dynamically built based on the catalog. The catalog won’t change too often, so this dynamically generated menu can be safely cached for hours. Or for some or all of a category or product page, when you set the key to the category id or product id.

Look at your pages and evaluate what parts of the page don’t change that frequently. Even if you can only cache the page or block for five minutes, that can be a huge performance win.

Read on for more….
Continue reading

Apache mod_deflate and mod_cache issues

The Problem: Using Apache mod_deflate and mod_disk_cache (or other mod_cache) together can create far too many cached files.

The Background: Apache is a web server with many different modules you can load in to enhance it. Two common ones are mod_deflate and mod_cache (or mod_disk_cache).

Mod_deflate compresses content that is sent to the webserver using gzip. It can take 100k of html, css, or javascript, and compress it down to ~10k, before transmitting it to the user’s browser. The browser then uncompresses it, and displays the page. Most web servers (depending on how your site/application is structured anyway) are not CPU limited. Therefore, you can spend some extra CPU doing the compression, and get much faster content delivery times to your users, who are often bandwidth limited. Not only does this make pages load faster for your users, but it also allows request handling threads to complete sooner, letting your web server handle more requests.

Some web browsers are not able to handle gzipped content correctly, therefore it’s important to add in some logic to only send gzipped content to browsers who can handle it. Also, there are different types of files which are already compressed and hence trying to gzip them is a waste of time and resources, such as images, video, etc…

A common configuration may look like this:

<Location />
# Insert filter
SetOutputFilter DEFLATE

# Netscape 4.x has some problems...
BrowserMatch ^Mozilla/4 gzip-only-text/html

# Netscape 4.06-4.08 have some more problems
BrowserMatch ^Mozilla/4\.0[678] no-gzip

# MSIE masquerades as Netscape, but it is fine
BrowserMatch \bMSIE no-gzip 

# NOTE: Due to a bug in mod_setenvif up to Apache 2.0.48
# the above regex won't work. You can use the following
# workaround to get the desired effect:
BrowserMatch \bMSIE\s7  !no-gzip !gzip-only-text/html

# Don't compress images
SetEnvIfNoCase Request_URI \
\.(?:gif|jpe?g|png|swf|flv)$ no-gzip dont-vary

# Make sure proxies don't deliver the wrong content
Header append Vary User-Agent env=!dont-vary

This basically says:

“For files under /”
“Compress them”
“Unless it’s Netscape 4.x, then only compress text/html files”
“Or, if it’s Netscape 4.06-4.08, then don’t compress any files”
“But if it’s IE, don’t compress any files” – NOTE: this is different than the common version you see floating around which turns back on compression for IE. If you are loading content from a Flash swf within IE 6, that content can’t be compressed, even though IE 6 handles it fine. Flash doesn’t for some reason. So this setting is safer. If you aren’t using Flash, feel free to change this.
“but if it’s IE7, undo the no compression settings we made before, activating compression”
“but don’t compress already compressed files like images and video”
“Set the response Vary header to User-Agent so that any upstream caching or proxying won’t cache the wrong version and send a compressed version to a browser which can’t handle it, or an uncompressed version to a browser that should have gotten the compressed file”

Confused yet? :)

Mod_disk_cache allows you to specify various files to be cached on the web server and lets you set a cache expiration time, etc… It’s of great value when those files are being served out of a web application, and not coming from the local disk. For instance if Apache is serving files from an ATG instance, mod_disk_cache, lets you have the web server cache images, css, js, videos, etc… from your WAR. There’s also a memory based cache, mod_mem_cache, but it’s more trouble than it’s worth, and you can trust the linux kernel to cache recently accessed files in memory anyhow.

Got it?

So this is where it gets tricky.

If a response has a Vary header set, mod_disk_cache will cache a different version of that file for each value of the Header that Vary references.

So for a file compressed as above, there will be a different version cached for each User-Agent. In theory this will mean that browsers which support gzip compressed content, will get the compressed content, and browsers which don’t, will get the uncompressed version.

In practice, due to the amazing tiny variations of the full User-Agent string, you end up with thousands of copies of the same file in your cache. On a disk cache only a few hours old, there were over 4,400 cached copies of the same javascript file. Each with a slightly different User Agent string, even though there were less than 10 actual browser types represented.

This is a problem for several reasons: Firstly, you end up using far more disk space than you really need. Secondly, you negate the kernel’s in-memory file caching, since those 4,000+ version of the single file are being accessed, it won’t be able to simply keep the two different files (compressed and uncompressed) in memory. Thirdly, you make cleaning out the cache much slower, since you have to delete these thousands of extra files and their containing directories.

The Solution: I’m not sure… Any ideas?

Hiding jsessionid parameter from Google

If you’re running a website on JBoss you may discover that Google has indexed your pages with a jsessionid query parameter in the links.

The Google crawl bot does not support cookies, therefore JBoss uses the jsessionid query parameter in order to maintain a session state without cookies. These query parameters can impact your Google rank and indexing efficiency as the same page can be indexed multiple times with different session ids, and dilute your ranking. Also, it leads to ugly links.

If you want to still be able to support non-cookie using users, but would like Google to see cleaner links, you can use Apache’s mod_rewrite to modify the links for the Google bot only, leaving the normal functionality available to the rest of your users.

Assuming you have mod_rewrite enabled in your Apache instance, use this configuration in your apache config:

	# This should strip out jsessionids from google
	RewriteCond %{HTTP_USER_AGENT} (googlebot) [NC]
	ReWriteRule ^(.*);jsessionid=[A-Za-z0-9]+(.*)$ $1$2 [L,R=301]

This rule says for request where the user agent contains “googlebot” (with case insensitive matching), rewrite the URL without the jsessionid. It seems to work nicely.

Apache Proxy Breaks RichFaces

I’ve run into this twice now, so I wanted to document it here to help other folks, and to see if anyone knows the root cause of the issue.

When using RichFaces with Seam, things work just fine on my local development JBoss instance. But when I deploy the same EAR file up to my production JBoss instance, which is sitting behind an Apache proxy, everything works EXCEPT the rich/ajax stuff.

The issue was that the JavaScript located here: ContextRoot/a4j_3_1_4.GAorg.ajax4jsf.javascript.AjaxScript

would not load.

My Apache proxy was configured like this:

	ProxyPass /10MinuteMail balancer://mycluster/10MinuteMail/
	ProxyPass /10MinuteMail/* balancer://mycluster/10MinuteMail/
	ProxyPassReverse /10MinuteMail

With mycluster defined like this:

                AddDefaultCharset off
                Order deny,allow
                Allow from all

                #Allow from

Again, this configuration worked fine for everything EXCEPT that RichFaces JavaScript.

Since I am only using one node for 10MinuteMail, there is no real need for a load balancer configuration, so I replaced the configuration with this:

	ProxyPass /10MinuteMail
	ProxyPass /10MinuteMail/
        ProxyPassReverse /10MinuteMail/

Which works, and fixed the RichFaces reference.

So there’s your solution. However I have no idea what the actual root cause is.