The Problem: Using Apache mod_deflate and mod_disk_cache (or other mod_cache) together can create far too many cached files.

The Background: Apache is a web server with many different modules you can load in to enhance it. Two common ones are mod_deflate and mod_cache (or mod_disk_cache).

Mod_deflate compresses content that is sent to the webserver using gzip. It can take 100k of html, css, or javascript, and compress it down to ~10k, before transmitting it to the user’s browser. The browser then uncompresses it, and displays the page. Most web servers (depending on how your site/application is structured anyway) are not CPU limited. Therefore, you can spend some extra CPU doing the compression, and get much faster content delivery times to your users, who are often bandwidth limited. Not only does this make pages load faster for your users, but it also allows request handling threads to complete sooner, letting your web server handle more requests.

Some web browsers are not able to handle gzipped content correctly, therefore it’s important to add in some logic to only send gzipped content to browsers who can handle it. Also, there are different types of files which are already compressed and hence trying to gzip them is a waste of time and resources, such as images, video, etc…

A common configuration may look like this:

<Location />
# Insert filter
SetOutputFilter DEFLATE

# Netscape 4.x has some problems...
BrowserMatch ^Mozilla/4 gzip-only-text/html

# Netscape 4.06-4.08 have some more problems
BrowserMatch ^Mozilla/4\.0[678] no-gzip

# MSIE masquerades as Netscape, but it is fine
BrowserMatch \bMSIE no-gzip 

# NOTE: Due to a bug in mod_setenvif up to Apache 2.0.48
# the above regex won't work. You can use the following
# workaround to get the desired effect:
BrowserMatch \bMSIE\s7  !no-gzip !gzip-only-text/html

# Don't compress images
SetEnvIfNoCase Request_URI \
\.(?:gif|jpe?g|png|swf|flv)$ no-gzip dont-vary

# Make sure proxies don't deliver the wrong content
Header append Vary User-Agent env=!dont-vary

This basically says:

“For files under /”
“Compress them”
“Unless it’s Netscape 4.x, then only compress text/html files”
“Or, if it’s Netscape 4.06-4.08, then don’t compress any files”
“But if it’s IE, don’t compress any files” – NOTE: this is different than the common version you see floating around which turns back on compression for IE. If you are loading content from a Flash swf within IE 6, that content can’t be compressed, even though IE 6 handles it fine. Flash doesn’t for some reason. So this setting is safer. If you aren’t using Flash, feel free to change this.
“but if it’s IE7, undo the no compression settings we made before, activating compression”
“but don’t compress already compressed files like images and video”
“Set the response Vary header to User-Agent so that any upstream caching or proxying won’t cache the wrong version and send a compressed version to a browser which can’t handle it, or an uncompressed version to a browser that should have gotten the compressed file”

Confused yet? :)

Mod_disk_cache allows you to specify various files to be cached on the web server and lets you set a cache expiration time, etc… It’s of great value when those files are being served out of a web application, and not coming from the local disk. For instance if Apache is serving files from an ATG instance, mod_disk_cache, lets you have the web server cache images, css, js, videos, etc… from your WAR. There’s also a memory based cache, mod_mem_cache, but it’s more trouble than it’s worth, and you can trust the linux kernel to cache recently accessed files in memory anyhow.

Got it?

So this is where it gets tricky.

If a response has a Vary header set, mod_disk_cache will cache a different version of that file for each value of the Header that Vary references.

So for a file compressed as above, there will be a different version cached for each User-Agent. In theory this will mean that browsers which support gzip compressed content, will get the compressed content, and browsers which don’t, will get the uncompressed version.

In practice, due to the amazing tiny variations of the full User-Agent string, you end up with thousands of copies of the same file in your cache. On a disk cache only a few hours old, there were over 4,400 cached copies of the same javascript file. Each with a slightly different User Agent string, even though there were less than 10 actual browser types represented.

This is a problem for several reasons: Firstly, you end up using far more disk space than you really need. Secondly, you negate the kernel’s in-memory file caching, since those 4,000+ version of the single file are being accessed, it won’t be able to simply keep the two different files (compressed and uncompressed) in memory. Thirdly, you make cleaning out the cache much slower, since you have to delete these thousands of extra files and their containing directories.

The Solution: I’m not sure… Any ideas?