Instead of just using the Apache mod_rewrite rules from my post on “Hiding jsessionid parameters from Google“, which uses redirects, wouldn’t it be better to simply not output the jsessionid parameter into the URLs?
First, what are those jsessionid params, and why are they there?
For a web application to have state, i.e. remember things from one page request to the next (such as that you’re logged in, who you are, what is in your shopping cart, etc…), most web applications have something called a session. The session starts when you hit the website at first, sticks with you while you are on the site, and expires after you have either logged out or have been idle (i.e. not clicked on anything) for a set period of time (perhaps 30 minutes).
In general the actual session data is held on the server, things like your shopping cart, your user profile, all of that. However, in order to associate requests from your web browser with the correct session, your browser needs to pass something for the web application to recognize which session is yours. This is traditionally done in two ways:
firstly and primarily using a session-life browser cookie (or two) which hold a session identifier and optionally some additional security token(s). The browser receives this cookie from the web application, and then sends the cookie back to the web application with each page request. The web application looks at the cookie, and figures out which session is yours, and handles your page request appropriately.
secondly, and usually only as a fall-back for browsers which do not support cookies or whose cookie support has been turned off, is to rewrite every link in the web application which points to another page in the same web application with a special session id added to the URI of the link. This is usually done as a path parameter (following a ‘;’), but sometimes is also done as a query parameter (following a ‘?’).
Since on the first request to a web application, the browser is not sending a session cookie, the web application has no way of knowing if the browser actually supports cookies or not. So for the first page, the web application will usually send back the session cookie AND rewrite all of the links on the page with the jsessionid just in case the cookie is not returned.
So what’s the problem?
Search engine spiders, like Google’s GoogleBot, usually do not support cookies. This means that they see the site with the jsessionid parameter in every link and every requested URL. So this leads to three related problems. First, the links that show up in a Google search include an ugly ‘jsessionid=xxxxxx’ which looks ugly. Second, Google doesn’t recognize that the jsessionid parameter doesn’t change the page content, and as such each time the GoogleBot hits the site, and gets a different jsessionid, it indexes all of the pages again. This leads to getting multiple result listings for the same page in search results. For instance you might see the same page listed 7 times in a row. Third, by having multiple instances of the same page with the same content, the Google PageRank of the actual page is severely diluted and perhaps even penalized due to the multiple presentations.
Because of these problems, we do not want the GoogleBot to see the jsessionid URI parameters.
In my earlier post, linked to above, I used Apache mod_rewrite to look for requests from GoogleBot, and send a redirect back to GoogleBot, redirecting it to the same URI it had initially requested, just stripped of the jsessionid parameter.
This time I’m going to use a Servlet Filter to prevent the jsessionid parameter from being inserted into the URL links on the page for GoogleBot requests. This is more elegant since there are no redirects.
First, I want to link to the web page which provided the starting point for the solution I used: JSESSIONID considered harmful
I took that approach and modified the filter code to only do this for GoogleBot requests, which will allow users who don’t support or allow cookies to still use the site.
I have one Java class: DisableUrlSessionFilter.java