Site Network: Personal | Professional | Photography

Technical Blog

This blog will contain content related to Java, Seam, Security, my sites and projects, as well as other technical subjects I am interested in.

Comments and questions are welcome!

ATG SEO – Search Engine Optimization

Monday, April 27th, 2009

Search is Rocking

No matter how good your website or web application is, if people can’t find it, then it doesn’t do anyone any good.

The Secret to SEO

So how do you game Google? How do you trick search engines to put you at the top?

You don’t.

The “secret” to search engines is that they are built to return the most useful relevant results. All you have to do is to build your site in accordance with standards, common sense, and have it be helpful or useful. The upside to taking these elements into consideration is that not only will it help people find your site, but often it will improve the experience on the site as well.

I’m going to cover aspects of your site such as head tags, semantic tags, standards and accessibility, URL structures, limitations to be aware of, how to test what GoogleBot sees, tools to check your site with, and how to boost your Google ranking using means beyond just fixing up your site.

These pointers will apply to any website, but I will make some specific recommendations and provide some sample code for ATG sites, which are often not very SEO friendly, despite best intentions.

Head Meta Tags – Title, Description, and Keywords

The first important things that the search engine will see in your HTML are the tags in the head section of the HTML. There are three that we care about, Title, Description, and Keywords.

Title

The Title meta tag should uniquely identify the content of the page. Every page should have a different title, and the title should describe the content of the page accurately and succinctly. My preference is to put the site name after the unique portion of the title, like “Men’s Shoes – MyShoeStore.com” or “Men’s Fluevog Size 12 – MyShoeStore.com”. The content of the title should be relevant to the content on the page, words from the title should be found prominently within the page content.

When you’re using a template or including a shared header file, as is common in many dynamic sites, especially ATG sites, you need to pass in the unique title as a param.

Within your header page you use code like this:

<title><dsp:valueof param="title">Shoes</dsp:valueof> - MyShoeStore.com</title>

And from within the calling page you pass in a title like this:

<dsp:include page="fragments/page-header.jsp">
     <dsp:param name="title" value="Contact Us" />
</dsp:include>

Or with catalog data, like this:

<dsp:include page="fragments/page-header.jsp">
     <dsp:param name="title" param="category.displayName" />
</dsp:include>

The other thing to keep in mind is that the Title is the default name of the bookmark if the user bookmarks the page. It needs to be clear, concise and helpful, and should make it easy for the user to find among their other bookmarks.

Description

The description meta tag should contain a longer description of the page content. It should be no longer than 160 characters, and should use terms that are highly relevant for the page content, and are repeated within the page.

For dynamic sites, including ATG, I recommend the same approach as above for the title tag.

Keywords

The keywords meta tag should contain close to eight keywords which best represent the content of the page, and terms that people might use to search for similar content. Those keywords should feature prominently within the page content itself.

For dynamic sites, including ATG, I recommend the same approach as above for the title tag.

Tuning Your Head Tags

It’s important to tune your description and keywords to get the right users to the right content. This is the great part: by using the best relevant terms, by providing helpful accurate terms and not trying to scam or spam traffic your way, you end up helping searchers to find what they are looking for, and you end up helping yourself to get more and higher quality traffic..

So in order to be more efficient, you need to figure out what people are searching for when they end up at your site. There are a couple ways to do this. The most basic is to use an analytics package, such as Google Analytics, and look at the keywords people are using that direct them to your site. Google Analytics makes this easy by also showing you the average time on the site, and number of pages visited by users divided by the keywords they searched for. This makes it easy to tell which keywords generate traffic of users who WANT to be on your site, and find it useful (and hence will come back, make purchases, etc…), and which keywords brought in the wrong people, people who were looking for something else.

Here is the Google Analytics keyword report for my site:?

Google Analytics Keyword Report

You can see that people who come to my site when they’re searching for “create web service seam” only look at the first page they see, and have 100% bounce rate. That means one of two things: A) this group of people was really looking for something else and my site didn’t help (in which case I should probably change my keywords and content so they don’t waste their time on my site) or B) the first page answered all of their questions. In this case the page on my site is probably pretty helpful for people searching for those words, so I’m going to keep the keywords, but you understand the point.

These types of reports should allow you to tune your keywords to get the best quality traffic and the happiest users.

To be more advanced you can track the keywords that didn’t just lead people to your site, but led people who made purchases/other goals to your site. You can do this with Google Analytics by setting up Goals, or with other analytics packages.

There will be three more posts on SEO coming in the next few days!

———
ATG SEO – Accessibility and GoogleBot

ATG SEO – URL Formats and Crawler Limits

ATG SEO – Tools and Traffic

Why Is User Experience Performance So Important?

Wednesday, December 31st, 2008

In my ATG Performance Tuning post I mentioned that how a user perceives the site performance impacts their behavior on the site, and that a fast site leads to more purchases/traffic/etc…

Here are some numbers to back that up:

  • Amazon found that a 100ms increase in page response time led to a 1% DROP in sales, or conversely improving a page response time by 100ms will increase sales 1%. I suspect that this effect continues beyond the 100ms mark, but probably tapers off at some point.
  • Google found that an 500ms increase in page response time led to a 20% drop in traffic and revenue. This is despite the 3X increase in search results delivered (30 results instead of the default 10) to the test group.
  • Google also found that a 30% reduction in page size resulted in 30% more traffic/usage due to faster loading and rendering.

Given the relatively low cost/time in performance tuning your application, the resultant gain of 1%-20%+ in revenue makes it a smart move.

“As Google gets faster, people search more, and as it gets slower, people search less”
– Marissa Mayer, Google vice president of search products and user experience

This is also true for your website, just replace search with “buy”, “read”, etc…

In fact, I’ll lay down a wager: If you improve the page rendering time of the most visited pages of your ATG site by over 10% or 100ms (which ever is greater), and you don’t see any improvement in your goal conversion (purchases, sign-ups, whatever your measured goal is) I will give you an iPhone.

My up-coming posts on ATG Performance Tuning will make it easy to improve the page performance much more than that. I’d expect most ATG sites can cut the user experience of page loading and rendering time by 50%, or more, based on the advice I will be posting here.

So check back often!

Protocol Buffers

Wednesday, July 9th, 2008

I just read about the recently released Protocol Buffers from Google.

“Protocol Buffers allow you to define simple data structures in a special definition language, then compile them to produce classes to represent those structures in the language of your choice. These classes come complete with heavily-optimized code to parse and serialize your message in an extremely compact format. Best of all, the classes are easy to use: each field has simple “get” and “set” methods, and once you’re ready, serializing the whole thing to – or parsing it from – a byte array or an I/O stream just takes a single method call.”

It’s like XML binding, only supposedly MUCH faster, and it looks very easy as well. The Java Tutorial lays it out pretty simply.

I’m definitely going to try using Protocol Buffers next time I need to transport some data across the wire and see how it goes. Has anyone tried this yet? Any comments?

JBoss jsessionid Query Parameter Removal

Tuesday, May 20th, 2008

Instead of just using the Apache mod_rewrite rules from my post on “Hiding jsessionid parameters from Google“, which uses redirects, wouldn’t it be better to simply not output the jsessionid parameter into the URLs?

First, what are those jsessionid params, and why are they there?

For a web application to have state, i.e. remember things from one page request to the next (such as that you’re logged in, who you are, what is in your shopping cart, etc…), most web applications have something called a session. The session starts when you hit the website at first, sticks with you while you are on the site, and expires after you have either logged out or have been idle (i.e. not clicked on anything) for a set period of time (perhaps 30 minutes).

In general the actual session data is held on the server, things like your shopping cart, your user profile, all of that. However, in order to associate requests from your web browser with the correct session, your browser needs to pass something for the web application to recognize which session is yours. This is traditionally done in two ways:

firstly and primarily using a session-life browser cookie (or two) which hold a session identifier and optionally some additional security token(s). The browser receives this cookie from the web application, and then sends the cookie back to the web application with each page request. The web application looks at the cookie, and figures out which session is yours, and handles your page request appropriately.

secondly, and usually only as a fall-back for browsers which do not support cookies or whose cookie support has been turned off, is to rewrite every link in the web application which points to another page in the same web application with a special session id added to the URI of the link. This is usually done as a path parameter (following a ‘;’), but sometimes is also done as a query parameter (following a ‘?’).

Since on the first request to a web application, the browser is not sending a session cookie, the web application has no way of knowing if the browser actually supports cookies or not. So for the first page, the web application will usually send back the session cookie AND rewrite all of the links on the page with the jsessionid just in case the cookie is not returned.

So what’s the problem?

Search engine spiders, like Google’s GoogleBot, usually do not support cookies. This means that they see the site with the jsessionid parameter in every link and every requested URL. So this leads to three related problems. First, the links that show up in a Google search include an ugly ‘jsessionid=xxxxxx’ which looks ugly. Second, Google doesn’t recognize that the jsessionid parameter doesn’t change the page content, and as such each time the GoogleBot hits the site, and gets a different jsessionid, it indexes all of the pages again. This leads to getting multiple result listings for the same page in search results. For instance you might see the same page listed 7 times in a row. Third, by having multiple instances of the same page with the same content, the Google PageRank of the actual page is severely diluted and perhaps even penalized due to the multiple presentations.

Because of these problems, we do not want the GoogleBot to see the jsessionid URI parameters.

In my earlier post, linked to above, I used Apache mod_rewrite to look for requests from GoogleBot, and send a redirect back to GoogleBot, redirecting it to the same URI it had initially requested, just stripped of the jsessionid parameter.

This time I’m going to use a Servlet Filter to prevent the jsessionid parameter from being inserted into the URL links on the page for GoogleBot requests. This is more elegant since there are no redirects.

First, I want to link to the web page which provided the starting point for the solution I used: JSESSIONID considered harmful

I took that approach and modified the filter code to only do this for GoogleBot requests, which will allow users who don’t support or allow cookies to still use the site.

I have one Java class: DisableUrlSessionFilter.java

package com.digitalsanctuary.util;

import java.io.IOException;

import javax.servlet.Filter;
import javax.servlet.FilterChain;
import javax.servlet.FilterConfig;
import javax.servlet.ServletException;
import javax.servlet.ServletRequest;
import javax.servlet.ServletResponse;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import javax.servlet.http.HttpServletResponseWrapper;

/**
 * Servlet filter which disables URL-encoded session identifiers.
 *
 *
 * Copyright (c) 2006, Craig Condit. All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions are met:
 *
 * * Redistributions of source code must retain the above copyright notice,
 * this list of conditions and the following disclaimer.
 * * Redistributions in binary form must reproduce the above copyright notice,
 * this list of conditions and the following disclaimer in the documentation
 * and/or other materials provided with the distribution.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS &quot;AS IS&quot;
 * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
 * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
 * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
 * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
 * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
 * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 * POSSIBILITY OF SUCH DAMAGE.
 *
 * Modified by Devon Hillard (devon@digitalsanctuary.com) to only filter for GoogleBot,
 * not for users without cookies enabled.
 *
 */
@SuppressWarnings("deprecation")
public class DisableUrlSessionFilter implements Filter {

    /**
     * The string to look for in the User-Agent header to identify the GoogleBot.
     */
    private static final String GOOGLEBOT_AGENT_STRING = "googlebot";

    /**
     * The request header with the User-Agent information in it.
     */
    private static final String USER_AGENT_HEADER_NAME = "User-Agent";

    /**
     * Filters requests to disable URL-based session identifiers.
     *
     * @param pRequest
     *                the request
     * @param pResponse
     *                the response
     * @param pChain
     *                the chain
     *
     * @throws IOException
     *                 Signals that an I/O exception has occurred.
     * @throws ServletException
     *                 the servlet exception
     */
    public void doFilter(final ServletRequest pRequest, final ServletResponse pResponse, final FilterChain pChain)
	    throws IOException, ServletException {
	// skip non-http requests
	if (!(pRequest instanceof HttpServletRequest)) {
	    pChain.doFilter(pRequest, pResponse);
	    return;
	}

	HttpServletRequest httpRequest = (HttpServletRequest) pRequest;
	HttpServletResponse httpResponse = (HttpServletResponse) pResponse;

	boolean isGoogleBot = false;

	if (httpRequest != null) {
	    String userAgent = httpRequest.getHeader(USER_AGENT_HEADER_NAME);
	    if (StringUtils.isNotBlank(userAgent)) {
		if (userAgent.toLowerCase().indexOf(GOOGLEBOT_AGENT_STRING) > -1) {
		    isGoogleBot = true;
		}
	    }
	}

	if (isGoogleBot) {
	    // wrap response to remove URL encoding
	    HttpServletResponseWrapper wrappedResponse = new HttpServletResponseWrapper(httpResponse) {
		@Override
		public String encodeRedirectUrl(final String url) {
		    return url;
		}

		@Override
		public String encodeRedirectURL(final String url) {
		    return url;
		}

		@Override
		public String encodeUrl(final String url) {
		    return url;
		}

		@Override
		public String encodeURL(final String url) {
		    return url;
		}
	    };

	    // process next request in chain
	    pChain.doFilter(pRequest, wrappedResponse);
	} else {
	    pChain.doFilter(pRequest, pResponse);
	}
    }

    /**
     * Unused.
     *
     * @param pConfig
     *                the config
     *
     * @throws ServletException
     *                 the servlet exception
     */
    public void init(final FilterConfig pConfig) throws ServletException {
    }

    /**
     * Unused.
     */
    public void destroy() {
    }
}

and the servlet filter configuration in my web.xml file:

	<filter>
		<filter-name>DisableUrlSessionFilter</filter-name>
		<filter-class>
			com.digitalsantuary.util.DisableUrlSessionFilter
		</filter-class>
	</filter>

....

	<filter-mapping>
		<filter-name>DisableUrlSessionFilter</filter-name>
		<url-pattern>/*</url-pattern>
	</filter-mapping>

So far, it seems to be working beautifully. It only impacts the GoogleBot, and it successfully strips the jsessionid parameter from the links on the site.

Enjoy!

Hiding jsessionid parameter from Google

Monday, May 19th, 2008

If you’re running a website on JBoss you may discover that Google has indexed your pages with a jsessionid query parameter in the links.

The Google crawl bot does not support cookies, therefore JBoss uses the jsessionid query parameter in order to maintain a session state without cookies. These query parameters can impact your Google rank and indexing efficiency as the same page can be indexed multiple times with different session ids, and dilute your ranking. Also, it leads to ugly links.

If you want to still be able to support non-cookie using users, but would like Google to see cleaner links, you can use Apache’s mod_rewrite to modify the links for the Google bot only, leaving the normal functionality available to the rest of your users.

Assuming you have mod_rewrite enabled in your Apache instance, use this configuration in your apache config:

	# This should strip out jsessionids from google
	RewriteCond %{HTTP_USER_AGENT} (googlebot) [NC]
	ReWriteRule ^(.*);jsessionid=[A-Za-z0-9]+(.*)$ $1$2 [L,R=301]

This rule says for request where the user agent contains “googlebot” (with case insensitive matching), rewrite the URL without the jsessionid. It seems to work nicely.