Making Search Keywords Easy

I was recently contacted by SortFix who introduced their offering to me and thought maybe I’d be interested in writing a blog post about SortFix. (full disclosure: I have received nothing from SortFix other than their e-mail request).

SortFix is basically a value added search provider who wraps Google search results. Their approach is to analyze other keywords which appear frequently in your search results. You can then drag these other high frequency keywords to either the “Add to search” box, or the “Remove” box. There’s also a “Dictionary” box which defines keywords you may not know.

For example, if you search for “RS6“, you’ll get “Power words” like “v10″, “performance”, “audi”, “carlos”, 2010″, “2003″, “juan”, etc… By adding or removing those keywords, you can tune your search for either the 2003 edition of the RS6, or the new 2010 one, or you can check out the king of Spain, Juan Carlos’, ride.

I can see this being useful for people who don’t have super good Google-Fu. I don’t see myself using it, but I can see it being useful for many other folks. Another point against it is that currently it’s a Flash based interface, and generally I avoid Flash as much as possible. Apparently they are working on a non-Flash version, which would be a nice improvement IMHO.

I really like the idea the idea of offering up high frequency additional keywords to people who are searching for things to help them refine their search. I can see this being very useful for onsite eCommerce searching, helping narrow down products based on common attributes, etc…

Make Google Ignore JSESSIONID

Search engines like Google will often index content with params like JSESSIONID and other session or conversation scope params. This causes two problems: first the links returned in the Google search results can have these parameters in them, resulting in “session not found” or other incompatible session state issues. Secondly it can cause a single page of content, to be indexed multiple times (with differing parameters) this diluting your page’s rank.

I’ve posted two solutions to this issue in the past: Using Apache to ReWrite URLs to remove JSESSIONID and a more advanced solution of using a Servlet Filter to avoid adding JSESSIONID for GoogleBot Requests.

Now there’s an even better way to handle this. Google has added an amazing new feature to their Webmaster Tools which allows you to specify how the GoogleBot indexer should handle various parameters. You can ignore certain parameters such as JSESSIONID, cid, and others, and also specifically not ignore other parameters such as productId, skuId, etc…

Log into your Google Webmaster Tools, and select the site you wish to work with. Under “Site Configuration” -> “Settings” there is a new section at the bottom called “Parameter handling”. Click on “adjust parameter settings” to expand the parameter handling configuration for your site. Sometimes Google will suggest various parameters it has discovered while crawling your site, and other times you just enter the parameters you want Google to ignore or pay attention to.

Google Webmaster Tools Parameter Handling Interface

This is a much more elegant solution to the JSESSIONID problem, and also allows you to easily handle other parameters your site may use for either session state or dynamic content generation correctly. The only downside is that this only impacts Google, whereas with the correct configuration my older two solutions can handle any Search Engine Bot. Maybe other search providers will or do provide a similar feature.

ATG SEO – Tools and Traffic

SEO Tools

Other than using lynx to test your site, there are several other tools I would recommend utilizing in your quest for better SEO. One great tool is the SEO Analysis Tool from SEOWorkers.com. This tool quickly analyzes the page, and provides excellent reports on the size and relevancy of your meta tags. It also provides data on the frequency of keywords, singly and in groups of two and three. Here is a part of the a report for sparkred.com:

Google Webmaster Tools for SparkRed.com

I’ve already mentioned utilizing Google Analytics to monitor what keywords are bringing you good traffic versus bad. I’d also recommend another Google tool, Google Webmaster Tools. Google provides a TON of useful data here. You can see things like the crawl frequency, page response times during the crawl, and page rank of your pages (including a by month breakdown). You can also use quick links to see related pages, pages which link in to your site, all the indexed pages of your site, and more. It also gives you diagnostics into any problems with the crawl or your meta tags.

How to Promote Your Site

In order to drive traffic to your site and to increase your ranking in search engines, such as your Google Page Rank, you want high quality links pointing to your site. By high quality links I don’t mean spam links, link farms, or the like. I mean links on relevant sites pointing your site, driving high quality traffic.

There are two halves to this. The first is to provide really good quality content on your site. This means having a great site, it means updating your site frequently, and it means providing non-core but helpful related content. An example of that might be having a blog that provides related content. For instance for MyShoeStore.com you might have a blog with posts about the latest fashions and how to match shoes to each look, the best sales you’re currently running, the 10 most extreme shoes of all time, photos of the shoes worn by celebrities that week, etc… You want something that will have people returning to visit your site more frequently, and something that will make other people link to your site in their blogs, tweets, Facebook pages, IMs to friends, and so forth. If you have a blog post about the shoes on the red carpet at the People’s Choice awards, people will link to that from tons of forums discussing fashion. Those a great links! Relevant high quality links bringing in high quality traffic. And you just need a blogger. A college fashionista who wants to get into fashion journalism will do it for cheap.

The second is to promote your site on other sites. Find forums and communities that are filled with the people you want to attract to your site, signup, and contribute to the community conversation. Note I didn’t say spam the community. Post useful stuff. Even stuff that links to sites that aren’t yours. Be a real contributing helpful member of the site. In your site profile, have a link to your site, and when it’s relevant feel free to link to your site, your latest sale, whatever, as long as it’s relevant. Obviously if you’re an ATG developer or a VP of IT, it’s probably not you doing all this, but have someone do it.

Those two things are how you get traffic and a high page ranking in the search engines. It’s all about having good content and being an upstanding citizen of the relevant online communities.

ATG SEO – URL Formats and Crawler Limits

URL Formats and Structures

By making your URLs expressive and relevant to the content and structure of the site, you help not only your search engine ranking but also your users, since they can easily tell what a given link will take them to.

This is a bad URL:


http://myshoestore.com/app/cat/browse.jsp?id=245345&cat=234523

This is a good URL:


http://myshoestore.com/shop/mens-shoes/fluevog/size-12

It is chock full of descriptive words. The page allows you to “shop” for “mens shoes”, more specifically “fluevogs” in “size 12″. This makes it much easier for search engines to know the purpose of the page and also for users to know what a link will take them to.

In order to accomplish this, you should name your directories and pages as accurately and descriptively as possible. You should also structure your site’s content and URLs in a logical hierarchical fashion.

Now your site may have a single actual JSP that handles displaying a category, and another one that handles displaying a product, any product. So you need to map the URL http://myshoestore.com/shop/mens-shoes/fluevog/size-12 to actually serve up the content from http://myshoestore.com/app/cat/browse.jsp?id=245345&cat=234523.

Depending on your technology there are different ways to do this.

If you’re using JBoss Seam it’s very easy to use rewrite patterns in the pages.xml mapping file. This not only handles mapping the incoming requests for pretty URLs to the actual resources on the backend, but also handles generating the pretty URLs within the site automatically, which is a huge time saver.

If you’re using Apache you can use mod_rewrite to translate the pretty requested URLs to the ugly actual URLs. Of course in that case you need to ensure you’re generating the correct pretty URLs on the pages of your site.

If you’re using ATG you should read the chapter of the ATG Programmers Guide titled Search Engine Optimization (chapter 10 for ATG 2006.3). This covers the ATG support for URL Templates and the Jump Servlet. A few of the downsides to be aware of, are that it’s not super simple to set up, and that it only displays the pretty URLs to search engines, not to all users. I really prefer solutions that give users the benefits of readable URLs as well. The out of the box ATG system has too much of a performance impact to use for all situations.

We’ll be releasing a high performance open source solution for URL re-writing in ATG eCommerce applications as part of the Open Source Foundation ATG eCommerce Framework in the near future.

Know Your Limits

Search engine spiders, like the GoogleBot, have limits as to what they’ll parse and consider. For instance the GoogleBot will only read in the first ~101kb of your page’s HTML. Anything after that is ignored. So you need to ensure that your pages are smaller than 101kb. This is also a best practice with regards to performance: keep your HTML as small as possible.

Search engines will often display a small chunk of text with the search results, usually this is taken from the page’s description meta tag. Most will only show the first 160 characters of the description, so you want to be sure that your description content is less than 160 characters and makes sense for a human to read.

Many search engines will ignore, or penalize you, for having more than 100 links on a given page. Keep the number of links on a single page to a reasonable level. If your primary navigation must have more than 100 links, you can load in the second, third, etc… level navigation via AJAX/Javascript.. This will let your users have access to the full navigation structure from any page, but keeps things more reasonable for the search engine crawler. You’ll want to be sure that the crawler will be able to traverse through the complete site structure using the more limited navigational options it can see, the non-AJAX navigation.

ATG SEO – Accessibility and GoogleBot

Semantic Tags

Use the right tags for the job. For high importance headings use <h1>, for lists use <li>, and so on. Don’t JUST use CSS classes for styling, make sure you’re also using the appropriate HTML markup for your content. By using semantic markup you can identify the importance and structure of the content and data on the page in a way that not only assists search engine crawlers, like GoogleBot, understand your content better, but also supports alternate browsers/web clients such as screen readers, other accessibility tools, and future applications.

Accessibility

By designing your site to be easily accessible to search engines, you also end up making it accessible to people with disabilities such as the blind or vision limited. Or you can look at it the other way: by designing your site to be accessible to people with disabilities and to be WAI or 508 compliant, you end up with an excellent site for search ranking.

It can often be hard to justify the cost and effort to make your site 508 compliant, even if it’s a legal requirement, however if you view it as a SEO effort, it’s much easier to assert a strong ROI on the project, and kill two birds with one stone.

You can read more about accessibility here:

Standards

Be sure you’re following standards and best practices around your markup. This includes having nicely structured, valid, HTML/XHTML markup, and also being sure you have accurate helpful alt and title tags on your links, images, and other dom elements.

See what GoogleBot sees

First you have to be able to see the page the way GoogleBot sees it. The easiest way is to use a text based web browser like links or lynx. GoogleBot is actually a customized version of lynx, and supposedly has been/is being improved to be able to cull some data out of Flash, and possibly even some Javascript driven content, however it’s best to assume that what lynx shows you is what GoogleBot sees.

You’ll want to design your page structure and dom tree to be logical in order, structure, and semantic. Here is a view of CNN.com in lynx:

The CNN.com homepage viewed using the lynx browser

The CNN.com homepage viewed using the lynx browser

It’s extremely usable. You can see the main navigation first thing, and if you scroll down you see the main headline stories, latest news, etc… This is what you want your site to look like to GoogleBot.

In contrast, here is what a large theatre’s website looks like in lynx:?
theatre

The primary navigation on the real site doesn’t even appear here. The images don’t have titles or alt tags, and the structure is lacking and confusing. This is NOT what you want your site to look like to GoogleBot.