Dienstag, 13. Januar 2015

SEO query string universal solution

universal solution for seo troubles caused by urls with query strings
URLs with query strings can be a real poison for a SEO. The main and mostly harmful damage untreated URLs with string query do, is a not calculable rise of amount of existing URLs with same content, HTTP answer code 200 and untreated indexing management, also called duplicated content. Another issue caused by query strings in URLs is overspending of crawl budget to URLs with query strings, which must be better excluded from crawling and indexing.

On this way a site with untreated URLs with query strings gets on the one side such URLs into index, which don't belong here, on the other side the crawling budget for good URLs could be missed, cause overspend.

There are some passive techniques to deal with query strings in URLs. Actually i planned to publish existing techniques for dealing with query strings in URLs and my solution for SEO problems caused by query strings in URL into my ultimate htaccess SEO tutorial, but then this topic got some more details, so i decided to create an extra article about query strings in URL and SEO.

Preceding kinds of SEO dealing with query strings in URLs

  • while Google means, it could deal with query strings in URLs, it recommends to adjust the bot's settings in Webmaster Tools for each of existing query strings.
  • URLs with query strings could be disallowed in the robots.txt with a rule like
    Disallow: /?*
    Disallow: /*?
    
  • If header information of HTML or PHP files, available with URLs with query strings, can be edited, so it is possible to add rules for indexing management and URL canonicalisation, like
    <meta name="robots" content="noindex, nofollow">
    <link href="Current URL, but without query string" rel="canonical">
    
These methods are mainly manual, require unpredictable workload and solute problems partly. But the good news is: i have an universal solution, working for all URLs with query strings and getting rid of all SEO troubles caused by query strings in URLs.

Universal solution for SEO of URLs with query strings

1. URL architecture

The solution works on the server (Apache) side, in the .htaccess (or httpd.conf) file. Before we begin, lets look detailed on the URL architecture in the htaccess syntax. Any URL has three parts, which can be separately addressed by different htaccess variables:
http://example.com/page/?query-string Here

  • example.com is addressed by HTTP_HOST
  • page/ is addressed by REQUEST_URI
  • ?query-string is addressed by QUERY_STRING
This addressing i will use in the solution approach.

2. Solution approach

The approach in general is:
  • to find all URLs with query strings,
  • add to all of them a custom HTTP header,
  • add to the HTTP header X-Robots tag noindex,
  • add to the HTTP header a link rel="canonical" rule, populated with the same URL as the current, but without the query string.

3. Assembling the rules set

As the first we ensure, that mod_rewrite is set on and formulate the rewrite base, which could be other then the root:
<ifModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
Then we catch URLs with any query string:
RewriteCond %{QUERY_STRING} .
# The rule catching any query string could be like
# RewriteCond %{QUERY_STRING} ^[a-zA-Z0-9]*$
# but for the sake of simplicity we use the first one
Then we do the trick: we create a rewrite rule for all catched URLs with any query string, which doesn't rewrite any URL, but add to all catched URL our custom header:
RewriteRule .* : [E=DO_SEO_HEADER:1]
Well, on this place we've already done: we got all URLs with any query string and added to all of them a custom HTTP header. What remains is to populate the HTTP header with rules we need: First we add the X-Robots tag to all HTTP headers, triggered by firing of environment variable, we set before:
# Close the mod_rewrite check
</ifModule>
# Do mod_headers check
<ifModule headers.c>
Header set X-Robots-Tag "noindex, nofollow" env=DO_SEO_HEADER
Then we add the canonical rule with the "pure" URL, composed of HTTP_HOST and REQUEST_URI. This rule is also by firing of environment variable, we set before.
Header set Link '%{HTTP_HOST}%{REQUEST_URI}e; rel="canonical"' env=DO_SEO_HEADER
# Close mod_headers check
</ifModule>

SEO query string solution is ready for action! Using this rules set your site's ranking will never be damaged by duplicated content, caused by URLs with query strings. Another benefit brought by this rules set is always exact indexing management: you will always have only canonical URL version in index. Use and enjoy!

TL; DR;
Universal solution for SEO troubles caused by URLs with query strings: add the following rules set to your .htaccess file:

PS: full rules set:
# Ensure Apache's mod_rewrite works
<IfModule mod_rewrite.c>
# Set rewrite engine on 
RewriteEngine On
# Set rewrite base
RewriteBase /
# Catch any query
RewriteCond %{QUERY_STRING} .
# Rewrite all findings, also URLs with query strings so, that they get new header
RewriteRule .* : [E=DO_SEO_HEADER:1]
</IfModule>
# Ensure Apache's mod_rewrite works
<IfModule mod_headers.c>
# Specify X-Robots tag with noindex option for URLs with query strings
Header set X-Robots-Tag "noindex, nofollow" env=DO_SEO_HEADER
# Specify canonical URL version
Header set Link '%{HTTP_HOST}%{REQUEST_URI}e; rel="canonical"' env=DO_SEO_HEADER
</IfModule>
Yandex.Metrica