Dienstag, 27. Oktober 2015

Solution: how to avoid Google's crawling and indexing non-existing pages

Many webmasters are affected from the weird issue: Google is indexing (at least crawling) non-existing URLs. The issue isn't depending of whether one uses Wordpress or other CMS. This question about why Google is crawling and / or indexing non-existing URLs appears in all webmaster forums, Google Groups and so on, but without a clear solution.

The fact, that Googlebot creates and crawls a bunch of non-existing URLs, lets arise some questions:

  • Where non existing URLs are coming from?
  • Why is it not optimal, if non-existing URLs are crawled respectively indexed?
  • How to minimize risks related to non-existing URLs?
The crawler bot reads web documents of a site one by one, using the sitemap (presuming there are no sitemap errors, where the crawler could stop crawling and get off). After the crawler is ready with existing URLs, it begins a kind of brute force attack to find spare parts for building URLs. Specially Googlebot is looking for everything in URLs and source code

  • what is a part of any URL slug,
  • what it means would be a part of an URL slug,
  • what could be utilized in an URL slug, like IDs, parameters, labels, values, anchors, variables, relative paths, folder names and so on.

Here Google explains the crawler bot's behavior, like:
Googlebot trying to follow links found in JavaScript, Flash files, or other embedded content.
In short, the crawler takes everything, what could be utilized for URL building, builds URLs and tries to get every byte from them.

This behavior of Google wouldn't be a problem at all, if any request of non-existing URL would be answered by properly configured server with the error code 404, Google would get a bunch of 404 errors and that's all - such errors make no negative impact to the site, where Google got them.

But, in most cases, servers are misconfigured and answer HTTP requests to non-existing pages with the code 200, like were these pages existing.

Why isn't optimal, if Google gets code 200 from non-existing URLs

In general there are two problems:
  1. overspending of crawl budget - crawler has a limit amount of crawl budget pro website.If it crawls non-existing pages, it could happen, that crawl budget expires before  important new pages will be crawled.
  2. indexing of non-existing URLs: if there is an existing page example.com/page/, and a page example.com/index.php/page/, which isn't exist, but answers with code 200 and has content from the first page, so it could happen, that both pages appear in index, or even only the second page appears in index.

Solution: answer all requests of non-existing URLs with 404

This solution forces serving of 404 error to all requests of non-existing URLs, which could be created by crawler. Add to your .htaccess following rules:
ErrorDocument 404 /404.php
AcceptPathInfo Off

<IfModule mod_rewrite.c>
RewriteEngine on 
RewriteBase / 
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_FILENAME} !-f 
RewriteRule ^(.*)$ /404.php [NC,L]

The really nice gimmick does the rule AcceptPathInfo Off. Without this rule only URLs like example.com/page/123 are handled. This rule handles URLs like example.com/page/index.php/123 too.

Limitation: if you use this technique and want to have readable (search engine friendly) URLs, you must use mod_rewrite.