The fact, that Googlebot creates and crawls a bunch of non-existing URLs, lets arise some questions:
- Where non existing URLs are coming from?
- Why is it not optimal, if non-existing URLs are crawled respectively indexed?
- How to minimize risks related to non-existing URLs?
- what is a part of any URL slug,
- what it means would be a part of an URL slug,
- what could be utilized in an URL slug, like IDs, parameters, labels, values, anchors, variables, relative paths, folder names and so on.
Here Google explains the crawler bot's behavior, like:
Googlebot trying to follow links found in JavaScript, Flash files, or other embedded content.In short, the crawler takes everything, what could be utilized for URL building, builds URLs and tries to get every byte from them.
This behavior of Google wouldn't be a problem at all, if any request of non-existing URL would be answered by properly configured server with the error code 404, Google would get a bunch of 404 errors and that's all - such errors make no negative impact to the site, where Google got them.
But, in most cases, servers are misconfigured and answer HTTP requests to non-existing pages with the code 200, like were these pages existing.
Why isn't optimal, if Google gets code 200 from non-existing URLs
In general there are two problems:- overspending of crawl budget - crawler has a limit amount of crawl budget pro website.If it crawls non-existing pages, it could happen, that crawl budget expires before important new pages will be crawled.
- indexing of non-existing URLs: if there is an existing page example.com/page/, and a page example.com/index.php/page/, which isn't exist, but answers with code 200 and has content from the first page, so it could happen, that both pages appear in index, or even only the second page appears in index.
Solution: answer all requests of non-existing URLs with 404
This solution forces serving of 404 error to all requests of non-existing URLs, which could be created by crawler. Add to your .htaccess following rules:ErrorDocument 404 /404.php AcceptPathInfo Off <IfModule mod_rewrite.c> RewriteEngine on RewriteBase / RewriteCond %{REQUEST_FILENAME} !-d RewriteCond %{REQUEST_FILENAME} !-f RewriteRule ^(.*)$ /404.php [NC,L] </IfModule>
The really nice gimmick does the rule AcceptPathInfo Off. Without this rule only URLs like example.com/page/123 are handled. This rule handles URLs like example.com/page/index.php/123 too.
Limitation: if you use this technique and want to have readable (search engine friendly) URLs, you must use mod_rewrite.