Google Bot Attempts to Crawl Shortest Urls First – Russell Ballestrini

Joe-rb 14y, 11d ago

It's interesting - and I'm going through the data.

BUT - does it have any relevance regarding SEO?

remark

hojboj-rb 14y, 11d ago

You gonna crawl them randomly? You gonna crawl them by longest url first? Alphabetically? What difference does it make?

remark

Guy-Gervais-rb 14y, 11d ago

I'd speculate that it's because most site will try to make their most interesting content accessible via short, easy to remember URLs. So by crawling those first, Google gets the best parts rapidly.

remark

Tenticle-rb 14y, 11d ago

MongoDB is Web Scale.

remark

Dominik-rb 14y, 11d ago

I think Google tries to fetch any page starting from the homepage down the tree of pages. But how should Google know how deep an url is in your page tree?

It seems pretty pragmatic to me to use the length of the url as an indicator. The logic could possibly be that with every added directory /a/b/c in the url you hit a page deeper down the tree.

This is only a rule of thumb but makes perfectly sense, as google fetches only a limited number of url per domain and the deeper a page lies on yout domain the less important its content should be.

What do you think?

remark

JK-rb 14y, 11d ago

seems pretty obvious because this way let the bot find the entire structure of a site, for ex if crawl domain.com/longestdirectory/dir/dir2 it could miss some "folder" before dir2

remark

Anonymous-rb 14y, 11d ago

Graphed the googlebot requests in the log: http://i.imgur.com/uMoUT.png

remark

russell 14y, 11d ago

Wow! thanks, I'm going to add that to the blog up top!

remark

Ben-Krull-rb 14y, 11d ago

In my experience, Google follows this crawling pattern when all links are of equal value and there are too many links on a page to crawl them all (e.g. most of your state pages). I have several local directory style sites so I've seen this pattern many times.

Just curious, where did you get the public school information for this site? I'm always looking for good quality local data sources.

remark

Anthony-Brunetti-rb 14y, 11d ago

Hi Russell,

Interesting.

I believe It's because of your site's internal links & hierarchy:

home - links to: home/state - links to: home/state/city - links to: home/state/city/school-name

The spider will follow all of the links on each level before going deeper, following your hierarchy. I imagine your chart looks flat because all states have about the same URL length. To test this theory, find a city/school URL that is shorter then a /city and see if that gets crawled first. You could also test the internal link theory by linking to a long URL on your homepage, and watch the spider hit that first.

P.S.

I have a client who does lead generation for education, I run their SEO program, and I could make use of this set-up. I'd also be happy to share some SEO advice. If you're interested, get in touch with me: anthony@42metrics.com.

Have a great day.

remark

YTwLhMHV 14y, 11d ago

Given the graph, it is clear to see it is not getting the entire set of urls first and then sorting them, so it is not actually attempting to crawl shortest urls first. It is just crawling based on the structure of the website, which happens to be that further depth links have a longer url, and so it is a breadth-first search as expected. This type of search is completely usual for a crawler, and how I would write one too. The observation that it somehow is crawling the shortest urls first is mistaken, it is just following the structure of the website as you can see in the graph.

remark

Carlos-Solís-rb 14y, 11d ago

If I built a crawler myself, of course I'd start with the shortest URLs first, just to build a simpler iterator (e.g. from "z.com" I'd pass to "aa.com" and so on).

remark

Dave-Stopher-rb 14y, 11d ago

This is some really good info about the google bot.

Google has always said that it prefers shorter URLS!!

Dave

remark

Ian-rb 14y, 11d ago

I've suspected for a long time that Google uses URL length as one 'guess' at site structure. So it might assume shorter URLs are at the 'top' of the hierarchy, and therefore more important.

This is by far the best evidence of this I've seen though. Nice work Russell!!!

remark

Keith-Brown-rb 14y, 11d ago

Site structure could be one reason Googlebot crawls shorter URL's, another would likely be a measure of quality. Often dynamic URL's from search strings and deep navigation are not as important as cleaner SEO friendly URL. Google's click tracking studies have shown time and time again that CTR drops as URL length increases. Google wants to index the most friendly URL's first obviously.

remark

HackerNews-rb 14y, 10d ago

JonnieCache 1 day ago | link

There is probably some highly non-obvious reason that sorting your queue of URLs by length is optimal, which was arrived at after a lot of modelling and testing. We're unlikely to ever know the answer unless someone from google explains it to us. reply

JonnieCache 1 day ago | link

Thinking about it more, its probably just a breadth-first search. Duh. reply

frisco 1 day ago | link

Breadth-first search means crawling all of the links on the page and adding all of the links on the child pages to the queue at once rather than drilling down on one link-path first before moving to the other links on the first page. There's probably some highly non-obvious reason for crawling by url length. reply

xyzzyz 16 hours ago | link

The thing is, the deeper you are, the longer are the urls, so if you do a breadth first search, you are more likely to visit short urls first. reply

AshleysBrain 1 day ago | link

It's probably just a short URL is a heuristic for an important site. www.site.com/section is probably more important than www.site.com/section/subsection/detail/page/5/comments. A good move for the crawler - don't get distracted by "deep" pages - try and stick to high level ones first. Edit: this would also encourage webmasters to use short URLs, which benefits users by being easier to remember, too. reply

orijing 20 hours ago | link

How well-known is this to webmasters? If it's not well-known I cannot see how the second part could be. But the first could very much be true, and is what I would've surmised. reply

cma 1 day ago | link

I don't agree with the last point; seems like it would encourage things like foo.com/1hu83FG2 or lead to excessive abbreviation. reply

DrJokepu 1 day ago | link

If I remember correctly, your page ranks better in Google if the search terms are in the URL (even more so if they are in the hostname). reply

AshleysBrain 1 day ago | link

Well, it's just a heuristic. Also, would you consider that level of abbreviation for your own site instead of descriptive names? You do want users to be able to remember URLs if at all possible... reply

arn 1 day ago | link

I noticed this behavior also when I was following Googlebot's crawl of my old pages after I had done a redesign. it's not because of sitemap or because of url structure or because of dynamic content. Mine were blog articles in the same format. This is how it was crawled: sitename.com/year/mo/day/stub sitename.com/year/mo/day/stub-one sitename.com/year/mo/day/stub-one-two sitename.com/year/mo/day/stub-one-two-three sitename.com/year/mo/day/stub-one-two-three-four reply

orijing 20 hours ago | link

The question that I have is whether this is a relative behavior (i.e. whether, for a given domain 'domain.com' Google prioritizes domain.com/short-url over domain.com/longer/url.html) or a global one (i.e. prioritizing short.com/url over very-long-domain.com/nested/pages/hierarchy.html, all else equal). I can definitely see the local/relative effects being a natural consequence of prioritizing by pagerank, but the global part sounds more like a separate signal. Does anyone have insights? reply

esryl 1 day ago | link

The site adheres to a strict url structure. /state/city/id/schoolname - entering from the homepage, the only way to crawl the site 1 level at a time would be crawling the shortest urls first. this structure is also emphasised in the breadcrumbs on every page, the shortest urls are also the ones with the most internal links. why would you crawl the site in any other way? reply

underdown 1 day ago | link

"why would you crawl the site any other way?" I could see crawling pages most likely to have changed first as those pages would most likely lead to fresh content. reply

*

3 points by foxhop 1 day ago | link

If you look at a particular city page you will notice that the cities are in alphabetical order, however google bot still crawls by length of url... reply

personalcompute 1 day ago | link

I did a more scientific analysis of the googlebot requests in the provided log (graph! http://i.imgur.com/uMoUT.png) and it definitely looks like it is taking shortest urls first. Anyone else with a large site want to check as well for further data? reply

*

1 point by foxhop 1 day ago | link

Thanks for the graph, I've added it to the blog page reply

meow 1 day ago | link

That's probably because short urls usually tend to be static pages while long ones tend to be dynamically generated content. reply

personalcompute 1 day ago | link

His entire site that he observed this behavior on is static. reply

jules 1 day ago | link

But Google doesn't know that. reply

_grrr 1 day ago | link

A poor mans PageRank algorithm, assuming nothing else, would assign a higher PageRank to shorter site links on a page. Presumably the crawler visits pages with a higher PR first. reply

TuxPirate 1 day ago | link

Pagerank determine crawls rate amongst other things. I find it is also likely that short URLs (especially in the case of a directory-type site) are seen first by the spider and that this order is respected by the crawler (FIFO). You can also ask in #seo on irc.freenode.net I know there are knowledgable SEO people in there who might be able to provide you with a decent answer. reply

ignifero 1 day ago | link

Maybe because that's how they are sorted in the hashtable/database they use to queue urls? Or maybe because they want to index the shortest pages first, so that they are processed before any duplicates with longer urls (i.e. get /articles/ before /articles/index.php) reply

georgemcbay 1 day ago | link

I suspect you're on the right track with your first guess. Most people posting here are looking for some sort of deep meaning in this when IMO it is more likely just due to a localized side-effect of doing something such as storing the urls in a trie-like structure and then iterating over it breadth-first. reply

christianwilde 1 day ago | link

I imagine that in these specific case is because longer URLs represent deeper pages on the site that are less "important" (in terms of internal incoming links and pagerank) than the shorter ones. It doesn't seem logical that google order the URLs by length and then crawl them in that order; probably the URL length can be a factor that the bot takes into account, but not the only one in the manner this article suggest :) Anyway, good point, that deserves more testing to extract some conclussions reply

TuxPirate 1 day ago | link

Pagerank determine crawls rate amongst other things (http://techpatio.com/2009/search-engines/google/matt-cutts-g...). I find it is also likely that short URLs (especially in the case of a directory-type site) are seen first by the spider and that this order is respected by the crawler (FIFO). You can also ask in #seo on irc.freenode.net I know there are knowledgable SEO people in there who might be able to provide you with a decent answer. reply

abrudtkuhl 1 day ago | link

It starts at the top of your sitemap - which are likely shorter URLs reply

bauchidgw 1 day ago | link

cant confirm this - sitemaps are used for discovery -> the urls listed in the sitemap get pushed into the 'discovered urls queue' then this queue is prioritized for crawling- and - if there are no other factors - the shorter urls get prioritized higher (as there is a bigger chance that a shorter url is a canonical version of a longer url - well, the chance is bigger then the other way round reply

remark

JonnieCache 1 day ago | link

There is probably some highly non-obvious reason that sorting your queue of URLs by length is optimal, which was arrived at after a lot of modelling and testing.
We're unlikely to ever know the answer unless someone from google explains it to us.
reply

JonnieCache 1 day ago | link

Thinking about it more, its probably just a breadth-first search. Duh.
reply

frisco 1 day ago | link

Breadth-first search means crawling all of the links on the page and adding all of the links on the child pages to the queue at once rather than drilling down on one link-path first before moving to the other links on the first page. There's probably some highly non-obvious reason for crawling by url length.
reply

xyzzyz 16 hours ago | link

The thing is, the deeper you are, the longer are the urls, so if you do a breadth first search, you are more likely to visit short urls first.
reply

AshleysBrain 1 day ago | link

It's probably just a short URL is a heuristic for an important site. www.site.com/section is probably more important than www.site.com/section/subsection/detail/page/5/comments. A good move for the crawler - don't get distracted by "deep" pages - try and stick to high level ones first.
Edit: this would also encourage webmasters to use short URLs, which benefits users by being easier to remember, too.
reply

orijing 20 hours ago | link

How well-known is this to webmasters? If it's not well-known I cannot see how the second part could be. But the first could very much be true, and is what I would've surmised.
reply

cma 1 day ago | link

I don't agree with the last point; seems like it would encourage things like foo.com/1hu83FG2 or lead to excessive abbreviation.
reply

DrJokepu 1 day ago | link

If I remember correctly, your page ranks better in Google if the search terms are in the URL (even more so if they are in the hostname).
reply

AshleysBrain 1 day ago | link

Well, it's just a heuristic. Also, would you consider that level of abbreviation for your own site instead of descriptive names? You do want users to be able to remember URLs if at all possible...
reply

arn 1 day ago | link

I noticed this behavior also when I was following Googlebot's crawl of my old pages after I had done a redesign.
it's not because of sitemap or because of url structure or because of dynamic content.
Mine were blog articles in the same format. This is how it was crawled:
sitename.com/year/mo/day/stub
sitename.com/year/mo/day/stub-one
sitename.com/year/mo/day/stub-one-two
sitename.com/year/mo/day/stub-one-two-three
sitename.com/year/mo/day/stub-one-two-three-four
reply

orijing 20 hours ago | link

The question that I have is whether this is a relative behavior (i.e. whether, for a given domain 'domain.com' Google prioritizes domain.com/short-url over domain.com/longer/url.html) or a global one (i.e. prioritizing short.com/url over very-long-domain.com/nested/pages/hierarchy.html, all else equal).
I can definitely see the local/relative effects being a natural consequence of prioritizing by pagerank, but the global part sounds more like a separate signal.
Does anyone have insights?
reply

esryl 1 day ago | link

The site adheres to a strict url structure. /state/city/id/schoolname - entering from the homepage, the only way to crawl the site 1 level at a time would be crawling the shortest urls first. this structure is also emphasised in the breadcrumbs on every page, the shortest urls are also the ones with the most internal links.
why would you crawl the site in any other way?
reply

underdown 1 day ago | link

"why would you crawl the site any other way?"
I could see crawling pages most likely to have changed first as those pages would most likely lead to fresh content.
reply
	
*

3 points by foxhop 1 day ago | link

If you look at a particular city page you will notice that the cities are in alphabetical order, however google bot still crawls by length of url...
reply

personalcompute 1 day ago | link

I did a more scientific analysis of the googlebot requests in the provided log (graph! http://i.imgur.com/uMoUT.png) and it definitely looks like it is taking shortest urls first. Anyone else with a large site want to check as well for further data?
reply
	
*

1 point by foxhop 1 day ago | link

Thanks for the graph, I've added it to the blog page
reply

meow 1 day ago | link

That's probably because short urls usually tend to be static pages while long ones tend to be dynamically generated content.
reply

personalcompute 1 day ago | link

His entire site that he observed this behavior on is static.
reply

jules 1 day ago | link

But Google doesn't know that.
reply

_grrr 1 day ago | link

A poor mans PageRank algorithm, assuming nothing else, would assign a higher PageRank to shorter site links on a page. Presumably the crawler visits pages with a higher PR first.
reply

TuxPirate 1 day ago | link

Pagerank determine crawls rate amongst other things. I find it is also likely that short URLs (especially in the case of a directory-type site) are seen first by the spider and that this order is respected by the crawler (FIFO).
You can also ask in #seo on irc.freenode.net I know there are knowledgable SEO people in there who might be able to provide you with a decent answer.
reply

ignifero 1 day ago | link

Maybe because that's how they are sorted in the hashtable/database they use to queue urls? Or maybe because they want to index the shortest pages first, so that they are processed before any duplicates with longer urls (i.e. get /articles/ before /articles/index.php)
reply

georgemcbay 1 day ago | link

I suspect you're on the right track with your first guess.
Most people posting here are looking for some sort of deep meaning in this when IMO it is more likely just due to a localized side-effect of doing something such as storing the urls in a trie-like structure and then iterating over it breadth-first.
reply

christianwilde 1 day ago | link

I imagine that in these specific case is because longer URLs represent deeper pages on the site that are less "important" (in terms of internal incoming links and pagerank) than the shorter ones. It doesn't seem logical that google order the URLs by length and then crawl them in that order; probably the URL length can be a factor that the bot takes into account, but not the only one in the manner this article suggest :)
Anyway, good point, that deserves more testing to extract some conclussions
reply

TuxPirate 1 day ago | link

Pagerank determine crawls rate amongst other things (http://techpatio.com/2009/search-engines/google/matt-cutts-g...).
I find it is also likely that short URLs (especially in the case of a directory-type site) are seen first by the spider and that this order is respected by the crawler (FIFO).
You can also ask in #seo on irc.freenode.net I know there are knowledgable SEO people in there who might be able to provide you with a decent answer.
reply

abrudtkuhl 1 day ago | link

It starts at the top of your sitemap - which are likely shorter URLs
reply

bauchidgw 1 day ago | link

cant confirm this - sitemaps are used for discovery -&gt; the urls listed in the sitemap get pushed into the 'discovered urls queue' then this queue is prioritized for crawling- and - if there are no other factors - the shorter urls get prioritized higher (as there is a bigger chance that a shorter url is a canonical version of a longer url - well, the chance is bigger then the other way round
reply

Michael-Martinez-rb 14y, 5d ago

I downloaded your data and found many examples where long URLs were crawled before short URLs. I don't know why you would think there was a pattern. The graphic is cute but doesn't show how the crawlers skipped around your link list rather than follow the pattern on your root URL (which was fetched along with the favicon.ico file many times before anything else was fetched).

You don't indicate when links started appearing to the site, where they appeared, or provide any helpful information that might pinpoint when and how Google learned about the site.

So the only explanation that makes sense for what you have perceived is your relative inexperience in analyzing crawl logs and patterns.

remark

russell 14y, 5d ago

All the pages were created at the same time.

The 'cute' graph (not graphic) displays the length of the resource "get" urls in the order that google bot fetched them. There is obviously a pattern in the graph related to length.

Of course there were some urls fetched first that were larger than others, but that is just some chaotic abnormalities among the data. The pattern persists regardless.

Attacking my experience does not help your argument.

remark

rIsn02pP 157d, 6h ago [edited]

Oh - I typed a longish reply, and then clicked "Watch", causing my reply to get lost.

Anyway - one other possible reason could be that Google wants to report the shortest (is often the most relevant) link for a web page, and this way pages get automatically linked to their shortest address.

remark