Google Exec Says It's A Good Idea: Open The Index And Speed Up The Internet

geekfactor · on May 25, 2011

It strikes me that the entire article/proposal is based on a faulty premise:

"After all, the value is not in the index it is in the analysis of that index."

The ability for a given search engine to innovate is based on having control of the index. The line between indexing and analysis isn't quite as clean as what is implied by the article, if only for the simple fact that you can only analyze what is in the index.

For example, at it's simplest and index is a list of what words are in what document on the web. But what if I want to give greater weight to words that are in document titles or headings? Then I need to somehow put that into the index.

What if I want to use the proximity between words to determine relevance of a result for a particular phrase? Need to get that info into the index, too.

In the end, what the author really wants is for someone to maintain a separate copy of the internet for bots. In order for someone to do that, they'd need to charge the bot owners, but the bot owners could just index your content for free, so why would they pay?

mapgrep · on May 25, 2011

Three easy reasons search engine owners /might/ pay for a full copy of the web crawl:

-Faster. You don't have the latency of millions of HTTP connections, but instead a single download. (Or a few dozen. Or a van full of hard drives.)

-Easier. The problem of crawling quickly but politely has been handled for you. The reading of sitemaps has been handled for you. The problem of deciding how deep to crawl, and when to write off a subsite as an endless black hole, has been handled for you. Etc.

-Predictable. Figuring out, in advance, how much it is going to cost you to crawl some/all of the web is, to say the least, tricky. Buying a copy with a known price tag provides a measure of certainty.

Of course, I am leaving out the potential pitfalls, but the point is there /are/ arguments in favor of buying a copy of the web (and then building your own index).

geekfactor · on May 26, 2011

All good points. Maintaining a current and high quality index is clearly not free in terms of development time, bandwidth, storage or a host of other factors.

oh_sigh · on May 25, 2011

You are setting up a straw man by saying "at it's simplest an index is a list of what words are in what documents". That kind of index(an inverted word index) could be generated from the ideal data format, but it would be stupid store such a rich data set in such a dumb, information-losing format.

"The index" would just be a list of key/values. The key would be a URL, and the value would be the content located at that URL. There would also be some kind of metadata attached to the keys to indicate HTTP status code, HTTP header information, last-crawled date, and any other interesting data. From this data set, other, more appropriate indexes could be generated(for example via hadoop)

> In the end, what the author really wants is for someone to maintain a separate copy of the internet for bots.

Yes, but not for bots. It would be for algorithms.

> In order for someone to do that, they'd need to charge the bot owners

Probably.. The funding could be like ICANN, whose long-term funding comes from groups that benefit from its services.

> but the bot owners could just index your content for free, so why would they pay?

How would you create a copy of the internet for free? Are you just going to run your crawler on your home modem while you're at work? Where are you going to store all that data? How are you going to process it? How long is that going to take? Wouldn't it just be easier to(for example) mount a shared EBS volume in EC2 that has the latest, most up to date crawled internet available for your processing?

munificent · on May 26, 2011

> "The index" would just be a list of key/values. The key would be a URL, and the value would be the content located at that URL.

They have that index already. It's called "the internet". If you're storing the entire page content, that's not really much of an "index".

> it would be stupid store such a rich data set in such a dumb, information-losing format.

Discarding information is the entire purpose of indexing and doing things like map-reduce (emphasis on the reduce!): you discard worthless information so that the stuff you care about is of a manageable size.

The reason Google serves up search results quickly is because it carefully controls the data it has to wade through. Adding more data doesn't make it smarter, it makes it slower.

oh_sigh · on May 26, 2011

> They have that index already. It's called "the internet". If you're storing the entire page content, that's not really much of an "index".

Fine. Let's call it an archive, or cache, or snapshot then. I only used that phrase because it was thrown around in the post I was responding to, and I specifically put that in quotes because I wasn't sure it was the right phrase to use.

> Discarding information is the entire purpose of indexing and doing things like map-reduce (emphasis on the reduce!)

No. Just no. An index is a data structure that is designed to quickly lookup its elements based on some key. It implies nothing about what your data(neither keys nor elements) look like.

Are you serious about "emphasis on reduce!"? "Reduce" is not named reduce because it inherently reduces the amount of data you are working with. It is called "reduce" because 1) an engineer at google liked the sound of it, and 2) it takes a list of intermediary values with all the same key. It is quite easy, and common, to have reduce functions which end up spitting out MORE data than you started with

> you discard worthless information so that the stuff you care about is of a manageable size.

Sometimes, but that doesn't work in this situation. How do you know what data is worthless before you know what algorithms will be applied to the data? The only acceptable solution is to keep a plaintext copy of the data retrieved from a particular URL.

jedberg · on May 25, 2011

No one seems to remember that Amazon did this 5 years ago: http://arstechnica.com/old/content/2005/12/5756.ars

helwr · on May 25, 2011

"The Alexa Web Search web service has been deprecated and is no longer available for new subscriptions." http://aws.amazon.com/alexawebsearch/

nl · on May 26, 2011

Which shows what the demand for it was.

panic · on May 25, 2011

What "index" is this article asking Google to open? The index against which they run actual queries has to be tied to Google's core search algorithms, which I doubt they'd want to make public.

So would they open an "index" of web page contents? In this case, why would another search engine access Google's "index" rather than the original server? The original server is guaranteed to be up to date, and there's no single point of failure.

translocation · on May 25, 2011

From my understanding, the premise is that the 'index' generated from each crawled site will be some set of metadata smaller than the site's actual content. So instead of many robots, each crawling through all the data on your site, there could be one bot, which updates a single (smaller) index that all search engines can access.

I agree that Google's index is probably optimized to work with their search algorithm. From what the author claims, though, this doesn't mean that Google would be losing anything by allowing other engines to use the index, as "all the value is in the analysis" of the index.

bdonlan · on May 25, 2011

There's also significant value in knowing when to re-index sites due to changing conditions. For example, if some new iSomething is announced, re-indexing apple.com as well as a number of popular Apple-related news sites would be very helpful in keeping the index fresh. There's also feedback from the ranking algorithm in determining how often to re-index, how deep to index, etc etc.

lpolovets · on May 25, 2011

There are two good reasons to access an index instead of the original server:

1) You won't DOS the host site.

2) You don't have to respect robots.txt. If you need to crawl a 1 million page site, and its robots.txt restricts you to 1 page/second, then you'll have to wait for a long time. Downloading a crawl dump from a central repository would be much easier.

sigil · on May 25, 2011

> 2) You don't have to respect robots.txt. If you need to crawl a 1 million page site, and its robots.txt restricts you to 1 page/second, then you'll have to wait for a long time.

Yeah, consider a site like Hacker News, where the crawl delay is not 1 second but 30 seconds [1].

If you're trying to grab historical data like iHackerNews did [2], you might be better off hitting Google's webcache instead...except, to enumerate urls you have to scrape the search results pages, AFAIK.

This is why Google opening its webcache via an API is a GREAT idea.

[1] http://news.ycombinator.com/robots.txt

[2] http://api.ihackernews.com/

kelnos · on May 25, 2011

Tsk, tsk. On a side note, HN's robots.txt is returned with a content-type of text/html.

joe_the_user · on May 25, 2011

I think the idea would be to have a global index of sites that sites use to mirror their data for searching.

This database would normally only accessible by search engines and the sites themselves could then disallow direct bot search in their robots.txt.

It occurs to me that this might have to be "invite only" - Google invites the sites they trust to put their data there but if they catch someone "cheating" in one way or another, they stop indexing. Plus they wouldn't have to invite really small sites.

trotsky · on May 25, 2011

I guess I don't understand, if someone provides me with a storage cluster of the_whole_internet for free, won't my proprietary_search_algorithm significantly degrade the IOPS and network bandwidth of the storage? Where would it all be? In some google data center that now anyone can demand colocation in? What happens when I accidentally or maliciously slow down Bing's updates and degrade their quality? And, as others mentioned, what happens when people push data into the index that doesn't represent what they're hosting?

It seems like this would be quite a complex project with a for the public good approach. Maybe it could work as an AWS project to sell amazon compute cycles.

jerf · on May 25, 2011

I suspect the only way to make this work would be for the "index" to actually be some sort of stream. It wouldn't be a "file" or "database". It would require probably on the order of hundreds of thousands of dollars worth of hardware just to receive the stream and run a hello-world map-reduce(-esque calculation) on it would be my guess. It's your job to turn that stream into a queryable database.

As for your last two questions, there's nothing new whatsoever about them. Search engine pollution is ancient news.

sebastianavina · on May 25, 2011

yeah, sure... lets make a system and store all kind of information there, so people can browse it... it would be great to distribute it around the world, maybe on different companies, and sync the data every day so it keeps fresh... I don't know, maybe we can even have every person store their own data on their own private server... but of course, in an open index... </sarcasm>

uxp · on May 25, 2011

I'm blinded by what you are trying to imply sarcastically.

A rough distributed model could be implemented similar to the way we (hackers/coders) use github as a central repository for a distributed system. People contributing to the index on a private server could do whatever they want but since that instance of the index is not public, no one else will care about what the owner has done to it. Forks can be pushed to a public staging area where others can view it and verify it's accuracy, and then the major players can merge those changes into their forks.

The complaint (with github) that it is hard to figure out the canonical repo is also invalid in this model, as one can start with a fork of Google or Yahoo's public repo, and then build their own through merging or hacking directly on it, just like one can fork Linus' Linux kernel, and then merge in other's forks to incorporate other changes.

Remember, the index itself, as in the raw data taken in by GoogleBot or Yahoo! Slurp bot, would be the shared information. The analysis of the data, as in pagerank and other factors that Google decides makes one page more relevant to a keyword than the other, would not be shared as that is the bread and butter of each engine.

DrCatbox · on May 25, 2011

The sarcasm was because the idea precisly explains what we have today, it is called the web.

recoiledsnake · on May 26, 2011

So the web wikipedia.org is the same as making the files available at http://dumps.wikimedia.org/enwiki/latest/ ?

ctdonath · on May 25, 2011

His is a variant of

"I have a map of the United States... Actual size. It says, 'Scale: 1 mile = 1 mile.' I spent last summer folding it. I hardly ever unroll it. People ask me where I live, and I say, 'E6." - Steven Wright

drivebyacct2 · on May 25, 2011

I'm not sure I understand your point/perspective.

nkassis · on May 25, 2011

basically his point is that the internet is the open index. At least that's what I think I understand.

Centralizing it all so that one company has to run the index would mean that company would have to foot the entire bandwidth bill which they won't do, so it would end up distributed as the internet already is. It's just a shifting of the traffic to a centralized db which would need to be kept up to date.

drivebyacct2 · on May 25, 2011

I don't understand why a decentralized, open, free index is a bad thing. It's not the same as it is today and there's no reason one person would have to carry all the load.

gnaritas · on May 25, 2011

> don't understand why a decentralized, open, free index is a bad thing.

It's not, it's called the web.

ctdonath · on May 25, 2011

Every town has a library. Let's just close all the bookstores then.

drivebyacct2 · on May 26, 2011

Is anyone here capable of having a conversation instead of jabs, quips or snark? Or at least a good reason why a distributed or copied, shared index is a bad thing for anyone?

ctdonath · on May 26, 2011

Sometimes there's a point to the snark, which is missed by the downvoters.

There's more to an index than "hey, let's share an index". That's WHY we have multiple search engines: there will be differing views of what should be in that shared index, and how it should be constructed to facilitate varying needs. There are multiple search engines because each one plays off "hey, our index is better than theirs".

My snark makes the point: why not just have a single distributed/copied/shared library system, and do away with pricy competition? Answer: because that single system does not provide everything everyone wants, and people are willing to pay for a different selection & service.

The search engine and index are tightly coupled. They're huge, they're complex, they're expensive - and people think they can make a good buck by somehow doing it different. Create a "universal index", and someone will realize they can make money by making their own, leading right back to what we have today.

The TFA's key issue isn't really that multiple indexes are crawling his website, it's that they're gathering his data thru the most inefficient means possible - polling every page as often as practical. You don't want universal index, because short of banning competing indexes there will always be competing indexes. You want an agreed-on search interface: a means to serve the indexes what they want at a cost lower than what it costs them to poll-mine your site.

Yeah, a universal index is a bad idea. It ignores the fact (you'd think visitors to ycombinator of all sites would get this) that there is money in competition, ergo there will never be a single index. That, and it still doesn't solve the problem: it may reduce the number of crawlers sucking your bandwidth, but it's still polling every page as fast and often as it can.

drivebyacct2 · on May 27, 2011

I'm not sure we're talking about the same thing. From what I understand "index" is just a copy of the database of sites. There is little to no value in simply having that index. The index is near useless.

Yeah, the more I read your post, the more I don't think you understand the point of the idea. The idea would be that many people could contribute to indexing the web, faster, and sharing that in a decentralized fashion. With open access to that index, anyone is able to innovate on that data and build their own search engine.

Having fewer crawlers would be a good thing and would indeed decrease the number of hits to your website. More importantly, the work of those crawlers could be distributed and thus page updates could happen even faster.

Even if you're right, that some how the indices themselves can be tuned and can be search-engine specific, who cares? That doesn't preclude or prevent an open index from existing.

BTW, I didn't really downvote anyone, nor did I have a problem with your comment. I just didn't understand it, and it was hot on the tails of another indecipherable comment. I think I understand your point now, but I feel we're either talking about two different things or thinking about them in vastly different manners.

DrCatbox · on May 25, 2011

He just told you what the web is today. As it was since 1991.

VladRussian · on May 25, 2011

too bad that upvote number for your comment is invisible. It would be interesting to know how many people is able to see it through the same way

Edit: looking at the other comments, doesn't seem to be that that many were able to get it. What a disappointing state of minds.

spiralganglion · on May 25, 2011

The people who "get it" are just going to have a chortle and move on. Commenting to the effect of "I got it, that was funny" doesn't add anything to the conversation.

chrislomax · on May 25, 2011

I think this is a good idea. The whole idea of people syncing their own data doesn't work though, it gives too much room for people to fudge their data into the system so it favours them more.

I think the idea is good though. I think there would be a fight though to say who is the aggregator of the information. This would also mean whoever does distribute it has a stranglehold on the industry in terms of how and when it supplies this information.

I can see it's uses but I can equally see a lot of cons for the system not working or some serious amount of anti trust.

If you could get an unbiased 3rd party involved though and they built the database then I think that would work.

Emore · on May 25, 2011

For the record, the Google exec (Berthier Ribeiro-Neto) is the co-author of "Modern Information Retrieval" [1], an excellent book and close to a standard text on IR.

[1] http://www.amazon.com/Modern-Information-Retrieval-Ricardo-B...

a_m_kelly · on May 25, 2011

I can second the recommendation of that book, I've heard a lot of good things, though I haven't read it. It's recently been updated in a 2nd edition [1], though I have no idea if there are substantive changes, presumably, there are, given more than a decade has elapsed. If anyone's read the updated version, I'd appreciate knowing if and/or how the book's changed, I've been thinking about picking it up.

I have read pretty big sections of Manning's Introduction to IR, and it served me fairly well as an introduction to the field. It's available online.[2]

[1] http://www.amazon.com/Modern-Information-Retrieval-Concepts-...

[2] http://nlp.stanford.edu/IR-book/information-retrieval-book.h...

extension · on May 25, 2011

We're talking about the cache, right? The index, or more likely indices, are optimized data structures used to search the cache. I doubt Google could share those without revealing too much about their ranking algorithm.

Letting sites inject into the cache is an interesting idea, but Google will still have to spider periodically to ensure accuracy. Inevitably, a large number of sites will just screw it up, because the internet is mostly made of fail. This would leave Google with only bad options: If they delist all the sites to punish them, they leave a significant hole in their dataset. But if they don't punish them and just silently fix it by spidering, there is no longer any threat to keep the black hat SEOs in check. Either way, it would cause an explosion in support requirements and Google is apparently already terrible at that.

amikazmi · on May 26, 2011

I think the idea was that only Google will crawl your site and update the index, then the rest of the search engines will use the index instead of hitting your site.

ChuckMcM · on May 25, 2011

"Each of these robots takes up a considerable amount of my resources. For June, the Googlebot ate up 4.9 gigabytes of bandwidth, Yahoo used 4.8 gigabytes, while an unknown robot used 11.27 gigabytes of bandwidth. Together, they used up 45% of my bandwidth just to create an index of my site."

I don't suppose anyone has considered making an entry in robots.txt that says either:

last change was : <parsable date>

Or a URL list of the form

<relative_url> : <last change date>

There are a relatively small number of robots (a few 10's perhaps) which crawl your web site, all of the legit ones provide contact information either in the referrer header or on their web site. If you let them know you had adopted this approach then they could very efficiently not crawl your site.

That solves two problems;

@ web sites on the back end of ADSL lines but don't change often wouldn't have their bandwidth chewed by robots,

@ The search index would be up to date so if someone who needed to find you hit that search engine they would still find you.

jerf · on May 25, 2011

You just described site maps: http://www.sitemaps.org/protocol.php

ChuckMcM · on May 25, 2011

Nice, I knew it made too much sense not to have already been done in some form. I wonder if Google and Bing respect it.

bdonlan · on May 25, 2011

Google _invented_ it.

ChuckMcM · on May 26, 2011

snarktastic response: But that doesn't answer the question :-)

Google invents lots of things, and I suspect that you are correct that they honor sitemap, they also abandon things they once thought were the answer (QR codes, Wave, Etc.). Since I have my own CMS for web pages I'll try adding this to it's output and see what GoogleBot does with it.

mdwrigh2 · on May 26, 2011

I'm curious where you got the idea that Google has "abandoned" QR codes.

Bo102010 · on May 26, 2011

http://lmgtfy.com/?q=google+abandons+qr+codes

Not saying it's an accurate summary, but I don't think you were actually very curious!

troels · on May 25, 2011

There's already place for doing this in the http protocol. I would assume that crawlers respect this, if provided, although I haven't tested to verify my expectation.

rachelbythebay · on May 25, 2011

Years ago, Googlebot would send If-Modified-Since headers, and Apache would honor them.

I ran into this by chance when writing a wrapper to obfuscate e-mail addresses in mailing list archives. I didn't change the URL but had it served by a script instead of being a flat file. When it first went online, all of the robots kept crawling the files over and over. I finally made it supply the right mtime to Apache, which then did the right thing with the incoming IMS header, generating a 304 and not sending out new content.

It's possible this has regressed, but I would hope it hasn't.

unfasten · on May 25, 2011

I have a newly registered domain with only a sparse page up as the index so far. It's been getting crawled fairly regularly by Google, Baidu and Yahoo. Google and Baidu are sending If-Modified-Since (Baidu is also sending If-None-Match) and are receiving 304 Not Modified responses each time they crawl. Yahoo sends neither header and is requesting the full page every single time. This is without any explicit cache headers set on my end.

troels · on May 25, 2011

That is to be expected. `If-None-Match` and `Etag` are a relatively late caching strategy, that is done at the server (or edge) side.

Have you tried serving your pages with `Expires` and `Cache-Control` headers? I you give it - say - a timeout of a week, then a well-behaving client shouldn't retry before that time has went by.

SoftwareMaven · on May 25, 2011

A couple of thoughts come to mind:

1. If I were Microsoft, I wouldn't trust Google's index. How do I know they aren't doing subtle things to the index to give them an advantage?

2. Having the resources to keep a live snapshot of the web is one of the big players' advantages. Opening the index, while good for the web, would not necessarily be good for the company. Google could mitigate that by licensing the data: for data more than X hours old, you get free access; for data newer than that, you pay a license fee to Google. Furthermore, integrate the data with Google's cloud hosting to provide a way to trivially create map/reduce implementations that use the data.

3. On the other side, what a great opportunity the index could provide for startups. Maintaining a live index of the web is costly and getting more and more difficult as people lock down their robots.txt. Being able to immediately test your algorithms against the whole web would be a godsend for ensuring your algorithms work with the huge dataset and that your performance is sufficient.

Here's to hoping Google goes forward with it!

thevivekpandey · on May 25, 2011

The first step would be for some top companies (Google, Yahoo...) to share the index. That way, there would be some speed up of the internet, and the index would not be open to abuse by arbitrary people/companies.

mmaunder · on May 26, 2011

The author should use something like "crawl data" instead of "index". An index is the end result of analyzing crawled web pages.

It's a cool idea though because Yahoo sucks up a ton of my bandwidth and delivers very little in SEO traffic. On most of my sites now I have a Yahoo bot specific Crawl-Delay in robots.txt of 60 seconds, which pretty much bans them.

stretchwithme · on May 25, 2011

Maybe each site should be able to designate who indexes it and robots can get that index from that indexer. Let the indexers compete. Let each site decide how frequently it can index. Allow the indexer that gets the business use the index immediately, with others getting access just once a day. Perhaps a standardized raw sharable index format could be created, with each search company processing it further for their own needs after pulling it.

And let the site notify the indexer when things change, so all the bandwidth isn't used looking for what's changed. Actual changes could make it in to the index more quickly if the site could draw attention to it immediately rather than an army of robots having to invade as frequently as inhumanly possible. The selected indexer could still visit once a day or week to make sure nothing gets missed.

ck2 · on May 25, 2011

Google would never do this.

Their attitude is to take everything in but not to let you automate searches to get data out.

This is the biggest problem I have with search engines - you want to deep index all my sites? Fine, but you better let me search in return - deeper than 1000 results (and ten pages). Give us RSS, etc.

chrislomax · on May 25, 2011

The whole article is about information in its rawest form and nothing to do with searchable content.

You would write something that takes the information they are referring to in this article, it's how you digest and index that information yourself that makes the difference

random42 · on May 25, 2011

> This is the biggest problem I have with search engines - you want to deep index all my sites? Fine, but you better ...

For _most_ of the websites, its in _their_ interest to have good SERP ranking. Not the other way around.

sigil · on May 25, 2011

"Index" is the wrong word. He's not calling for Google to open up their index, but rather open their webcache.

random42 · on May 25, 2011

This article is about an year old. [july, 2010]

jwr · on May 26, 2011

It strikes me that both in the article and in most comments people have no idea of what they are talking about, and yet they boldly carry on.

"The index"? Feature extraction is the most complex part of almost any machine learning algorithm, and search is no different. Indexing full text documents is a really difficult task, especially if you take inflected languages into account (English is particularly easy).

I don't see a way to "open the index" without disclosing and publishing a huge amount of highly complex code, that also makes use of either large dictionaries, or huge amounts of statistical information. It's not like you can just write a quick spec of "the index" and put it up on github.

FWIW, I run a startup that wrote a search engine for e-commerce (search as a service).

198d · on May 25, 2011

I don't think it's quite that simple. The index that google serves search query results from is a direct result of the algorithms they've applied to the data the googlebot has gathered. If by 'index' the author means the data the goolebot (for example) has downloaded from the internet, that's quite a bit different, but still probably serves the purpose the author is looking for. The index is a highly specialized representation of all the data they've collected.

jessriedel · on May 25, 2011

> If by 'index' the author means the data the goolebot (for example) has downloaded from the internet.

That's what the author means by 'index'.

mindstab · on May 25, 2011

Does it seem naive to anyone else to allow site owners to update the index and stop spidering. a) lots of people for various reasons (ignorance, security through obscurity) would just not update it and stuff would fall out of search. Second, this seems incredibly ripe for abuse. Like we don't have enough search spam result problems already, letting spammers have more direct access to the content going into their rankings seems like a truly bad idea.

tlb · on May 25, 2011

When spiders use more bandwidth than customers, your website must not be very popular. It implies that each page is viewed only a handful of times / month on average.

tlrobinson · on May 26, 2011

It's also possible they have a huge amount of content that is sparsely accessed by a large number of users. But in general I agree.

Edit: SmugMug seems to fall into this category: http://don.blogs.smugmug.com/2010/07/15/great-idea-google-sh...

Also interesting:

And if you think about it, the robots are much harder to optimize for – they’re crawling the long tail, which totally annihilates your caching layers. Humans are much easier to predict and optimize for.

eykanal · on May 25, 2011

Good article. The fact is, the index itself isn't worth nearly as much as the algorithms. Heck, open the index, and let anyone add to it. MSN, Yahoo, Bing, anyone... let them add to that single index and make the index awesome, and then anyone can try their hand at making a great search algorithm. If each company really thinks their search algorithm is better than everyone else's, this is competition at it's best.

jsnell · on May 25, 2011

It might not be as valuable as ranking algorithms, but that doesn't mean it has no value at all. A deeper and more timely index is a clear competitive advantage, the former for long tail queries and the latter for topical ones. What would be the benefit for Google to let Microsoft leech off that effort? And if it for some reason was done, what would be the point of continuing further development of indexing quality?

I think a more correct title for this post would have been "Google exec tries to politely dismiss a silly idea", but of course that's not as punchy.

SkimThat · on May 25, 2011

TL;DR - A lot of traffic on the Internet comes from search engine bots like Google’s and Yahoo’s indexing pages. If Google’s index was open, search engines could share each other’s resources and not have to repeatedly spider pages. This would significantly boost traffic speed and the idea was even supported by Larry Page, one of, Google’s co-founders. Page initially resisted Google going commercial.

braindead_in · on May 25, 2011

The title is a bit misleading. The author suggested it and the Google Brazil head supported it and said 'You should write a position paper on it'.

tlrobinson · on May 26, 2011

What format would the indexes be made available in? Raw lists of URLs and caches of the HTML pages, or pre-built inverted indexes, PageRank data, etc?

If it's the former all this really does is move the burden from sites to Google, and introduces a single point of failure.

If it's the latter, which seems unlikely, what incentive does Google have to share that data? It's part of their competitive advantage.

bkudria · on May 26, 2011

Google has a ton of private data that should not have been indexed, in their index. It's just that no one has thought to search for it yet. (See: http://en.wikipedia.org/wiki/Johnny_Long)

A single public index would expose this data to stronger analysis (or even plain reading), not just Google search queries.

redditmigrant · on May 25, 2011

I dont know if this is naive, but wouldnt the data model/storage strategy of the index be influenced by the ranking algorithms that use it? If thats the case, then I would presume google's index tries to store the data in a form thats efficient for their ranking algorithm to work off of and it might not be in the best format for say bing/yahoo to use.

chrislomax · on May 25, 2011

I think they are referring to the data in its rawest format before they have indexed and ranked the information themselves. They will all crawl the information in exactly the same way. They will just take the plain text and store it. I don't think any bot would actually do anything else with the data on the fly.

If you think about it, it does make sense in a lot of respects. I have dealt with a lot of companies that sell data, the only difference is this data is freely available to everyone so everyone thinks they should crawl the information themselves.

The only people who lose out are the people paying the bandwidth bills. The internet would actually be slower due to the amount of information passing around when it is not needed.

This idea makes more sense the more we discuss it

Lewisham · on May 25, 2011

Yes, it would be fair to assume that index optimization is also part of the Secret Sauce, unless you store the raw data. Storing the raw data also requires Secret Sauce like Google FIle System, and you'll end up with the sarcastic comment above that the Internet is the raw data and we're back at square one.

dennisgorelik · on May 26, 2011

Centralization [of search index] has significant overhead.

Bandwidth is not nearly as expensive as the overhead of such search index centralization.

ecaradec · on May 26, 2011

Even if it would be beneficial to the whole internet, but if google did that that would be like giving an advantage to all the google competitors : they wouldn't need to solve the crawling problem. It may not be algorithmically gorgeous but it's still one problem less. Would be fun though, we could buy a tarball of the whole internet ;)

robot · on May 25, 2011

Also, why not use a single base station at each location for all mobile mobile service providers? Rather than having multiple 3G base stations for each provider, polluting our radio space? I think when there is competition, there is always a multiple of something, it's just a fact of open market and we may have to live with it.

bluelu · on May 26, 2011

At the end, one single company will control the internet. I hope this is not something you want. Like twitter does control twitter and only opens up their data to gnip, etc...

This won't be accepted. And even legally, this is not possible due to copyright laws in different countries.

endlessvoid94 · on May 25, 2011

I have a potentially stupid question. When the author says "45% of my bandwidth", does he mean 45% of a QUOTA? Or actually 45% of the pipe is being used?

If it's the former, this seems like it wouldn't help speed at all.

metamatt · on May 25, 2011

He's saying if he could optimize away all the requests by crawlers and robots, he'd have half as much work to do. Doing twice as much work costs more -- on the server side to serve it up, and on the network side to transfer it.

It's not literally a speed comparison, but you can imagine how if you could optimize away all the robot requests, you could do something else with the time and resources you'd formerly spent serving them.

The smugmug addendum to the story explains this pretty clearly.

ddemchuk · on May 25, 2011

45% of his total monthly bandwidth used. So say he's getting 50 gigs of traffic a month, the bots make up 22.5 gigs of that

endlessvoid94 · on May 25, 2011

OK. So how does this speed anything up?

EDIT: To clarify, how does reducing the amount of bandwidth used speed up anything? Why am I being downvoted for this?

Locke1689 · on May 25, 2011

Google doesn't know when you will update content on your site so it has to almost constantly hit it with crawler bots. The idea would be to move towards a push-based model wherein sites could push updates to an open index instead of waiting for bots to crawl the site and use up extra bandwidth.

Of course, if this is the primary issue, I don't see why Google couldn't just implement a closed push-based index. When your site updates, you push the changes to Google. The index is still closed but it solves the bandwidth problem without opening Google resources.

lukes · on May 25, 2011

Google wouldn't trust the sites to do this correctly. So even if they did provide some push mechanism I think they'd still send bots to your server to be sure.

Locke1689 · on May 25, 2011

That's quite possible, but in that case an open index would be even more useless since Google would just have to duplicate it internally anyway.

node56 · on May 25, 2011

It seems like the web server should be able to report what has changed since a timestamp. The first step when crawling a site would be first to do something like an HTTP STATUS request, if that succeeds the bot could merely crawl the pages that have changed. Or if the bot did not trust the site, it could always do the full crawl and verify the result.

andreasvc · on May 25, 2011

That sounds rather like the DNS "NOTIFY" signal, which a master sends to its slaves (secondary nameservers) to trigger a zone transfer when DNS data has been updated. Without that secondary servers are forced to download the zone data e.g. every hour.

thezilch · on May 25, 2011

It speeds up the web, because we can't our entire data set in cache -- "we" being anyone with a large data set and "can't" being defined by available hardware (costs) and time (lost working on "real" problems). Google will crawl several million pages, a day, which can be quite costly to keep hot in a cache -- not to, again, mention resources, time, and money not being spent on improving general speed.

Considering there are dozens of bots that will crawl one's site in "random" order, they all begin to wreak havoc on caches, as each robot and the humans don't browse in uniform.

chmike · on May 25, 2011

You are right. I guess the idea that it would speed up the internet is because less traffic would yield less congestion, or faster server response.

People who downvoted you didn't understood or think your comment is not pertinent.

shepting · on May 25, 2011

His hope is that sites would only need to be indexed once. It would only be the 4.9GB for Google and he would save the 4.8GB used by Yahoo and the 11.27GB used by the other search engines.

Apple-Guy · on May 25, 2011

The Google guy does not work in the head office, and isn't in charge of policy. He doesn't understand that search -> ads is what earns the Google riches.

stcredzero · on May 25, 2011

If Google and a few other companies can charge some multiple of what it costs to index a site, then it could even be a money making prospect for them.

benwerd · on May 25, 2011

Well, on one level, it's a great idea. On another, it gives Google the keys to the entire freaking web.

brianobush · on May 25, 2011

part of the secret that makes any search engine unique is the knowledge that a site at x.com exists and they know there is a forum at x.com/forums which is not visible by simply crawling from the root x.com. On the other hand, I would love an open web cache for my work.

joshaidan · on May 25, 2011

While I think this is a really cool idea, for some reason the word hiybbprqag comes to mind. :)

ddemchuk · on May 25, 2011

The reason Google (and Bing and Yahoo and Yandex and and and) is in the position they are in is because they have the bandwidth and computational power to crawl and index the web with the speed and reach necessary for it to be useful. They aren't going to just start giving that away any time soon...

agentultra · on May 25, 2011

There are protocols for bots. Not all of them follow it... so block requests from them.

Problem solved... like a million internet years ago.

turbohz · on May 25, 2011

How do you get indexed, then? because I can't see how this solves anything.

Isn't the proposal clear enough?

1. Optimize the indexing process so that we avoid each search engine crawling independently every site.

2. Devise a method to refresh the index when the content changes (hash, date...)

Seems resonable enough, to me.

Meai · on May 25, 2011

I assume he means to block all search engines except the big guys: Google, Bing, Yahoo. Anyone else has too little impact. Not sure how one could do that, it's not like a request tells me "hey there, I'm a robot! Let me in?"

gnaritas · on May 25, 2011

Most actually do via the user agent string in the request. You can't stop a malicious bot this way, but you could kill most bot traffic with a rewrite rule, presuming robots.txt isn't good enough.