Other Pages

Wednesday, May 20, 2015

A Relevant Tale: How Google Killed Inktomi

 

 

 

On March 20th, 2000 Inktomi had a market capitalization of 25 billion dollars. As a relatively early employee, I was a multimillionaire on paper. Life was good. In the next year and a half the stock went down by 99.9%. In the end, Inktomi was acquired by Yahoo for 250M. What happened? Among other things, Google. Grab some popcorn and enjoy this story.

Inktomi was the #1 search engine in the world for a while. When I joined we had just won the Yahoo contract, and were serving search results for HotBot (there is still a search page there!) At first I worked on developing crawling and indexing tools written in C++. Our main goal at the time was to grow our index size, and at the same time to improve relevance. It became clear that as our document base grew, relevance would play a more important role. For ten million documents you may be able to filter out all but a handful of documents with a few well-chosen keywords. In that case any relevance algorithm would do; your desired result would be present in the one and only result page. You wouldn’t miss it. For a billion documents however, the handful would become hundreds or thousands. Without a good relevance algorithm, your desired result might be on page 17. You’d give up before getting to it.

At first we were using a classic tf-idf based model, enhanced by emphasizing certain features of pages or urls that correlated with “goodness.” For example, yahoo.com is probably more relevant to the query yahoo than yahoo.com/some/deep/page.html. We thought shorter urls were better. Of course this query was very popular, so spammers started creating pages stuffed with the word Yahoo. This was the beginning of an arms race that continues today. Back then we were the main target because we processed more searches than anyone else.

Inktomi_mug

Enter The Google

Yahoo had been complaining to us about not being result #1 for yahoo for a while. We fixed that special case, but we couldn’t do the same for many other sites or pages. In 1999 Google was gaining popularity because they were solving exactly this problem. We didn’t perceive them as a threat yet, but we did realize that we had to do our own version of PageRank. I was assigned to that task.

My small contribution to improving our relevance was coming up with a simple formula to take into account the occurrences of words in links pointing to pages. The insight was realizing that this followed a power law: at the time Yahoo.com had about 1M instances of the word yahoo in links pointing to it. Nobody else came close. Other Yahoo properties had an order of magnitude less, and then came a long tail of other sites. I decided to use the logarithm of the count as a boost for the word in the document. This wasn’t as sophisticated as PageRank (we’d get to that later), but it was a huge improvement. Our relevance got much better over time as other people spent countless hours implementing our own link analysis algorithms. We had a clear mandate from the execs; our priorities at search were:

1) relevance

2) relevance

3) relevance

Doug Cook built a tool to quickly measure the relevance effects of algorithmic changes based on precomputed human judgments. For example: it was clear that Yahoo.com was the definitive result for the query “yahoo” so it would score a 10. Other Yahoo pages would be ok (perhaps a 5 or  6). Irrelevant pages stuffed with Yahoo-related keywords would be spam, and humans would give them a negative score if they showed up for that query. Given ten results and a query, we could instantly evaluate the goodness of the results based on the human rankings.

We had a sample corpus of links and queries for which we could run this test as often as we wanted, and compare ourselves against Google. We did this for months until it became clear that we were “as good as Google.” Our executives were happy.

Relevance Is Only So Relevant

I thought about why I was using Google myself, and I’m sure it’s obvious to everyone now: theexperience was superior.

  • Inktomi didn’t control the front-end. We provided results via our API to our customers. This caused latency. In contrast, Google controlled the rendering speed of their results.
  • Inktomi didn’t have snippets or caching. Our execs claimed that we didn’t need caching because our crawling cycle was much shorter than Google’s. Instead of snippets, we had algorithmically-generated abstracts. Those abstracts were useless when you were looking for something like new ipad screen resolution. An abstract wouldn’t let you see that it’s 2048×1536, you’d have to click a result.

In short, Google had realized that a search engine wasn’t about finding ten links for you to click on. It was about satisfying a need for information. For us engineers who spent our day thinking about search, this was obvious. Unfortunately, we were unable to sell this to our executives. Doug built a clutter-free UI for internal use, but our execs didn’t want to build a destination search engine to compete with our customers. I still have an email in which I outlined a proposal to build a snippets and caching cluster, which was nixed because of costs.

Are there any lessons to be learned from this? For one, if you work at a company where everyone wants to use a competitor’s product instead of its own, be very worried. If I were an executive at such a company I would follow Yoda’s advice: “Do or do not. There is no try.” If you’re not willing to put in the effort to compete, you might as well cut your losses (like Google did with Buzz, for example).

Inktomi1

No comments:

Post a Comment