Sun, Mar. 7th, 2010, 12:51 am
Wired Exclusive: How Google’s Algorithm Rules the Web

http://www.wired.com/magazine/2010/02/ff_google_algorithm/


Team Bing has been focusing on unique instances where Google’s algorithms don’t always satisfy. For example, while Google does a great job of searching the public Web, it doesn’t have real-time access to the byzantine and constantly changing array of flight schedules and fares. So Microsoft purchased Farecast — a Web site that tracks airline fares over time and uses the data to predict when ticket prices will rise or fall — and incorporated its findings into Bing’s results. Microsoft made similar acquisitions in the health, reference, and shopping sectors, areas where it felt Google’s algorithm fell short.

Even the Bingers confess that, when it comes to the simple task of taking a search term and returning relevant results, Google is still miles ahead.

...

The story of Google’s algorithm begins with PageRank, the system invented in 1997 by cofounder Larry Page while he was a grad student at Stanford. Page’s now legendary insight was to rate pages based on the number and importance of links that pointed to them — to use the collective intelligence of the Web itself to determine which sites were most relevant. It was a simple and powerful concept, and — as Google quickly became the most successful search engine on the Web — Page and cofounder Sergey Brin credited PageRank as their company’s fundamental innovation.

But that wasn’t the whole story. “People hold on to PageRank because it’s recognizable,” Manber says. “But there were many other things that improved the relevancy.” These involve the exploitation of certain signals, contextual clues that help the search engine rank the millions of possible results to any query, ensuring that the most useful ones float to the top.

Web search is a multipart process. First, Google crawls the Web to collect the contents of every accessible site. This data is broken down into an index (organized by word, just like the index of a textbook), a way of finding any page based on its content. Every time a user types a query, the index is combed for relevant pages, returning a list that commonly numbers in the hundreds of thousands, or millions. The trickiest part, though, is the ranking process — determining which of those pages belong at the top of the list.

That’s where the contextual signals come in. All search engines incorporate them, but none has added as many or made use of them as skillfully as Google has. PageRank itself is a signal, an attribute of a Web page (in this case, its importance relative to the rest of the Web) that can be used to help determine relevance. Some of the signals now seem obvious. Early on, Google’s algorithm gave special consideration to the title on a Web page — clearly an important signal for determining relevance. Another key technique exploited anchor text, the words that make up the actual hyperlink connecting one page to another. As a result, “when you did a search, the right page would come up, even if the page didn’t include the actual words you were searching for,” says Scott Hassan, an early Google architect who worked with Page and Brin at Stanford. “That was pretty cool.” Later signals included attributes like freshness (for certain queries, pages created more recently may be more valuable than older ones) and location (Google knows the rough geographic coordinates of searchers and favors local results). The search engine currently uses more than 200 signals to help rank its results.

Google’s engineers have discovered that some of the most important signals can come from Google itself. PageRank has been celebrated as instituting a measure of populism into search engines: the democracy of millions of people deciding what to link to on the Web. But Singhal notes that the engineers in Building 43 are exploiting another democracy — the hundreds of millions who search on Google. The data people generate when they search — what results they click on, what words they replace in the query when they’re unsatisfied, how their queries match with their physical locations — turns out to be an invaluable resource in discovering new signals and improving the relevance of results. The most direct example of this process is what Google calls personalized search — a feature that uses someone’s search history and location as signals to determine what kind of results they’ll find useful.1 But more generally, Google has used its huge mass of collected data to bolster its algorithm with an amazingly deep knowledge base that helps interpret the complex intent of cryptic queries.

Take, for instance, the way Google’s engine learns which words are synonyms. “We discovered a nifty thing very early on,” Singhal says. “People change words in their queries. So someone would say, ‘pictures of dogs,’ and then they’d say, ‘pictures of puppies.’ So that told us that maybe ‘dogs’ and ‘puppies’ were interchangeable. We also learned that when you boil water, it’s hot water. We were relearning semantics from humans, and that was a great advance.”

But there were obstacles. Google’s synonym system understood that a dog was similar to a puppy and that boiling water was hot. But it also concluded that a hot dog was the same as a boiling puppy. The problem was fixed in late 2002 by a breakthrough based on philosopher Ludwig Wittgenstein’s theories about how words are defined by context. As Google crawled and archived billions of documents and Web pages, it analyzed what words were close to each other. “Hot dog” would be found in searches that also contained “bread” and “mustard” and “baseball games” — not poached pooches. That helped the algorithm understand what “hot dog” — and millions of other terms — meant. “Today, if you type ‘Gandhi bio,’ we know that bio means biography,” Singhal says. “And if you type ‘bio warfare,’ it means biological.”

Throughout its history, Google has devised ways of adding more signals, all without disrupting its users’ core experience. Every couple of years there’s a major change in the system — sort of equivalent to a new version of Windows — that’s a big deal in Mountain View but not discussed publicly. “Our job is to basically change the engines on a plane that is flying at 1,000 kilometers an hour, 30,000 feet above Earth,” Singhal says. In 2001, to accommodate the rapid growth of the Web, Singhal essentially revised Page and Brin’s original algorithm completely, enabling the system to incorporate new signals quickly. (One of the first signals on the new system distinguished between commercial and noncommercial pages, providing better results for searchers who want to shop.) That same year, an engineer named Krishna Bharat, figuring that links from recognized authorities should carry more weight, devised a powerful signal that confers extra credibility to references from experts’ sites. (It would become Google’s first patent.) The most recent major change, codenamed Caffeine, revamped the entire indexing system to make it even easier for engineers to add signals.

Google is famously creative at encouraging these breakthroughs; every year, it holds an internal demo fair called CSI — Crazy Search Ideas — in an attempt to spark offbeat but productive approaches.

...

The Mike Siwek query illustrates how Google accomplishes this. When Singhal types in a command to expose a layer of code underneath each search result, it’s clear which signals determine the selection of the top links: a bi-gram connection to figure it’s a name; a synonym; a geographic location. “Deconstruct this query from an engineer’s point of view,” Singhal explains. “We say, ‘Aha! We can break this here!’ We figure that lawyer is not a last name and Siwek is not a middle name. And by the way, lawyer is not a town in Michigan. A lawyer is an attorney.”

This is the hard-won realization from inside the Google search engine, culled from the data generated by billions of searches: a rock is a rock. It’s also a stone, and it could be a boulder. Spell it “rokc” and it’s still a rock. But put “little” in front of it and it’s the capital of Arkansas. Which is not an ark. Unless Noah is around. “The holy grail of search is to understand what the user wants,” Singhal says. “Then you are not matching words; you are actually trying to match meaning.”

...

This flexibility — the ability to add signals, tweak the underlying code, and instantly test the results — is why Googlers say they can withstand any competition from Bing or Twitter or Facebook. Indeed, in the last six months Google has made more than 200 improvements, some of which seem to mimic — even outdo — the offerings of its competitors. (Google says this is just a coincidence and points out that it has been adding features routinely for years.) One is real-time search, eagerly awaited since Page opined some months ago that Google should be scanning the entire Web every second. When someone queries a subject of current interest, among the 10 blue links Google now puts a “latest results” box: a scrolling set of just-produced posts from news sources, blogs, or tweets. Once again, Google uses signals to ensure that only the most relevant tweets find their way into the real-time stream. “We look at what’s retweeted, how many people follow the person, and whether the tweet is organic or a bot,” Singhal says. “We know how to do this, because we’ve been doing it for a decade.”