Elise Ackerman | November 11, 2008
BEHIND Yahoo's push to open up web search and advertising is software powerful enough to sort through the entire US Library of Congress in less than half a minute.
The software, called Hadoop, is part of Yahoo's massive computing grid and is transforming the way Yahoo and corporate giants such as IBM extract meaning from enormous streams of data.Universities are also using the code - an open-source version of software Google relies on for daily operation - to train a new generation of computer scientists and engineers.
"It makes it possible to actually take advantage of all the computers we have hooked together," Yahoo search and advertising sciences vice-president Larry Heck says.
Hadoop improves the relevance of ads Yahoo shows on the internet by analysing the company's endless flow of data - now more than 10TB daily - on the fly. As users click from Yahoo Mail to Yahoo Search to Yahoo Finance and back again, Hadoop helps figure out what ad, if any, is likely to catch someone's attention.
The key lies in mining insights from mind-boggling amounts of data. If a woman repeatedly reads reviews of sports utility vehicles, then clicks on automotive classifieds and then orders a book on helping a child adjust to kindergarten, she might be in the market for a family-size car, according to a Yahoo sales presentation.
As part of the push for more openness, Yahoo will be using the technology to boost ad sales not only on its own websites but on sites owned by the 796 members of a newspaper consortium working with the search giant to sell more ads at better prices.
In some ways, perhaps it is even more targeted than search advertising," says Leon Levitt, digital media vice-president for Cox Newspapers, a consortium member.
For Yahoo, an innovative approach to internet advertising is a major accomplishment. When Yahoo launched its Hadoop project in January 2006, it was selling search advertising for half of what Google charged and watched its share of internet searches dwindle.
Hadoop was first put to work building Yahoo's web index - the biggest computing problem inside Yahoo. Since then, a team of engineers has tuned the software, and researchers inside and outside of Yahoo have begun using it to experiment with giant data sets.
"All of a sudden, instead of waiting overnight, people could get the results of their experiments in a minute," says Doug Cutting, a work-at-home dad who hacked out the first version of Hadoop in his home in Sonoma County, California, as part of an open-source search project.
Cutting, a programmer who helped build search engines at Apple and Excite, had started the search project in 2000 because he wanted his code to live on. He knew closed-source projects, where software is treated as a corporate secret, had a way of dying. With open source, the code is published and other programmers can contribute and help fix bugs.
"It was a pretty ambitious goal, destined for failure in the short term but still worth pursuing in the long term," Cutting says.
Plugging away with a core group of volunteers and with support from the Apache Foundation, Cutting created a library of code he called Lucene and a web crawler he called Nutch.
Cutting made some progress, but was stymied by the sheer size of the web. He was able to index only several hundred million web pages, a fraction of the web, which was already billions of pages and expanding quickly.
It was Google that inadvertently supplied the solution. In 2004, Google fellows Jeffrey Dean and Sanjay Ghemawat published a paper about MapReduce, the secret software that Google uses to process raw data using thousands of computers. "It pretty much directly addressed the scaling issue we were having," says Cutting.
Using the clues provided by the Google paper, Cutting wrote Hadoop, which was named after his son's toy elephant. Yahoo saw the code and offered Cutting a job.
While a team of engineers adapted Hadoop to run reliably on tens of thousands of computers, researchers embraced the software as a new data mining tool. Early this year, developers at Amazon, Facebook and Intel were using Hadoop for everything from log analysis to modelling earthquakes.
"Hadoop gave me, an ordinary developer, the ability to do something extraordinary," says Jinesh Varia, a web services evangelist at Amazon.
Google quickly got on board, launching an initiative with IBM to provide universities such as Stanford, University of California, MIT and Carnegie Mellon with clusters of several hundred computers, so students could learn new techniques for parallel programming. Since Google's MapReduce was a trade secret, Google and IBM announced the students would be taught on Hadoop.
"We are using not only the contribution we are giving to the software, but the contributions from the larger community as well, and everybody wins from it," Yahoo's Heck says.
MCT