|
| United States Worldwide |
|
Searching for Perfection
A New Generation of Search Technology Gets Results By Al Riske 23.Apr.08 - You don't have to be a genius to see that search technology is far from perfect. Google does a decent job on the Web, but ... "Search is more than just the Web," says Steve Green, the principal investigator in Sun's research into advanced search technologies. While searching for documents across a worldwide network is a difficult challenge, the Web is full of links and, as Google discovered, those links are important indicators of what people find valuable. "But the fine fellows at IBM research showed a couple of years ago that corporate intranets don't have the sort of link structure that's necessary for things like Google's Page Rank system to work," Green points out. "Mostly because there are not enough links." So Green and his Sun Labs colleagues have been looking into other techniques – a sophisticated passage retrieval algorithm, an automatic document classification system, and a tag-based recommendation engine that recommends things people end up actually liking.
Getting good results is as hard as (or harder than) you would imagine. If a corporation, for example, has a huge collection of documents (without enough links to indicate what people are pointing to as useful), one of the most effective ways to improve the quality of search results is to get people involved. A search within Sun's internal network, for example, actually generates two simultaneous searches: one of the entire multi-million document archive and one of a select set of maybe a thousand documents that have been identified as especially useful. The results page then displays useful Quick Links at the top and additional results below that. Ironically, employees often ignore the Quick Links. "Google has trained people to be blind to the results at the top of the page that are a little separate, because they figure that's an ad," Green says. He notes that some Sun employees have actually complained about poor search results when the link they're looking for is right at the top.
Sun has been doing research in this field almost as long as it's had a research laboratory, and Green, who joined Sun Labs in 1999, was the driving force behind the Java-based search engine that ships as part of Sun's Web server and portal server. At first the project centered around a passage retrieval algorithm that was mostly done by the time Green arrived on the scene. "The thing the team surmised was: If somebody types a query into a search engine and we can find a close match for the question they ask, then it seems likely that the answer to the question they asked is in that passage," he says. "So they developed this passage retrieval methodology where they would try to find exactly the words that the user asked for in the order they asked for them with no intervening text."
In essence the algorithm would start out looking for an exact match, just as if you'd put quotation marks around your query. Often, however, that criteria proves too restrictive. If you put "used books" as your query and the text says "used mystery books," you're out of luck. "This algorithm on the other hand would start out with the most restrictive case, which is an exact phrase match, but if it couldn't find enough stuff that met that criteria it would start to back off. It would allow the words to get a little farther apart. It would allow the words to be out of order in the document. In some cases it would even allow words to drop out," Green says. "The idea was, for each deviation you would encounter during this relaxation process, you impose a penalty on the passage. You start out with a penalty of zero for a perfect phrase match and you impose a penalty for each deviation from that ideal."
Out of those beginnings, Green and his colleagues split off to form the Advanced Search Technology project. "We ended up writing this product-quality search engine that can handle reasonable size collections of a few million documents [not just a few thousand, as when he started]. Then I wanted to start working on the kinds of things you can do once you have that," he says. Things like automatic document classification. "If you have a set of documents that are categorized -- these are about football and these are about car racing -- and the categories have been assigned by hand, what you'd really like to be able to do is figure out how to assign those categories to other documents automatically," Green says. You'd like to be able to determine if documents are similar or completely dissimilar so you could group them accordingly. "Turns out the things we did in order to do that are useful in domains you wouldn't expect," Green says, "like recommending musical artists."
Take, for example, Last.fm, a website that "connects you with your favorite music, and used your unique taste to find new music, people, and concerts you'll like." The site contains hundreds of thousands of musical artists and lets visitors apply tags. Sun Labs was given access to seven million tags to run some experiments. "One of the nice things about our engine is we have a very flexible notion of what a document is, so we can say things like, 'Gee, artists have had all these tags applied to them. Let's say that the artist is the document and the words of the document are the tags people have applied to that artist,'" Green says. "Because the same tag gets applied to an artist many times, the words even have frequencies associated with them, which means we can employ the kind of techniques search engine use to decide what are important terms for a document."
He is referring to "term weighting" methods developed in the 1970s, when, incidentally, his father was building information retrieval systems. "Given a particular document, you need to figure out, 'What are the words that best represent that document?' The important words are those that occur frequently in that document and infrequently in the rest of the collection," Green explains. "In a case like Last.fm, you can use a term-weighting function to decide which are the most important tags for a given artist. Then you can compute the similarity between artists by computing the similarity between their documents."
Such was the genesis of Project Aura, in which Green has teamed up with fellow researcher Paul Lamere, who is best known for a related project called Search Inside the Music. The new project is named for the text "aura" around a given object, which can then be used to provide similarity-based, content-based recommendations. "We started seeing more and more of this stuff ... if we just have a text search engine and we have a good notion of what the document is and what the words in the document are, then we can start to do some interesting things," Green says. "With our quarter million music artists, we can find the most similar artist to any given artist in about 200 milliseconds. Fast enough to do real live recommendations on the Web all the time. That's not necessarily the case with standard recommendation techniques being used today." Green sees Google as the first social Web application, the first to take advantage of the wisdom of crowds, particularly people's propensity to link to useful information. "Essentially we can start to take advantage of exactly the same thing. If you look at these social tagging sites, most of them treat tagging as a database problem -- applying tags to resources in database tables. But they don't really use those tags to do anything interesting. To them a tag is just a key in a database; it's not really a word. Aura takes into account that these are words people have provided for a reason," he says. "Paul ran an evaluation last year where he collected some artist recommendations from professional music critics and from several music recommender sites on the Web. Then he took our recommendations from this artist-similarity computation, and our recommendations were much better received. We kind of thought we'd finish in the middle of the pack, but we ended up being the best."
Green notes that most recommendations on the Web today are based on collaborative filtering -- people who liked X also liked Y. "But collaborative filtering has a number of problems. The main one being the cold-start problem. When something is new, nobody has listened to it so you can't figure out a way to recommend it," he says.
"We get around the cold-start problem by being able to extrapolate from social tags. Think about movies, for example. There are movie reviews on the Web, plot summaries, all these different resources that reside in different places. We can take all those into account when computing similarities between movies. We don't have to rely on the fact that you watched it and I watched it and we have similar tastes." In other words, Project Aura provides a hybrid solution. "We can do collaborative filtering sorts of things, and we can do text aura- or content-based recommendations. What we really want to do is provide a combination of the two on a sliding scale where, when you have the big database, you can do straight collaborative filtering, which people really like, but when you don't have the big data -- the purchase data, the listening data -- you can start using these content-based recommendation techniques and avoid the weird feedback loops you get when popular things get recommended and recommended things get popular and it's hard to break out of that cycle." |
|
|||||||||||||||||