Skip to Content Java Solaris Communities Partners My Sun Sun Store United States Worldwide

»  Contrarian Minds Archive
Searching for Perfection

A New Generation of Search Technology Gets Results

By Al Riske

23.Apr.08 - You don't have to be a genius to see that search technology is far from perfect.

Google does a decent job on the Web, but ...

"Search is more than just the Web," says Steve Green, the principal investigator in Sun's research into advanced search technologies.

While searching for documents across a worldwide network is a difficult challenge, the Web is full of links and, as Google discovered, those links are important indicators of what people find valuable.

"But the fine fellows at IBM research showed a couple of years ago that corporate intranets don't have the sort of link structure that's necessary for things like Google's Page Rank system to work," Green points out. "Mostly because there are not enough links."

So Green and his Sun Labs colleagues have been looking into other techniques – a sophisticated passage retrieval algorithm, an automatic document classification system, and a tag-based recommendation engine that recommends things people end up actually liking.

"Search is more than just the Web."

Steve Green
Principal Investigator, Project Aura
Sun Microsystems Laboratories

 

Getting good results is as hard as (or harder than) you would imagine.

If a corporation, for example, has a huge collection of documents (without enough links to indicate what people are pointing to as useful), one of the most effective ways to improve the quality of search results is to get people involved.

A search within Sun's internal network, for example, actually generates two simultaneous searches: one of the entire multi-million document archive and one of a select set of maybe a thousand documents that have been identified as especially useful.

The results page then displays useful Quick Links at the top and additional results below that.

Ironically, employees often ignore the Quick Links.

"Google has trained people to be blind to the results at the top of the page that are a little separate, because they figure that's an ad," Green says.

He notes that some Sun employees have actually complained about poor search results when the link they're looking for is right at the top.

"This algorithm ... would start out with the most restrictive case, which is an exact phrase match, but if it couldn't find enough stuff that met that criteria it would start to back off."

Steve Green
Principal Investigator, Project Aura
Sun Microsystems Laboratories

 

Sun has been doing research in this field almost as long as it's had a research laboratory, and Green, who joined Sun Labs in 1999, was the driving force behind the Java-based search engine that ships as part of Sun's Web server and portal server.

At first the project centered around a passage retrieval algorithm that was mostly done by the time Green arrived on the scene.

"The thing the team surmised was: If somebody types a query into a search engine and we can find a close match for the question they ask, then it seems likely that the answer to the question they asked is in that passage," he says. "So they developed this passage retrieval methodology where they would try to find exactly the words that the user asked for in the order they asked for them with no intervening text."

Steve Green

In essence the algorithm would start out looking for an exact match, just as if you'd put quotation marks around your query. Often, however, that criteria proves too restrictive. If you put "used books" as your query and the text says "used mystery books," you're out of luck.

"This algorithm on the other hand would start out with the most restrictive case, which is an exact phrase match, but if it couldn't find enough stuff that met that criteria it would start to back off. It would allow the words to get a little farther apart. It would allow the words to be out of order in the document. In some cases it would even allow words to drop out," Green says.

"The idea was, for each deviation you would encounter during this relaxation process, you impose a penalty on the passage. You start out with a penalty of zero for a perfect phrase match and you impose a penalty for each deviation from that ideal."

"If you have a set of documents ... and the categories have been assigned by hand, what you'd really like to be able to do is figure out how to assign those categories to other documents automatically."

Steve Green
Principal Investigator, Project Aura
Sun Microsystems Laboratories

 

Out of those beginnings, Green and his colleagues split off to form the Advanced Search Technology project.

"We ended up writing this product-quality search engine that can handle reasonable size collections of a few million documents [not just a few thousand, as when he started]. Then I wanted to start working on the kinds of things you can do once you have that," he says.

Things like automatic document classification.

"If you have a set of documents that are categorized -- these are about football and these are about car racing -- and the categories have been assigned by hand, what you'd really like to be able to do is figure out how to assign those categories to other documents automatically," Green says.

You'd like to be able to determine if documents are similar or completely dissimilar so you could group them accordingly.

"Turns out the things we did in order to do that are useful in domains you wouldn't expect," Green says, "like recommending musical artists."

"Turns out the things we did ... are useful in domains you wouldn't expect, like recommending musical artists."

Steve Green
Principal Investigator, Project Aura
Sun Microsystems Laboratories

 

Take, for example, Last.fm, a website that "connects you with your favorite music, and used your unique taste to find new music, people, and concerts you'll like."

The site contains hundreds of thousands of musical artists and lets visitors apply tags. Sun Labs was given access to seven million tags to run some experiments.

"One of the nice things about our engine is we have a very flexible notion of what a document is, so we can say things like, 'Gee, artists have had all these tags applied to them. Let's say that the artist is the document and the words of the document are the tags people have applied to that artist,'" Green says. "Because the same tag gets applied to an artist many times, the words even have frequencies associated with them, which means we can employ the kind of techniques search engine use to decide what are important terms for a document."

Steve Green

He is referring to "term weighting" methods developed in the 1970s, when, incidentally, his father was building information retrieval systems.

"Given a particular document, you need to figure out, 'What are the words that best represent that document?' The important words are those that occur frequently in that document and infrequently in the rest of the collection," Green explains.

"In a case like Last.fm, you can use a term-weighting function to decide which are the most important tags for a given artist. Then you can compute the similarity between artists by computing the similarity between their documents."

"We kind of thought we'd finish in the middle of the pack, but we ended up being the best."

Steve Green
Principal Investigator, Project Aura
Sun Microsystems Laboratories

 

Such was the genesis of Project Aura, in which Green has teamed up with fellow researcher Paul Lamere, who is best known for a related project called Search Inside the Music.

The new project is named for the text "aura" around a given object, which can then be used to provide similarity-based, content-based recommendations.

"We started seeing more and more of this stuff ... if we just have a text search engine and we have a good notion of what the document is and what the words in the document are, then we can start to do some interesting things," Green says.

"With our quarter million music artists, we can find the most similar artist to any given artist in about 200 milliseconds. Fast enough to do real live recommendations on the Web all the time. That's not necessarily the case with standard recommendation techniques being used today."

Green sees Google as the first social Web application, the first to take advantage of the wisdom of crowds, particularly people's propensity to link to useful information.

"Essentially we can start to take advantage of exactly the same thing. If you look at these social tagging sites, most of them treat tagging as a database problem -- applying tags to resources in database tables. But they don't really use those tags to do anything interesting. To them a tag is just a key in a database; it's not really a word. Aura takes into account that these are words people have provided for a reason," he says.

"Paul ran an evaluation last year where he collected some artist recommendations from professional music critics and from several music recommender sites on the Web. Then he took our recommendations from this artist-similarity computation, and our recommendations were much better received. We kind of thought we'd finish in the middle of the pack, but we ended up being the best."

"You can start using these content-based recommendation techniques and avoid the weird feedback loops you get when popular things get recommended and recommended things get popular."

Steve Green
Principal Investigator, Project Aura
Sun Microsystems Laboratories

 

Green notes that most recommendations on the Web today are based on collaborative filtering -- people who liked X also liked Y.

"But collaborative filtering has a number of problems. The main one being the cold-start problem. When something is new, nobody has listened to it so you can't figure out a way to recommend it," he says.

Steve Green

"We get around the cold-start problem by being able to extrapolate from social tags. Think about movies, for example. There are movie reviews on the Web, plot summaries, all these different resources that reside in different places. We can take all those into account when computing similarities between movies. We don't have to rely on the fact that you watched it and I watched it and we have similar tastes."

In other words, Project Aura provides a hybrid solution.

"We can do collaborative filtering sorts of things, and we can do text aura- or content-based recommendations. What we really want to do is provide a combination of the two on a sliding scale where, when you have the big database, you can do straight collaborative filtering, which people really like, but when you don't have the big data -- the purchase data, the listening data -- you can start using these content-based recommendation techniques and avoid the weird feedback loops you get when popular things get recommended and recommended things get popular and it's hard to break out of that cycle."

Steve Green Portrait
Steve Green

Title: Principal Investigator, Project Aura, Sun Microsystems Laboratories.

Interests: Building scalable information retrieval systems, automatic document classification, and lexical semantics.

Education: Bachelor's and master's degrees in math and computer science from the University of Waterloo. Doctorate in computer science from the University of Toronto.

Background: Spent two years as a research fellow at Macquarie University in Sydney, Australia, before joining Sun in 1999.

Quote: "My Dad can build a better indexer than your Dad."

Blog: The Search Guy

Little-Known Fact: Green, whose father wrote information retrieval systems for the Canadian government back in the 1970s, has never met another second-generation "search guy." For all he knows, he may be the only one.

Patents: Several pending.

Hobbies: "I have a two and a half year old. I don't have time for hobbies."

Last Book Read: Night Watch, by Terry Pratchett.

Favorite Meal: Steak and french fries.

Favorite Movie: Superman II. ("I was 13 when it came out and I was a Superman geek.")

Pet Peeve: Tools that don't learn. ("About 40 percent of email messages that claim to have an attachment, don't have one because the person forgot. If my email program can underline misspelled words as I type, why can't it check to see if I've promised to attach something and remind me to do it?")

Childhood Ambition: "I always wanted to do this, which is why it's such an awesome job."

First Job: Washing dishes for the Bay Buffet restaurant in Ottawa, Ontario, Canada.

Retreat: The family cottage outside Ladysmith, Quebec, Canada.

What Brought Him to Sun: "I wanted the opportunity to work on things that people might actually use instead of just writing papers about it."

 
Would you recommend this Sun site to a friend or colleague?
Contact About Sun News Employment Privacy Terms of Use Trademarks Copyright 1994-2008 Sun Microsystems, Inc.