|
|
Key Ideas behind the Technology |
|
Making a difference
|
We have found that techniques from knowledge representation and
natural language processing can make a useful contribution to solving
the paraphrase problem. By searching a structured conceptual taxonomy
of the words and phrases extracted from a collection of documents,
our algorithms can effectively connect terms in a query with appropriate
related terms in document passages.
|
The problem with synonyms
|
A common approach to the paraphrase problem is to use tables
of synonyms to automatically expand queries by adding terms
that are recorded as "synonymous." However, there are few
real synonyms in English, so the common practice is to include
related words as if they were synonyms. However, treating terms
this way when they are
not really synonyms introduces a level of granularity that trades off
precision for recall. There is no a priori correct level for this tradeoff
- different information needs require different levels of generality -
so this technique often degrades retrieval rather than improving it.
As an alternative to synonym classes, we use taxonomic
subsumption algorithms that exploit generality (subsumption) rather than
synonymy to connect terms in queries with passages that contain
more specific terms as well as the requested terms. These
algorithms do not
automatically explore more general terms, so
the level of generality is controlled by your choice of query
terms. For example, if you ask for "motor vehicles" you would get
trucks, buses, cars, etc., but if you ask for "automobiles" you
would get cars and taxicabs, but not trucks and buses.
|
Taxonomies
|
Using knowledge bases of general semantic facts, structured conceptual
taxonomies (a type of semantic network) can be constructed from words
and phrases. These words and phrases can be extracted automatically
from text and parsed into conceptual structures. The taxonomy can be
organized by the most-specific-subsumer (MSS) relationship, where each
concept is linked to the most specific concepts that subsume it -
i.e., that are more general than it is. Terms in a query are
individually matched with corresponding concepts in the taxonomy
together with their subconcepts.
For example, given the general semantic facts that "washing" is a
kind of "cleaning" and "car" is a kind of "automobile", an algorithmic
classification system can automatically classify "car washing" as a
kind of "automobile cleaning". A query for "automobile cleaning" or
"automobile washing" will immediately retrieve hits for "car
washing".
|
Knowledge Technology Group, Sun Microsystems Laboratories
|