[ CogSci Summaries home | UP | email ]
http://www.jimdavies.org/summaries/

Resnik, P. (1995) Using Information Content to Evaluate Semantic Similarity in a Taxonomy. International Joint Conference for Artificial Intelligence (IJCAI-95). 448-453.

@InProceedings{Resnick1995,              
  author =       {Resnik, Philip}           
  title =        {Using Information Content to Evaluate Semantic Similarity in a Taxanomy}           
  booktitle =    {International Joint Conference for Artificial
  Intelligence (IJCAI-95)}           
  crossref =     {},             
  key =          {}           
  pages =        {448--453}           
  year =         {}           
  editor =       {}           
  volume =       {}           
  number =       {}           
  series =       {}           
  address =      {}           
  month =        {}           
  organization = {}           
  publisher = {},             
  note =         {}           
  annote =       {}           
}                  
Download paper from http://citeseer.ist.psu.edu/resnik95using.html

Author of the summary: Mike Niu, 2mln@qlink.queensu.ca

Cite this paper for:

Measurement of semantic similarity in an IS-A taxonomy:

One of the most crucial steps for an AI to be intelligent is for it to “understand” the natural language that we speak everyday, so that it can process the documents and/or the text in any general ways specified, such as sort the documents based on their urgency, or even advice intelligently to a human being. In order to achieve this, the AI must be able to extract important words such as nouns from a sentence. Before any processing can be done, the AI needs to first “understand” the relationships between different nouns in the taxonomy as nouns are the basis for building up our knowledge. One of the most important relations between nouns is how similar they are. The difficult question comes – but how do we compare such relations. Even humans sometimes do not agree that two words are similar ex. green and inexperience. Fortunately, past study (Miller and Charles 1991) has shown that human judgment has a correlation of 0.9015. That is, we humans disagree on similarity of two ! words only 10% of the time. Noted that this should be taken as a very ideal upper bound, as the sample size is small and with similar ages -- 38 undergrads. Obviously we need a database of taxonomy for the relations between nouns, and some good people of Princeton University has constructed such database called WordNet, which is a large English semantic network that group synonymous words called synsets. Below is an IS-A taxonomy tree extracted from WordNet [p449]:

Figure#1:
     |
     |
Medium of Exchange
     |          \
     |           \
    Money         \
     |             \
    Cash           Credit
     |                |
     |                |
    COIN          Credit Card
    /   \
 Nickel  Dime

Note that the concept COIN subsumes both Nickel and Dime while the concept Medium of Exchange subsumes the two subtrees Money and Credit. Therefore, Nickel and Dime are more similar than say, Nickel and Credit Card. This is the basis we use for comparisons.

Traditional method of edge counting[p448]:

One obvious way to measure the similarity between two words is by simply counting the edges from one word to another, so to count the similarity between Nickel and Credit Card, we travel from one of the leaves, Nickel, goes up the taxonomy tree until we find a concept that subsumes them both, Medium of Exchange, and goes down to another leave, Credit Card, for a total of 6 edges. Therefore, the following formula is used to measure the similarity is given as:

Simedge(w1,w2) = (2 x MAX) – [minlen(c1,c2)] (1)

Where MAX = maximum possible path length, c = all possible concepts ranges over a word. We need to distinguish similarities between distinct concepts and between words because a word such as orange corresponds to two distinct concepts, color and fruit. (also see Nickel and Gold and their 4 immediate parent concepts in Figure#2) Therefore, we need to account for all such concepts and take the shortest edge/length used to connect one of concepts for w1 to another for w2. Finally, we maximize it by subtracting from 2 x Max possible length. For an example, for the sake of illustration, assume each edge cost is one in Figure#2, Nickel and Gold here, we take Chemical element or Metal for 2xMAX-2, rather than Asset for 2xMax-7, and former thus has a larger Simedge value than the latter, and we take the largest Simvalue of the two, i.e. 2xMax-2.

Figure#2:
   |					|
   |					|
 Asset			    	Substance------------------	
   |    \				|		Solid
   |	 -------			|		   |
 Money        Wealth			|		Crystal
   |	 	|			|		   |
   |      	.			.		   .
 Cash	 	.			.		   .
   |		|			|		   |
   |		|			|		   |
  COIN	       Treasure   	Chemical element	 Metal
 /    \         |			/  \            /    \
Nickel Dime   Gold                  Nickel' Gold'  Nickel'' Gold''

Noted that Nickel, Nickel' and Nickel'' are distinct concepts as their immediate parents are distinct, but if we are talking about the word Nickel, then it includes all three concepts. i.e. word(Nickel) = {Nickel, Nickel',Nickel''}

Edge counting is problematic as it assumes uniform edge distance[p448]:

To see that edge counting is problematic, consider e.g. SAFETY VALVE IS-A VAVLE and KNITTING MACHINE IS-A MACHINE, even though both has only one edge to its parent node, the former seems to have a edge distance shorter or more similar than the latter. This becomes even more problematic when specific categories of taxonomy are compared – biological categories concepts are much denser than other categories such as philosophical concepts. As a result, edge counting only has a correlation of 0.6671. See Table1.

Table#1
------------------------------------------
Similarity Method	Correlation
------------------------------------------
Human judgements	 0.9015
Information content      0.7911
Probability		 0.6671
Edge counting	     	 0.6645
------------------------------------------

Resnik's new approach by using information content [p449-450]:

By using the standard argumentation of information theory by Ross [1976], information content of a concept is defined as the negative the log likelihood, -logp(c), where p(c) is the probability of encountering such concept c. For example, in Figure#1, Money has a less information content than NICKEL as the probability of encountering the concept, p(Money) is much greater than encountering the probability of p(Nickel). Consequently, Nickel contains more information, i.e. more specific than Money as it is quantifiable, an exact form of money etc. Also, whenever the concept Nickel is used, the concept of Money is also used as it is the parent node of Nickel. As a result, Resnik concluded that the higher the terms in the taxonomy tree, the less information they contain but are encountered more frequently. E.g. the concept ‘matter' would have a p(matter) of near 1 and a information content of near 0. Resnik also uses the following equation to define the similarity between two dis! tinct concepts [p450]:

Sim(c1,c2) = max c in S(c1,c2) [-logp(c)] (2)

where c = concept and S(c1,c2) = set of concepts c, that subsume both c1 and c2 and -logp(c) = information it carries -- the information content. Then this equation simply means that we want a concept, with maximum information content, that subsumes both c1 and c2. Therefore in Figure#2, COIN is used for the similarity between concepts Nickel and Dime rather than Cash, since it is the one with more information content and also subsumes both Nickel and Dime. The equation for comparing two words (remember that a word can have multiple distinct concepts, e.g. orange) is required to consider all concepts that range over w1 and w2 as before,

Simword(w1,w2) = take max from all c1,c2 [sim(c1,c2)] (3)

Where c1 ranges over w1 and c2 ranges over w2. The formula then means we want to compare all concepts of w1 and w2, and take the maximum similar value Sim(c1,c2), where both w1 and w2 can be an instance of the maximum concept. For example, refer to Figure#2, for Nickel and Gold, the word Nickel has parents: COIN, Chemical element and metal; the word Gold has parents: Treasure, Chemical element and Metal. The maximum c for Nickel and Gold would be chosen to be ASSET, Chemical element or Metal. As in the real world, assume p(Asset) > p(Metal) > p(Chemical element), i.e. more general and less information content, Chemical element would be chosen as it's information content is the largest and its value, Sim(Nickel',Gold'), will be calculated using equation(2), and assigned to Simword(Gold,Nickel) for the two words Gold and Nickel.

Probability of encountering a concept[p449]:

One final piece of information that is missing from the above two equations, is how exactly do we calculate p(c) – the probability of encountering such a concept. It is not as complicated as it seems, Resnik first establishes the frequencies of concepts from the Brown Corpus of American English,

“a large (1 million word) collection of text across genres ranging from news articles to science fiction. Each noun that occurred in the corpus was counted as an occurrence of each taxonomic class containing it.”

i.e. if any node's sub-concept occurred in these articles, then this node is also counted again. E.g. For every occurrence of Nickel or Dime, the frequencies of coin, cash and money and upper concepts all add one. Formally,

freq(c) = sum of, n in words(c), count(n) (4)

Where words(c) is the set of words subsumed by concept c. Then, finally concept probability is given,

p(c) = freq(c) / N, (5)

where N = total number of nouns observed. (excluding those not in WordNet of course)

Resnik has also computed all p(c)s for all concepts that are in WordNet for the following study and for further in-depth verifications.

Study[p450]:

Resnik used 10 subjects and 28 pairs of words/nouns and computed all p(c)s described above, and confirmed that his method is significantly better than the traditional edge counting, See Table 3, with a result correlation of 0.7911 or near ~80% compared to 0.6671 or 67%. See Table 1. This implies that given any pair of words, the AI agrees with humans roughly 80% of the times that the pair is similar to some quantifiable degrees! This means the goal of making an AI to summarize millions of documents, which might be done by substituting upper concepts for grainer lower concepts in the text, would no longer be distant and a semi intelligent adviser would no longer be a dream.

Disscusion:

So what makes Resnik's method significantly better? In my opinion, it is the fact that edge distance is replaced by probability of encountering such a parent concept where both children concepts can be an instance, i.e. that the edge is quantifiable, and therefore, we are choosing the most related concept we humans frequently seen everyday, (remember that p(c) is calculated from a million word article collections), and therefore humans and AI, in this case, can immediately relate and judge the similarity of two words. After all, the similarity value is measured against common human judgments rather than how structurally related the two concepts are, as a result, specific sources for calculating p(c) must be chosen appropriately if the AI needs to realize technical structure relatedness of two words. Resnik illustrated this problem with a more specific example[p451], Please see Table 2, where there are too many news articles from the sources, and the result is that the simil! arity between tobacco and horse is much higher than tobacco and alcohol. This is due to the fact that horse can also be used as a slang term for heroin, and is more frequently used in the news. Resnik then suggests the need to consider all concepts of two words instead of taking just the maximal such as in this case, and further exciting research is required. (for the interested readers, one could refer to the recent research based on Resnik's work by Rose and Roni Using Wordnet to Supplement Corpus Statistics.)

Table #2
------------------------------------------
n1	n2	sim(n1,n2)	class
------------------------------------------
tabacco	alcohol	7.63		DRUG
tabacco	sugar	3.56		SUBSTANCE
tobacco	horse	8.26		NARCOTIC
------------------------------------------

Table#3 (for a complete table, please see original article)
--------------------------------------------------------------------------------
Word Pair	Miller and Charles  Human Judgement	sim   simedge	simp(c)
--------------------------------------------------------------------------------
car	automobile	3.92		3.90		8.04	30	0.9962
gem	jewel		3.84		3.84		14.93	30	1.0000
..
..
noon	string		0.08		0.00		0.00	0.00	0.00
rooster	voyage		0.08		0.00		0.00	0.00	0.00
--------------------------------------------------------------------------------

Back to the Cognitive Science Summaries homepage
Cognitive Science Summaries Webmaster:
JimDavies (jim@jimdavies.org)