Ponzetto & Strube 2007 Knowledge Derived From Wikipedia For Computing Semantic Relatedness

[ CogSci Summaries home | UP | email ]
http://www.cc.gatech.edu/~jimmyd/summaries/

Ponzetto, S.P., Strube, M. (2007). Knowledge Derived From Wikipedia For Computing Semantic Relatedness. Journal of Artificial Intelligence Research, 30, 181-212.

@Article{PonzettoStrube2007,
  author = 	 {Ponzetto, Simone Paolo and Strube, Michael},
  title = 	 {Knowledge Derived From Wikipedia For Computing Semantic Relatedness},
  journal = 	 {Journal of Artificial Intelligence Research},
  year = 	 {2007},
  volume = 	 {30},
  Tpages = 	 {181--212},
}

Author of the summary: Craig J. Greenberg, 2007, cgreenbe@connect.carleton.ca

Cite this paper for:

Wikipedia is a source of a large amount of structured real-world knowledge

Wikipedia page links model semantic relatedness. [p182]

Semantic relatedness refers to distance on a semantic network, taking different relationship types into account (hypernymy, antonymy, etc.), while semantic similarity refers only to overlap of meaning (Budanitsky & Hirsk, 2006).

Articles are organized on a category network, assigned by editors.

Relative position on this network can be used to measure the semantic relatedness of words [p184]

There are many methods of measuring relatedness of two words, some of which correlate better than others to human judgements. [p185]

These methods include: text overlap of definition (first "gloss" paragraph), shortest path distance (number of edges) between two words on the network, and information content of the closest category subsuming both. [p185]

Relatedness on Wikipedia can be computed using combinations of these measures.

For computing category paths between words, a maximum search depth of 4, and through categories deeper than 2 (where 0 is the top of the category network, i.e. connected to all subcategories), returns judgements closest to those of humans.[p186]

Performance on relatedness judgements can be assessed by the Pearson correlation between human relatedness judgements and the database judgements.[p188]

Wikipedia generally outperforms WordNet (another semantic knowledge database) on tests of semantic relatedness, but not similarity. [p190]

Although one appealing aspect of Wikipedia is its constant growth, no significant difference was found in relatedness-judgement correlation to human judgements between Wikipedia of Sept. 2006 and May 2007. [192]

Wikipedia is competitive with WordNet for determining if expressions are coreferent. [p199-200]

Wikipedia can be used for relatedness judgements in other languages. [p.201-202]

The actual paper can be found at http://www.jair.org/media/2308/live-2308-3485-jair.pdf

Background

Natural language processing has usually involved statistical techniques, but real-world general knowledge is needed to progress further. Wikipedia may be a good source of such knowledge.

The quality of Wikipedia for this purpose is assessed in two ways: semantic relatedness judgements and coreference resolution. Performance is compared to human judgements on the same tasks.

Wikipedia articles are structured by a network of category relationships. These relationships model linguistic relationships, such as: [p182]

Redirect pages- points alternate expressions to the same article; models synonymy
Disambiguation pages- list possible articles a polysemic expression could be referring to; models homonymy
Internal links- articles with links to other articles; models cross-reference

One problem with using Wikipedia is its vast category depth, branching, and multiple inheritance relationships.

Calculating relatedness can be done in a number of ways. Some systems perform better or worse depending on which measure of relatedness is used. For two given words, the measures used here are the following [p185]:

Path-Based Measures: Relatedness is inversely proportional to the number of edges in the shortest path between two words in the category network.
Information Content Measures: Relatedness is based on how much information is present in the closest word superordinate in the category network to both words.
Text Overlap Measures: Relatedness is based on how much text is shared between the "gloss" or brief definition of each of the two words.

Assessing Wikipedia's Performance

Because many expressions will yield disambiguation pages when searched in Wikipedia, a method to decide which entry to use is required. The method used in these experiments was (roughly) to compare the entries listed in each word's disambiguation pages, and if any word appeared in both pages, to use that definition. Otherwise, use the first listed. [p186]

Once the disambiguated pages are found, the relatedness measures can be used. A depth-limited search of depth 4 is performed to find the closest category superordinate to both words. Performance was improved by limiting the depth and searching only categories deeper (more specific) than the 2nd level in the network. Otherwise, all words are marginally related by the fact that they all belong to the category CATEGORY [p186].

The number of overlapping hits on Google, the search engine, was used as a baseline to compare WordNet and Wikipedia. Performance was evaluated by assessing the Pearson correlation between human relatedness judgements and the comparison algorithms used with WordNet and Wikipedia. Both WordNet and Wikipedia outperformed Google, but only differed from each other insignificantly.

The specific failings of WordNet came from "sense proliferation"; that is, it uses all possible meaning combinations of two polysemic words, and uses the pair with the shortest path, even if not semantically appopriate. By contrast, the Wikipedia search algorithm disambiguates first, then finds the shortest path between these two. [p190]

The particular word databases where Wikipedia outperformed WordNet were ones designed with semantic relatedness in mind, rather than just semantic similarity, which makes sense when the above algorithm techniques are considered. Because of the small word sets used, the authors are not convinced that the comparison is fair. [p191]

Co-reference tasks

A more realistic natural language task is judging whether terms are co-referring.

Using a wide array of semantic features, as well as the relatedness measures from the previous task, the accuracy of WordNet's and Wikipedia's co-reference judgements were analyzed. WordNet uses mainly surface lexical features, while Wikipedia emphasizes semantic features. They performed approximately equally, "which indicates the usefulness of using an encyclopedic knowledge base as a replacement for a lexical taxonomy." [p200]

Additionally, semantic features are more useful for judging nouns, while surface features "such as string matching and alias suffice" for proper nouns [p200]. However, Wikipedia was still better for proper nouns, because "Wikipedia contains a larger amount of information about named entities than WordNet" [p200].

Other Languages

There is a lack of well-structured word databases for non-English languages. Because Wikipedia's article translations preserve semantic category structure of the links, it can be readily used for relatedness judgements in other languages also. It performed as well as an existing German word database for relatedness judgements.

Summary author's notes:
Page numbers are from the original periodical publication.

Back to the Cognitive Science Summaries homepage
Cognitive Science Summaries Webmaster:

JimDavies (jim@jimdavies.org)

Ponzetto, S.P., Strube, M. (2007). Knowledge Derived From Wikipedia For Computing Semantic Relatedness. Journal of Artificial Intelligence Research, 30, 181-212.

Author of the summary: Craig J. Greenberg, 2007, cgreenbe@connect.carleton.ca

Cite this paper for:

Summary author's notes: Page numbers are from the original periodical publication.

Summary author's notes:
Page numbers are from the original periodical publication.