Project Text Simplification: An Information Retrieval Approach to Sense Ranking

Authors:Mirella Lapata and Frank Keller School of Informatics, University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW, UK

Text simplification mainly involves converting complicated text to the simpler one. This is done by replacing complicated words with the simpler and frequently used synonyms.It is obvious that each complicated word will have at-least two ambiguous synonyms or senses or meanings. One need to resolve this ambiguity. How do we do that is what this paper tells about.

As I had mentioned in my previous post Word Sense Disambiguity(WSD) is the ability to identify the intended meaning(sense) of word in context. In WSD choosing the most frequent sense for an ambiguous word is a powerful heuristic.In this paper, an information retrieval base method for sense ranking queries on IR(Information Retrieval)Engine to estimate the degree of information between the word and its senses.

The WSD can be achieved with the help of WordNet and Corpus.Now let us understand what WordNet and Corpus are????
WordNet:It is a semantically oriented dictionary of English similar to traditional thesaurus but with a richer structure.There is a so called function, synset in the wordnet. Synset is nothing but the synonym set that is a collection of synonym words.
Corpora:It is a large body of text.

Method Used:Central in our approach is the assumption that context provides important cues regarding a word’s meaning. The documents are typically written with the certain topics in mind which are often indicated by word distributional patterns.

For example, documents talking about "congressional tenure" are likely to contain words such "as term of office or incumbency", whereas documents talking about "legal tenure" (i.e., the right to hold property) are likely to include the words "right or land". Now, we could estimate which sense of tenure is most prevalent simply by comparing whether tenure co-occurs more often with term of office than with land provided we knew that both of these terms are semantically related to tenure.

Fortunately, senses in WordNet are represented by synonym terms. So all we need to do for estimating words sense frequencies is to count how often it co-occurs with its synonyms.
The co-occurrence definition is that two words co-occur if they are attested in the same document. After finding the synonym set next step is to find which is the dominant sense or synonym.This is explained as follows

Dominant Sense Acquisition:
Throughout the paper we use the term frequency the shorthand for document frequency that is the number of documents that contain a word or a set of words which may or may not be adjacent. For this we use the synset function of WordNet(which I explained earlier portion of this paper).
purposes.
As an example consider the noun "tenure", which has the following senses in WordNet:

(1) Sense 1
tenure, term of office, incumbency(synonym set of tenure)
=> term(hypernym of above senses)
(2) Sense 2
tenure, land tenure (synonym set of tenure)
=> legal right(hypernym of above senses)

The senses are represented by the two synsets {tenure, term of office, incumbency} and {tenure, land tenure}. (The hypernyms for each sense are also listed; indicated by the arrows.) We can now approximate the frequency with which a word "w1" occurs with the sense "s" by computing its synonym frequencies

synonym frequencies:for each word "S1" in syns(s),the set of synonyms of s, we field a query of the form w1 AND S1. These synonym frequencies can then be used to determine the most frequent sense of w1 in a variety of ways (to be detailed below).
So the queries for the above example of tenure will be as follows:
(1) a. "tenure" AND "term of office"
b. "tenure" AND "incumbency"
(2) "tenure" AND "land tenure"

For example, query (1-a) will return the number of documents in which tenure and term of office co-occur.Presumably, tenure is mainly used in its dominant sense in these documents. In the same way,query (2) will return documents in which tenure is used in the sense of land tenure.

Hypernym frequencies:
Apart from synonym frequencies, we also generate hypernym frequencies by submitting queries of the form w1 AND S1, for each S1 in hype(s), the set of immediate hypernyms of the sense s. The hypernym queries for the two senses of tenure are:
(3) "tenure" AND "term"
(4) "tenure" AND "legal right"
Hypernym queries are particularly useful for synsets of size one, i.e., where a word in a given sense has no synonyms, and is only differentiated from other senses by its hypernyms.

Once the synonym frequencies and hypernym frequencies are in place, we can compute a word's predominant sense in number of ways

First way:First, we can vary the way the frequency of a given sense is estimated
based on synonym frequencies:

• Sum: The frequency of a given synset(set of synonyms) is computed as the sum of the synonym frequencies.
For example, the frequency of the dominant sense of tenure would be computed by
adding up the document frequencies returned by the queries "tenure AND term of office"(1a)and"incumbency"(1b).

• Average (Avg): The frequency of a synset is computed by taking the average of synonym
frequencies.

• Highest (High): The frequency of a synset is determined by the synonym with the highest
frequency.

Second way: we can vary whether or not hypernym are taken into account:

• No hypernyms (−Hyp): Only the synonym frequencies are included when computing the
frequency of a synset.
For example, only the queries "tenure AND term of office"(1a)and"tenure AND incumbency"(1b) are relevant for estimating the dominant sense of tenure.
• Hypernyms (+Hyp): Both synonym and hypernym frequencies are taken into account
when computing sense frequency.
For example, the frequency for the senses of tenure would be computed based on the document frequencies returned by queries "tenure AND term of office"(1a) ,"tenure AND incumbency"(1b) and "tenure AND term" (3)(by summing, averaging, or taking the highest value, as before).

The third way:This option relates to whether the sense frequencies are used in raw or in normalized form:

• Non-normalized (−Norm): The raw synonym frequencies are used as estimates of sense frequencies.

• Normalized (+Norm): Sense frequencies are computed by dividing the word-synonym frequency by the frequency of the synonym in isolation.

For example, the normalized frequency for "tenure AND term of office" (1-a) is computed by dividing the document frequency for "tenure" AND "term of office" by the document frequency
for "term of office". Normalizing takes into account the fact that the members of the synset of a sense may differ in frequency.

One of these three ways is used to get the word's sense acquisition. The model selection can be done as follows
The goal is to establish which model configuration is best suited for the WSD task. We thus varied how the overall frequency is computed (Sum, High, Avg), whether hypernyms are included (±Hyp), and whether the frequencies are normalized (±Norm).
For example the following table shows the sum,high and avg for some data content.

	-Norm				+Norm
	+Hyp		- Hyp		+Hyp		-Hyp
	P	R	P	R	P	R	P	R
sum	42.3	40.8	46.3	44.6	45.9	44.3	48.6	46.8
High	51.6	49.8	51.1	49.3	57.2	55.1	59.7	57.9
Avg	44.1	42.6	48.5	46.8	49.6	47.8	51.5	49.6

In sum, the best performing model is High,+Norm, −Hyp, achieving a precision of 59.7% and a recall of 57.9%.
Once the model has been selected the complicated word is replaced with the dominant sense which was found by the selected model.This is how the word sense rank is obtained.Depending on the rank, the most dominant sense is chosen for the replacement of the complicated words.This is done for each and every sentence of the text. Thats it we get the simplified version of the text:)

Project Text Simplification

Tuesday, February 8, 2011

An Information Retrieval Approach to Sense Ranking

No comments:

Post a Comment

Followers

Blog Archive

Contributors