Monday, January 31, 2011

Synonymy in Collocation Extraction

This paper is published by Darren Pearce et all, School of Cognitive and Computing Sciences (COGS) University of Sussex.
This paper describes about the collocation extraction using Wordnet, mainly concentrates about methods employed for extracting collocations. Before going through the details let us have a small lookout over what are collocations?

A collocation is two or more words that often go together. These combinations just sound "right" to native English speakers, who use them all the time.
Ex: fast food, quick meal, strong coffee etc.. its just co-incidence that I have taken examples of food! :P

Complicated PaPeR Definition: A pair of words is considered a collocation if one of the words significantly prefers a particular lexical realisation of the concept the other represents.

The definition of the exact nature of a collocation varies from one to another. It is variously defined as a habitual word combination.

Techniques:
Collocations may involve words in between for ex: I break down the door ; I broke down the door; I broke down the battered, old door. All contain Collocation (break-down, door) with varying number of words in between.

1.Church and Hanks (1990): This technique used mutual information to measure the strength of association between words. Potentially, this could be used directly for collocation extraction.
Problems:This leads to some strange collocations such as (doctor, hospital) Just because words occur together frequently does not mean they form a collocation.

2.Smadja (1993): To over come the above problem, semantics of the two words are determined by the graph of the distribution of counts between the two collocates. Here if there is a narrow, peaked spread then this is an indication that there's a syntactic relation between the two words.
Problems: This uses implicit syntax strategy for extracting syntactic information.

3.New Approach: This approach makes use of Wordnet database resource. With respect to a particular target word, it is possible to partition a synonym set into three disjoint subsets:
  • Those words which are collocations of the target word
  • Those words which tend not to be used with the target word although, if used, do not lead to unnatural readings;
  • Those words which must not be used with the target word since they will lead to unnatural readings.
This last subset has been named anti-collocations.
In the paper they have discussed examples which derives the classification of potential collocations into 4 categories: collocation, potential, unknown and wrong.
They have derived several formula to find out the collocation strength
Formulation:
  • Occurrence count, c(w1,w2) : This returns the number of occurrence of word w1 in combination with word w2.
  • co-occurrence set, CSw : For a given word w at least two elements in the synset have non-zero co-occurrence counts.
  • Candidate Collocation Synsets, CCSw:Synsets are filtered with respect to w to obtain CCSw.
  • For a synset, S belongs to CCSw, a word w1 is selected as the most frequently co-occurring element with the word w. Its corresponding frequency is f1.
  • The highest co-occurrence frequency, f11, of the remaining words is then calculated.
  • Collocation strength:It is the difference between the occurrence counts of these two top-ranked elements, f1-f11, that can be used to rate 'collocation strength', s, in the following way:s=(f1-f11)/f1.
Please refer paper for the formulas of above mentioned terms.

Conclusions and Future Work
This paper speaks about many techniques for collocation extraction, among which the new technique makes extensive use of Wordnet resource.This includes lexical relations such as hypernymy and meronymy.
Drawbacks:
In the formulation mentioned above it is assumed that for any sysnet there's one and only one element that forms a collocation with a particular target word. This is not the case as we discussed above about the anti-collocations.Such situations could also be accounted for by a probabilistic approach.

1 comment:

  1. Bhuvan great blog...


    Include the paper name, publishers and authors either at the end or at the beginning.

    Justify the text
    Please replace "i" by "I" :-)

    ReplyDelete