Sunday, March 6, 2011

Text Summarization Using Lexical Chains

Authors: Meru Brunn, Yllias Chali, Christopher J. Pinchak

 Until now there has always been posts about "Text Simplification". So you must be wondering what "Text Summarization" has relevance with our project. I suggest you to read more and in the end you will be convinced about the relevance. So lets begin with the details.

What is Text Summarization
Summarization is the process of condensing a source text into a shorter version by preserving its information content. It becomes very useful for a reader when he has no time to read the whole paper to understand whether it is important to him. Legal documents are usually very lengthy and includes jargons which makes it difficult for a reader to understand. Summarization tool can be of great help in such situations.

Introduction
 Summarization is usually done by extracting important sentences from the source text and compiling them to generate coherent summaries. In this paper they provide an algorithm to identify important sentences by forming lexical chains.

The overall architecture of the system is shown in Figure 1. It consists of several modules organized as a pipeline.






Preprocessing

1.  Segmentation: To start the summarization process, the original text is first sent to the text segmeter. The role of the text segmenter is to divide the given text into segments that address the same topic.  This segmentation allows later modules to better analyze and generate a summary for a given source text.                                                         

 2.  Tagging: This module performs Part-of-speech tagging. The words are considered individually and the semantic structure is not considered.

3. Parsing:  In this module, tagged words are collected and organized into their syntactic structure. We can select various components (or phrases) depending on their syntactic position within a sentence. For example, we could decide to select all noun phrases within a given text. Finding these noun phrases would be a trivial task using a parsed representation of the text. Since the parser and tagger are not entirely compatible with respect to input/output, the tagger output is refined such that it becomes compatible with the parser. For example, The parser expects that the tagged words will be of the form ’word TAG’. The tagger outputs tagged words with the form ’word_TAG’, and so the underscore is simply removed.

4. Noun filteringNoun filtering improves the accuracy of text summarization by selectively removing nouns from the parsed text. These nouns come from the source text and are identified by the tagger. However, there are nouns that both contribute to and detract from the subject of the text.
       Consider an analogy to analogue data transmission. During data transmission, there is both a signal component and a noise component. Data transmission conditions are ideal when there is a strong signal and low noise. It is when the signal is overcome by noise that it becomes difficult to detect. This is similar to the presence of nouns within the source text. Those nouns that contribute to the subject of the text are part of the ’signal’, and those that do not are part of the ’noise’. The noun filter’s job is to reduce the ’noise’ nouns while still retaining as many ’signal’ nouns as possible. 
                 There are a number of different heuristics that could be used to filter out the ’noise’ nouns. They have designed a heuristic using the idea that nouns contained within subordinate clauses are less useful for topic detection than those contained within main clauses. However, these main and subordinate clauses are not easily defined. Hence for their system, they have selected a relatively simple heuristic.  They chose to identify the first noun phrase and the noun
phrase included in the first verb phrase from the first sub-sentence of each sentence as the main clause, with other phrases being subordinate.

5. Lexical chainer:
The steps of the algorithm for lexical chain computation are as follows:
  •  We select the set of candidate words. A candidate word comes from an open class of words that function as a noun phrase or proper name as results of the noun filtering process.

  • The senses of all the candidate words are considered, which are obtained from the thesaurus. In this experiment, we used WordNet thesaurus . At this step all senses of the word are considered, and each word sense is represented by distinct sets considered as levels. The first one constitutes the set of synonyms and antonyms, the second one constitutes the set of first hypernyms/hyponyms and their variations (i.e., meronyms/holonyms, etc.), and so on.

  • They find the semantic relatedness among the set of senses according to its representations. If two sense representations of two distinct words matches,then they are said to be semantically related. Each semantic relationship is associated with a measure that indicates the length of the path taken in the matching with respect to the levels of the two compared sets.

  • They build up chains that are sets such as
                                  
                                                      
         in which is semantically related to  

         for     

  • We retain the longest chains by relying on the following preference criterion:
                       word repetition >> synonym/antonym . . .


         In this implementation, this preference is handled by assigning scores to each pairwise semantical relation in the chain, and then summing those pairwise scores. Hence, the score of a chain is based on its length and on the type of relationships among its members.

In the lexical chaining method, each word-sense has to be semantically related to every other word-sense in the chain. The order of the open class words in the document does not play a role in the building of chains. However, it turned out that the number of lexical chains could be extremely large, and thus problematic, for larger segments of text. To cope with this, they reduced the word-sense representation to synonyms only when they had long text segments. Lexical chains are computed for each text segment.

6. Sentence ExtractionEach sentence is ranked with reference to the total number of lexical cohesion scores collected. The objective of such a ranking process is to assess the importance of each score and to combine all scores into a rank for each sentence. In performing this assessment, provisions are made for a threshold which specifies the minimal number of links required for sentences to be lexically cohesive. Ranking a sentence according to this procedure involves summing the lexical cohesion scores associated with the sentence which are above the threshold. 
Each sentence is ranked by summing the number of shared chain members over the sentence. More precisely, the score for sentence(i) is the number of words that belong to sentence(i) and also to those chains that have been considered in the segment selection phase. The summary consists of the ranked list of top-scoring sentences, according to the desired compression ratio, and ordered in accordance with their appearance in the source text.


They participated in the single document DUC evaluation. The task consisted of, given a document, creating a generic summary of the document with a length of approximately 100 words. Thirty sets of approximately 10 documents each were provided as system input for this task. According to their analysis, the results seem promising. The grammaticality of their summaries scored an average of 3.73/4. Similarly, the cohesion and organization scores were, on average, of 2.55/4 and 2.66/4, respectively.


Conclusions and Future Work
This paper presents an efficient implementation of the lexical cohesion approach as the driving engine of
the summarization system. The ranking procedure, which handles the text ’aboutness’ measure, is used to select the most salient and best connected sentences in a text corresponding to the summary ratio requested by the user. In the future, they plan to investigate the following problems:

  • Their methods extract whole sentences as single units. The use of compression techniques will increase the condensation of the summary and improve its quality.
  • Their summarization method uses only lexical chains as representations of the source text. Other clues could be gathered from the text and considered when generating the summary.
  • In the noun filtering process, their hypothesis eliminates the terms in subordinate clauses. Rather than eliminating them, it may also prove fruitful to investigate weighting terms according to the kind of clause in which they occur.

The concept of lexical chain is similar to our idea of constructing the Markov chain matrix. We can consider this matrix to be a source of sense disambiguation and a tool which will tell us the "about"ness of the text and help us in simplifying the text in the right manner.



P.S: I have used an equation in my post and I feel proud to say that I learnt it from one of our Sir's blog post. You can also refer to it by visiting the blog Academic Me!    :-)




Wednesday, March 2, 2011

Complex Lexico-Syntactic Reformation of Sentences using Typed Dependency Representations





Author: Advaith Siddhartha Department of Computing Science,University of Aberdeen



The reasons for why the most of the authors want to choose one formulation over the other is for ,avoiding shifts in focus and issues of salience and end weight and also to account for differences in reading skills and domain knowledge. This paper is all about an approach to automate complex reformulation. Reformulation of complex sentences is for better understanding by the person with the low literacy level.


Let us consider the following four discourse makers for causation studied by the author. These differ in the lexico syntactic properties of discourse marker such as cause,because of,because,cause of.

Example(1) a.An incendiary device caused the explosion [A-CAUSE-B](here A implies an incendiary device caused and B implies the explosion)
b.The explosion occurred because of an incendiary device[B-BECAUSE OF-A]
c. The explosion occurred because of incendiary device[B-BECAUSE-A].
d.The cause of the explosion was an incendiary device[CAUSE OF-B-A].
The discourse makers can be verbs,prepositions,conjunctions and nouns.Additionally the order of presentation ca also be varied to the following four more forms.
(1) e. The explosion was caused by an incendiary
device. [B-CAUSEBY-A]
f. Because of an incendiary device, the explosion occurred. [BECAUSEOF-A-B]
g. Because there was an incendiary device, the
explosion occurred. [BECAUSE-A-B]
h. An incendiary device was the cause of the explosion. [A-CAUSEOF-B]

From the above example it is clear that some formulations of a given content can be more felicitous than others. i.e The explosion was caused by an incendiary device(1e) is more preferable to Because there was an incendiary device, the explosion occurred(1g).

Related work on text reformulation:
1.Discourse Connectives and Comprehension
This work involved the manual reformulation of the complex sentences. The sentences were manually rewritten to make language more accessible or to make the content more transparent.
Drawback: The manual reformulation was dependent on the way a person sees the text.
For example (2)
a. Because Mexico allowed slavery, many Americans and their slaves moved to Mexico during
that time.
b. Many Americans and their slaves moved to Mexico during that time, because Mexico allowed slavery.
Thus the (b) version of the above example would be preferred for children who can grasp causuation ,but who have not yet become comfortable with alternative clause orders.

2.Connectives and Text (Re)Generation
Much of the work regarding (re)generation of text based on discourse connectives aims to simplify
text in certain ways, to make it more accessible to particular classes of readers.The PSET technique about which I have already blogged,considered simplifying news report for aphasic readers. That paper mainly focused on lexical simplification by replacing difficult words with the simpler one.The syntactic simplification in PSET was restricted to string substitution and sentence splitting based on pattern matching over chunked text.The technique in this paper aims to extend these strands of research by allowing more sophisticated insertion,deletion and substitution reorganization and modification of of content within a sentence.
Drawback:However ,to date, these systems do not consider syntactic reformulations of the type we are interested in.
3.Sentence Compression:
Sentence compression is a research area that aims to shorten sentences for the purpose of summarising the main content.The approach to sentence compression focus on deletion operations,mostly performed low down in the parse tree to remove modifiers.
Drawback:However ,given their focus on sentence compression ,they restricted themselves to local transformations near the bottom of the parse tree.

Regeneration using Transfer Rules
In this section,let us first describe our data, and then report our experience with performing text reformulation using these representations.
DATA:
We use a corpus which contains examples of complex lexico syntactic reformulations such as those in the example one(the above first example).The corpus contains 144 such examples.

1.Reformulation using Phrasal Parse Trees:
The following parse tree shows the active and passive voice with "cause" as verb.A transfer rule is derived by aligning nodes between two parse trees so that the rule only contains the differences in structure between the trees.

passive voice:The explosion was caused by an incendiary device.
(S
(NP (AT The) (NN1 explosion))
(VP (VBDZ be+ed)
(VP (VVN cause+ed)
(PP (II by)
(NP (AT1 an) (JJ incendiary) (NN1 device))))))
Active voice:An incendiary device caused the explosion.
(S
(NP (AT1 An) (JJ incendiary) (NN1 device))
(VP (VVD cause+ed)
(NP (AT the) (NN1 explosion))))


Derived Rule:
(S
(??X0[NP])
(VP (VBZ be+s)
(VP(VVN cause+ed) (PP(II by+) (??X1[NP])))))
(S
(??X1[NP])
(VP (VVZ cause+s) (??X0[NP])))


In the representation derived rule the variable X0[NP] maps onto any node (sub tree) with the label NP.In this example "explosion" is labelled with NP.
Drawback:In practice however , the parse tree representation is too dependent on the grammar rules employed by the parser.

2.Reformulation using MRS(Minimal Recursion Semantics):
This representation provides another option to use a bi-directional grammar and perform the transforms at a semantic level.
Consider a very short example for ease of illustration:
Tom ate because of his hunger.
The MRS representation of the above sentence is shown below
named(x5,Tom), _eat_v_1(e2,x5),_because_of(e2,x11), poss(x11,x16),pron(x16), _hunger_n(x11)
This technique treats because of as a multi word expression and assigns it a comparable to a prepositions.The possible rule is as follows
_because_of(e,x), P(e,y) <-> _cause_v_1(e10,x,y,l1), l1:P(e,y)
Here 'P' is to be understood as a general predicate.After applying the rule the example turns as follows
His hunger caused Tom to eat.
Drawback:The problem encountered ,however is that bidirectional grammars fail to parse ill-formed input and will also fail to analyse some well-formed input because of limitations in coverage of unusual constructions.


Reformulation using Typed Dependencies
Let us consider the following example
The explosion was caused by an incendiary device.
The set of dependencies represent a tree. while phrase structure trees represent the nesting of constituents with the actual words at the leaf nodes,dependency trees have words at every node:
To generate from a dependency tree,we need to know the order in which to process nodes -in general tree traversal will be “inorder”; i.e, left sub trees will be processed before the root and right sub trees after. These are generation decisions that would usually be guided by the type of dependency and statistical preferences for word and phrase order. However, we can simply use the word positions (1–8) from the original sentence.
The first transformation is that one list of predicates is replaced by another. Applying this transformation creates a new dependency tree:

Thus our transformation rules, in addition to Deletion and Insertion operations, also need to provide rules for tree traversal order. These only need to be provided for nodes where the transform has reordered sub trees
(“??X0”, which instantiates to “cause+ed:4” in the trees). Our rule would thus include:

3. Traversal Order Specifications:
(a) Node ??X0: [??X2, ??X0, ??X3]
This states that for node ??X0, the traversal order should be subtree ??X2 followed by current
node ??X0 followed by subtree ??X3. Using this specification would allow us to traverse the tree
using the original word order for nodes with no order specification, and the specified order where
a specification exist. In the above instance, this would lead us to generate:

An incendiary device caused the explosion.

Tuesday, March 1, 2011

Motivations and Methods for Text Simplification

Helloo..

The authors for the above paper are R. Chandrasekhar, Christine Doran and B. Srinivas

As the title suggests the paper talks about the methods and reasons for Text Simplification.

They say that to simplify a sentence we need an idea of the structure of the sentence, to identify the components to be separated out.A parser could be used to get the complete structure of the sentence.since parser is prone to errors while parsing long and complex sentences ,they use two alternatives for a parser that is used for simplification .

The first approach uses a Finite State Grammar (FSG) to produce noun and verb groups while the second uses a Super tagging model to produce dependency linkages.

Now let us discuss the reasons for Text simplification :
1) If sentences are simple it is easy for both programs and users to process.
2) Simple sentences are easy to parse because they involve less ambiguity.
3) Simple sentences results in quality of machine translation.
4) Information retrieval is easy i.e only specific relevant sentences can be retrieved in response to the queries.
5)Simplification can be used to weed out irrelevant text with greater precision, and thus aid in summarization.
6)Clarity of text.

Simplification process is a two step procedure one is to obtain structure of the sentence and then apply simplification rules on the structure to identify the components that can be simplified.

In order to simplify one need to identify the articulation points i.e the points where the sentence can be logically split.Possible articulation points include the beginnings and ends of phrases, punctuation marks, subordinating and coordinating conjunctions, and relative pronouns.

These articulation points define a set of rules which can map original sentence pattern to simpler sentence pattern and is applied again and again until it is no more applicable.
ex:
Talwinder Singh, who masterminded the Kanishka crash in 1984, was killed in a fierce two hour encounter...
Talwindcr Singh was killed in a fierce two-hour encounter ... Talwinder Singh masterminded the Kanishka crash in 1984.


FSG based Simplification:

Here we consider sentences as word groups or chunks and consider the chunk boundaries as articulation points .
Chunking allows us to find out the syntax of the sentence and the structure of simplification rules at a coarser granularity, since we need no longer be concerned with the internal
structure of the chunks.

Each chunk is a word group consisting of a verb phrase or a noun phrase, with some attached
modifiers. The noun phrase recognizer also marks the number (singular/plural) of the phrase. The verb phrase recognizer provides some information on tense, voice and aspect.

The chunked sentences are then simplified using a set of ordered simplification rules.


An example rule that simplifies sentences with a relative pronoun

X:NP,Relpron Y,Z->XP Z . X:NP Y
The rule is interpreted as follows. If a sentence starts with a noun phrase (X:tiP), and is followed
by a phrase with a relative pronoun, of the form
( RelPron Y ,) followed by some (Z), where Y and Z are arbitrary sequences of words, then
the sentence may be simplified into two sentences, namely the sequence (X) followed by (Z), and (X) followed by (Y). The resulting sentences are then recursively simplified, to the extent possible.

A Dependency-based model:
This model is based on simple dependency representation provide by LTAG( Lexicalized Tree Adjoining Grammar) .

LTAG: These contain elementary tress called initial trees and auxiliary trees.
Initial trees include nouns,PP,simple sentences etc.
Auxiliary tress include relative clauses ,adverbials etc.

Supertagging: LTAG tells us that only dependency elements be present in the same tree because the LTAG localizes dependency elements.
As a result of this localization, a lexical item may be associated with more than one eLementary
tree, We call these elementary trees super tags.
We use trigrams to disambiguate the super tags as to assign one super tag for each word in a process called super tagging.


EVALUATION:
To establish the dependency links among the words of the sentence, we exploit the dependency
information present in the super tags. Each supertag associated with a word allocates slots for
the arguments to the word. These slots have a
polarity value reflecting their orientation with respect to the anchor of the supertag. Also associated with a supertag is a list of internal nodes
that, appear in the supertag.Using this information, a simple algorithm
may be used to annotate the sentence with dependency links.


The objective of the evaluation is to examine the advantages of the DSM over the FSG-based model for simplification. In the FSG approach since the input to the simplifier is a set of noun and verb groups, the rules for the simplifier have to identify basic predicate argument relations to ensure that the right chunks remain together in the output. The simplifier in the DSM has access to information about argument structure, which makes it much easier to specify simplification patterns involving complete constituents.