Project Text Simplification: 2011

Sunday, May 22, 2011

Our penultimate show...

It was bright sunny on a Monday morning and we had our final project presentation at college. (There will be another one) But this would carry 100 marks of our total of 200 marks. I got up very early so that I get ready on time. I ironed my dress till there was not a single wrinkle on it. I wanted this to be a perfect day. But somehow I was convinced that it would not be a cakewalk.

We had decided to meet at 9 AM and be prepared for the presentation. But when I reached college, only Bhuvan was present there. I was instantly reminded of the proverb- "In life, everything does not happen according to the way you want it". I was disappointed but there was nothing I could do. I stood numb in the corridors trying very hard to come up with a solution and find a way out of this mess. At first, my pessimistic thoughts told me to go and cancel our presentation and tell them we were not prepared. But the optimist in me said we could manage it somehow.

Bhuvan was at the shop getting the printout of the report which was so pathetically formatted (all of us are to be blamed for that). I was surprised that the reviewers did not slam it on our faces. I could see all of us failing then. Since we were the 3rd batch, there was sometime before our turn came. Ambika came from her hometown directly to college, she did not know the developments in the project or her part in the presentation. I felt so helpless. We had two options: either sit and sulk about our irresponsible work or sit and work out something before our turn came. Of course we chose the second option. Ambika and I worked on the ppt while Bhuvan and Anusha got everything ready to show the demo of our project. Our turn would arrive in 15 minutes and we hurried through it. We did not have time to decide who would present what. We managed to finally consolidate everything on time.

Our turn arrived and we entered the presentation room. Since our laptops have become moving desktops now, we need to plug in to the power supply in order to get it working. Meanwhile the reviewers were taking a look at our report. By the time we could set up the laptop and boot it, they started shooting questions about the project. Now this part is very easy as long as we know how to make them understand our point. So one question led to another and they understood the working of our whole idea behind the project without the use of slides and paper (and also without Bhuvan's turrrrrr turrrrr lines on the paper ;) ) But this post would go for a waste if I didn't mention about Bhuvan's presentation that day. He had that confidence in his tone and body language and he was so articulate that everybody understood the concept very well. They did not have any more questions for us. Everybody seemed to like the idea and they were very happy with the efforts put in by us in the past 5 months. At that moment, all of us forgot the roller coaster ride we had in the morning and were so joyous. What we couldn't achieve by sitting at home for a week was achieved in less than 60 minutes as a TEAM which stands for "Together Everyone Can Achieve More".

Lastly, I would say that all this would not be possible without the continuous faith of Sudarshan Sir in us and for motivating us all this while; for being our inspiration for now and for years to come. It is also unfair if I conclude his contribution in just 2 lines. ( A separate post on Sudarshan Sir titled "Superman Sudarshan" will be up shortly) ;)

All in all a memorable day! :)

Thursday, May 19, 2011

Kickass Ending! Part-1

Apu-"nagbeda neenu nang modhle irritate aagthidhe", WoOw what a start for the day! Thoughts in my head "Openinge heege innu real movie hegirutho!!?? 0_o"

Never seen apu like that before, in all pink wear she was all red in anger!! :D If I had cracked any single pj I thing that would have been "The end" story for me!!:P :P Lucky me I didn't do any stupid things.

One big drama about the report alignment, Ambika was late, Anu was late and our team was late!!! Woow what an beginning to our final project review, it was for 100 marks as always we would fall back to our circle "bad presentation"!!

We didn't have proper ppt ready, we didn't have proper report ready, we are doomed!! With these thoughts in head I entered the department, Apu and Amibka were busy in making ppt anusha was busy in helping them out, me thinking what would happen with our presentation, Upon that, everyone are in formals, i was in my old style blue loose t-shirt and jeans, adding to my tension.

"Okay its done lets go and stand in front of the boardroom Akshatha's group has entered in" from Anu, we packed it up and sat outside the boardroom, Just looked at the report once wow such an awesome crap!! spent 110re for this which will be thrown on our face no matter what! :P My thoughts - " Should have made little less number of pages, if they throw on our face it would hit little soft atleast" :D

Just peeped in, there were some serious faces and serious questioning going on, man we are gone!! We don't have anything properly, No good Report neither a good ppt, most of all NO PREPARATION!! It will be bad for sure, other team came out, they weren't looking good, as they were saying they are screwing out there! We going to get it all for sure lets enter!

We just saw panel of 4 with Vinay sir taking the lead role(villain) :P He took our report he asked about Mysql to me, piece of cake, i gave back sooper answer, this one question gave me hell lot of confidence, he asked us to give a brief description about the project, as usual Apu sooper dooperly explained about the proj, That was Vinay's turn to summarize what he understand about the project, "Its a very simple project just replacing the words with there synonyms thats all" {my heart beats raising} Apu started answering something, I don't know what struck me, I didn't care for anything i just wanted to speak i thought i should cut loose, i raised my big tone, cut through her, spoke about complexity of just identifying complicated words in a text for about 3 mins, after my explanations i just could see nodding heads all around, wooowww they are convinced!! :) :) I was damn happy, we made it a mark! After that question i just realized everything else would be easy, they kept asking about different modules, i could see everyone in my team were standing as if they want to answer more and more question, Ambika , Anusha, Apu were all raising there tone while answering, I could see the confidence in everyone popping out. By the way HOD was so supportive that day he actually helped us answering a question! We never went to him for reporting about the progress not much of interaction with him, we thought he would be angry but today was a day which was beyond our imagination! Everything is happening exactly opposite from what we thought it would be..

We are finished you can leave wait have u presented any paper?? - "yes sir at BNMIT" you are the first team to get that 10marks :D from Vinay sir with a smile :) My thoughts- "What just happened was any kinda dream!?? Did we really do that!?" haha :D At last one good presentation, that to with Zero preparation and Zero confidence :) :)

We did it! Something which was next to impossible, which was least expected from all 4 of us. By the way i had told anusha i would definitely speak, will get back with a bang! Kept my promise ;) ;)

Happiesss Endinggss :) :) :)

Wednesday, May 4, 2011

Points To Be Taken Care While Preparing Project Report

Any project is incomplete without a proper report. Now it's time to prepare report for our project. Today we got the format for writing the report. Before we start writing the report we should keep the following points in mind and come up with a good report.The points to be taken care while writing the report are

No sentence in the report must be copy pasted.
Sentences in the report must be as small as possible.
Paragraph should not go beyond 10 lines.
Include as many diagrams as possible(at least 20 diagrams must be there in the report)
Maintain the same tense throughout the report.
Literature survey should include all the papers which are there in the blog and also the papers in the http://www.citeulike.org site.
Introduction part should include
1. Why is this problem important.
2. Entire report must be summarized in the introduction.
3. It should include definitions of necessary terms such as co-occurrance .
4. It should also include brief introduction of packages such as NLTK, wordnet,pygtk etc.

Conclusion part should include the work we have done(in brief) and it's future enhancements.

Keeping all these points in mind, friends lets put our best effort and come up with a great report :):)

Tuesday, April 12, 2011

20% progress!

"Sir, Sir co-occurrence matrix is the frequency of two words. Consider a big text sir its like we have a huge text compiled and we, we took out the co-occurrence of the two words!!", That was me who couldn't explain one single and one of the main part of work that we had done, this is the first thing that pops out of my head when i think about this Monday,11th April!

It was just another Monday, it has to be good one for us, we are ready with our working code, even though it won't give proper result some times, but you can expect 3 out of 5 times good answer. We were confident and Sudarshan sir had called us for early morning pre-presentation session.

Points to be noted:

never go with techi terms while you are presenting.
Never bring out the term markov matrix or matrix in that case!
Be confident and go for it.

Sounds funny when i just think what i did in our boardroom, it was just opposite of sir's advise.

It was late, but it wasn't our mistake, we were confident but didn't have slides :P We entered but our time was bad. We wanted to start but didn't allowed to start. Hod wanted my parents contact number but i don't remember and its just the beginning!!

We stepped inside the boardroom Hod- " what do you 8th sem think of yourself? etc etc.. " my thoughts on that moment- "idhella namge bekitha!?? "Anyways he asked us to explain our contributions, guess what we were back in the same old circle, bad presentation he wasn't impressed! Later he gave one input, it didn't simplify, now the 2nd input even now it didn't get simplified, damn we are screwed, i knew 3 out of 5 times it will give proper answer, but never thought first 2 sentences were the wrong ones!!:P He rated our progress to be 20%.WOooW!! :P

Its just done and dust, i wasn't speaking anything about it, came out and just realized how upset was our guide. That was bad, need to show the simplified version of this input!

"We require the students to exhibit impeccable attitude"

We all of us had only one thing in mind, have to give the simplified result! sir should be impressed thats all what we had it in mind! After little tweaks by afternoon we got the ouptut that was expected!

"We need the students to show perfect attitude"

Woow!! done :) But still was nervous, sir gave 2 more sentences, YippEee was working, we got simplified version of it :D Happiessss endings with cane juice from our sir :) :) :)

All Izzz wellLL :) :)

Review story.....

The story starts with a call from our team-mate Bhuvan zero Raj zero R :) on Sunday evening .He told that there will be project review tomorrow . We were confident enough that we will give the review in a very good way as we had given it on Friday itself.He said that first 10 batches need to give the review which will start at 10 on Monday. Next day it was 10:30 when we went to the board room in our department.As our batch number was three we decided to go at 10:30,thinking that by 10:30 other batches would finish their review. But the story was different there. No one(other batches) had come for the review. As we stepped into the board room we got proper scoldings for all the reasons which start from bunking the class to coming late for the review. Then we were asked to give the review. By that time everything was vanished from our minds.That day morning at 7:30 we noted out the important points to be said.But after getting proper scoldings nothing was there in our minds.

The review started. I was asked "What is your contribution to the project??" The unexpected question for me. I prepared my mind and told about my part in the project. As soon as I completed that the next question was "What is your contribution to the project??". yeah the same question he asked again. I was bewildered why he is asking the same question again. I don't know why he asked again the same question.

I told about the percentage of my work in our project for the second time.Likewise he asked questions I tried my best to answer his questions.Literally I struggled to convince him. He was not letting me to explain in a continuous flow.It was not his fault may be I was nervous to answer all his questions.

It was next Bhuvan's turn. I think Bhuvan will never forget the word "co-occurance".He has put all his physical strength in explaining the word "co-occurance".He attempted to explain by drawing many rows and columns but he could not convince the audience there.But he is confident enough that he will not let happen the same in the next review.

Anusha continued with the Bhuvan's explanation and even she could not convince the audience. "One should understand the audience first and then explain according to his level" was the quote given by our reviewer. At last Apoorva told about her contribution in the project and she also answered some questions regarding our the time efficiency of our project.

Somehow review got over. But we were very disappointed. We didn't talk to each other for sometime. Then Sudarshan sir gave us some tips about how present it there.The best part of the review was that it inspired us a lot to improvise our code to give a better and accurate result(given the proper input). Amidst all these we enjoyed the explanation skills of Bhuvan zero Raj zero R :) :)

SAD MONDAY!..:(:(

Hi All,

Well Monday being the first day of the week is supposed to be best day,energetic,fully pumped etc etc but sadly for our text simplify group this turns out to be the most disappointing day all the time :(
Every time we plan out of ideas on how to show case our project i.e prepare for presentations before entering the board ,we would be all pumped up and fully confident on how to speak but when we start our presentation ,the entire story line changes ,disappointing!...:(

The same repeated this Monday too, We had our project review and we were supposed to explain what work has been so far and the status of the project so we reached college at 7.30 to discuss how to present our work done so far in a nice filmy way ( I meant to give a great show :P). We listed out few key points on what to speak like
1)What is Text Simplification?
2)Why text simplification is needed?
3)Our idea of implementing Text simplification etc etc

We were all set to face HOD to give a grand show as decided .Messages started flooding in inbox saying that we are supposed to be at 10.00 in board room for the review.We were in HODs cabin at 10.30 when our review was at 10.00 thinking that our turn would take time since ours was 3rd batch but anyways after getting few scoldings from Ramakrishna sir we got our chance to enter the board room.

Now starts the main play...........

Hod as soon as he saw started scolding for mass bunk ,Indiscipline etc etc.
Then all of sudden he pointed out Ambika's name and told her to explain what role she played in the project and what is her contribution to the project.I was quite shocked when he started off like that but Ambika somehow completed her explanation ,Hod was not convinced though .Then it was Bhuvan's turn . He has almost everything in his mind that he goes blank to explain even a single line of what he has done. He was angry and we could see his frustration by looking at the way how he was scribbling on the paper when Hod told him explain the Co ocurrence concept :P .Then it was Apoorva's turn followed by mine .I had only few contributions to explain even for that which HOD was not convinced.:(

All were looking at each others face perplexed when all this happened and the most interesting part is our guide was there sitting in front of us watching all this.He had explained very well beforehand on how to face project reviews and how to speak confidently but sadly all this drama happened in front of him:(

After we came out We discussed few points with our guide on what work is left out and how to face reviews in future.
We assigned tasks amongst ourselves and started with the same and we are waiting for this Saturday as this our deadline for completion of project and show it to HOD that our project indeed helps others in TEXT SIMPLIFICATION .
Party waiting on this Saturday for two reasons, one is project completion and one more Bhuvan's treat ..:P

hoping to post very soon ..

Cya GN!.........

Monday, April 11, 2011

Text Simplification --------Part 2

I was out shopping with mom when my phone started ringing. It was Bhuvan who frantically called me to say there was going to be another round of project review by our HOD as he was not happy with the progress of 8th sem students. So we began preparing for Monday's review by running our script and debugging some errors. We got stuck since each module was on a different system. We had to integrate it on one system. We finally managed to do it until we realized in the morning of our ignorance and hence we ended up not showing the proper output to Sudarshan sir :P

We somehow fall into our own trap because of our ignorance and this was no doubt one such day!! We had to somehow impress the reviewer with our presentation (where we have failed miserably). That was one of the reason we had a 7.30 AM meet today morning. We wanted to prepare well for the presentation. We noted down what to say and what NOT to say, hence we were all set to enter the room and give a great show until..........................

We entered the room to find our HOD sitting alone. He took a sheet of paper which had our names and some phone number next to it. He asked all of us what our father's number is. We all told him until it was Bhuvan's turn. He started saying- "Sir, my dad's number starts with 9180..... and my mom's number starts with 9008....." and we all burst out laughing :D

The review didn't go as expected and all of us were disappointed. But amidst all of these, there were such humorous moments which I can never forget. The reviewer was asking us our individual contributions. We found it difficult to explain to him what we actually did in the past 3 months working for 6 hours a day on an average. In the process most of us got restless and annoyed, so when it came to Bhuvan (again! ;) ) he took a book and started drawing lines like he was cutting a tree with a hacksaw blade. :D If I may quote Sir, it looked like he was drawing dark lines on the reviewer's face. He gave an input text which read- "Bhuvan abused Ambika" and the replacement was nill ( if u ask me, there was nothing to replace in that sentence)

We were all the more determined now to finish the tasks and show the output, so we went to the lab and started working. None of us spoke a word, we were concentrating that much! We did achieve something. Thanks to the review which happened, we became more efficient.

"We require the students to exhibit impeccable attitude"

was the input text which was not getting simplified in the review room. But now it shows-

"We need the students to show perfect attitude"

Wow!! What an achievement.. It feels great to see the proper results :P

We were convinced that our package was working fine. If only we update our database (Markov matrix) and create a showy user interface , we would get better results. So its quite obvious what our next task should be. And here we are working on the same at 12 AM.

We have a big party aligned for Saturday, so we are scheduled to complete our work by Friday.

Three cheers to Simplifiers!! :)

Hip hip........ (ok I can hear you guys shouting "hurray" :P )

Sunday, March 6, 2011

Text Summarization Using Lexical Chains

Authors: Meru Brunn, Yllias Chali, Christopher J. Pinchak

Until now there has always been posts about "Text Simplification". So you must be wondering what "Text Summarization" has relevance with our project. I suggest you to read more and in the end you will be convinced about the relevance. So lets begin with the details.

What is Text Summarization?

Summarization is the process of condensing a source text into a shorter version by preserving its information content. It becomes very useful for a reader when he has no time to read the whole paper to understand whether it is important to him. Legal documents are usually very lengthy and includes jargons which makes it difficult for a reader to understand. Summarization tool can be of great help in such situations.

Introduction

Summarization is usually done by extracting important sentences from the source text and compiling them to generate coherent summaries. In this paper they provide an algorithm to identify important sentences by forming lexical chains.

The overall architecture of the system is shown in Figure 1. It consists of several modules organized as a pipeline.

Preprocessing

1. Segmentation: To start the summarization process, the original text is first sent to the text segmeter. The role of the text segmenter is to divide the given text into segments that address the same topic. This segmentation allows later modules to better analyze and generate a summary for a given source text.

2. Tagging: This module performs Part-of-speech tagging. The words are considered individually and the semantic structure is not considered.

3. Parsing: In this module, tagged words are collected and organized into their syntactic structure. We can select various components (or phrases) depending on their syntactic position within a sentence. For example, we could decide to select all noun phrases within a given text. Finding these noun phrases would be a trivial task using a parsed representation of the text. Since the parser and tagger are not entirely compatible with respect to input/output, the tagger output is refined such that it becomes compatible with the parser. For example, The parser expects that the tagged words will be of the form ’word TAG’. The tagger outputs tagged words with the form ’word_TAG’, and so the underscore is simply removed.

4. Noun filtering: Noun filtering improves the accuracy of text summarization by selectively removing nouns from the parsed text. These nouns come from the source text and are identified by the tagger. However, there are nouns that both contribute to and detract from the subject of the text.

Consider an analogy to analogue data transmission. During data transmission, there is both a signal component and a noise component. Data transmission conditions are ideal when there is a strong signal and low noise. It is when the signal is overcome by noise that it becomes difficult to detect. This is similar to the presence of nouns within the source text. Those nouns that contribute to the subject of the text are part of the ’signal’, and those that do not are part of the ’noise’. The noun filter’s job is to reduce the ’noise’ nouns while still retaining as many ’signal’ nouns as possible.

There are a number of different heuristics that could be used to filter out the ’noise’ nouns. They have designed a heuristic using the idea that nouns contained within subordinate clauses are less useful for topic detection than those contained within main clauses. However, these main and subordinate clauses are not easily defined. Hence for their system, they have selected a relatively simple heuristic. They chose to identify the first noun phrase and the noun

phrase included in the first verb phrase from the first sub-sentence of each sentence as the main clause, with other phrases being subordinate.

5. Lexical chainer:

The steps of the algorithm for lexical chain computation are as follows:

We select the set of candidate words. A candidate word comes from an open class of words that function as a noun phrase or proper name as results of the noun filtering process.

The senses of all the candidate words are considered, which are obtained from the thesaurus. In this experiment, we used WordNet thesaurus . At this step all senses of the word are considered, and each word sense is represented by distinct sets considered as levels. The first one constitutes the set of synonyms and antonyms, the second one constitutes the set of first hypernyms/hyponyms and their variations (i.e., meronyms/holonyms, etc.), and so on.

They find the semantic relatedness among the set of senses according to its representations. If two sense representations of two distinct words matches,then they are said to be semantically related. Each semantic relationship is associated with a measure that indicates the length of the path taken in the matching with respect to the levels of the two compared sets.

They build up chains that are sets such as

in which

is semantically related to

for

We retain the longest chains by relying on the following preference criterion:

word repetition >> synonym/antonym . . .

In this implementation, this preference is handled by assigning scores to each pairwise semantical relation in the chain, and then summing those pairwise scores. Hence, the score of a chain is based on its length and on the type of relationships among its members.

In the lexical chaining method, each word-sense has to be semantically related to every other word-sense in the chain. The order of the open class words in the document does not play a role in the building of chains. However, it turned out that the number of lexical chains could be extremely large, and thus problematic, for larger segments of text. To cope with this, they reduced the word-sense representation to synonyms only when they had long text segments. Lexical chains are computed for each text segment.

6. Sentence Extraction: Each sentence is ranked with reference to the total number of lexical cohesion scores collected. The objective of such a ranking process is to assess the importance of each score and to combine all scores into a rank for each sentence. In performing this assessment, provisions are made for a threshold which specifies the minimal number of links required for sentences to be lexically cohesive. Ranking a sentence according to this procedure involves summing the lexical cohesion scores associated with the sentence which are above the threshold.

Each sentence is ranked by summing the number of shared chain members over the sentence. More precisely, the score for sentence(i) is the number of words that belong to sentence(i) and also to those chains that have been considered in the segment selection phase. The summary consists of the ranked list of top-scoring sentences, according to the desired compression ratio, and ordered in accordance with their appearance in the source text.

DUC evaluation

They participated in the single document DUC evaluation. The task consisted of, given a document, creating a generic summary of the document with a length of approximately 100 words. Thirty sets of approximately 10 documents each were provided as system input for this task. According to their analysis, the results seem promising. The grammaticality of their summaries scored an average of 3.73/4. Similarly, the cohesion and organization scores were, on average, of 2.55/4 and 2.66/4, respectively.

Conclusions and Future Work

This paper presents an efficient implementation of the lexical cohesion approach as the driving engine of

the summarization system. The ranking procedure, which handles the text ’aboutness’ measure, is used to select the most salient and best connected sentences in a text corresponding to the summary ratio requested by the user. In the future, they plan to investigate the following problems:

Their methods extract whole sentences as single units. The use of compression techniques will increase the condensation of the summary and improve its quality.
Their summarization method uses only lexical chains as representations of the source text. Other clues could be gathered from the text and considered when generating the summary.
In the noun filtering process, their hypothesis eliminates the terms in subordinate clauses. Rather than eliminating them, it may also prove fruitful to investigate weighting terms according to the kind of clause in which they occur.

The concept of lexical chain is similar to our idea of constructing the Markov chain matrix. We can consider this matrix to be a source of sense disambiguation and a tool which will tell us the "about"ness of the text and help us in simplifying the text in the right manner.

P.S: I have used an equation in my post and I feel proud to say that I learnt it from one of our Sir's blog post. You can also refer to it by visiting the blog Academic Me! :-)

Wednesday, March 2, 2011

Complex Lexico-Syntactic Reformation of Sentences using Typed Dependency Representations

Author: Advaith Siddhartha Department of Computing Science,University of Aberdeen

The reasons for why the most of the authors want to choose one formulation over the other is for ,avoiding shifts in focus and issues of salience and end weight and also to account for differences in reading skills and domain knowledge. This paper is all about an approach to automate complex reformulation. Reformulation of complex sentences is for better understanding by the person with the low literacy level.

Let us consider the following four discourse makers for causation studied by the author. These differ in the lexico syntactic properties of discourse marker such as cause,because of,because,cause of.

Example(1) a.An incendiary device caused the explosion [A-CAUSE-B](here A implies an incendiary device caused and B implies the explosion)
b.The explosion occurred because of an incendiary device[B-BECAUSE OF-A]

c. The explosion occurred because of incendiary device[B-BECAUSE-A].
d.The cause of the explosion was an incendiary device[CAUSE OF-B-A].

The discourse makers can be verbs,prepositions,conjunctions and nouns.Additionally the order of presentation ca also be varied to the following four more forms.

(1) e. The explosion was caused by an incendiary

device. [B-CAUSEBY-A]

f. Because of an incendiary device, the explosion occurred. [BECAUSEOF-A-B]

g. Because there was an incendiary device, the

explosion occurred. [BECAUSE-A-B]

h. An incendiary device was the cause of the explosion. [A-CAUSEOF-B]

From the above example it is clear that some formulations of a given content can be more felicitous than others. i.e The explosion was caused by an incendiary device(1e) is more preferable to Because there was an incendiary device, the explosion occurred(1g).

Related work on text reformulation:

1.Discourse Connectives and Comprehension

This work involved the manual reformulation of the complex sentences. The sentences were manually rewritten to make language more accessible or to make the content more transparent.

Drawback: The manual reformulation was dependent on the way a person sees the text.

For example (2)

a. Because Mexico allowed slavery, many Americans and their slaves moved to Mexico during

that time.

b. Many Americans and their slaves moved to Mexico during that time, because Mexico allowed slavery.

Thus the (b) version of the above example would be preferred for children who can grasp causuation ,but who have not yet become comfortable with alternative clause orders.

2.Connectives and Text (Re)Generation

Much of the work regarding (re)generation of text based on discourse connectives aims to simplify

text in certain ways, to make it more accessible to particular classes of readers.The PSET technique about which I have already blogged,considered simplifying news report for aphasic readers. That paper mainly focused on lexical simplification by replacing difficult words with the simpler one.The syntactic simplification in PSET was restricted to string substitution and sentence splitting based on pattern matching over chunked text.The technique in this paper aims to extend these strands of research by allowing more sophisticated insertion,deletion and substitution reorganization and modification of of content within a sentence.

Drawback:However ,to date, these systems do not consider syntactic reformulations of the type we are interested in.

3.Sentence Compression:

Sentence compression is a research area that aims to shorten sentences for the purpose of summarising the main content.The approach to sentence compression focus on deletion operations,mostly performed low down in the parse tree to remove modifiers.

Drawback:However ,given their focus on sentence compression ,they restricted themselves to local transformations near the bottom of the parse tree.

Regeneration using Transfer Rules

In this section,let us first describe our data, and then report our experience with performing text reformulation using these representations.

DATA:

We use a corpus which contains examples of complex lexico syntactic reformulations such as those in the example one(the above first example).The corpus contains 144 such examples.

1.Reformulation using Phrasal Parse Trees:

The following parse tree shows the active and passive voice with "cause" as verb.A transfer rule is derived by aligning nodes between two parse trees so that the rule only contains the differences in structure between the trees.

passive voice:The explosion was caused by an incendiary device.

(NP (AT The) (NN1 explosion))

(VP (VBDZ be+ed)

(VP (VVN cause+ed)

(PP (II by)

(NP (AT1 an) (JJ incendiary) (NN1 device))))))

Active voice:An incendiary device caused the explosion.

(NP (AT1 An) (JJ incendiary) (NN1 device))

(VP (VVD cause+ed)

(NP (AT the) (NN1 explosion))))

Derived Rule:

(??X0[NP])

(VP (VBZ be+s)

(VP(VVN cause+ed) (PP(II by+) (??X1[NP])))))

↓

(??X1[NP])

(VP (VVZ cause+s) (??X0[NP])))

In the representation derived rule the variable X0[NP] maps onto any node (sub tree) with the label NP.In this example "explosion" is labelled with NP.

Drawback:In practice however , the parse tree representation is too dependent on the grammar rules employed by the parser.

2.Reformulation using MRS(Minimal Recursion Semantics):

This representation provides another option to use a bi-directional grammar and perform the transforms at a semantic level.

Consider a very short example for ease of illustration:

Tom ate because of his hunger.

The MRS representation of the above sentence is shown below

named(x5,Tom), _eat_v_1(e2,x5),_because_of(e2,x11), poss(x11,x16),pron(x16), _hunger_n(x11)

This technique treats because of as a multi word expression and assigns it a comparable to a prepositions.The possible rule is as follows

_because_of(e,x), P(e,y) <-> _cause_v_1(e10,x,y,l1), l1:P(e,y)

Here 'P' is to be understood as a general predicate.After applying the rule the example turns as follows

His hunger caused Tom to eat.

Drawback:The problem encountered ,however is that bidirectional grammars fail to parse ill-formed input and will also fail to analyse some well-formed input because of limitations in coverage of unusual constructions.

Reformulation using Typed Dependencies

Let us consider the following example

The explosion was caused by an incendiary device.

The set of dependencies represent a tree. while phrase structure trees represent the nesting of constituents with the actual words at the leaf nodes,dependency trees have words at every node:

To generate from a dependency tree,we need to know the order in which to process nodes -in general tree traversal will be “inorder”; i.e, left sub trees will be processed before the root and right sub trees after. These are generation decisions that would usually be guided by the type of dependency and statistical preferences for word and phrase order. However, we can simply use the word positions (1–8) from the original sentence.

The first transformation is that one list of predicates is replaced by another. Applying this transformation creates a new dependency tree:

Thus our transformation rules, in addition to Deletion and Insertion operations, also need to provide rules for tree traversal order. These only need to be provided for nodes where the transform has reordered sub trees

(“??X0”, which instantiates to “cause+ed:4” in the trees). Our rule would thus include:

3. Traversal Order Specifications:

(a) Node ??X0: [??X2, ??X0, ??X3]

This states that for node ??X0, the traversal order should be subtree ??X2 followed by current

node ??X0 followed by subtree ??X3. Using this specification would allow us to traverse the tree

using the original word order for nodes with no order specification, and the specified order where

a specification exist. In the above instance, this would lead us to generate:

An incendiary device caused the explosion.

Tuesday, March 1, 2011

Motivations and Methods for Text Simplification

Helloo..

The authors for the above paper are R. Chandrasekhar, Christine Doran and B. Srinivas

As the title suggests the paper talks about the methods and reasons for Text Simplification.

They say that to simplify a sentence we need an idea of the structure of the sentence, to identify the components to be separated out.A parser could be used to get the complete structure of the sentence.since parser is prone to errors while parsing long and complex sentences ,they use two alternatives for a parser that is used for simplification .

The first approach uses a Finite State Grammar (FSG) to produce noun and verb groups while the second uses a Super tagging model to produce dependency linkages.

Now let us discuss the reasons for Text simplification :
1) If sentences are simple it is easy for both programs and users to process.
2) Simple sentences are easy to parse because they involve less ambiguity.
3) Simple sentences results in quality of machine translation.
4) Information retrieval is easy i.e only specific relevant sentences can be retrieved in response to the queries.
5)Simplification can be used to weed out irrelevant text with greater precision, and thus aid in summarization.
6)Clarity of text.

Simplification process is a two step procedure one is to obtain structure of the sentence and then apply simplification rules on the structure to identify the components that can be simplified.

In order to simplify one need to identify the articulation points i.e the points where the sentence can be logically split.Possible articulation points include the beginnings and ends of phrases, punctuation marks, subordinating and coordinating conjunctions, and relative pronouns.

These articulation points define a set of rules which can map original sentence pattern to simpler sentence pattern and is applied again and again until it is no more applicable.
ex:
Talwinder Singh, who masterminded the Kanishka crash in 1984, was killed in a fierce two hour encounter...
Talwindcr Singh was killed in a fierce two-hour encounter ... Talwinder Singh masterminded the Kanishka crash in 1984.

FSG based Simplification:

Here we consider sentences as word groups or chunks and consider the chunk boundaries as articulation points .
Chunking allows us to find out the syntax of the sentence and the structure of simplification rules at a coarser granularity, since we need no longer be concerned with the internal
structure of the chunks.

Each chunk is a word group consisting of a verb phrase or a noun phrase, with some attached
modifiers. The noun phrase recognizer also marks the number (singular/plural) of the phrase. The verb phrase recognizer provides some information on tense, voice and aspect.

The chunked sentences are then simplified using a set of ordered simplification rules.

An example rule that simplifies sentences with a relative pronoun

X:NP,Relpron Y,Z->XP Z . X:NP Y
The rule is interpreted as follows. If a sentence starts with a noun phrase (X:tiP), and is followed
by a phrase with a relative pronoun, of the form
( RelPron Y ,) followed by some (Z), where Y and Z are arbitrary sequences of words, then
the sentence may be simplified into two sentences, namely the sequence (X) followed by (Z), and (X) followed by (Y). The resulting sentences are then recursively simplified, to the extent possible.

A Dependency-based model:
This model is based on simple dependency representation provide by LTAG( Lexicalized Tree Adjoining Grammar) .

LTAG: These contain elementary tress called initial trees and auxiliary trees.
Initial trees include nouns,PP,simple sentences etc.
Auxiliary tress include relative clauses ,adverbials etc.

Supertagging: LTAG tells us that only dependency elements be present in the same tree because the LTAG localizes dependency elements.
As a result of this localization, a lexical item may be associated with more than one eLementary
tree, We call these elementary trees super tags.
We use trigrams to disambiguate the super tags as to assign one super tag for each word in a process called super tagging.

EVALUATION:
To establish the dependency links among the words of the sentence, we exploit the dependency
information present in the super tags. Each supertag associated with a word allocates slots for
the arguments to the word. These slots have a
polarity value reflecting their orientation with respect to the anchor of the supertag. Also associated with a supertag is a list of internal nodes
that, appear in the supertag.Using this information, a simple algorithm
may be used to annotate the sentence with dependency links.

The objective of the evaluation is to examine the advantages of the DSM over the FSG-based model for simplification. In the FSG approach since the input to the simplifier is a set of noun and verb groups, the rules for the simplifier have to identify basic predicate argument relations to ensure that the right chunks remain together in the output. The simplifier in the DSM has access to information about argument structure, which makes it much easier to specify simplification patterns involving complete constituents.

Sunday, February 27, 2011

Our working Sunday..

It was almost nearing midnight on Saturday when all of us were lethargic and drowsy. We hadn't progressed much all this week since we had started to think that there was lots of time left and the others had not even started. So we all went to bed deciding to meet online on Sunday ( a little later than 9 am). I didn't set my alarm too.

I got up to the tinkling sound the "ghante" at 9.30 am ( Anirudh does sandhyavandane everyday without fail). I was sipping the milk when Anusha called me online. We all had received a mail from Sir which truly was a wake-up call for us. It was time we understood a simple fact that there was nothing wrong in him having certain expectations from us, when he is giving us his time and guiding us in return for nothing . We were hit by the mail brutally and we all worked for more than 8 hours , of course with some progress. (Ms Ambika excluded. She is absconding since 10 days ;) )

We decided to be sincere students and work harder.

We started the day off by learning SVN. It became less complicated since Anusha had done a Phd on it since a month. She knew almost all the required commands and links useful for us. We were successful in committing version 1 of a test file. We created a repository to store our files and got stuck while importing the files to the repository. I am guessing there is a problem with the permissions which we will resolve soon enough. Bhuvan was watching the tutorials on how to create a plug-in( Though I doubt if he was live-streaming the cricket match in between ;) ).

It has been 2 weeks since we started implementation and we could not develop more than 2 modules. We are lacking in this aspect because we dont know the simple tricks in programming. We got stuck with an error for 2 full days without knowing what it was when our Sir could debug it in less than 15 minutes. We failed because we didn't use Pdb. Had we learnt this tool a month back it would have saved us those 2 days. That was the day when we first realized the importance of that tool. It is so easy to debug the errors now!

We have snippets of codes written and stored everywhere on the disk with different names. As the number of modules keep increasing, I am finding it difficult to maintain all of them. When i want to include some module in a program, I go and search for it on the disk. Clearly shows that I am wasting my time. So Sir taught us how to modularize all our codes and to create a library which will help us stay organized. I can create my own library and when Bhuvan or Anusha want to use one of my modules, they can directly include it. This saves a lot of time and effort!

I had forgotten for a while that we are not doing an ordinary project but an extraordinary one! :-) I feel glad to be a part of such a project with constant motivation by our Sir and my fellow simplifiers. I am sure all "text-simplify" mates are motivated and charged up to work harder. In the end, we all want to see the project being a huge success.

Cheers!

P.s:- I really want to appreciate Bhuvan who sat in front of the computer doing the TS project (hopefully) despite the over hyped India vs England WC match going on since afternoon :)

Friday, February 25, 2011

Ideas,Coding and High spirits!!............

Helloooooo people..

Well,it was Thursday morning and we had our project discussion at 7.30 and this time the meeting was only for Simplifiers...

I was first one to reach and again since I was jobless started thinking on Apoo's super Kannada and Bhuvan 's healthy Diet and Ambika's Hidden secret ... lol this are really interesting topics when u get to know the details ,which I will be sharing very soon ...

Since all 3 of us (Bhuvan ,Me,Apoo) slept a little late we were quite sleepy but our sir 's

evergreen voice and his energy made us active.

Just before sir came and joined all three of us we were into serious discussions on how to proceed further in our modules .

We had finished our module on sub sentence matching with the corpus and it was working fine...

It takes the sub sentences considering the adjacent words and compares it with the corpus, so we all were into thinking on how to simplify this sub sentence and we had many questions in our mind

1) whether we need to simplify all words in the sub sentence and replace it ,find the frequency and then compare it with original frequency??

2) replace a single word with the synonym and take adjacent words and then find the frequency??

3) we thought lets us consider a graph for each sub sentence and the one with peak value in the graph will be considered for simplification but we didn't know which words and how many words to consider for replacement................

and many more......

Finally after our Sir came and he raised a fantastic question ' how do we check the appropriateness of the sentence after simplification' He started the discussion with Markov matrix and he tells us that it might help in solving this problem.

I will explain What it is actually about???.....

Consider a set of words column wise in a matrix and consider the same words row wise in an another matrix and mulitiply those two matrices .The frequency count i.e the probability of two words are obtained and stored in a matrix , this probability is used to justify whether the already simplified sentence is SIMPLE AND IS MOST FREQUENTLY USED .

He says The Markov Principle talks about a set of events which immediately follows a event.

For ex soon after eating Papad in a restaurant it is most likely that one will order roti ..:P..not a perfect example though

At last we thought of replacing the complicated word assuming that there is only one such word with the synonym and then continue the process of sub sentence searching/matching.

hmmmmm full technical stuff till now .ufffffff

In midst of all these ,Ambika is nicely enjoying her vacation in her home town .When I call her She says the weather is good etc etc .Dont know why Ambika is feeling strange with known things happening around her.. :P

Then we had discussion on Apoo s diet and FITNESS mantra ..lol that is quite interesting u know..:P . She is up to something because she says 'WAIT FOR 2 MONTHS ' for everything we ask..:P adu en madthalo gothilla..

We Had good breakfast and headed to our class rooms ...

Classes are very much interesting with all interesting lectures;)

Bhuvan was eagerly waiting to show his code to the team and the afternoon session was kind of project review ,oh no it was in fact a code review..

we all showed our codes to sir and he was quite happy seeing the code working ...:)

We then discussed on next few tasks for the week and winded up .

Still searching for a good fine day to release my learning on Svn to friends .Hope that day comes soon.......

Lastly I feel like mentioning this quote which is in my diary and it is quite inspiring too!...

Your Dreams must come from your heart's deepest desires.Only Then will the barriers come down before you.

feeling sleepy people......cya good night!..

Tuesday, February 22, 2011

It's just the Beginning(V1.0)!..

It was just another Saturday, I should work for the project no no, its a weekend, I should enjoy or else I should sleep!. Last Saturday was different, i was active, i was charged up, i was on fire, i was all pumped up, burnt the midnight oil!!

By the way the reason behind it!!???

Cold war with our guide, He sent across a mail saying you people aren't progressing.. :P :P Taunting by his videos :P :P We all got fired up! My only aim was proving sir wrong.

Coding is never difficult unless you imagine, how the flow of the program should go. If you know the output if you could imagine how flow of your program should go then half your program is ready. Rest all is your little programming skills, Google and your ideas of using the language for the best.

I just know what to do, got the idea on what all functions i should work on, after all coding is the part i enjoy a lot, but seriously i had turn lazy i was actually postponing things until sir's mail and his taunting status! :P

Here I go Simplifying the text, its simple straight forward approach to simplify!! :P :P

Simple is never simple, My sweet computer had to face all my emotions my anger, my joy of getting outputs, frustration of untraceable errors, breaking heads on thinking how to code for particular. Google is god if it could show solutions for the problems, if not start cursing it for wasting my time hehe :P

It was Sunday evening around 6 when i first checked for my programs output after compiling all the small blocks of code into a single file. It was showing the output and i was jumping, dancing all around the place hehe :D

Learnt a lot coding through the program many thanks to our sir's taunts, mails ;) Yeah this ishttp://stackoverflow.com one of the best website I always look upon for help. Past one and half year its been my savior. If u have doubts regarding any programming language you could seek help there.

By the way Version1.0 is ready! its just matter of few days i will come up with a newer, better version i have rated the present one as just 35% accurate and "my expectations are more ";) ;) hehe :D

Signing off Bhuvan :)

Busy Monday!...:)

Hello.....

Its Monday!!..The first day and also a new start for the week .I had to get up at 5 'o' clock(AM) since our Sir had called us for project discussion at 7.30 am.
I board my bus at 6.30 .It was an amazing weather,cool breeze ,window seat and what else could I ask for... and I took my cell out to read all my forwards which I had received last night and I must tell you the nice road near Pesit reminds me of Jab We Met song though it is funny.. :P..

I was the first one to reach coll at 7.10 and I was jobless so I started taking few good pics of my coll in my camera. Surprisingly Bhuvan was very late and 4 of them including Sir had to treat 3 of us(celebrities are ANUSHA,AKSHATHA and JAWERIYA ;) ) for coming late ,this is our SFPE rule !!! ,(stomach full pocket empty ) which we assessors and simplifiers follow.

Jaweriya was star for the today as she gave us a wonderful explanation on grading a text .
The talk was really interesting as our sir filled in examples to make it a lively session.
She wrote a few formulas for the readability assessment and content information which included lot of mathematical equations.

Her main explanation was on Grading the text ,she stresses on two points for this, one is readability assessment and one more Content information and how much of each as to be used to get the peak and efficient value.

For ex: Bhuvan likes coding and it is 20%work of our project and apoo likes reading which involves 80% work of our project but both are important for the completion of our project.
work ----------------- Grades
100% of bhuvan's coding ----------- 0
90% of bhuvans coding & 10% apoo s reading------ 10
50% of Bhuvan' work and 50% of apoo 's work-------- 50
20% of Bhuvan' work and 80% of apoo 's work ------- 85
0% of Bhuvan' work and 100% of apoo 's work-------- 5

Here in the example we can note that at one point there is a maximum efficiency and that combination of work from both of them would give maximum result .Similarly the same concept is applied to grading a text and here the two contenders are Readability assessment and Content information....:).

After a good discussion on this topic ,we were supposed to continue our work and we did that till till 12 o clock with continuous debugging and coding.
exactly at 12 we were told to solve Prasad s sir problem and that went on for 1 hour ,we were supposed to trace an algorithm on Chain Matrix Multiplication( Given a a set of matrices like a1,a2,a3,a4.. which combination of matrix multiplication would result in least number of steps in multiplications.

After this long time of tracing ,we went to have lunch in NRI canteen and I was waiting for it because Bhuvan was supposed to treat ..yahooooooooo !!!..:)

Then we all had a small Birthday party for Madhura which we enjoyed a lot and not to forget the pastry cake ,it was yummmmm...:P.

End of the day is the most important part for which me and apoo were waiting eagerly ..that is our code to be free of errors ,which we succeeded at around 1.30am due to our sir s help..
But one thing people ..This Index out of range error I tell you is so damn irritating if you dont overcome it .
I recommend all of you to please use python debugger (PDB) which saves a lot of time in your coding when you encounters errors.
Python debugger is really wonderful tool of python ....

We now started again with next module of our coding which searches sub sentences in the corpus by taking adjacent words.
Our search module was a grand success...

Lastly Credits to Bhuvan because he wrote an amazing code which simplifies a text with complicated words to simpler one but it did make sense when the word is replaced ...Kudos Bhuvan.....

This is the story of my BUSY monday........more posts to come.........
Cya. for now....

Monday, February 21, 2011

Lunch at IISc :-)

Delicious lunch at the mess

Bhuvan and Ambika relishing the food!

Sir: Yenidu ellaru ootane kandildhero thara thinthaidira :P

L-R: Heroine's sister, Heroine, Don

Sir: Haaaa haaaa enjoy madi :)

Bhuvan trying to show off his photography skills ;-)

Hmmm.....If any of u want autographs, please stand in the queue ;)

Aah! A nice day spent at IISc... Yummy lunch and a fun game played under the shade of the trees. More details to follow this post.

Monday, February 14, 2011

Progress for the day :)

uh! its coding it will be easy day for me.. Its never the case when you are thinking to work in DSCE with its facilities! :P We just get to know this all the times in a rough way, today is one such example.. I just went to college library at five past nine found ambika texting{msgs cost ;) } come on its valentines day after all ;) shouldn't ask whom.. :P

"Private rooms at library best place to do project"?? If you think yes, you must b biggest fool :P Or it is not meant for it also hehe :P :P Just went to see if there's a plug point in the private rooms, first shock for the day "No plug points in those rooms", second internet is not getting connected.

what a start for the week!! :P

We went to the study section where we thought we can find plug points and got one and got no internet still. All the while one sentence everyone chanting "idhella bekitha namge!!??" should have sat properly code from home with no problems what so ever!! Anyways sad story continues, we went to our dept saw juniors sitting on the comps, and here goes final year project, students are suppose to come to college and no comp to work over!!. Anyways went inside say very few students final years and we got 2 comps started coding, we have internet but not on ubuntu and upon that should install the packages all over again!! Once again "idhella bekitha!??"

After all the fuss decided to go to home, first good decision we took!!..

Sad story doesn't end here, i came to home and saw my internet was still not working :( anyways we have the comp with all the necessary packages.. :) Started coding and good time started got internet connected too :) Yeah not to forget all the fun when ambika started all blushing when she was getting msgs and she wanted to leave early as she was feeling sleepy, i think we all know the todays specialty and we could make out the urgency too ;) ;)

Getting serious...

We wrote the codes for:

Removing the stop words.
Lookup for the 1500 English simple words. Filtering out the simple words.
Identifying the complex words(key words).
Finding the synonyms for the keywords.

To do list for tomorrow:

Finding out co-occurrence rate for the key words.
Based on the rate we should select the proper synonym and fit for replacing.

Thats my progress for the day.. :)

My progress......

I must say that today was not my day :( I got up late and somehow I managed to go to mess , it was exact 9’o clock by then. when I reached mess , there was a very long queue which resembles to the queue waiting for the ticket of the first day first show of Rajnikant movie. By the time I finished my breakfast it was 9:10. Then I rushed to library where I thought my team mates will be waiting for me but that was not the fact :). They came few minutes later .We were very energetic and very excited to code together . We planned that we four will work together and definitely today we will finish off our first module and we will show it to sir . But our plan didn’t work well:( First the WiFi did not connect. I don’t know what happens to the WiFi sometimes(you can say most of the times). We tried and tried but it didn’t. Then we stepped into digital library, but there also net was very slow. Only the system infront of which Bhuvan was sitting was working fine. By that time it was almost 10 ‘o clock and we were getting tensed that sir will scold us properly and we were preparing our mind for that. Then we left library and went to lab and there were only two systems free. We started off with the coding. But we could not code because of the disturbance there. Then we decided to go home. As Wifi was not connecting I went to Bhuvan’s home and anusha and apoorva went to their home.

Then we started with the actual coding. We were able to remove the nouns, proper nouns and prepositions. We extracted few key words. Then we searched for these keywords in the most frequently used English words. If that word is present in those words which the file contains then no need to replace that word and search for the next word. The code worked fine. Next we found the set of synonyms for the tokenized words and listed them in one variable. As of now the main and most important step of our module is to replace the word with the proper synonym which retain the context of the sentence . As of now we don’t know how to do this but we are trying to find a way for it. I tried different techniques but those were not so fruitful. In short today my progress in coding is that now I am able to extract the keywords that is the difficult wordsin a sentence wise manner. The next thing is to replace the word with the appropriate synonym which matches the context. This is difficult and most important step of our coding. Hope we will come up with the solution by tomorrow :)

Project Text Simplification