Sunday, February 27, 2011
Our working Sunday..
Friday, February 25, 2011
Ideas,Coding and High spirits!!............
Tuesday, February 22, 2011
It's just the Beginning(V1.0)!..
Busy Monday!...:)
Its Monday!!..The first day and also a new start for the week .I had to get up at 5 'o' clock(AM) since our Sir had called us for project discussion at 7.30 am.
I board my bus at 6.30 .It was an amazing weather,cool breeze ,window seat and what else could I ask for... and I took my cell out to read all my forwards which I had received last night and I must tell you the nice road near Pesit reminds me of Jab We Met song though it is funny.. :P..
I was the first one to reach coll at 7.10 and I was jobless so I started taking few good pics of my coll in my camera. Surprisingly Bhuvan was very late and 4 of them including Sir had to treat 3 of us(celebrities are ANUSHA,AKSHATHA and JAWERIYA ;) ) for coming late ,this is our SFPE rule !!! ,(stomach full pocket empty ) which we assessors and simplifiers follow.
Jaweriya was star for the today as she gave us a wonderful explanation on grading a text .
The talk was really interesting as our sir filled in examples to make it a lively session.
She wrote a few formulas for the readability assessment and content information which included lot of mathematical equations.
Her main explanation was on Grading the text ,she stresses on two points for this, one is readability assessment and one more Content information and how much of each as to be used to get the peak and efficient value.
For ex: Bhuvan likes coding and it is 20%work of our project and apoo likes reading which involves 80% work of our project but both are important for the completion of our project.
work ----------------- Grades
100% of bhuvan's coding ----------- 0
90% of bhuvans coding & 10% apoo s reading------ 10
50% of Bhuvan' work and 50% of apoo 's work-------- 50
20% of Bhuvan' work and 80% of apoo 's work ------- 85
0% of Bhuvan' work and 100% of apoo 's work-------- 5
Here in the example we can note that at one point there is a maximum efficiency and that combination of work from both of them would give maximum result .Similarly the same concept is applied to grading a text and here the two contenders are Readability assessment and Content information....:).
After a good discussion on this topic ,we were supposed to continue our work and we did that till till 12 o clock with continuous debugging and coding.
exactly at 12 we were told to solve Prasad s sir problem and that went on for 1 hour ,we were supposed to trace an algorithm on Chain Matrix Multiplication( Given a a set of matrices like a1,a2,a3,a4.. which combination of matrix multiplication would result in least number of steps in multiplications.
After this long time of tracing ,we went to have lunch in NRI canteen and I was waiting for it because Bhuvan was supposed to treat ..yahooooooooo !!!..:)
Then we all had a small Birthday party for Madhura which we enjoyed a lot and not to forget the pastry cake ,it was yummmmm...:P.
End of the day is the most important part for which me and apoo were waiting eagerly ..that is our code to be free of errors ,which we succeeded at around 1.30am due to our sir s help..
But one thing people ..This Index out of range error I tell you is so damn irritating if you dont overcome it .
I recommend all of you to please use python debugger (PDB) which saves a lot of time in your coding when you encounters errors.
Python debugger is really wonderful tool of python ....
We now started again with next module of our coding which searches sub sentences in the corpus by taking adjacent words.
Our search module was a grand success...
Lastly Credits to Bhuvan because he wrote an amazing code which simplifies a text with complicated words to simpler one but it did make sense when the word is replaced ...Kudos Bhuvan.....
This is the story of my BUSY monday........more posts to come.........
Cya. for now....
Monday, February 21, 2011
Lunch at IISc :-)
Aah! A nice day spent at IISc... Yummy lunch and a fun game played under the shade of the trees. More details to follow this post.
Monday, February 14, 2011
Progress for the day :)
- Removing the stop words.
- Lookup for the 1500 English simple words. Filtering out the simple words.
- Identifying the complex words(key words).
- Finding the synonyms for the keywords.
- Finding out co-occurrence rate for the key words.
- Based on the rate we should select the proper synonym and fit for replacing.
My progress......
I must say that today was not my day :( I got up late and somehow I managed to go to mess , it was exact 9’o clock by then. when I reached mess , there was a very long queue which resembles to the queue waiting for the ticket of the first day first show of Rajnikant movie. By the time I finished my breakfast it was 9:10. Then I rushed to library where I thought my team mates will be waiting for me but that was not the fact :). They came few minutes later .We were very energetic and very excited to code together . We planned that we four will work together and definitely today we will finish off our first module and we will show it to sir . But our plan didn’t work well:( First the WiFi did not connect. I don’t know what happens to the WiFi sometimes(you can say most of the times). We tried and tried but it didn’t. Then we stepped into digital library, but there also net was very slow. Only the system infront of which Bhuvan was sitting was working fine. By that time it was almost 10 ‘o clock and we were getting tensed that sir will scold us properly and we were preparing our mind for that. Then we left library and went to lab and there were only two systems free. We started off with the coding. But we could not code because of the disturbance there. Then we decided to go home. As Wifi was not connecting I went to Bhuvan’s home and anusha and apoorva went to their home.
Then we started with the actual coding. We were able to remove the nouns, proper nouns and prepositions. We extracted few key words. Then we searched for these keywords in the most frequently used English words. If that word is present in those words which the file contains then no need to replace that word and search for the next word. The code worked fine. Next we found the set of synonyms for the tokenized words and listed them in one variable. As of now the main and most important step of our module is to replace the word with the proper synonym which retain the context of the sentence . As of now we don’t know how to do this but we are trying to find a way for it. I tried different techniques but those were not so fruitful. In short today my progress in coding is that now I am able to extract the keywords that is the difficult wordsin a sentence wise manner. The next thing is to replace the word with the appropriate synonym which matches the context. This is difficult and most important step of our coding. Hope we will come up with the solution by tomorrow :)
My Progress today...
Here we are at the DSCE library at sharp 9 o'clock trying to find a plug point for our laptops and trying to connect to the Wi-Fi.. Uh- huh! Nothing works our way. Finally we decide to login through the systems at digital library but the connectivity is so bad that we could not login to Gmail even. We keep running around, trying to figure out what to do and at 10 o'clock we decide to go back to our department and work in the labs. We enter the lab to find to find two systems unoccupied. What could have possibly gone wrong now? Take a wild guess! Yeah.. The internet does not work with Ubuntu.. Ugh! We tried to code the module with the help of the materials we already had, but it was not working.
Bhuvan said- "I am never going to come to college and work, you people do whatever you want!"
Ambika said- "ahan! Bhuvan.. Today you came to college just because internet was not working at your place. So keep quiet" ;)
Anusha said- "le...naave dodda halla thodi, adralli biddange aythu" :P
Ha ha ha ha! The situation was so funny. I was laughing in spite of zero progress. Thereafter we decided that we would work from home efficiently!
We worked for the rest of the day at home. (of course! I took a number of breaks in between). We wrote a piece of module on our own.
The present module we are aiming at will scan for the presence of a particular sentence in the corpus and hence return the frequency count of that sentence.
I am reading two papers presently. The simplified version of it will follow soon.
- Integrating selectional preferences in WordNet by Eneko Agirre and David Martinez
- Text Simplification for Language Learners: A Corpus Analysis by Sarah E. Petersen, Mari Ostendorf
Experience programming the first module.....
Early morning got up with great enthusiasm thinking of the fact that our team would collectively start coding on a module today and we are going see some output at the end of the day.
Personally we thought team work would add more ideas to the project and the work will finish at a faster pace...
As usual it took me one hour to reach college and as soon I entered the library , my teammates gave a sad news about our so called college Internet connection.
Due to Bad internet connection in library ,we shifted to our department lab and started our actual planning for the module that we were supposed to code.
We divided our work into points and started off with coding.........
We took brown corpus for searching text and Shakuntala s blog as our sample test and scanned for each and every sentence.
After scanning the text ,we break the sentence with a space.
We take each sentence and match it with the brown corpus.If the string is present, it prints yes.
The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on.
So this was the progress done today and it continues.........
Cya......
Wednesday, February 9, 2011
Implementation Details of the First Two Modules...
- After the input text is scanned and broken into sentences, we remove the stop words such as pronouns, prepositions, etc.
- Now we are left with few words which may be noun, verb or adjective. We need to choose the keywords for replacement and hence it is the most important step in the module.
- Assuming a long sentence will not consist of more than 5 keywords, we limit our count to 5 and process those words.
- Consider a sentence to have 3 keywords namely: $S=\{w_1, w_2, w_3\}$.
- If $w_1$ has synonyms namely - $[s_1, s_2, s_3]$, then replace $w_1$ with $s_1$ and find the frequency count of $\{s_1, w_2,w_3\}$. Similarly, replace $w_1$ with $s_2$ and find the frequency count of $\{s_2, w_2, w_3\}$ and so on for all the synonyms of $w_1$. Finally $w_1$ is replaced with its synonym $s$ which has the highest frequency count compared to the other synonyms. For example, if $s_1$ has a frequency $(f=100)$ in $\{s_1, w_2, w_3\}$, $s_2$ $(f=500)$ and $s_3$ $(f=300)$ then we replace $w_1$ with $s_2$.
- The same process is continued for the remaining words i.e $w_2$, $w_3$. (repeat step 5)
- An existing corpus will be used to check the presence of the keywords in context and their frequency.
- After the input text is scanned and broken into sentences, we remove the stop words such as pronouns, prepositions, etc.
- Now we are left with few words which may be noun, verb or adjective. We need to choose the keywords for replacement and hence it is the most important step in the module.
- Assuming a long sentence will not consist of more than 5 keywords, we limit our count to 5 and process those words.
- Let us consider the keywords as--- $\{w1, w2, w3, w4, w5 \}$
- Find the word in the mean position ($w_3$ in this case), let $w_3$ have the synonyms $\{s_1, s_2, s_3\}$.
- Substitute $w_3$ with $s_1$ and find out the frequency count for $\{w_2, w_3, w_4\}$. Now consider $\{w_1, w_2, s_1, w_4, w_5\}$ and plot the graph of frequency curve.
- Repeat step 6 by replacing $w_3$ with the remaining synonyms $\{s_2, s_3\}$.
- Compare the graphs and finally choose the best synonym to be replaced with $w_3$.
- Repeat steps 5-8 for all the keywords in the list.
Tuesday, February 8, 2011
An Information Retrieval Approach to Sense Ranking
Text simplification mainly involves converting complicated text to the simpler one. This is done by replacing complicated words with the simpler and frequently used synonyms.It is obvious that each complicated word will have at-least two ambiguous synonyms or senses or meanings. One need to resolve this ambiguity. How do we do that is what this paper tells about.
As I had mentioned in my previous post Word Sense Disambiguity(WSD) is the ability to identify the intended meaning(sense) of word in context. In WSD choosing the most frequent sense for an ambiguous word is a powerful heuristic.In this paper, an information retrieval base method for sense ranking queries on IR(Information Retrieval)Engine to estimate the degree of information between the word and its senses.
The WSD can be achieved with the help of WordNet and Corpus.Now let us understand what WordNet and Corpus are????
WordNet:It is a semantically oriented dictionary of English similar to traditional thesaurus but with a richer structure.There is a so called function, synset in the wordnet. Synset is nothing but the synonym set that is a collection of synonym words.
Corpora:It is a large body of text.
Method Used:Central in our approach is the assumption that context provides important cues regarding a word’s meaning. The documents are typically written with the certain topics in mind which are often indicated by word distributional patterns.
For example, documents talking about "congressional tenure" are likely to contain words such "as term of office or incumbency", whereas documents talking about "legal tenure" (i.e., the right to hold property) are likely to include the words "right or land". Now, we could estimate which sense of tenure is most prevalent simply by comparing whether tenure co-occurs more often with term of office than with land provided we knew that both of these terms are semantically related to tenure.
Fortunately, senses in WordNet are represented by synonym terms. So all we need to do for estimating words sense frequencies is to count how often it co-occurs with its synonyms.
The co-occurrence definition is that two words co-occur if they are attested in the same document. After finding the synonym set next step is to find which is the dominant sense or synonym.This is explained as follows
Dominant Sense Acquisition:
Throughout the paper we use the term frequency the shorthand for document frequency that is the number of documents that contain a word or a set of words which may or may not be adjacent. For this we use the synset function of WordNet(which I explained earlier portion of this paper).
purposes.
As an example consider the noun "tenure", which has the following senses in WordNet:
tenure, term of office, incumbency(synonym set of tenure)
=> term(hypernym of above senses)
(2) Sense 2
tenure, land tenure (synonym set of tenure)
=> legal right(hypernym of above senses)
The senses are represented by the two synsets {tenure, term of office, incumbency} and {tenure, land tenure}. (The hypernyms for each sense are also listed; indicated by the arrows.) We can now approximate the frequency with which a word "w1" occurs with the sense "s" by computing its synonym frequencies
synonym frequencies:for each word "S1" in syns(s),the set of synonyms of s, we field a query of the form w1 AND S1. These synonym frequencies can then be used to determine the most frequent sense of w1 in a variety of ways (to be detailed below).
So the queries for the above example of tenure will be as follows:
(1) a. "tenure" AND "term of office"
b. "tenure" AND "incumbency"
(2) "tenure" AND "land tenure"
Hypernym frequencies:
Apart from synonym frequencies, we also generate hypernym frequencies by submitting queries of the form w1 AND S1, for each S1 in hype(s), the set of immediate hypernyms of the sense s. The hypernym queries for the two senses of tenure are:
(3) "tenure" AND "term"
(4) "tenure" AND "legal right"
Hypernym queries are particularly useful for synsets of size one, i.e., where a word in a given sense has no synonyms, and is only differentiated from other senses by its hypernyms.
Once the synonym frequencies and hypernym frequencies are in place, we can compute a word's predominant sense in number of ways
First way:First, we can vary the way the frequency of a given sense is estimated
based on synonym frequencies:
For example, the frequency of the dominant sense of tenure would be computed by
adding up the document frequencies returned by the queries "tenure AND term of office"(1a)and"incumbency"(1b).
• Average (Avg): The frequency of a synset is computed by taking the average of synonym
frequencies.
• Highest (High): The frequency of a synset is determined by the synonym with the highest
frequency.
Second way: we can vary whether or not hypernym are taken into account:
• No hypernyms (−Hyp): Only the synonym frequencies are included when computing the
frequency of a synset.
For example, only the queries "tenure AND term of office"(1a)and"tenure AND incumbency"(1b) are relevant for estimating the dominant sense of tenure.
• Hypernyms (+Hyp): Both synonym and hypernym frequencies are taken into account
when computing sense frequency.
For example, the frequency for the senses of tenure would be computed based on the document frequencies returned by queries "tenure AND term of office"(1a) ,"tenure AND incumbency"(1b) and "tenure AND term" (3)(by summing, averaging, or taking the highest value, as before).
The third way:This option relates to whether the sense frequencies are used in raw or in normalized form:
• Non-normalized (−Norm): The raw synonym frequencies are used as estimates of sense frequencies.
• Normalized (+Norm): Sense frequencies are computed by dividing the word-synonym frequency by the frequency of the synonym in isolation.
For example, the normalized frequency for "tenure AND term of office" (1-a) is computed by dividing the document frequency for "tenure" AND "term of office" by the document frequency
for "term of office". Normalizing takes into account the fact that the members of the synset of a sense may differ in frequency.
One of these three ways is used to get the word's sense acquisition. The model selection can be done as follows
The goal is to establish which model configuration is best suited for the WSD task. We thus varied how the overall frequency is computed (Sum, High, Avg), whether hypernyms are included (±Hyp), and whether the frequencies are normalized (±Norm).
For example the following table shows the sum,high and avg for some data content.
|
| -Norm | +Norm | ||||||
|
| +Hyp | - Hyp | +Hyp | -Hyp | ||||
|
| P | R | P | R | P | R | P | R |
| sum | 42.3 | 40.8 | 46.3 | 44.6 | 45.9 | 44.3 | 48.6 | 46.8 |
| High | 51.6 | 49.8 | 51.1 | 49.3 | 57.2 | 55.1 | 59.7 | 57.9 |
| Avg | 44.1 | 42.6 | 48.5 | 46.8 | 49.6 | 47.8 | 51.5 | 49.6 |
In sum, the best performing model is High,+Norm, −Hyp, achieving a precision of 59.7% and a recall of 57.9%.
Once the model has been selected the complicated word is replaced with the dominant sense which was found by the selected model.This is how the word sense rank is obtained.Depending on the rank, the most dominant sense is chosen for the replacement of the complicated words.This is done for each and every sentence of the text. Thats it we get the simplified version of the text:)







