Wednesday, February 2, 2011

Modularizing the Project Text Simplification..

Ufff.... Just done with Reading papers :P Still i have to blog half the part of Automatic Induction of Rules for Text Simplification 2nd half is so complicated couldn't understand, will try to break it and complete it soon..

Now Diving into Implementing our Idea through the codes. Before that i should be explaining our Idea, what are we going to do, wait wait, you all know what is the problem right?? hope you all remember!!

Very basic idea what we are trying to do is:
  1. Scanning the text, parsing the text into sentences further parsing it into thier constituent words with thier parts of speech.
  2. Finding out the complicated words in the sentence.
  3. Finding out the synonyms for the words and replacing the words with their synonyms which match the context. Making sure that the replaced word should reduce the complexity of the sentence not adding the complexity.
Now i will consider each of the above points:
* Scanning the text and should divide the input text into sentences. The obtained sentence is checked for its frequency and stored for the future reference. It is then parsed into words, we can the eliminate the preposition, adverbs, pronouns in the sentence and treated only nouns, adjectives, verbs for replacement.

* Use a Google API to find the matches for the sentence, if we find more relative sentences that have more hits than the sentence under observation, then compare that sentence and subtracting from each other we would get certain words, leaving out prepositions pronouns and all we would get certain set of words from both sentences, if the semantics of the corresponding words match then we can replace the word with word from the most frequently used sentence. So you got simplified sentence But its very immature it has loads of drawbacks, i'm not mentioning because it would extend to pages hehe :P :D

* We can employ either of the two methods to find out the complexity of a words
  • Finding out the frequency of all the words and consider the 2 lowest frequently used words for replacement. The disadvantage her may be we would end up complicating those two words, like we would have a sentence which is very simple and frequently used since i'm going for the least frequently used i would still go and check for the replacement which is really abundant loss of time and power.
  • Considering a ready made database that tells the words which are very frequently used and those which doesn't come under the database is considered as complicated and it is operated for simplification. I feel this is more robust and simpler approach.
* Now we got the complicated words in our hand lets go and find in the thesaurus for its synonyms. We got the synonyms which one synonym is the best for replacement, here comes the actual problem. Here i would go for a Google API "that returns the frequency count for the word in that particular sentence" so we are thinking to find out the appropriateness of the sentence using the peoples usage of the sentence in the web.
So considering single word at a time from the lowest frequency, start replacing as described above.

* Again after replacement check the frequency of occurrence of the newly constructed sentence, compare this with the frequency of the old sentence that we had stored earlier just making sure that our replacements hasn't complicated the sentence. :D

Implementing this idea is superb interesting task, looking forward to explore and hack!..

No comments:

Post a Comment