After a week long discussion, we managed to come out with a lucid plan to develop a package to simplify the text (one paragraph at least if not more) ;-)
We intend to come out with two modules, each exhibiting a different idea. So let us see what those 2 modules are:
MODULE 1
The module which is going to be developed by Bhuvan and Ambika will of course simplify the text, but we will see how they are going to do it.
- After the input text is scanned and broken into sentences, we remove the stop words such as pronouns, prepositions, etc.
- Now we are left with few words which may be noun, verb or adjective. We need to choose the keywords for replacement and hence it is the most important step in the module.
- Assuming a long sentence will not consist of more than 5 keywords, we limit our count to 5 and process those words.
- Consider a sentence to have 3 keywords namely: $S=\{w_1, w_2, w_3\}$.
- If $w_1$ has synonyms namely - $[s_1, s_2, s_3]$, then replace $w_1$ with $s_1$ and find the frequency count of $\{s_1, w_2,w_3\}$. Similarly, replace $w_1$ with $s_2$ and find the frequency count of $\{s_2, w_2, w_3\}$ and so on for all the synonyms of $w_1$. Finally $w_1$ is replaced with its synonym $s$ which has the highest frequency count compared to the other synonyms. For example, if $s_1$ has a frequency $(f=100)$ in $\{s_1, w_2, w_3\}$, $s_2$ $(f=500)$ and $s_3$ $(f=300)$ then we replace $w_1$ with $s_2$.
- The same process is continued for the remaining words i.e $w_2$, $w_3$. (repeat step 5)
- An existing corpus will be used to check the presence of the keywords in context and their frequency.
MODULE 2
The module which is going to be developed by Anusha and I will also simplify the text. This is a colorful one which will involve graphs as well! Here goes the details:
- After the input text is scanned and broken into sentences, we remove the stop words such as pronouns, prepositions, etc.
- Now we are left with few words which may be noun, verb or adjective. We need to choose the keywords for replacement and hence it is the most important step in the module.
- Assuming a long sentence will not consist of more than 5 keywords, we limit our count to 5 and process those words.
- Let us consider the keywords as--- $\{w1, w2, w3, w4, w5 \}$
- Find the word in the mean position ($w_3$ in this case), let $w_3$ have the synonyms $\{s_1, s_2, s_3\}$.
- Substitute $w_3$ with $s_1$ and find out the frequency count for $\{w_2, w_3, w_4\}$. Now consider $\{w_1, w_2, s_1, w_4, w_5\}$ and plot the graph of frequency curve.
- Repeat step 6 by replacing $w_3$ with the remaining synonyms $\{s_2, s_3\}$.
- Compare the graphs and finally choose the best synonym to be replaced with $w_3$.
- Repeat steps 5-8 for all the keywords in the list.
We intend to simplify the text partially if not completely through these modules.
So Best of luck "Text-Simplify" mates!
hey apoo doubt in step 6 in module 1:
ReplyDeletewhen second word is processed for replacement ,the synonym of w1 i.e (s)is maintained or the original word w1 is used ..
In simpler words is it [s,s2,w3] or [w1,s2,w3]
@Anusha.. We have had this discussion before. I feel we should take [s, s2, w3]. But we don't know which one will yeild the best result. Maybe we could try both.
ReplyDelete@apoorva and @anusha
ReplyDeleteCheck both the methods and try to see which results in better simplification.