Extracting remarkable words
From Impure Wiki
Sometimes we may have to deal with long bodies of text, and we need a practical way of getting an idea of what they are talking about and what are the most important concepts in them.
Impure offers a great module for distilling the essence of texts: remarkableWords. You can see a description of how it works in featured controls. In general terms, it is very useful for turning any English or Spanish text into a tag cloud, which is a convenient representation for getting a quick insight into textual data.
We will use it here to build a cloud of the most relevant words in "Alice in Wonderland". The process is very simple: first, we use FileLoader to load the contents of a text file into Impure. The full text of "Alice" is available at http://bestiario.org/research/dataRepository/books/alice_in_wonderland.txt
Then we will place RemarkableWords on the stage and just pass it the loaded data. Since our text is in English, we don't need to change the 'language' setting. We may use the second inlet to set the maximum number of words returned if we want to.
RemarkableWords returns a Table where the first column is a StringList of words, and the second a NumberList of weights for each of them. We are going to use the operator getElementFromList to get them as separate Lists.
All that is left now is using this data to build a visualization. A natural match for this case is CirclesTagCloud. It will take as input a list of words and a list of weights, which is precisely what we have ready from the previous process. More important words will appear as larger circles.
Here's a capture of the finished space, and the Impure code is below it.
0 String http://bestiario.org/research/dataRepository/books/alice_in_wonderland.txt 475 128 3 Number 0 539 402 4 Number 1 542 496 7 String Alice in Wonderland, from Lewis Carrol 285 93 8 getElementFromList 617 375 9 getElementFromList 620 469 16 TableAtAGlance 251 397 185 297 15 RemarkableWords 253 292 13 FileLoader 332 187 12 CirclesTagCloud 748 54 670 713 0 13 15 16 15 8 3 8 1 8 12 4 9 1 15 9 9 12 1 13 15
