Texts
From Impure Wiki
basic string structures: String, StringList

basic operators: splitString, occurrences
Handle text operations is quite useful, because a lot of queries are made of text. On the other hand text analysis is the basis for social media analysis. What people are saying about something? How often? How to know which assets are being related with some product in the social media ecology? After reading contents from blogs, news, twitter and so on, you might want to analyze the gathered informationt. Some options are: count the occurrences of a word, count how much two words occur in the same sentence or paragraph, compare the entire lexicon used by two different sources, visualize the common words used in different resources,... Text analysis is also the key for data mining, parsing and filtering. And when it takes to parse data, these operators become very useful:
- stringTransformations
- firstTextBetweenStrings
- getAllStringsBetweenTwoStrings
- replaceSubstring
One powerful control (that works like an operator but needs to download data to perform) is the RemarkableWords. You just need to feed it with a text, and it will create a Table with a StringList containing all the words of the text (without repetitions), and a NumberList with a value associated to each word. This value compares the occurrence frequency of the word in the text with the occurrence frequency of the same word in the language it belongs. This could be seen as a weighted set of tags for the text (calculated in just only one step).
But there are many other things you can do with texts. For instance:
- create a network of words by their co-occurrences in sentences or by they consecutiveness.
- calculate, analyze and visualize the occurrences of several words a long a text.
- compare two or more texts by the words they contain.
- perform complex search by combining different criteria
Many of the modules in the apis library return String. Our culture has a story of more than 500 years of storing texts, Internet is plenty of textual information and this phenomena continue to increase exponentially. Even numbers are hidden in texts, and you have to perform textual processes to isolate the quantities.
Next image shows the result of the following process:
- searching of a word on twitter using the TwitterSearchLoader api
- joining all the 100 titter messages loaded
- calculating remarbale words on the obtained text
- visualizing the remarkable words with CirclesTagCloud
