Data mining

Algorithms for text segmentation

Hello folks!

A task of handling hashtags has arisen in the context of data analysis from Twitter. It was needed to take hashtag and split it into separate words. The task seemed primitive, but it turned out, I underestimated it. I had to try several algorithms until I found that.

This article could be considered as a kind of chronology of completing the task with the analysis of the advantages and disadvantages of each used algorithms. So if you are interested in this topic, please make yourself comfortable here.

It should be noted that the task of breaking large text without spaces is very common in NLP. Neuro-linguistic programming (NLP) is an approach to communication, personal development, and psychotherapy created in the 1970s. The title refers to a stated connection between the neurological processes "neuro", language "linguistic" and behavioral patterns that have been learned through experience "programming" and can be organized to achieve specific goals in life.

Algorithm 1. Minimum Matching

We read a line and find the first word that matches. We save this word and repeat the procedure for the remainder of the line. If the last line does not match any single word, we assume that the segmentation is not found.
The algorithm is very fast (just what we need), but it is very stupid.

Example: niceday => nice day.
But, the segmentation will not be found for niceweather, because after the word nice, the algorithm determines following things: the pronoun we, articles a & the, and a letter r is not in our dictionary.

Algorithm 2. Maximum Matching and Greedy

Next, everything is the same as in the first case, but we always choose the word with maximum length.
The algorithm is slower than the previous one, because we need to start from the end of the line to determine a word with the maximum length of the first one. The speed will drop significantly if it will be needed to handle very long lines, but since we have data from Twitter, there is no need to worry about that.

An example: niceweather => nice weather
But, again the segmentation will not be found for workingrass. The first word “working” that matches will be our algorithm, and not “work” that will absorb the first letter in the word “grass”.
Can both algorithms be combined in some way? But, then what do we do with a line niceweatherwhenworkingrass? In general, we came to bruteforce.

Algorithm 3. Bruteforce

We generate all possible options and split a string into words using the usual recursive function. There will be some options 2^(N-1), where N is the size of the line. Next, we make screening of those options where got in a substring from the dictionary. The resulting option will be correct one. The main problem of the algorithm is the speed.
Stop! Why do we generate all that and then filter it? We just need to generate the right things.

Algorithm 4. Clever Bruteforce

We modify the recursive function so that the recursive call occurs when we already matched the word from the dictionary. In this case, the desired segmentation is generated right away. The algorithm is very fast, it gives the desired result, and in general, I thought that the taks was completed.
Unfortunately, I lost sight of the ambiguity. The fact that the segmentation of the string is not unique and there are cases when there are dozens of equivalent segmentations.

An example: expertsexchange => (expert sex change, experts exchange)

There is a new sub-task: how to choose the correct segmentation?
I have tried many different options and the results were not very good. I needed a smart algorithm for the machine in order to understand this one “dwarfstealorcore”. So there come to help the machine learning algorithms.

Algorithm 5. Clever Bruteforce with ambiguity resolving (unigram model)

In order to teach our program to solve the ambiguity, we insert a big text file (Train-set) that helps to build a model. In our case, the unigram model is the frequencies that are used for each word in the text. Then, for each of the candidates for segmentation, we consider probability as a product of the probabilities of each word in a candidate. Whichever candidate got more probabilities that candidate won. It's simple.

An example: input => in put
Is it unexpectedly? Just the words “in” and “put” are met very often in the text, while the word “input” is met only 1 time. The unigram model does not know anything about the most even primitive connection between the words.

Algorithm 6. Clever Bruteforce with ambiguity resolving (bigram model)

It is all the same, only now we are building a bigram model of language. This means that we consider the frequencies of all word pairs that go in a row. For example, "Kiev is the capital of Ukraine" will be divided into five bigrams: Kiev is, is the, the capital, capital of, of Ukraine. Using this approach, the model understands at least what words can stand together, and what cannot. Now the frequency of bigram “in put” is zero in our model.


The algorithm shows good results. The weak point is the dictionary. Since the data on Twitter is mostly informal, such as people's names, geographical names, etc., the dictionary separates out a lot of appropriate candidates. Therefore, one of the developmental directions of the algorithm is a rejection of the dictionary. Instead, you can use the words from the Train-set. The second weak point is the Train-set. As everything depends on it in the ML algorithms, you should have as much relevant data as possible. Here, as an option, you can use the Train-set from the data obtained from the same Tweeter.


The dictionary that has over 58,000 words was taken from here.
A file has been selected as the Train-set with more than a million words that was found on the website of Peter Norvig.
All this was done in the language of Clojure. So if you are interested go to github.
xially 26 april 2012, 11:36
Vote for this post
Bring it to the Main Page


Leave a Reply

Avaible tags
  • <b>...</b>highlighting important text on the page in bold
  • <i>..</i>highlighting important text on the page in italic
  • <u>...</u>allocated with tag <u> text shownas underlined
  • <s>...</s>allocated with tag <s> text shown as strikethrough
  • <sup>...</sup>, <sub>...</sub>text in the tag <sup> appears as a superscript, <sub> - subscript
  • <blockquote>...</blockquote>For  highlight citation, use the tag <blockquote>
  • <code lang="lang">...</code>highlighting the program code (supported by bash, cpp, cs, css, xml, html, java, javascript, lisp, lua, php, perl, python, ruby, sql, scala, text)
  • <a href="http://...">...</a>link, specify the desired Internet address in the href attribute
  • <img src="http://..." alt="text" />specify the full path of image in the src attribute