A task of handling hashtags has arisen in the context of data analysis from Twitter. It was needed to take hashtag and split it into separate words. The task seemed primitive, but it turned out, I underestimated it. I had to try several algorithms until I found that.
This article could be considered as a kind of chronology of completing the task with the analysis of the advantages and disadvantages of each used algorithms. So if you are interested in this topic, please make yourself comfortable here.
It should be noted that the task of breaking large text without spaces is very common in NLP. Neuro-linguistic programming (NLP) is an approach to communication, personal development, and psychotherapy created in the 1970s. The title refers to a stated connection between the neurological processes "neuro", language "linguistic" and behavioral patterns that have been learned through experience "programming" and can be organized to achieve specific goals in life.
A few days ago, the USA announced the withdrawal of troops from Iraq. They did not leave empty-handed and took the biometric data of three million Iraqi civilians (it is approximately 10% of the population). For several years, U.S. Marines carried handheld portable optical scanners, which allow quickly collecting the irises and fingerprints from any passer-by in the field.
The University of Milan and Facebook have finished a research on the theory of six degrees of separation. It was based on the social graph of Facebook. There have been investigated all Facebook accounts. It turned out that any two people are 4.74 steps away (not 6 steps), by way of introduction, from each other on Earth.
By the way, in the U.S. the number of chains is less than 4.37 (statistically, Facebook is used by more than half of Americans in the age of 13 and older).
The researchers said, "Even if we take the most geographically remote Facebook user in the Siberian tundra or the Peruvian jungle, a friend of your friend probably knows a friend of a friend".
Three years ago, Microsoft made a similar research of 242 million MSN users that had written at least one message per month. Their result was equal to 6.6.