Learning Phrases with PyLucene, part1
Lucene is a library that is used to build text search optimised indexes, it is an Apache project and is the core file format sitting under ElasticSearch. The following code uses pylucene which is a JNI wrapper to the Java API.
The algorithm derives from the idea that the terms in search results will have increased frequencies for their search terms and associated concepts. Phrases similarly should have increased frequencies. By using Lucene it is fast, though we have to index first. The code also demonstrates an integration between Lucene and Pandas for analytics. The technique here could be used summarise in aggregate user / player entered text in surveys, reviews etc. That might otherwise get ignored by analytics.