Learning Phrases with PyLucene, part 1.

Author

Peter Tillotson

Published

March 2, 2026

Abstract

Lucene is a library that is used to build text search optimised indexes, it is an Apache project and is the core file format sitting under ElasticSearch. The following code uses pylucene which is a JNI wrapper to the Java API.

The algorithm derives from the idea that the terms in search results will have increased frequencies for their search terms and associated concepts. Phrases similarly should have increased frequencies. By using Lucene it is fast, though we have to index first. The code also demonstrates an integration between Lucene and Pandas for analytics. The technique here could be used summarise in aggregate user / player entered text in surveys, reviews etc. That might otherwise get ignored by analytics.

In part 2 we reuse our tokenised index and use pytorch to build a model for significant term extraction.

Introduction

Apache Lucene [1] is Java library providing powerful indexing and search features. This includes vectorised on disk formats for retrieving term positions and associated statistics such as Term frequency and document frequency. In the following I demonstrate the use of Lucene alongside Python Pandas in the development of a significant terms algorithm.

Lucene

A Lucene index is a set of files on on disk that enables the fast retrieval of documents based on search terms. Figure 1 provides a very quick primer on the process. It is beyond the of this paper to develop the readers full understanding of the process in full. Note the rich information that stored in particular in Term and Posting.

---
config:
  themeVariables:
    fontSize: 10
---
erDiagram
    direction LR
    Index
    IndexWriter
    Document
    Fields {
        bool stored_raw_text 
        bool indexed
    }
    Analyzer
    Term {
        int doc_freq
        int term_freq
    }
    Posting {
        int position_in_token_stream
        int start_offset
        int end_offset
        bytes payload
    }
    IndexReader
    IndexSearcher
    Query

    Index ||--|| IndexWriter : "writes with" 
    IndexWriter ||--|{ Document : has
    Document ||--|{ Fields : has
    Fields ||--|| Analyzer: uses
    Analyzer ||--|{ Term : tokenizes
    Term ||--|| Posting: has

    Index ||--|| IndexReader : "reads with"
    IndexReader ||--|| IndexSearcher: "searches"
    IndexSearcher ||--|| Query: ""
    Query ||--|{ Term: clause

    Analyzer ||--|| EnglishAnalyzer : impl
    Analyzer ||--|| StemmingAnalyzer : impl
%%    Analyzer ||--|| "..." : "many impl"
Figure 1: A quick primer to the Lucene concepts used in this paper.

Token Streams, Analyzers, Stop Words and Postings

When indexing text is turned into a token stream using an EnglishAnalyzer:

  • it breaks on whitespace removing punctuation
  • it lower cases, and
  • removes common stop words

Also alongside each Term it stores the position in the token stream and the start / end byte offsets from the original text.

---
config:
  themeVariables:
    fontSize: 10
  flowchart:
    rankSpacing: 15
    nodeSpacing: 25
    padding: 10
---
flowchart LR
    %% This graph shows the Lucene analysis process term by term.
    %% It is laid out top-to-bottom (TD).

    %% Define a class for stop words to style them red.
    classDef stopword fill:#ffdddd,stroke:#ff0000

    %% Subgraph for the bottom row: the postings.
    %% It also flows left-to-right to align with the token stream.
    subgraph "<span style='white-space:nowrap'>The Quick Brown Fox Jumps Over the Lazy Dog.</span>"
        direction LR 
        subgraph n1 ["The"]
            direction LR
            T1("the"):::stopword
            P1["(stop word)<br/>_"];
        end
        subgraph n2 ["Quick"]
            direction LR
            T2("quick")
            P2["Pos: 1<br/>Off: 4-9"];
        end
        subgraph n3 ["Brown"]
            direction LR
            T3("brown")
            P3["Pos: 2<br/>Off: 10-15"];
        end
        subgraph n4 ["Fox"]
            direction LR
            T4("fox")
            P4["Pos: 3<br/>Off: 16-19"];
        end
        subgraph n5 ["Jumps"]
            direction LR
            T5("jumps")
            P5["Pos: 4<br/>Off: 20-25"];
        end
        subgraph n6 ["Over"]
            direction LR
            T6("over"):::stopword
            P6["(stop word)<br/>_"];
        end
        subgraph n7 ["the"]
            direction LR
            T7("the"):::stopword
            P7["(stop word)<br/>_"];
        end
        subgraph n8 ["Lazy"]
            direction LR
            T8("lazy")
            P8["Pos: 5<br/>Off: 35-39"];
        end
        subgraph n9 ["Dog."]
            direction LR
            T9("dog")
            P9["Pos: 6<br/>Off: 40-43"];
        end
        %% Link postings together
        n1 --> n2 --> n3 --> n4 --> n5 --> n6 --> n7 --> n8 --> n9;
    end;
Figure 2: An example token stream from an EnglishAnalyzer.

When I look at [2] I cannot help thinking there are clear parallels which inspired the choice to do a follow up part 2. I plan to take the phrases I learn here to bootstrap the build of a more general model in PyTorch.

The algorithm

Enough of future me problems for the time being. The algorithm for extracting phrase I am using here is as follows.

  1. We build and index of all docs, storing terms and alongside each term its position and byte offsets within the text.
  2. We search and retrieve a subset of docs.
  3. We compare the term statistics for the search results with those of the index and find some terms are now more frequent.
  4. We use the adjacencies from the more frequent term positions to build phrases.

Note: It is really fast.

Setting up the environment and loading IMDB

Code
import textwrap
import plotly

import pandas as pd
import numpy as np

plotly.offline.init_notebook_mode()
plotly.io.renderers.default = 'svg'
pd.options.plotting.backend = "plotly"

Loading the IMDB movie review document set

Code
reviews_df = pd.read_json('data/reviews.json')
reviews_df.columns = ['text', 'sentiment']
print(reviews_df.sentiment.unique())
reviews_df
[0 1]
text sentiment
0 Once again Mr. Costner has dragged out a movie... 0
1 This is an example of why the majority of acti... 0
2 First of all I hate those moronic rappers, who... 0
3 Not even the Beatles could write songs everyon... 0
4 Brass pictures (movies is not a fitting word f... 0
... ... ...
49995 Seeing as the vote average was pretty low, and... 1
49996 The plot had some wretched, unbelievable twist... 1
49997 I am amazed at how this movie(and most others ... 1
49998 A Christmas Together actually came before my t... 1
49999 Working-class romantic drama from director Mar... 1

50000 rows × 2 columns

Building the Lucene index

This is the code that indexes all the documents, it takes around 7s to index 50,000 documents and is potentially a one off operation. We configure the fields we would like to start and how they are indexed / stored respectiveley. For this paper we:

  • Only index the sentiment value
  • Store and index the text, using an EnglishAnalyzer and also storing position in the token stream and byte offsets

We haven’t but could have index the same text with a range of Analyzers and added each as their own field.

Code
# This requires pylucene, which is a thin wrap over Lucene running 
# in a JVM
import os
import lucene

from pathlib import Path

from java.nio.file import Paths
from org.apache.lucene.analysis.en import EnglishAnalyzer
from org.apache.lucene.document import (
    Document, 
    Field,
    FieldType
)
from org.apache.lucene.index import (
    DirectoryReader,
    IndexOptions, 
    IndexWriter,
    IndexWriterConfig,
    Term
)

from org.apache.lucene.store import NIOFSDirectory
from org.apache.lucene.util import BytesRefIterator


if not os.environ.get('jvm_started', False):
    # Here is the JVM being spun up
    env = lucene.initVM(vmargs=['-Djava.awt.headless=true', '-Xmx256M'])
    os.environ['jvm_started'] = "True"
Code
%%time
index_dir = 'index'

# Define field type
t1 = FieldType()                       # for sentiment 
t1.setStored(True)                     # store full text 
t1.setIndexOptions(IndexOptions.DOCS)

t2 = FieldType()                       # for text
t2.setStored(True)                     
t2.setStoreTermVectors(True)           # Needed for quick extract of stats
t2.setStoreTermVectorPositions(True)   # To help with co location 
t2.setStoreTermVectorOffsets(True)     # So I can reference back to stored text
t2.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS)


if not Path(index_dir).exists():
    # build the index if not exist 
    fsDir = NIOFSDirectory(Paths.get(index_dir))
    writerConfig = IndexWriterConfig(EnglishAnalyzer())
    writer = IndexWriter(fsDir, writerConfig)

    def index_review(tpl):
        # Add a document, assign type 
        doc = Document()
        doc.add(Field('sentiment', str(tpl.sentiment), t1))
        doc.add(Field('text', tpl.text, t2))
        writer.addDocument(doc)
        return tpl
    
    try: 
        reviews_df.apply(index_review, axis=1)
    finally:
        writer.commit()
        writer.forceMerge(1, True)
        writer.close()

# open an index reader 
fsDir = NIOFSDirectory(Paths.get(index_dir))
reader = DirectoryReader.open(fsDir)
print(f"{reader.numDocs()} docs found in index")
50000 docs found in index
CPU times: user 6.22 s, sys: 549 ms, total: 6.77 s
Wall time: 6.26 s

If you were building towards production it would be straightforward to abstract the configuration for fields. Lucene does not enforce that all documents in a given index have the same fields. If you were to build towards production building conventions in here would be beneficial. Maybe always have raw that stores the source record in case you reindex differently over time.

Extracting index stats

This works with the index and gets a TermEnum for the text field we used when indexing documents.

A TermEnum enables iterating over every Term and provides access to Term frequency and Document frequency. It is really quick and these are the only fields needed for the index stats.

Code
%%time
leaves = reader.leaves()

term_stats = []
for leaf_reader in leaves:
    # Building the index term and document stats 
    te = leaf_reader.reader().terms('text').iterator()
    for term in BytesRefIterator.cast_(te):
        # Iterate through all terms
        te.seekExact(term)
        term_stats.append({
            'term': term.utf8ToString(), 
            'doc_freq': te.docFreq(), 
            'term_freq':te.totalTermFreq()
        })

index_term_stats_df = (
    pd.DataFrame(term_stats)
    .set_index('term')
)
index_term_stats_df['tfidf'] = (
    index_term_stats_df.term_freq / index_term_stats_df.doc_freq
)
    
index_term_stats_df = (
    index_term_stats_df
    .sort_values(by=['tfidf'], ascending=[False])
)
index_term_stats_df
CPU times: user 247 ms, sys: 4.11 ms, total: 251 ms
Wall time: 202 ms
doc_freq term_freq tfidf
term
trivialbor 1 26 26.0
stop.oz 1 23 23.0
montero 1 20 20.0
narvo 1 13 13.0
tucso 1 13 13.0
... ... ... ...
ibus 2 2 1.0
ibánez 1 1 1.0
ibéria 1 1 1.0
ica 1 1 1.0
lyrics.i 1 1 1.0

79664 rows × 3 columns

[Figure 3;fig-df-term-freq] present the document frequencies for terms in the index. The curve is pretty typical there is a small set of words we use all the time even after stop words are removed.

Something to think on is how much can we learn from anything that is too frequent or too rare. For index_term_stats_df above we could have probably skipped loading any Term with {"doc_freq": 1, "term_freq": 1} and had very little impact in being able to extract phrases.

Code
min_doc_frequency = 4

fig = index_term_stats_df.loc[
    index_term_stats_df.doc_freq>min_doc_frequency
].doc_freq.sort_values(ascending=False).plot()
fig.update_layout(width=900, height=400)
fig.show()
Figure 3
Code
# Shorter list of most common terms
#| label: fig-df-term-freq
#| caption: Document frequency for more common terms in the index.

min_doc_frequency = 3000

fig = index_term_stats_df.loc[
    index_term_stats_df.doc_freq>min_doc_frequency
].doc_freq.sort_values(ascending=False).plot()
fig.update_layout(width=900, height=400)
fig.show()

Searching the index and getting search stats

When we search we use a query, I’ve used a BooleanQuery here that lets be add clauses.

We search for the top 500 documents that match the query, we then use the TermVector aligned on search hit doc_id and extract the DataFrame in a similar way in which we did above. This time we have also extracted the positions and byte offsets. These are then used un phrase building.

The postings data can occur more than once per doc_id, e.g the Term you can appear multiple times in the tex.

Code
from org.apache.lucene.search import (
    IndexSearcher,
    TermQuery, 
    BooleanQuery,
    BooleanClause
)
Code
%%time
searcher = IndexSearcher(reader)

query = (
    BooleanQuery.Builder()
    .add(TermQuery(Term('text','thriller')), BooleanClause.Occur.MUST)
    #
    #.add(TermQuery(Term('sentiment',"0")), BooleanClause.Occur.MUST)
    #.add(TermQuery(Term('sentiment',"1")), BooleanClause.Occur.MUST)
).build()
hits = searcher.search(query, 500)

term_stats = []
for hit in hits.scoreDocs:
    # This is really quick, we are pulling search term doc and term frequencies 
    # along with their positions directly from the index. 
    vector = reader.termVectors().get(hit.doc)
    te = vector.terms('text').iterator()
    for term in BytesRefIterator.cast_(te):
        te.seekExact(term)
        postings = te.postings(None)
        postings.nextDoc()
        freq = postings.freq()          
            
        term_stats.append({
            'term': term.utf8ToString(), 
            'doc_freq': te.docFreq(), 
            'term_freq':te.totalTermFreq(),
            'doc_id':hit.doc,
            'postings': [
                {
                    'position': postings.nextPosition(),
                    'offset_start': postings.startOffset(),
                    'offset_end': postings.endOffset()
                } 
                for idx in range(0,freq)
            ], 
            # 'positions': [postings.nextPosition() for idx in range(0,freq)]
            
        })

doc_term_stats_df = (
    pd.DataFrame(term_stats)
    .set_index('term')
    .sort_index()
)

doc_term_stats_df['positions'] = doc_term_stats_df.postings.apply(lambda x: [p['position'] for p in x])
doc_term_stats_df
CPU times: user 789 ms, sys: 19.2 ms, total: 808 ms
Wall time: 280 ms
doc_freq term_freq doc_id postings positions
term
0 1 1 18663 [{'position': 86, 'offset_start': 511, 'offset... [86]
00.01 1 1 48714 [{'position': 496, 'offset_start': 2754, 'offs... [496]
02 1 1 38283 [{'position': 115, 'offset_start': 613, 'offse... [115]
06 1 1 38606 [{'position': 95, 'offset_start': 526, 'offset... [95]
1 1 1 47961 [{'position': 77, 'offset_start': 441, 'offset... [77]
... ... ... ... ... ...
zone 1 1 33334 [{'position': 66, 'offset_start': 365, 'offset... [66]
zudina 1 1 33498 [{'position': 333, 'offset_start': 1857, 'offs... [333]
zuniga 1 1 42945 [{'position': 124, 'offset_start': 682, 'offse... [124]
zuniga 1 1 42941 [{'position': 80, 'offset_start': 458, 'offset... [80]
über 1 1 36477 [{'position': 125, 'offset_start': 760, 'offse... [125]

42955 rows × 5 columns

Now we have to aggregate the search stats in a similar way to what we did for index stats.

Code
%%time

search_term_stats_df = (
    doc_term_stats_df[['doc_freq', 'term_freq']]
    .groupby('term')
    .sum()
)
search_term_stats_df['tfidf'] = search_term_stats_df.term_freq/search_term_stats_df.doc_freq

search_term_stats_df = (
    search_term_stats_df
    .sort_values(by=['tfidf'], ascending=[False])
)
search_term_stats_df
CPU times: user 4.93 ms, sys: 827 µs, total: 5.76 ms
Wall time: 5.8 ms
doc_freq term_freq tfidf
term
anand 2 15 7.5
fulci 1 7 7.0
zandale 1 7 7.0
anatomi 1 6 6.0
winfield 1 6 6.0
... ... ... ...
gene 2 2 1.0
gender 1 1 1.0
gem 6 6 1.0
gellar 1 1 1.0
über 1 1 1.0

6828 rows × 3 columns

Identifying search results terms that are significant

The search yields result documents that and some terms in the results are now more likely to occur than they did in the index as a whole. In finding the significant we’re interested in the terms that move the furthest.

I put some constraints here, I probably am only interested in Terms that have appeared in more than one doc, and most significant.

Code
%%time
full_df = (
    index_term_stats_df
    .join(
        search_term_stats_df, 
        lsuffix='_idx', 
        rsuffix='_src',
        how='inner'
    )
)
full_df = full_df.loc[full_df.doc_freq_src>1]
full_df['tfidf_diff'] = (
    full_df.tfidf_src      # Search tf/df 
    -                      # minus, terms that are no different = 0
    full_df.tfidf_idx      # Index tf/df    
)
full_df.sort_values(by=['tfidf_diff'], ascending=[False], inplace=True)

full_df.loc[full_df.doc_freq_src>1].head(10)
CPU times: user 3.07 ms, sys: 567 µs, total: 3.64 ms
Wall time: 3.16 ms
doc_freq_idx term_freq_idx tfidf_idx doc_freq_src term_freq_src tfidf_src tfidf_diff
term
anand 22 59 2.681818 2 15 7.500000 4.818182
depalma 25 42 1.680000 3 15 5.000000 3.320000
pierc 169 198 1.171598 2 8 4.000000 2.828402
altman 114 229 2.008772 4 19 4.750000 2.741228
pacino 179 313 1.748603 4 17 4.250000 2.501397
dev 17 66 3.882353 2 12 6.000000 2.117647
mute 212 245 1.155660 5 15 3.000000 1.844340
keitel 50 64 1.280000 2 6 3.000000 1.720000
cage 316 503 1.591772 7 23 3.285714 1.693942
dahl 26 43 1.653846 3 10 3.333333 1.679487
Code
fig = full_df.loc[full_df.doc_freq_src>1].tfidf_diff.sort_values(ascending=False).plot()
fig.update_layout(width=900, height=400)
fig.show()
Figure 4

Figure 4 presents the difference in frequencies of search terms when compared to the index. Term that see little difference, i.e. are common to both, will score close to zero.

Build phrases using colocation

The intuition here is that in search results phrases, like names would be likely move in significance together. Search for “Western” movies and we might expect “John” and “Wayne” to individually move in significance but if we looked at there frequent positions we would be able to link them.

The following algorithm does this, is adds a the concept of slop allowing for not direct adjacencies.

Code
%%time
max_slop=2  # Distance between words in phrase

phrase_df  =(
    full_df.loc[
        ( full_df.doc_freq_src>1 )  # In more than 1 doc
        & ( full_df.tfidf_diff>0 )  # Has moved in significance
    ]
    .head(200)
    .join(
        doc_term_stats_df[['doc_id', 'positions']],
        how='left'
    )
    .sort_values(by=['tfidf_diff'], ascending=[False])
    [['doc_id', 'positions']]
    .explode('positions')
    .reset_index()
    .set_index('doc_id')
    .sort_values(by=['doc_id', 'positions'])
)
count_df = phrase_df.groupby('doc_id')['positions'].count()
count_df.name = "nos_row"
count_df = count_df.loc[lambda x: x>1] 
phrase_df = count_df.to_frame().join(phrase_df).drop('nos_row', axis=1)

phrase_df['pos_diff'] = phrase_df.positions.groupby('doc_id').diff()
phrase_df['pos_diff'] = np.where(
    phrase_df.pos_diff<=max_slop, phrase_df.pos_diff,
    np.where(
        (phrase_df.pos_diff>max_slop) & (phrase_df.pos_diff.shift(-1)<=max_slop), 0,
    np.where(
        np.abs(phrase_df.positions.shift(-1)-phrase_df.positions)<=max_slop, 0,-1))
)

phrase_df.head(8)
CPU times: user 7.08 ms, sys: 1.5 ms, total: 8.58 ms
Wall time: 8.12 ms
term positions pos_diff
doc_id
11 thriller 58 -1
11 thriller 64 -1
11 novel 76 -1
11 thriller 189 -1
114 thriller 116 -1
114 thriller 283 -1
114 member 319 -1
185 t 10 -1
Code

%%time
def build_phrase(df):
    out = []
    phrase = []
    positions = []
    last=0
    for doc_id,row in df.iterrows():
        idx = row.pos_diff
        if idx == 0:
            if len(phrase)>0:
                out.append(
                    pd.Series(
                        {'phrase': ' '.join(phrase), 'positions': positions},
                    )
                )
                phrase = []
                positions = []
                last=0
        diff = idx-last
        if diff > 1:
            for i in range(diff-1):
                phrase.append("???") # if slop >1 I might miss a term, so use placeholder
                positions.append(positions[-1]+1)
        phrase.append(row.term)
        positions.append(row.positions)
        last=idx
    
    if len(phrase)>0:
        out.append(
            pd.Series(
                {'phrase': ' '.join(phrase), 'positions': positions},
            )
        )
    return  pd.DataFrame(out).set_index('phrase')

all_phrase_df = (
    # inline test
    phrase_df.loc[
        (phrase_df.pos_diff>=0) 
    ]
    .groupby('doc_id')
    .apply(build_phrase)
    .sort_values(by='phrase')
    .reset_index()
)

(
    all_phrase_df
   .loc[lambda x: x.phrase.str.len()<64] # To fix pdf table render
   .set_index('phrase')
).head(8)
CPU times: user 39 ms, sys: 1.4 ms, total: 40.4 ms
Wall time: 40 ms
doc_id positions
phrase
al pacino 40503 [90, 91]
al pacino 40503 [71, 72]
al pacino 40491 [172, 173]
al pacino 40491 [715, 716]
al pacino 40491 [260, 261]
al pacino 40491 [599, 600]
al pacino cusack 40504 [27, 28, 30]
associ ??? rock 25176 [109, 110, 111]
Code
%%time
significant_phrases_df = (
    all_phrase_df
    .drop(["positions"], axis=1)
    .groupby('phrase')
    .agg({
        "doc_id": ['nunique', lambda x: np.unique(x).tolist()]
    })
    .droplevel(0, axis=1)
    .rename({"nunique":"nos_docs"}, axis=1)
    .rename({"<lambda_0>":"doc_ids"}, axis=1)
    .loc[lambda x: x.nos_docs>1]
    .sort_values(by=['nos_docs'], ascending=[False])
).head(50)

significant_phrases_df
CPU times: user 2.88 ms, sys: 435 µs, total: 3.32 ms
Wall time: 3.15 ms
nos_docs doc_ids
phrase
music video 16 [4172, 10713, 35326, 36748, 36751, 36753, 3675...
michael dougla 8 [16814, 16815, 16819, 35759, 35762, 35764, 484...
polit thriller 8 [6796, 16815, 16819, 26213, 32420, 40500, 4050...
michael ??? thriller 5 [36750, 36756, 36766, 36768, 36769]
robert altman 4 [6322, 43012, 43013, 46909]
hitchcockian thriller 4 [5463, 7445, 43174, 48620]
red rock 3 [49316, 49318, 49325]
pacino ??? cusack 3 [40491, 40500, 40504]
mute wit 3 [33299, 33498, 33521]
lara ??? boyl 3 [49316, 49318, 49325]
denzel washington 3 [19086, 47750, 47758]
conspiraci thriller 3 [6796, 9855, 26211]
song thriller 3 [4172, 36756, 36761]
servic ??? pete 2 [48428, 48429]
thriller ??? thriller 2 [7445, 44486]
southern gothic 2 [43012, 44469]
vijai anand 2 [36164, 36176]
al pacino 2 [40491, 40503]
robert mitchum 2 [5463, 43012]
citi hall 2 [40491, 40503]
music video thriller 2 [36761, 36773]
isn t 2 [22713, 23335]
zane ??? brook attend 2 [5984, 8161]

Fixing placeholder, unknown values

To do this we can use the offsets and the original document. Basically we have all the byte offsets and the original text so we can just look it up, even if had be stripped out as a stop word.

Code
%%time
place_df = (
    # significant_phrases_df
    significant_phrases_df.loc[
        significant_phrases_df.index.str.contains("???", regex=False)
    ]
    .drop('doc_ids', axis=1)
    .join(
        all_phrase_df.set_index('phrase')
        , how='inner'
    )
    
)
place_df['terms']= place_df.index.str.split()

def do_zip(tpl):
    tpl['zip'] = list(zip(tpl['terms'], tpl['positions']))
    return tpl

def de_tuple(tpl):
    term, position = tpl['zip']
    tpl['term'] = term
    tpl['position'] = position
    return tpl

def extract(tpl):
    d = tpl['postings']
    flag = pd.notna(d)
    tpl['position_r']=  d['position'] if flag else -1
    tpl['offset_start']= d['offset_start'] if flag else -1
    tpl['offset_end']= d['offset_end'] if flag else -1
    return tpl

place_df = (
    place_df
    .apply(do_zip, axis=1)
    .drop(['positions', 'terms'], axis=1)
    .reset_index()
    .reset_index()
    .rename({'index':'pid'}, axis=1) # to handle repeated phrase in a doc_id
    .explode('zip')
    .apply(de_tuple, axis=1)
    .drop('zip', axis=1)
    .reset_index()
    .set_index(['doc_id', 'term'])
    .join(
        (
            doc_term_stats_df
            .reset_index()
            .set_index(['doc_id', 'term'])
            .drop('positions', axis=1)
        )
        , how='left'
    )
    .explode('postings')
    .apply(extract, axis=1)
    .drop('postings', axis=1)
    .loc[lambda x: x.position==x.position_r]
    .reset_index()
    .groupby(['pid', 'phrase', 'doc_id'])
    .agg({
        'offset_start': min,
        'offset_end': max
    })
    .reset_index('doc_id')
)

place_df
CPU times: user 181 ms, sys: 6.76 ms, total: 188 ms
Wall time: 187 ms
doc_id offset_start offset_end
pid phrase
0 michael ??? thriller 36768 0 26
1 michael ??? thriller 36769 50 76
2 michael ??? thriller 36766 855 882
3 michael ??? thriller 36750 404 430
4 michael ??? thriller 36756 458 485
5 michael ??? thriller 36766 65 92
6 pacino ??? cusack 40500 132 149
7 pacino ??? cusack 40504 918 935
8 pacino ??? cusack 40491 2033 2052
9 lara ??? boyl 49316 1739 1755
10 lara ??? boyl 49318 505 523
11 lara ??? boyl 49325 1058 1074
12 lara ??? boyl 49318 984 1000
13 servic ??? pete 48429 440 458
14 servic ??? pete 48428 845 863
15 thriller ??? thriller 7445 26 50
16 thriller ??? thriller 44486 578 609
17 zane ??? brook attend 5984 130 153
18 zane ??? brook attend 8161 130 153
Code
%%time
for d in place_df.itertuples():
    index, doc_id, offset_start, offset_end = d
    stored = reader.storedFields()
    doc = stored.document(doc_id)
    text = doc.get('text')
    
    print(f"Doc {doc_id}: {text[offset_start:offset_end]}")
Doc 36768: Michael Jackson's Thriller
Doc 36769: Michael Jackson's THRILLER
Doc 36766: Michael Jackson - "Thriller
Doc 36750: Michael Jackson's Thriller
Doc 36756: Michael Jackson's 'Thriller
Doc 36766: Michael Jackson's "Thriller
Doc 40500: Pacino and Cusack
Doc 40504: Pacino and Cusack
Doc 40491: Pacino. John Cusack
Doc 49316: Lara Flynn Boyle
Doc 49318: Lara Flynn Boyle's
Doc 49325: Lara Flynn Boyle
Doc 49318: Lara Flynn Boyle
Doc 48429: Service Agent Pete
Doc 48428: Service agent Pete
Doc 7445: Thriller". But thrillers
Doc 44486: thrillers, especially thrillers
Doc 5984: Zane and Brook attended
Doc 8161: Zane and Brook attended
CPU times: user 2.99 ms, sys: 1.03 ms, total: 4.02 ms
Wall time: 6.19 ms

Conclusion

As analysts, data people we have user / player entered text in lots of places, surveys, reviews social profiles etc. We often ignore it, but we don’t have to. The technique here demonstrates the use of Lucene directly, but Elastic has significant term tools [3,4]. I go a step further an build phrases from the observation that the terms in a phrase are colocated. I’ve demonstrated that this is possible even when I drop stop words on the floor and provided a method for recovering the original sense and context.

It is fast, approximate timings:

  • Indexing 50,000 short documents, 7s
  • Building index stats, 200ms
  • Building search stats, 400ms
  • Building phrases, 50ms
  • Resolve placeholder text, 200ms

More though we’ve see this tooling is fast, I think there may be a place alongside AI model training which I plan to explore in the next part when I take what this produces to try and bootstrap the training of a PyTorch model.

References

[1]
Lucene: A Java library providing powerful indexing and search features. n.d. https://lucene.apache.org/.
[2]
[3]
Unveiling unique patterns: A guide to significant terms aggregation in Elasticsearch. n.d. https://www.elastic.co/search-labs/blog/significant-terms-aggregation-elasticsearch.
[4]