Learning Phrases with PyLucene, part 1.

Author

Peter Tillotson

Published

March 2, 2026

Abstract

Lucene is a library that is used to build text search optimised indexes, it is an Apache project and is the core file format sitting under ElasticSearch. The following code uses pylucene which is a JNI wrapper to the Java API.

The algorithm derives from the idea that the terms in search results will have increased frequencies for their search terms and associated concepts. Phrases similarly should have increased frequencies. By using Lucene it is fast, though we have to index first. The code also demonstrates an integration between Lucene and Pandas for analytics. The technique here could be used summarise in aggregate user / player entered text in surveys, reviews etc. That might otherwise get ignored by analytics.

In part 2 we reuse our tokenised index and use pytorch to build a model for significant term extraction.

Introduction

Apache Lucene [1] is Java library providing powerful indexing and search features. This includes vectorised on disk formats for retrieving term positions and associated statistics such as Term frequency and document frequency. In the following I demonstrate the use of Lucene alongside Python Pandas in the development of a significant terms algorithm.

Lucene

A Lucene index is a set of files on on disk that enables the fast retrieval of documents based on search terms. Figure 1 provides a very quick primer on the process. It is beyond the of this paper to develop the readers full understanding of the process in full. Note the rich information that stored in particular in Term and Posting.

---
config:
  themeVariables:
    fontSize: 10
---
erDiagram
    direction LR
    Index
    IndexWriter
    Document
    Fields {
        bool stored_raw_text 
        bool indexed
    }
    Analyzer
    Term {
        int doc_freq
        int term_freq
    }
    Posting {
        int position_in_token_stream
        int start_offset
        int end_offset
        bytes payload
    }
    IndexReader
    IndexSearcher
    Query

    Index ||--|| IndexWriter : "writes with" 
    IndexWriter ||--|{ Document : has
    Document ||--|{ Fields : has
    Fields ||--|| Analyzer: uses
    Analyzer ||--|{ Term : tokenizes
    Term ||--|| Posting: has

    Index ||--|| IndexReader : "reads with"
    IndexReader ||--|| IndexSearcher: "searches"
    IndexSearcher ||--|| Query: ""
    Query ||--|{ Term: clause

    Analyzer ||--|| EnglishAnalyzer : impl
    Analyzer ||--|| StemmingAnalyzer : impl
%%    Analyzer ||--|| "..." : "many impl"

Figure 1: A quick primer to the Lucene concepts used in this paper.

Token Streams, Analyzers, Stop Words and Postings

When indexing text is turned into a token stream using an EnglishAnalyzer:

it breaks on whitespace removing punctuation
it lower cases, and
removes common stop words

Also alongside each Term it stores the position in the token stream and the start / end byte offsets from the original text.

---
config:
  themeVariables:
    fontSize: 10
  flowchart:
    rankSpacing: 15
    nodeSpacing: 25
    padding: 10
---
flowchart LR
    %% This graph shows the Lucene analysis process term by term.
    %% It is laid out top-to-bottom (TD).

    %% Define a class for stop words to style them red.
    classDef stopword fill:#ffdddd,stroke:#ff0000

    %% Subgraph for the bottom row: the postings.
    %% It also flows left-to-right to align with the token stream.
    subgraph "<span style='white-space:nowrap'>The Quick Brown Fox Jumps Over the Lazy Dog.</span>"
        direction LR 
        subgraph n1 ["The"]
            direction LR
            T1("the"):::stopword
            P1["(stop word)<br/>_"];
        end
        subgraph n2 ["Quick"]
            direction LR
            T2("quick")
            P2["Pos: 1<br/>Off: 4-9"];
        end
        subgraph n3 ["Brown"]
            direction LR
            T3("brown")
            P3["Pos: 2<br/>Off: 10-15"];
        end
        subgraph n4 ["Fox"]
            direction LR
            T4("fox")
            P4["Pos: 3<br/>Off: 16-19"];
        end
        subgraph n5 ["Jumps"]
            direction LR
            T5("jumps")
            P5["Pos: 4<br/>Off: 20-25"];
        end
        subgraph n6 ["Over"]
            direction LR
            T6("over"):::stopword
            P6["(stop word)<br/>_"];
        end
        subgraph n7 ["the"]
            direction LR
            T7("the"):::stopword
            P7["(stop word)<br/>_"];
        end
        subgraph n8 ["Lazy"]
            direction LR
            T8("lazy")
            P8["Pos: 5<br/>Off: 35-39"];
        end
        subgraph n9 ["Dog."]
            direction LR
            T9("dog")
            P9["Pos: 6<br/>Off: 40-43"];
        end
        %% Link postings together
        n1 --> n2 --> n3 --> n4 --> n5 --> n6 --> n7 --> n8 --> n9;
    end;

Figure 2: An example token stream from an EnglishAnalyzer.

When I look at [2] I cannot help thinking there are clear parallels which inspired the choice to do a follow up part 2. I plan to take the phrases I learn here to bootstrap the build of a more general model in PyTorch.

The algorithm

Enough of future me problems for the time being. The algorithm for extracting phrase I am using here is as follows.

We build and index of all docs, storing terms and alongside each term its position and byte offsets within the text.
We search and retrieve a subset of docs.
We compare the term statistics for the search results with those of the index and find some terms are now more frequent.
We use the adjacencies from the more frequent term positions to build phrases.

Note: It is really fast.

Setting up the environment and loading IMDB

Code

import textwrap
import plotly

import pandas as pd
import numpy as np

plotly.offline.init_notebook_mode()
plotly.io.renderers.default = 'svg'
pd.options.plotting.backend = "plotly"

Loading the IMDB movie review document set

Code

reviews_df = pd.read_json('data/reviews.json')
reviews_df.columns = ['text', 'sentiment']
print(reviews_df.sentiment.unique())
reviews_df

[0 1]

	text	sentiment
0	Once again Mr. Costner has dragged out a movie...	0
1	This is an example of why the majority of acti...	0
2	First of all I hate those moronic rappers, who...	0
3	Not even the Beatles could write songs everyon...	0
4	Brass pictures (movies is not a fitting word f...	0
...	...	...
49995	Seeing as the vote average was pretty low, and...	1
49996	The plot had some wretched, unbelievable twist...	1
49997	I am amazed at how this movie(and most others ...	1
49998	A Christmas Together actually came before my t...	1
49999	Working-class romantic drama from director Mar...	1

50000 rows × 2 columns

Building the Lucene index

This is the code that indexes all the documents, it takes around 7s to index 50,000 documents and is potentially a one off operation. We configure the fields we would like to start and how they are indexed / stored respectiveley. For this paper we:

Only index the sentiment value
Store and index the text, using an EnglishAnalyzer and also storing position in the token stream and byte offsets

We haven’t but could have index the same text with a range of Analyzers and added each as their own field.

Code

# This requires pylucene, which is a thin wrap over Lucene running 
# in a JVM
import os
import lucene

from pathlib import Path

from java.nio.file import Paths
from org.apache.lucene.analysis.en import EnglishAnalyzer
from org.apache.lucene.document import (
    Document, 
    Field,
    FieldType
)
from org.apache.lucene.index import (
    DirectoryReader,
    IndexOptions, 
    IndexWriter,
    IndexWriterConfig,
    Term
)

from org.apache.lucene.store import NIOFSDirectory
from org.apache.lucene.util import BytesRefIterator


if not os.environ.get('jvm_started', False):
    # Here is the JVM being spun up
    env = lucene.initVM(vmargs=['-Djava.awt.headless=true', '-Xmx256M'])
    os.environ['jvm_started'] = "True"

Code

%%time
index_dir = 'index'

# Define field type
t1 = FieldType()                       # for sentiment 
t1.setStored(True)                     # store full text 
t1.setIndexOptions(IndexOptions.DOCS)

t2 = FieldType()                       # for text
t2.setStored(True)                     
t2.setStoreTermVectors(True)           # Needed for quick extract of stats
t2.setStoreTermVectorPositions(True)   # To help with co location 
t2.setStoreTermVectorOffsets(True)     # So I can reference back to stored text
t2.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS)


if not Path(index_dir).exists():
    # build the index if not exist 
    fsDir = NIOFSDirectory(Paths.get(index_dir))
    writerConfig = IndexWriterConfig(EnglishAnalyzer())
    writer = IndexWriter(fsDir, writerConfig)

    def index_review(tpl):
        # Add a document, assign type 
        doc = Document()
        doc.add(Field('sentiment', str(tpl.sentiment), t1))
        doc.add(Field('text', tpl.text, t2))
        writer.addDocument(doc)
        return tpl
    
    try: 
        reviews_df.apply(index_review, axis=1)
    finally:
        writer.commit()
        writer.forceMerge(1, True)
        writer.close()

# open an index reader 
fsDir = NIOFSDirectory(Paths.get(index_dir))
reader = DirectoryReader.open(fsDir)
print(f"{reader.numDocs()} docs found in index")

50000 docs found in index
CPU times: user 6.22 s, sys: 549 ms, total: 6.77 s
Wall time: 6.26 s

If you were building towards production it would be straightforward to abstract the configuration for fields. Lucene does not enforce that all documents in a given index have the same fields. If you were to build towards production building conventions in here would be beneficial. Maybe always have raw that stores the source record in case you reindex differently over time.

Extracting index stats

This works with the index and gets a TermEnum for the text field we used when indexing documents.

A TermEnum enables iterating over every Term and provides access to Term frequency and Document frequency. It is really quick and these are the only fields needed for the index stats.

Code

%%time
leaves = reader.leaves()

term_stats = []
for leaf_reader in leaves:
    # Building the index term and document stats 
    te = leaf_reader.reader().terms('text').iterator()
    for term in BytesRefIterator.cast_(te):
        # Iterate through all terms
        te.seekExact(term)
        term_stats.append({
            'term': term.utf8ToString(), 
            'doc_freq': te.docFreq(), 
            'term_freq':te.totalTermFreq()
        })

index_term_stats_df = (
    pd.DataFrame(term_stats)
    .set_index('term')
)
index_term_stats_df['tfidf'] = (
    index_term_stats_df.term_freq / index_term_stats_df.doc_freq
)
    
index_term_stats_df = (
    index_term_stats_df
    .sort_values(by=['tfidf'], ascending=[False])
)
index_term_stats_df

CPU times: user 247 ms, sys: 4.11 ms, total: 251 ms
Wall time: 202 ms

	doc_freq	term_freq	tfidf
term
trivialbor	1	26	26.0
stop.oz	1	23	23.0
montero	1	20	20.0
narvo	1	13	13.0
tucso	1	13	13.0
...	...	...	...
ibus	2	2	1.0
ibánez	1	1	1.0
ibéria	1	1	1.0
ica	1	1	1.0
lyrics.i	1	1	1.0

79664 rows × 3 columns

[Figure 3;fig-df-term-freq] present the document frequencies for terms in the index. The curve is pretty typical there is a small set of words we use all the time even after stop words are removed.

Something to think on is how much can we learn from anything that is too frequent or too rare. For index_term_stats_df above we could have probably skipped loading any Term with {"doc_freq": 1, "term_freq": 1} and had very little impact in being able to extract phrases.

Code

min_doc_frequency = 4

fig = index_term_stats_df.loc[
    index_term_stats_df.doc_freq>min_doc_frequency
].doc_freq.sort_values(ascending=False).plot()
fig.update_layout(width=900, height=400)
fig.show()

Code

# Shorter list of most common terms
#| label: fig-df-term-freq
#| caption: Document frequency for more common terms in the index.

min_doc_frequency = 3000

fig = index_term_stats_df.loc[
    index_term_stats_df.doc_freq>min_doc_frequency
].doc_freq.sort_values(ascending=False).plot()
fig.update_layout(width=900, height=400)
fig.show()

Searching the index and getting search stats

When we search we use a query, I’ve used a BooleanQuery here that lets be add clauses.

We search for the top 500 documents that match the query, we then use the TermVector aligned on search hit doc_id and extract the DataFrame in a similar way in which we did above. This time we have also extracted the positions and byte offsets. These are then used un phrase building.

The postings data can occur more than once per doc_id, e.g the Term you can appear multiple times in the tex.

Code

from org.apache.lucene.search import (
    IndexSearcher,
    TermQuery, 
    BooleanQuery,
    BooleanClause
)

Code

%%time
searcher = IndexSearcher(reader)

query = (
    BooleanQuery.Builder()
    .add(TermQuery(Term('text','thriller')), BooleanClause.Occur.MUST)
    #
    #.add(TermQuery(Term('sentiment',"0")), BooleanClause.Occur.MUST)
    #.add(TermQuery(Term('sentiment',"1")), BooleanClause.Occur.MUST)
).build()
hits = searcher.search(query, 500)

term_stats = []
for hit in hits.scoreDocs:
    # This is really quick, we are pulling search term doc and term frequencies 
    # along with their positions directly from the index. 
    vector = reader.termVectors().get(hit.doc)
    te = vector.terms('text').iterator()
    for term in BytesRefIterator.cast_(te):
        te.seekExact(term)
        postings = te.postings(None)
        postings.nextDoc()
        freq = postings.freq()          
            
        term_stats.append({
            'term': term.utf8ToString(), 
            'doc_freq': te.docFreq(), 
            'term_freq':te.totalTermFreq(),
            'doc_id':hit.doc,
            'postings': [
                {
                    'position': postings.nextPosition(),
                    'offset_start': postings.startOffset(),
                    'offset_end': postings.endOffset()
                } 
                for idx in range(0,freq)
            ], 
            # 'positions': [postings.nextPosition() for idx in range(0,freq)]
            
        })

doc_term_stats_df = (
    pd.DataFrame(term_stats)
    .set_index('term')
    .sort_index()
)

doc_term_stats_df['positions'] = doc_term_stats_df.postings.apply(lambda x: [p['position'] for p in x])
doc_term_stats_df

CPU times: user 789 ms, sys: 19.2 ms, total: 808 ms
Wall time: 280 ms

	doc_freq	term_freq	doc_id	postings	positions
term
0	1	1	18663	[{'position': 86, 'offset_start': 511, 'offset...	[86]
00.01	1	1	48714	[{'position': 496, 'offset_start': 2754, 'offs...	[496]
02	1	1	38283	[{'position': 115, 'offset_start': 613, 'offse...	[115]
06	1	1	38606	[{'position': 95, 'offset_start': 526, 'offset...	[95]
1	1	1	47961	[{'position': 77, 'offset_start': 441, 'offset...	[77]
...	...	...	...	...	...
zone	1	1	33334	[{'position': 66, 'offset_start': 365, 'offset...	[66]
zudina	1	1	33498	[{'position': 333, 'offset_start': 1857, 'offs...	[333]
zuniga	1	1	42945	[{'position': 124, 'offset_start': 682, 'offse...	[124]
zuniga	1	1	42941	[{'position': 80, 'offset_start': 458, 'offset...	[80]
über	1	1	36477	[{'position': 125, 'offset_start': 760, 'offse...	[125]

42955 rows × 5 columns

Now we have to aggregate the search stats in a similar way to what we did for index stats.

Code

%%time

search_term_stats_df = (
    doc_term_stats_df[['doc_freq', 'term_freq']]
    .groupby('term')
    .sum()
)
search_term_stats_df['tfidf'] = search_term_stats_df.term_freq/search_term_stats_df.doc_freq

search_term_stats_df = (
    search_term_stats_df
    .sort_values(by=['tfidf'], ascending=[False])
)
search_term_stats_df

CPU times: user 4.93 ms, sys: 827 µs, total: 5.76 ms
Wall time: 5.8 ms

	doc_freq	term_freq	tfidf
term
anand	2	15	7.5
fulci	1	7	7.0
zandale	1	7	7.0
anatomi	1	6	6.0
winfield	1	6	6.0
...	...	...	...
gene	2	2	1.0
gender	1	1	1.0
gem	6	6	1.0
gellar	1	1	1.0
über	1	1	1.0

6828 rows × 3 columns

Identifying search results terms that are significant

The search yields result documents that and some terms in the results are now more likely to occur than they did in the index as a whole. In finding the significant we’re interested in the terms that move the furthest.

I put some constraints here, I probably am only interested in Terms that have appeared in more than one doc, and most significant.

Code

%%time
full_df = (
    index_term_stats_df
    .join(
        search_term_stats_df, 
        lsuffix='_idx', 
        rsuffix='_src',
        how='inner'
    )
)
full_df = full_df.loc[full_df.doc_freq_src>1]
full_df['tfidf_diff'] = (
    full_df.tfidf_src      # Search tf/df 
    -                      # minus, terms that are no different = 0
    full_df.tfidf_idx      # Index tf/df    
)
full_df.sort_values(by=['tfidf_diff'], ascending=[False], inplace=True)

full_df.loc[full_df.doc_freq_src>1].head(10)

CPU times: user 3.07 ms, sys: 567 µs, total: 3.64 ms
Wall time: 3.16 ms

	doc_freq_idx	term_freq_idx	tfidf_idx	doc_freq_src	term_freq_src	tfidf_src	tfidf_diff
term
anand	22	59	2.681818	2	15	7.500000	4.818182
depalma	25	42	1.680000	3	15	5.000000	3.320000
pierc	169	198	1.171598	2	8	4.000000	2.828402
altman	114	229	2.008772	4	19	4.750000	2.741228
pacino	179	313	1.748603	4	17	4.250000	2.501397
dev	17	66	3.882353	2	12	6.000000	2.117647
mute	212	245	1.155660	5	15	3.000000	1.844340
keitel	50	64	1.280000	2	6	3.000000	1.720000
cage	316	503	1.591772	7	23	3.285714	1.693942
dahl	26	43	1.653846	3	10	3.333333	1.679487

Code

fig = full_df.loc[full_df.doc_freq_src>1].tfidf_diff.sort_values(ascending=False).plot()
fig.update_layout(width=900, height=400)
fig.show()

Figure 4 presents the difference in frequencies of search terms when compared to the index. Term that see little difference, i.e. are common to both, will score close to zero.

Build phrases using colocation

The intuition here is that in search results phrases, like names would be likely move in significance together. Search for “Western” movies and we might expect “John” and “Wayne” to individually move in significance but if we looked at there frequent positions we would be able to link them.

The following algorithm does this, is adds a the concept of slop allowing for not direct adjacencies.

Code

%%time
max_slop=2  # Distance between words in phrase

phrase_df  =(
    full_df.loc[
        ( full_df.doc_freq_src>1 )  # In more than 1 doc
        & ( full_df.tfidf_diff>0 )  # Has moved in significance
    ]
    .head(200)
    .join(
        doc_term_stats_df[['doc_id', 'positions']],
        how='left'
    )
    .sort_values(by=['tfidf_diff'], ascending=[False])
    [['doc_id', 'positions']]
    .explode('positions')
    .reset_index()
    .set_index('doc_id')
    .sort_values(by=['doc_id', 'positions'])
)
count_df = phrase_df.groupby('doc_id')['positions'].count()
count_df.name = "nos_row"
count_df = count_df.loc[lambda x: x>1] 
phrase_df = count_df.to_frame().join(phrase_df).drop('nos_row', axis=1)

phrase_df['pos_diff'] = phrase_df.positions.groupby('doc_id').diff()
phrase_df['pos_diff'] = np.where(
    phrase_df.pos_diff<=max_slop, phrase_df.pos_diff,
    np.where(
        (phrase_df.pos_diff>max_slop) & (phrase_df.pos_diff.shift(-1)<=max_slop), 0,
    np.where(
        np.abs(phrase_df.positions.shift(-1)-phrase_df.positions)<=max_slop, 0,-1))
)

phrase_df.head(8)

CPU times: user 7.08 ms, sys: 1.5 ms, total: 8.58 ms
Wall time: 8.12 ms

	term	positions	pos_diff
doc_id
11	thriller	58	-1
11	thriller	64	-1
11	novel	76	-1
11	thriller	189	-1
114	thriller	116	-1
114	thriller	283	-1
114	member	319	-1
185	t	10	-1

Code


%%time
def build_phrase(df):
    out = []
    phrase = []
    positions = []
    last=0
    for doc_id,row in df.iterrows():
        idx = row.pos_diff
        if idx == 0:
            if len(phrase)>0:
                out.append(
                    pd.Series(
                        {'phrase': ' '.join(phrase), 'positions': positions},
                    )
                )
                phrase = []
                positions = []
                last=0
        diff = idx-last
        if diff > 1:
            for i in range(diff-1):
                phrase.append("???") # if slop >1 I might miss a term, so use placeholder
                positions.append(positions[-1]+1)
        phrase.append(row.term)
        positions.append(row.positions)
        last=idx
    
    if len(phrase)>0:
        out.append(
            pd.Series(
                {'phrase': ' '.join(phrase), 'positions': positions},
            )
        )
    return  pd.DataFrame(out).set_index('phrase')

all_phrase_df = (
    # inline test
    phrase_df.loc[
        (phrase_df.pos_diff>=0) 
    ]
    .groupby('doc_id')
    .apply(build_phrase)
    .sort_values(by='phrase')
    .reset_index()
)

(
    all_phrase_df
   .loc[lambda x: x.phrase.str.len()<64] # To fix pdf table render
   .set_index('phrase')
).head(8)

CPU times: user 39 ms, sys: 1.4 ms, total: 40.4 ms
Wall time: 40 ms

	doc_id	positions
phrase
al pacino	40503	[90, 91]
al pacino	40503	[71, 72]
al pacino	40491	[172, 173]
al pacino	40491	[715, 716]
al pacino	40491	[260, 261]
al pacino	40491	[599, 600]
al pacino cusack	40504	[27, 28, 30]
associ ??? rock	25176	[109, 110, 111]

Code

%%time
significant_phrases_df = (
    all_phrase_df
    .drop(["positions"], axis=1)
    .groupby('phrase')
    .agg({
        "doc_id": ['nunique', lambda x: np.unique(x).tolist()]
    })
    .droplevel(0, axis=1)
    .rename({"nunique":"nos_docs"}, axis=1)
    .rename({"<lambda_0>":"doc_ids"}, axis=1)
    .loc[lambda x: x.nos_docs>1]
    .sort_values(by=['nos_docs'], ascending=[False])
).head(50)

significant_phrases_df

CPU times: user 2.88 ms, sys: 435 µs, total: 3.32 ms
Wall time: 3.15 ms

	nos_docs	doc_ids
phrase
music video	16	[4172, 10713, 35326, 36748, 36751, 36753, 3675...
michael dougla	8	[16814, 16815, 16819, 35759, 35762, 35764, 484...
polit thriller	8	[6796, 16815, 16819, 26213, 32420, 40500, 4050...
michael ??? thriller	5	[36750, 36756, 36766, 36768, 36769]
robert altman	4	[6322, 43012, 43013, 46909]
hitchcockian thriller	4	[5463, 7445, 43174, 48620]
red rock	3	[49316, 49318, 49325]
pacino ??? cusack	3	[40491, 40500, 40504]
mute wit	3	[33299, 33498, 33521]
lara ??? boyl	3	[49316, 49318, 49325]
denzel washington	3	[19086, 47750, 47758]
conspiraci thriller	3	[6796, 9855, 26211]
song thriller	3	[4172, 36756, 36761]
servic ??? pete	2	[48428, 48429]
thriller ??? thriller	2	[7445, 44486]
southern gothic	2	[43012, 44469]
vijai anand	2	[36164, 36176]
al pacino	2	[40491, 40503]
robert mitchum	2	[5463, 43012]
citi hall	2	[40491, 40503]
music video thriller	2	[36761, 36773]
isn t	2	[22713, 23335]
zane ??? brook attend	2	[5984, 8161]

Fixing placeholder, unknown values

To do this we can use the offsets and the original document. Basically we have all the byte offsets and the original text so we can just look it up, even if had be stripped out as a stop word.

Code

%%time
place_df = (
    # significant_phrases_df
    significant_phrases_df.loc[
        significant_phrases_df.index.str.contains("???", regex=False)
    ]
    .drop('doc_ids', axis=1)
    .join(
        all_phrase_df.set_index('phrase')
        , how='inner'
    )
    
)
place_df['terms']= place_df.index.str.split()

def do_zip(tpl):
    tpl['zip'] = list(zip(tpl['terms'], tpl['positions']))
    return tpl

def de_tuple(tpl):
    term, position = tpl['zip']
    tpl['term'] = term
    tpl['position'] = position
    return tpl

def extract(tpl):
    d = tpl['postings']
    flag = pd.notna(d)
    tpl['position_r']=  d['position'] if flag else -1
    tpl['offset_start']= d['offset_start'] if flag else -1
    tpl['offset_end']= d['offset_end'] if flag else -1
    return tpl

place_df = (
    place_df
    .apply(do_zip, axis=1)
    .drop(['positions', 'terms'], axis=1)
    .reset_index()
    .reset_index()
    .rename({'index':'pid'}, axis=1) # to handle repeated phrase in a doc_id
    .explode('zip')
    .apply(de_tuple, axis=1)
    .drop('zip', axis=1)
    .reset_index()
    .set_index(['doc_id', 'term'])
    .join(
        (
            doc_term_stats_df
            .reset_index()
            .set_index(['doc_id', 'term'])
            .drop('positions', axis=1)
        )
        , how='left'
    )
    .explode('postings')
    .apply(extract, axis=1)
    .drop('postings', axis=1)
    .loc[lambda x: x.position==x.position_r]
    .reset_index()
    .groupby(['pid', 'phrase', 'doc_id'])
    .agg({
        'offset_start': min,
        'offset_end': max
    })
    .reset_index('doc_id')
)

place_df

CPU times: user 181 ms, sys: 6.76 ms, total: 188 ms
Wall time: 187 ms

		doc_id	offset_start	offset_end
pid	phrase
0	michael ??? thriller	36768	0	26
1	michael ??? thriller	36769	50	76
2	michael ??? thriller	36766	855	882
3	michael ??? thriller	36750	404	430
4	michael ??? thriller	36756	458	485
5	michael ??? thriller	36766	65	92
6	pacino ??? cusack	40500	132	149
7	pacino ??? cusack	40504	918	935
8	pacino ??? cusack	40491	2033	2052
9	lara ??? boyl	49316	1739	1755
10	lara ??? boyl	49318	505	523
11	lara ??? boyl	49325	1058	1074
12	lara ??? boyl	49318	984	1000
13	servic ??? pete	48429	440	458
14	servic ??? pete	48428	845	863
15	thriller ??? thriller	7445	26	50
16	thriller ??? thriller	44486	578	609
17	zane ??? brook attend	5984	130	153
18	zane ??? brook attend	8161	130	153

Code

%%time
for d in place_df.itertuples():
    index, doc_id, offset_start, offset_end = d
    stored = reader.storedFields()
    doc = stored.document(doc_id)
    text = doc.get('text')
    
    print(f"Doc {doc_id}: {text[offset_start:offset_end]}")

Doc 36768: Michael Jackson's Thriller
Doc 36769: Michael Jackson's THRILLER
Doc 36766: Michael Jackson - "Thriller
Doc 36750: Michael Jackson's Thriller
Doc 36756: Michael Jackson's 'Thriller
Doc 36766: Michael Jackson's "Thriller
Doc 40500: Pacino and Cusack
Doc 40504: Pacino and Cusack
Doc 40491: Pacino. John Cusack
Doc 49316: Lara Flynn Boyle
Doc 49318: Lara Flynn Boyle's
Doc 49325: Lara Flynn Boyle
Doc 49318: Lara Flynn Boyle
Doc 48429: Service Agent Pete
Doc 48428: Service agent Pete
Doc 7445: Thriller". But thrillers
Doc 44486: thrillers, especially thrillers
Doc 5984: Zane and Brook attended
Doc 8161: Zane and Brook attended
CPU times: user 2.99 ms, sys: 1.03 ms, total: 4.02 ms
Wall time: 6.19 ms

Conclusion

As analysts, data people we have user / player entered text in lots of places, surveys, reviews social profiles etc. We often ignore it, but we don’t have to. The technique here demonstrates the use of Lucene directly, but Elastic has significant term tools [3,4]. I go a step further an build phrases from the observation that the terms in a phrase are colocated. I’ve demonstrated that this is possible even when I drop stop words on the floor and provided a method for recovering the original sense and context.

It is fast, approximate timings:

Indexing 50,000 short documents, 7s
Building index stats, 200ms
Building search stats, 400ms
Building phrases, 50ms
Resolve placeholder text, 200ms

More though we’ve see this tooling is fast, I think there may be a place alongside AI model training which I plan to explore in the next part when I take what this produces to try and bootstrap the training of a PyTorch model.

References

[1]

Lucene: A Java library providing powerful indexing and search features. n.d. https://lucene.apache.org/.

[2]

PyTorch: BERT Tokenizer API n.d. https://docs.pytorch.org/text/main/transforms.html?highlight=berttokenizer#torchtext.transforms.BERTTokenizer.

[3]

Unveiling unique patterns: A guide to significant terms aggregation in Elasticsearch. n.d. https://www.elastic.co/search-labs/blog/significant-terms-aggregation-elasticsearch.

[4]

Significant terms aggregation for Elasticsearch. n.d. https://www.elastic.co/docs/reference/aggregations/search-aggregations-bucket-significantterms-aggregation.