---
config:
themeVariables:
fontSize: 10
---
erDiagram
direction LR
Index
IndexWriter
Document
Fields {
bool stored_raw_text
bool indexed
}
Analyzer
Term {
int doc_freq
int term_freq
}
Posting {
int position_in_token_stream
int start_offset
int end_offset
bytes payload
}
IndexReader
IndexSearcher
Query
Index ||--|| IndexWriter : "writes with"
IndexWriter ||--|{ Document : has
Document ||--|{ Fields : has
Fields ||--|| Analyzer: uses
Analyzer ||--|{ Term : tokenizes
Term ||--|| Posting: has
Index ||--|| IndexReader : "reads with"
IndexReader ||--|| IndexSearcher: "searches"
IndexSearcher ||--|| Query: ""
Query ||--|{ Term: clause
Analyzer ||--|| EnglishAnalyzer : impl
Analyzer ||--|| StemmingAnalyzer : impl
%% Analyzer ||--|| "..." : "many impl"
Learning Phrases with PyLucene, part 1.
Lucene is a library that is used to build text search optimised indexes, it is an Apache project and is the core file format sitting under ElasticSearch. The following code uses pylucene which is a JNI wrapper to the Java API.
The algorithm derives from the idea that the terms in search results will have increased frequencies for their search terms and associated concepts. Phrases similarly should have increased frequencies. By using Lucene it is fast, though we have to index first. The code also demonstrates an integration between Lucene and Pandas for analytics. The technique here could be used summarise in aggregate user / player entered text in surveys, reviews etc. That might otherwise get ignored by analytics.
In part 2 we reuse our tokenised index and use pytorch to build a model for significant term extraction.
Introduction
Apache Lucene [1] is Java library providing powerful indexing and search features. This includes vectorised on disk formats for retrieving term positions and associated statistics such as Term frequency and document frequency. In the following I demonstrate the use of Lucene alongside Python Pandas in the development of a significant terms algorithm.
Lucene
A Lucene index is a set of files on on disk that enables the fast retrieval of documents based on search terms. Figure 1 provides a very quick primer on the process. It is beyond the of this paper to develop the readers full understanding of the process in full. Note the rich information that stored in particular in Term and Posting.
Token Streams, Analyzers, Stop Words and Postings
When indexing text is turned into a token stream using an EnglishAnalyzer:
- it breaks on whitespace removing punctuation
- it lower cases, and
- removes common stop words
Also alongside each Term it stores the position in the token stream and the start / end byte offsets from the original text.
---
config:
themeVariables:
fontSize: 10
flowchart:
rankSpacing: 15
nodeSpacing: 25
padding: 10
---
flowchart LR
%% This graph shows the Lucene analysis process term by term.
%% It is laid out top-to-bottom (TD).
%% Define a class for stop words to style them red.
classDef stopword fill:#ffdddd,stroke:#ff0000
%% Subgraph for the bottom row: the postings.
%% It also flows left-to-right to align with the token stream.
subgraph "<span style='white-space:nowrap'>The Quick Brown Fox Jumps Over the Lazy Dog.</span>"
direction LR
subgraph n1 ["The"]
direction LR
T1("the"):::stopword
P1["(stop word)<br/>_"];
end
subgraph n2 ["Quick"]
direction LR
T2("quick")
P2["Pos: 1<br/>Off: 4-9"];
end
subgraph n3 ["Brown"]
direction LR
T3("brown")
P3["Pos: 2<br/>Off: 10-15"];
end
subgraph n4 ["Fox"]
direction LR
T4("fox")
P4["Pos: 3<br/>Off: 16-19"];
end
subgraph n5 ["Jumps"]
direction LR
T5("jumps")
P5["Pos: 4<br/>Off: 20-25"];
end
subgraph n6 ["Over"]
direction LR
T6("over"):::stopword
P6["(stop word)<br/>_"];
end
subgraph n7 ["the"]
direction LR
T7("the"):::stopword
P7["(stop word)<br/>_"];
end
subgraph n8 ["Lazy"]
direction LR
T8("lazy")
P8["Pos: 5<br/>Off: 35-39"];
end
subgraph n9 ["Dog."]
direction LR
T9("dog")
P9["Pos: 6<br/>Off: 40-43"];
end
%% Link postings together
n1 --> n2 --> n3 --> n4 --> n5 --> n6 --> n7 --> n8 --> n9;
end;
When I look at [2] I cannot help thinking there are clear parallels which inspired the choice to do a follow up part 2. I plan to take the phrases I learn here to bootstrap the build of a more general model in PyTorch.
The algorithm
Enough of future me problems for the time being. The algorithm for extracting phrase I am using here is as follows.
- We build and index of all docs, storing terms and alongside each term its position and byte offsets within the text.
- We search and retrieve a subset of docs.
- We compare the term statistics for the search results with those of the index and find some terms are now more frequent.
- We use the adjacencies from the more frequent term positions to build phrases.
Note: It is really fast.
Setting up the environment and loading IMDB
Code
import textwrap
import plotly
import pandas as pd
import numpy as np
plotly.offline.init_notebook_mode()
plotly.io.renderers.default = 'svg'
pd.options.plotting.backend = "plotly"Loading the IMDB movie review document set
Code
reviews_df = pd.read_json('data/reviews.json')
reviews_df.columns = ['text', 'sentiment']
print(reviews_df.sentiment.unique())
reviews_df[0 1]
| text | sentiment | |
|---|---|---|
| 0 | Once again Mr. Costner has dragged out a movie... | 0 |
| 1 | This is an example of why the majority of acti... | 0 |
| 2 | First of all I hate those moronic rappers, who... | 0 |
| 3 | Not even the Beatles could write songs everyon... | 0 |
| 4 | Brass pictures (movies is not a fitting word f... | 0 |
| ... | ... | ... |
| 49995 | Seeing as the vote average was pretty low, and... | 1 |
| 49996 | The plot had some wretched, unbelievable twist... | 1 |
| 49997 | I am amazed at how this movie(and most others ... | 1 |
| 49998 | A Christmas Together actually came before my t... | 1 |
| 49999 | Working-class romantic drama from director Mar... | 1 |
50000 rows × 2 columns
Building the Lucene index
This is the code that indexes all the documents, it takes around 7s to index 50,000 documents and is potentially a one off operation. We configure the fields we would like to start and how they are indexed / stored respectiveley. For this paper we:
- Only index the
sentimentvalue - Store and index the
text, using an EnglishAnalyzer and also storing position in the token stream and byte offsets
We haven’t but could have index the same text with a range of Analyzers and added each as their own field.
Code
# This requires pylucene, which is a thin wrap over Lucene running
# in a JVM
import os
import lucene
from pathlib import Path
from java.nio.file import Paths
from org.apache.lucene.analysis.en import EnglishAnalyzer
from org.apache.lucene.document import (
Document,
Field,
FieldType
)
from org.apache.lucene.index import (
DirectoryReader,
IndexOptions,
IndexWriter,
IndexWriterConfig,
Term
)
from org.apache.lucene.store import NIOFSDirectory
from org.apache.lucene.util import BytesRefIterator
if not os.environ.get('jvm_started', False):
# Here is the JVM being spun up
env = lucene.initVM(vmargs=['-Djava.awt.headless=true', '-Xmx256M'])
os.environ['jvm_started'] = "True"Code
%%time
index_dir = 'index'
# Define field type
t1 = FieldType() # for sentiment
t1.setStored(True) # store full text
t1.setIndexOptions(IndexOptions.DOCS)
t2 = FieldType() # for text
t2.setStored(True)
t2.setStoreTermVectors(True) # Needed for quick extract of stats
t2.setStoreTermVectorPositions(True) # To help with co location
t2.setStoreTermVectorOffsets(True) # So I can reference back to stored text
t2.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS)
if not Path(index_dir).exists():
# build the index if not exist
fsDir = NIOFSDirectory(Paths.get(index_dir))
writerConfig = IndexWriterConfig(EnglishAnalyzer())
writer = IndexWriter(fsDir, writerConfig)
def index_review(tpl):
# Add a document, assign type
doc = Document()
doc.add(Field('sentiment', str(tpl.sentiment), t1))
doc.add(Field('text', tpl.text, t2))
writer.addDocument(doc)
return tpl
try:
reviews_df.apply(index_review, axis=1)
finally:
writer.commit()
writer.forceMerge(1, True)
writer.close()
# open an index reader
fsDir = NIOFSDirectory(Paths.get(index_dir))
reader = DirectoryReader.open(fsDir)
print(f"{reader.numDocs()} docs found in index")50000 docs found in index
CPU times: user 6.22 s, sys: 549 ms, total: 6.77 s
Wall time: 6.26 s
If you were building towards production it would be straightforward to abstract the configuration for fields. Lucene does not enforce that all documents in a given index have the same fields. If you were to build towards production building conventions in here would be beneficial. Maybe always have raw that stores the source record in case you reindex differently over time.
Extracting index stats
This works with the index and gets a TermEnum for the text field we used when indexing documents.
A TermEnum enables iterating over every Term and provides access to Term frequency and Document frequency. It is really quick and these are the only fields needed for the index stats.
Code
%%time
leaves = reader.leaves()
term_stats = []
for leaf_reader in leaves:
# Building the index term and document stats
te = leaf_reader.reader().terms('text').iterator()
for term in BytesRefIterator.cast_(te):
# Iterate through all terms
te.seekExact(term)
term_stats.append({
'term': term.utf8ToString(),
'doc_freq': te.docFreq(),
'term_freq':te.totalTermFreq()
})
index_term_stats_df = (
pd.DataFrame(term_stats)
.set_index('term')
)
index_term_stats_df['tfidf'] = (
index_term_stats_df.term_freq / index_term_stats_df.doc_freq
)
index_term_stats_df = (
index_term_stats_df
.sort_values(by=['tfidf'], ascending=[False])
)
index_term_stats_dfCPU times: user 247 ms, sys: 4.11 ms, total: 251 ms
Wall time: 202 ms
| doc_freq | term_freq | tfidf | |
|---|---|---|---|
| term | |||
| trivialbor | 1 | 26 | 26.0 |
| stop.oz | 1 | 23 | 23.0 |
| montero | 1 | 20 | 20.0 |
| narvo | 1 | 13 | 13.0 |
| tucso | 1 | 13 | 13.0 |
| ... | ... | ... | ... |
| ibus | 2 | 2 | 1.0 |
| ibánez | 1 | 1 | 1.0 |
| ibéria | 1 | 1 | 1.0 |
| ica | 1 | 1 | 1.0 |
| lyrics.i | 1 | 1 | 1.0 |
79664 rows × 3 columns
[Figure 3;fig-df-term-freq] present the document frequencies for terms in the index. The curve is pretty typical there is a small set of words we use all the time even after stop words are removed.
Something to think on is how much can we learn from anything that is too frequent or too rare. For index_term_stats_df above we could have probably skipped loading any Term with {"doc_freq": 1, "term_freq": 1} and had very little impact in being able to extract phrases.
Code
min_doc_frequency = 4
fig = index_term_stats_df.loc[
index_term_stats_df.doc_freq>min_doc_frequency
].doc_freq.sort_values(ascending=False).plot()
fig.update_layout(width=900, height=400)
fig.show()Code
# Shorter list of most common terms
#| label: fig-df-term-freq
#| caption: Document frequency for more common terms in the index.
min_doc_frequency = 3000
fig = index_term_stats_df.loc[
index_term_stats_df.doc_freq>min_doc_frequency
].doc_freq.sort_values(ascending=False).plot()
fig.update_layout(width=900, height=400)
fig.show()Searching the index and getting search stats
When we search we use a query, I’ve used a BooleanQuery here that lets be add clauses.
We search for the top 500 documents that match the query, we then use the TermVector aligned on search hit doc_id and extract the DataFrame in a similar way in which we did above. This time we have also extracted the positions and byte offsets. These are then used un phrase building.
The postings data can occur more than once per doc_id, e.g the Term you can appear multiple times in the tex.
Code
from org.apache.lucene.search import (
IndexSearcher,
TermQuery,
BooleanQuery,
BooleanClause
)Code
%%time
searcher = IndexSearcher(reader)
query = (
BooleanQuery.Builder()
.add(TermQuery(Term('text','thriller')), BooleanClause.Occur.MUST)
#
#.add(TermQuery(Term('sentiment',"0")), BooleanClause.Occur.MUST)
#.add(TermQuery(Term('sentiment',"1")), BooleanClause.Occur.MUST)
).build()
hits = searcher.search(query, 500)
term_stats = []
for hit in hits.scoreDocs:
# This is really quick, we are pulling search term doc and term frequencies
# along with their positions directly from the index.
vector = reader.termVectors().get(hit.doc)
te = vector.terms('text').iterator()
for term in BytesRefIterator.cast_(te):
te.seekExact(term)
postings = te.postings(None)
postings.nextDoc()
freq = postings.freq()
term_stats.append({
'term': term.utf8ToString(),
'doc_freq': te.docFreq(),
'term_freq':te.totalTermFreq(),
'doc_id':hit.doc,
'postings': [
{
'position': postings.nextPosition(),
'offset_start': postings.startOffset(),
'offset_end': postings.endOffset()
}
for idx in range(0,freq)
],
# 'positions': [postings.nextPosition() for idx in range(0,freq)]
})
doc_term_stats_df = (
pd.DataFrame(term_stats)
.set_index('term')
.sort_index()
)
doc_term_stats_df['positions'] = doc_term_stats_df.postings.apply(lambda x: [p['position'] for p in x])
doc_term_stats_dfCPU times: user 789 ms, sys: 19.2 ms, total: 808 ms
Wall time: 280 ms
| doc_freq | term_freq | doc_id | postings | positions | |
|---|---|---|---|---|---|
| term | |||||
| 0 | 1 | 1 | 18663 | [{'position': 86, 'offset_start': 511, 'offset... | [86] |
| 00.01 | 1 | 1 | 48714 | [{'position': 496, 'offset_start': 2754, 'offs... | [496] |
| 02 | 1 | 1 | 38283 | [{'position': 115, 'offset_start': 613, 'offse... | [115] |
| 06 | 1 | 1 | 38606 | [{'position': 95, 'offset_start': 526, 'offset... | [95] |
| 1 | 1 | 1 | 47961 | [{'position': 77, 'offset_start': 441, 'offset... | [77] |
| ... | ... | ... | ... | ... | ... |
| zone | 1 | 1 | 33334 | [{'position': 66, 'offset_start': 365, 'offset... | [66] |
| zudina | 1 | 1 | 33498 | [{'position': 333, 'offset_start': 1857, 'offs... | [333] |
| zuniga | 1 | 1 | 42945 | [{'position': 124, 'offset_start': 682, 'offse... | [124] |
| zuniga | 1 | 1 | 42941 | [{'position': 80, 'offset_start': 458, 'offset... | [80] |
| über | 1 | 1 | 36477 | [{'position': 125, 'offset_start': 760, 'offse... | [125] |
42955 rows × 5 columns
Now we have to aggregate the search stats in a similar way to what we did for index stats.
Code
%%time
search_term_stats_df = (
doc_term_stats_df[['doc_freq', 'term_freq']]
.groupby('term')
.sum()
)
search_term_stats_df['tfidf'] = search_term_stats_df.term_freq/search_term_stats_df.doc_freq
search_term_stats_df = (
search_term_stats_df
.sort_values(by=['tfidf'], ascending=[False])
)
search_term_stats_dfCPU times: user 4.93 ms, sys: 827 µs, total: 5.76 ms
Wall time: 5.8 ms
| doc_freq | term_freq | tfidf | |
|---|---|---|---|
| term | |||
| anand | 2 | 15 | 7.5 |
| fulci | 1 | 7 | 7.0 |
| zandale | 1 | 7 | 7.0 |
| anatomi | 1 | 6 | 6.0 |
| winfield | 1 | 6 | 6.0 |
| ... | ... | ... | ... |
| gene | 2 | 2 | 1.0 |
| gender | 1 | 1 | 1.0 |
| gem | 6 | 6 | 1.0 |
| gellar | 1 | 1 | 1.0 |
| über | 1 | 1 | 1.0 |
6828 rows × 3 columns
Identifying search results terms that are significant
The search yields result documents that and some terms in the results are now more likely to occur than they did in the index as a whole. In finding the significant we’re interested in the terms that move the furthest.
I put some constraints here, I probably am only interested in Terms that have appeared in more than one doc, and most significant.
Code
%%time
full_df = (
index_term_stats_df
.join(
search_term_stats_df,
lsuffix='_idx',
rsuffix='_src',
how='inner'
)
)
full_df = full_df.loc[full_df.doc_freq_src>1]
full_df['tfidf_diff'] = (
full_df.tfidf_src # Search tf/df
- # minus, terms that are no different = 0
full_df.tfidf_idx # Index tf/df
)
full_df.sort_values(by=['tfidf_diff'], ascending=[False], inplace=True)
full_df.loc[full_df.doc_freq_src>1].head(10)CPU times: user 3.07 ms, sys: 567 µs, total: 3.64 ms
Wall time: 3.16 ms
| doc_freq_idx | term_freq_idx | tfidf_idx | doc_freq_src | term_freq_src | tfidf_src | tfidf_diff | |
|---|---|---|---|---|---|---|---|
| term | |||||||
| anand | 22 | 59 | 2.681818 | 2 | 15 | 7.500000 | 4.818182 |
| depalma | 25 | 42 | 1.680000 | 3 | 15 | 5.000000 | 3.320000 |
| pierc | 169 | 198 | 1.171598 | 2 | 8 | 4.000000 | 2.828402 |
| altman | 114 | 229 | 2.008772 | 4 | 19 | 4.750000 | 2.741228 |
| pacino | 179 | 313 | 1.748603 | 4 | 17 | 4.250000 | 2.501397 |
| dev | 17 | 66 | 3.882353 | 2 | 12 | 6.000000 | 2.117647 |
| mute | 212 | 245 | 1.155660 | 5 | 15 | 3.000000 | 1.844340 |
| keitel | 50 | 64 | 1.280000 | 2 | 6 | 3.000000 | 1.720000 |
| cage | 316 | 503 | 1.591772 | 7 | 23 | 3.285714 | 1.693942 |
| dahl | 26 | 43 | 1.653846 | 3 | 10 | 3.333333 | 1.679487 |
Code
fig = full_df.loc[full_df.doc_freq_src>1].tfidf_diff.sort_values(ascending=False).plot()
fig.update_layout(width=900, height=400)
fig.show()Figure 4 presents the difference in frequencies of search terms when compared to the index. Term that see little difference, i.e. are common to both, will score close to zero.
Build phrases using colocation
The intuition here is that in search results phrases, like names would be likely move in significance together. Search for “Western” movies and we might expect “John” and “Wayne” to individually move in significance but if we looked at there frequent positions we would be able to link them.
The following algorithm does this, is adds a the concept of slop allowing for not direct adjacencies.
Code
%%time
max_slop=2 # Distance between words in phrase
phrase_df =(
full_df.loc[
( full_df.doc_freq_src>1 ) # In more than 1 doc
& ( full_df.tfidf_diff>0 ) # Has moved in significance
]
.head(200)
.join(
doc_term_stats_df[['doc_id', 'positions']],
how='left'
)
.sort_values(by=['tfidf_diff'], ascending=[False])
[['doc_id', 'positions']]
.explode('positions')
.reset_index()
.set_index('doc_id')
.sort_values(by=['doc_id', 'positions'])
)
count_df = phrase_df.groupby('doc_id')['positions'].count()
count_df.name = "nos_row"
count_df = count_df.loc[lambda x: x>1]
phrase_df = count_df.to_frame().join(phrase_df).drop('nos_row', axis=1)
phrase_df['pos_diff'] = phrase_df.positions.groupby('doc_id').diff()
phrase_df['pos_diff'] = np.where(
phrase_df.pos_diff<=max_slop, phrase_df.pos_diff,
np.where(
(phrase_df.pos_diff>max_slop) & (phrase_df.pos_diff.shift(-1)<=max_slop), 0,
np.where(
np.abs(phrase_df.positions.shift(-1)-phrase_df.positions)<=max_slop, 0,-1))
)
phrase_df.head(8)CPU times: user 7.08 ms, sys: 1.5 ms, total: 8.58 ms
Wall time: 8.12 ms
| term | positions | pos_diff | |
|---|---|---|---|
| doc_id | |||
| 11 | thriller | 58 | -1 |
| 11 | thriller | 64 | -1 |
| 11 | novel | 76 | -1 |
| 11 | thriller | 189 | -1 |
| 114 | thriller | 116 | -1 |
| 114 | thriller | 283 | -1 |
| 114 | member | 319 | -1 |
| 185 | t | 10 | -1 |
Code
%%time
def build_phrase(df):
out = []
phrase = []
positions = []
last=0
for doc_id,row in df.iterrows():
idx = row.pos_diff
if idx == 0:
if len(phrase)>0:
out.append(
pd.Series(
{'phrase': ' '.join(phrase), 'positions': positions},
)
)
phrase = []
positions = []
last=0
diff = idx-last
if diff > 1:
for i in range(diff-1):
phrase.append("???") # if slop >1 I might miss a term, so use placeholder
positions.append(positions[-1]+1)
phrase.append(row.term)
positions.append(row.positions)
last=idx
if len(phrase)>0:
out.append(
pd.Series(
{'phrase': ' '.join(phrase), 'positions': positions},
)
)
return pd.DataFrame(out).set_index('phrase')
all_phrase_df = (
# inline test
phrase_df.loc[
(phrase_df.pos_diff>=0)
]
.groupby('doc_id')
.apply(build_phrase)
.sort_values(by='phrase')
.reset_index()
)
(
all_phrase_df
.loc[lambda x: x.phrase.str.len()<64] # To fix pdf table render
.set_index('phrase')
).head(8)CPU times: user 39 ms, sys: 1.4 ms, total: 40.4 ms
Wall time: 40 ms
| doc_id | positions | |
|---|---|---|
| phrase | ||
| al pacino | 40503 | [90, 91] |
| al pacino | 40503 | [71, 72] |
| al pacino | 40491 | [172, 173] |
| al pacino | 40491 | [715, 716] |
| al pacino | 40491 | [260, 261] |
| al pacino | 40491 | [599, 600] |
| al pacino cusack | 40504 | [27, 28, 30] |
| associ ??? rock | 25176 | [109, 110, 111] |
Code
%%time
significant_phrases_df = (
all_phrase_df
.drop(["positions"], axis=1)
.groupby('phrase')
.agg({
"doc_id": ['nunique', lambda x: np.unique(x).tolist()]
})
.droplevel(0, axis=1)
.rename({"nunique":"nos_docs"}, axis=1)
.rename({"<lambda_0>":"doc_ids"}, axis=1)
.loc[lambda x: x.nos_docs>1]
.sort_values(by=['nos_docs'], ascending=[False])
).head(50)
significant_phrases_dfCPU times: user 2.88 ms, sys: 435 µs, total: 3.32 ms
Wall time: 3.15 ms
| nos_docs | doc_ids | |
|---|---|---|
| phrase | ||
| music video | 16 | [4172, 10713, 35326, 36748, 36751, 36753, 3675... |
| michael dougla | 8 | [16814, 16815, 16819, 35759, 35762, 35764, 484... |
| polit thriller | 8 | [6796, 16815, 16819, 26213, 32420, 40500, 4050... |
| michael ??? thriller | 5 | [36750, 36756, 36766, 36768, 36769] |
| robert altman | 4 | [6322, 43012, 43013, 46909] |
| hitchcockian thriller | 4 | [5463, 7445, 43174, 48620] |
| red rock | 3 | [49316, 49318, 49325] |
| pacino ??? cusack | 3 | [40491, 40500, 40504] |
| mute wit | 3 | [33299, 33498, 33521] |
| lara ??? boyl | 3 | [49316, 49318, 49325] |
| denzel washington | 3 | [19086, 47750, 47758] |
| conspiraci thriller | 3 | [6796, 9855, 26211] |
| song thriller | 3 | [4172, 36756, 36761] |
| servic ??? pete | 2 | [48428, 48429] |
| thriller ??? thriller | 2 | [7445, 44486] |
| southern gothic | 2 | [43012, 44469] |
| vijai anand | 2 | [36164, 36176] |
| al pacino | 2 | [40491, 40503] |
| robert mitchum | 2 | [5463, 43012] |
| citi hall | 2 | [40491, 40503] |
| music video thriller | 2 | [36761, 36773] |
| isn t | 2 | [22713, 23335] |
| zane ??? brook attend | 2 | [5984, 8161] |
Fixing placeholder, unknown values
To do this we can use the offsets and the original document. Basically we have all the byte offsets and the original text so we can just look it up, even if had be stripped out as a stop word.
Code
%%time
place_df = (
# significant_phrases_df
significant_phrases_df.loc[
significant_phrases_df.index.str.contains("???", regex=False)
]
.drop('doc_ids', axis=1)
.join(
all_phrase_df.set_index('phrase')
, how='inner'
)
)
place_df['terms']= place_df.index.str.split()
def do_zip(tpl):
tpl['zip'] = list(zip(tpl['terms'], tpl['positions']))
return tpl
def de_tuple(tpl):
term, position = tpl['zip']
tpl['term'] = term
tpl['position'] = position
return tpl
def extract(tpl):
d = tpl['postings']
flag = pd.notna(d)
tpl['position_r']= d['position'] if flag else -1
tpl['offset_start']= d['offset_start'] if flag else -1
tpl['offset_end']= d['offset_end'] if flag else -1
return tpl
place_df = (
place_df
.apply(do_zip, axis=1)
.drop(['positions', 'terms'], axis=1)
.reset_index()
.reset_index()
.rename({'index':'pid'}, axis=1) # to handle repeated phrase in a doc_id
.explode('zip')
.apply(de_tuple, axis=1)
.drop('zip', axis=1)
.reset_index()
.set_index(['doc_id', 'term'])
.join(
(
doc_term_stats_df
.reset_index()
.set_index(['doc_id', 'term'])
.drop('positions', axis=1)
)
, how='left'
)
.explode('postings')
.apply(extract, axis=1)
.drop('postings', axis=1)
.loc[lambda x: x.position==x.position_r]
.reset_index()
.groupby(['pid', 'phrase', 'doc_id'])
.agg({
'offset_start': min,
'offset_end': max
})
.reset_index('doc_id')
)
place_dfCPU times: user 181 ms, sys: 6.76 ms, total: 188 ms
Wall time: 187 ms
| doc_id | offset_start | offset_end | ||
|---|---|---|---|---|
| pid | phrase | |||
| 0 | michael ??? thriller | 36768 | 0 | 26 |
| 1 | michael ??? thriller | 36769 | 50 | 76 |
| 2 | michael ??? thriller | 36766 | 855 | 882 |
| 3 | michael ??? thriller | 36750 | 404 | 430 |
| 4 | michael ??? thriller | 36756 | 458 | 485 |
| 5 | michael ??? thriller | 36766 | 65 | 92 |
| 6 | pacino ??? cusack | 40500 | 132 | 149 |
| 7 | pacino ??? cusack | 40504 | 918 | 935 |
| 8 | pacino ??? cusack | 40491 | 2033 | 2052 |
| 9 | lara ??? boyl | 49316 | 1739 | 1755 |
| 10 | lara ??? boyl | 49318 | 505 | 523 |
| 11 | lara ??? boyl | 49325 | 1058 | 1074 |
| 12 | lara ??? boyl | 49318 | 984 | 1000 |
| 13 | servic ??? pete | 48429 | 440 | 458 |
| 14 | servic ??? pete | 48428 | 845 | 863 |
| 15 | thriller ??? thriller | 7445 | 26 | 50 |
| 16 | thriller ??? thriller | 44486 | 578 | 609 |
| 17 | zane ??? brook attend | 5984 | 130 | 153 |
| 18 | zane ??? brook attend | 8161 | 130 | 153 |
Code
%%time
for d in place_df.itertuples():
index, doc_id, offset_start, offset_end = d
stored = reader.storedFields()
doc = stored.document(doc_id)
text = doc.get('text')
print(f"Doc {doc_id}: {text[offset_start:offset_end]}")Doc 36768: Michael Jackson's Thriller
Doc 36769: Michael Jackson's THRILLER
Doc 36766: Michael Jackson - "Thriller
Doc 36750: Michael Jackson's Thriller
Doc 36756: Michael Jackson's 'Thriller
Doc 36766: Michael Jackson's "Thriller
Doc 40500: Pacino and Cusack
Doc 40504: Pacino and Cusack
Doc 40491: Pacino. John Cusack
Doc 49316: Lara Flynn Boyle
Doc 49318: Lara Flynn Boyle's
Doc 49325: Lara Flynn Boyle
Doc 49318: Lara Flynn Boyle
Doc 48429: Service Agent Pete
Doc 48428: Service agent Pete
Doc 7445: Thriller". But thrillers
Doc 44486: thrillers, especially thrillers
Doc 5984: Zane and Brook attended
Doc 8161: Zane and Brook attended
CPU times: user 2.99 ms, sys: 1.03 ms, total: 4.02 ms
Wall time: 6.19 ms
Conclusion
As analysts, data people we have user / player entered text in lots of places, surveys, reviews social profiles etc. We often ignore it, but we don’t have to. The technique here demonstrates the use of Lucene directly, but Elastic has significant term tools [3,4]. I go a step further an build phrases from the observation that the terms in a phrase are colocated. I’ve demonstrated that this is possible even when I drop stop words on the floor and provided a method for recovering the original sense and context.
It is fast, approximate timings:
- Indexing 50,000 short documents, 7s
- Building index stats, 200ms
- Building search stats, 400ms
- Building phrases, 50ms
- Resolve placeholder text, 200ms
More though we’ve see this tooling is fast, I think there may be a place alongside AI model training which I plan to explore in the next part when I take what this produces to try and bootstrap the training of a PyTorch model.