flowchart TB
A["CSV Reviews"] --> B["Lucene Index<br/>(TMDB)"]
B --> C["IndexStatsExtractor<br/>(global TF-IDF)"]
B --> D["SearchStatsExtractor<br/>(per-query stats)"]
C --> E["PhraseBuilder"]
D --> E
E --> F["all_phrases_df<br/>(phrase, doc_ids)"]
F --> G["Sequence Labelling<br/>(TermVectors)"]
G --> H["BiLSTM PhraseTagger<br/>(REINFORCE + BCE)"]
H --> I["Evaluation<br/>(Precision / Recall / Reward)"]
Learning Phrases with PyLucene and Pytorch, part 2.
In part 2 we reuse our tokenised index and use pytorch to build a model for significant phrase extraction. It worked surprisingly well and being able to switch Analyzers proved useful. We found that the English Analyzer with stopword removal and stemming worked best.
The results are indicative, neither the dataset size or the length of training cycles are sufficient for the development of a genralised phrase extractor but the succcess and ovelap found between pylucene and pytorch is very encouraging. We just need to scale it up.
This time, we have absracted away the Lucene code from part 1 [1] into a set of classes that we can use to extract significant phrases from search results and the index. We will use these results and see how we get on building a pytorch model to extract phrases directly.
The pipeline has two distinct phases, see Figure 1. First, PyLucene extracts statistically significant phrases from a search result set. Second, those phrases are used as training labels to teach a PyTorch model to recognise similar phrases directly from raw token sequences.
Extracting Significant Phrases with Lucene
This time we have switch the dataset to the TMDB dataset, which is another movie reviews dataset with 10,000 Movies with 150k Cast, 63k Crew, and 80k User Reviews. Importantly for the test it has the movie title as a separate field. This makes it a little easier to generate lots of phrases. In the Lucene approach we can used the movie title as a query term and generate the significant phrases for each movie.
A limitation of this data set is it is quite small per movie. Many movies have only a few reviews.
Classes
The abstracted classes [1] greatly simplyfy the PyLucene phrase extraction pipeline.
classDiagram
class ConfigurableIndexer {
+dict index_config
+add_document(dict)
}
class IndexStatsExtractor {
+dict index_config
+str field_name
+extract() DataFrame
}
class SearchStatsExtractor {
+dict index_config
+str field_name
+IndexReader reader
+extract(TopDocs hits) DataFrame
}
class PhraseBuilder {
+DataFrame index_stats_df
+DataFrame search_stats_df
+IndexReader reader
+build_phrases(max_slop, num_significant_terms) DataFrame
}
ConfigurableIndexer ..> IndexStatsExtractor : writes index
IndexStatsExtractor ..> PhraseBuilder : index_stats_df
SearchStatsExtractor ..> PhraseBuilder : search_stats_df
Phrase Extraction Algorithm
The phrase extraction algorithm is presented in Figure 3.
flowchart TD
A["IndexStatsExtractor<br/>extract()"] --> C["Join on term<br/>compute tfidf delta"]
B["SearchStatsExtractor<br/>extract(hits)"] --> C
C --> D{"tfidf higher<br/>in search than index?"}
D -- No --> E["Drop term"]
D -- Yes --> F["Keep as significant term"]
F --> G["Group consecutive terms<br/>within max_slop positions"]
G --> H{"Gap ≤ max_slop?"}
H -- Yes --> I["Fill gap with '???'"]
H -- No --> J["End phrase"]
I --> G
J --> K["Resolve '???' via<br/>storedFields byte offsets"]
K --> L["resolved_text<br/>phrases DataFrame"]
Whereas in part 1 we programatically defined lucene documents and fields, all of this is now configurable. It would be a simple matter to extend this to daily indexing using templating the config file. For now though it is not that big at 33MB. The index configuration is as follows.
index_config = {
"directory": "./index_tmdb",
"default_analyzer": "keyword", # Whole doc as a single token
"fields": [
{
"name": "movie_title",
"options": "docs"
},
{
"name": "content",
"analyzer": "english", # Standard English analyzer
"options": "docs_freqs_positions_offsets"
}
]
}The phrase extraction configuration is also presented, later we will do similar for the pytorch model. It will be interesting to compare the complexities of the two approaches.
phrases_config = {
"min_doc_freq": 5, # minimum document frequency for significant phrase
"max_slop": 2, # maximum slop for significant phrase
"num_significant_terms": 200, # number of significant terms to extract
"max_hits": 500 # maximum number search results per movie
}Implementation
First we import the required classes.
Code
import pandas as pd
import numpy as np
from pathlib import Path
from acumed.search.index import (
ConfigurableIndexer,
IndexStatsExtractor
)
from acumed.search.searcher import SearchStatsExtractor
from acumed.search.phrases import PhraseBuilder
from org.apache.lucene.index import Term
from org.apache.lucene.search import (
IndexSearcher,
TermQuery,
BooleanQuery,
BooleanClause
)WARNING: Using incubator modules: jdk.incubator.vector
Mar 30, 2026 4:12:37 PM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=128
We index the TMDB reviews. It is small, but still large enough to be interesting.
Code
%%time
index_config = {
"directory": "./index_tmdb",
"default_analyzer": "keyword",
"fields": [
{
"name": "movie_title",
"options": "docs"
},
{
"name": "content",
"analyzer": "english",
"options": "docs_freqs_positions_offsets"
}
]
}
if not Path(index_config["directory"]).exists():
# Index each review in the dataframe
tmdb_df = pd.read_csv("data/tmdb/reviews.csv")
with ConfigurableIndexer(index_config) as indexer:
def index_review(tpl):
indexer.add_document({
"movie_title": tpl.movie_title,
"content": tpl.content
})
return tpl
tmdb_df.apply(index_review, axis=1)CPU times: user 2.87 s, sys: 130 ms, total: 3 s
Wall time: 1.82 s
As before we can extract the significant phrases from the index.
Code
%%time
phrases_config = {
"min_doc_freq": 5, # minimum document frequency for significant phrase
"max_slop": 2, # maximum slop for significant phrase
"num_significant_terms": 200, # number of significant terms to extract
"max_hits": 500 # maximum number search results per movie
}
with IndexStatsExtractor(index_config, "content") as ex:
index_stats_df = ex.extract()
with SearchStatsExtractor(index_config, "content") as ex:
titles = (
pd.read_csv("data/tmdb/reviews.csv")[['movie_title', 'movie_id']]
.groupby("movie_title")
.count()
.loc[lambda x: x.movie_id >= phrases_config["min_doc_freq"]] # more than 5 reviews
.reset_index()[['movie_title']]
.drop_duplicates()
)
searcher = IndexSearcher(ex.reader)
def f(tpl):
movie_title = tpl.movie_title
query = (
BooleanQuery.Builder()
.add(TermQuery(Term('movie_title',movie_title)), BooleanClause.Occur.MUST)
).build()
hits = searcher.search(query, phrases_config["max_hits"])
search_stats_df = ex.extract(hits)
pb = PhraseBuilder(index_stats_df, search_stats_df, ex.reader)
phrases_df = pb.build_phrases(max_slop=phrases_config["max_slop"], num_significant_terms=phrases_config["num_significant_terms"])
phrases_df['movie_title'] = movie_title
return phrases_df
# Cache invalidation sweep ensuring new PhraseBuilder doc_id schema triggers
all_phrases_df = titles.apply(f, axis=1)
all_phrases_df = pd.concat(all_phrases_df.tolist())
all_phrases_df.head(8)CPU times: user 21.6 s, sys: 97.1 ms, total: 21.7 s
Wall time: 19 s
| phrase | resolved_text | doc_ids | nos_docs | movie_title | |
|---|---|---|---|---|---|
| 0 | 13 go 30 | 13 go 30 | [3076, 3077, 3079] | 3 | 13 Going on 30 |
| 0 | first world war | first world war | [9584, 9596] | 2 | 1917 |
| 1 | man land | man land | [9582, 9584] | 2 | 1917 |
| 2 | other film | other film | [9585, 9595] | 2 | 1917 |
| 0 | 28 dai later | 28 dai later | [337, 340] | 2 | 28 Days Later |
| 1 | cillian murphi | cillian murphi | [340, 343] | 2 | 28 Days Later |
| 0 | 28 year later | 28 year later | [12505, 12508, 12509] | 3 | 28 Years Later |
| 1 | all too | all too | [12507, 12508] | 2 | 28 Years Later |
Learning Phrases with PyTorch via the Lucene Index
We now use the significant phrases extracted from the Lucene index as training data for a PyTorch model.
The task is token classification: given a sequence of words from a document, predict which tokens belong to a significant phrase. The natural candidates, in rough order of complexity, are:
| Model | Memory | Hardware | Notes |
|---|---|---|---|
| Logistic Regression | \(O(N)\) | CPU | No sequential context |
| Bidirectional LSTM [2] | \(O(N)\) | CPU / MPS | Good context, efficient |
| BERT / RoBERTa [3] | \(O(N^2)\) | GPU (≥ 8 GB VRAM) | State-of-the-art, memory intensive |
| Longformer [4] | \(O(N \sqrt{N})\) | High-end GPU | Designed for long documents |
A Transformer such as BERT would likely give the best recall, but its self-attention mechanism scales quadratically with sequence length. On a MacBook Pro with Apple Silicon, batches of 1024-token sequences caused kernel out-of-memory crashes in initial experiments. The Bidirectional LSTM [2,5] scales linearly and runs stably on Apple MPS — a practical choice when GPU memory is limited.
The training objective combines two signals:
- First, a Binary Cross-Entropy loss [6] with a high positive class weight (
pos_weight = 25) to compensate for label imbalance (phrase tokens are typically less than 5% of a document). - Second, a REINFORCE [7] policy gradient that rewards the model at the document level using the 0 / 1 / 2 scoring scheme, encouraging complete phrase recovery rather than isolated token hits.
For each document that has at least one target phrase, we read its term vector directly from the Lucene index, reconstruct the original token sequence ordered by position, then build a binary label vector marking which token positions belong to a target phrase, see Figure 4.
---
config:
themeVariables:
fontSize: 10
---
sequenceDiagram
participant PD as all_phrases_df
participant TV as Lucene TermVectors
participant SP as Sequence Prep
participant DS as SequenceDataset
PD->>SP: doc_phrases dict {doc_id: [phrases]}
loop For each doc_id
SP->>TV: reader.termVectors().get(doc_id)
TV-->>SP: terms + byte positions
SP->>SP: Sort by position, truncate to max_seq_len
SP->>SP: Slide phrase tokens over sequence → seq_labels [0/1]
SP->>SP: Map tokens → vocab integer IDs
SP->>DS: append (tensor, labels, doc_id)
end
DS->>DS: 75/25 train/test split
As before all tuneable hyperparameters are consolidated into a single dict at the top of the PyTorch code section. The task involves highly imbalanced data (only a few tokens are part of a phrase) and long documents, a few hyperparameters are particularly critical:
pos_weight(25.0): This is the most important parameter. It tells the Binary Cross-Entropy loss to treat a “missed” phrase token as 25 times more costly than a “false alarm” non-phrase token. Without this, the model would simply predict “0” for everything to achieve 95%+ accuracy.pg_weight(0.1): This balances the two training signals. A value of 0.1 ensures the Reinforcement Learning (REINFORCE) reward signal guides the model toward complete phrases without overwhelming the token-level BCE supervision.max_seq_len(1024): This limits how much of a document the model “sees” at once. While LSTMs scale linearly, processing very long sequences still consumes significant memory.hidden_dim(128): This defines the “capacity” of the LSTM. 128 units allows the model to remember enough contextual interaction between words to distinguish significant phrases from common collocations.
# All tuneable hyperparameters in one place
model_config = {
"max_seq_len": 1024, # Token truncation limit per document (avoids O(N²) memory in long docs)
"embed_dim": 128, # Embedding vector size
"hidden_dim": 128, # LSTM hidden units per direction
"num_layers": 2, # Stacked BiLSTM depth
"dropout": 0.2, # Dropout between LSTM layers
"pos_weight": 25.0, # BCE weight for phrase tokens (boosts Recall on sparse labels)
"lr": 0.001, # Adam learning rate
"batch_size": 32, # Mini-batch size
"epochs": 6, # Training epochs
"pg_weight": 0.1, # Policy-gradient loss coefficient relative to BCE
"test_size": 0.25, # Held-out fraction for evaluation
"random_state": 42, # Reproducible train/test split
"n_comparison_docs": 10, # Max documents displayed in visual comparison
}In the following code we use an IndexReader directly to translate from TermVector to Tensors.
Code
%%time
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from torch.nn.utils.rnn import pad_sequence
import numpy as np
# All tuneable hyperparameters in one place
model_config = {
"max_seq_len": 1024, # Token truncation limit per document (avoids O(N²) memory in long docs)
"embed_dim": 128, # Embedding vector size
"hidden_dim": 128, # LSTM hidden units per direction
"num_layers": 2, # Stacked BiLSTM depth
"dropout": 0.2, # Dropout between LSTM layers
"pos_weight": 25.0, # BCE weight for phrase tokens (boosts Recall on sparse labels)
"lr": 0.001, # Adam learning rate
"batch_size": 32, # Mini-batch size
"epochs": 6, # Training epochs
"pg_weight": 0.1, # Policy-gradient loss coefficient relative to BCE
"test_size": 0.25, # Held-out fraction for evaluation
"random_state": 42, # Reproducible train/test split
"n_comparison_docs": 10, # Max documents displayed in visual comparison
}
# 1. Prepare Target Phrases per Document
doc_phrases = (
all_phrases_df[['phrase', 'doc_ids']]
.explode('doc_ids')
.drop_duplicates()
.groupby('doc_ids')['phrase']
.apply(list)
.to_dict()
)
# 2. Extract Sequential Data from Lucene TermVectors
from org.apache.lucene.util import BytesRefIterator
from acumed.search.searcher import SearchStatsExtractor
vocab = {"<PAD>": 0, "<UNK>": 1}
sequences = []
labels = []
doc_ids_list = []
# Open the reader to loop the exact sequential positions per document
with SearchStatsExtractor(index_config, "content") as ex:
reader = ex.reader
for doc_id, target_phrases in doc_phrases.items():
vector = reader.termVectors().get(doc_id)
if vector is None:
continue
termsEnum = vector.terms("content")
if termsEnum is None:
continue
te = termsEnum.iterator()
term_positions = []
# Build out internal chronological document using native word positions
for term in BytesRefIterator.cast_(te):
term_str = term.utf8ToString()
postings = te.postings(None)
postings.nextDoc()
freq = postings.freq()
for _ in range(freq):
pos = postings.nextPosition()
term_positions.append((pos, term_str))
term_positions.sort(key=lambda x: x[0])
# Truncate long documents: Transformers scale O(N²), BiLSTM is O(N) but still memory-bounded
seq_tokens = [t for p, t in term_positions][:model_config["max_seq_len"]]
if not seq_tokens:
continue
# Build sequential BIO/Binary tags mapping targets directly onto local offsets
seq_labels = [0] * len(seq_tokens)
for phrase in target_phrases:
phrase_tokens = phrase.split()
p_len = len(phrase_tokens)
# Map sub-sequence overlaps ignoring identical max_slop mappings natively
for i in range(len(seq_tokens) - p_len + 1):
match = True
for j in range(p_len):
if phrase_tokens[j] != '???' and phrase_tokens[j] != seq_tokens[i+j]:
match = False
break
if match:
for j in range(p_len):
seq_labels[i+j] = 1
# Map vocabulary tensors
enc_seq = []
for t in seq_tokens:
if t not in vocab:
vocab[t] = len(vocab)
enc_seq.append(vocab[t])
sequences.append(torch.tensor(enc_seq, dtype=torch.long))
labels.append(torch.tensor(seq_labels, dtype=torch.float))
doc_ids_list.append(doc_id)CPU times: user 1.35 s, sys: 165 ms, total: 1.52 s
Wall time: 2.2 s
PyTorch Datasets
Next we create our standard DataLoaders with padded sequence spans and randomly split into train:test sets in the ratio 3:1.
Code
class SequenceDataset(Dataset):
def __init__(self, sequences, labels, doc_ids):
self.sequences = sequences
self.labels = labels
self.doc_ids = doc_ids
def __len__(self):
return len(self.sequences)
def __getitem__(self, idx):
return self.sequences[idx], self.labels[idx], self.doc_ids[idx]
def collate_fn(batch):
seqs, lbls, dids = zip(*batch)
seqs_padded = pad_sequence(seqs, batch_first=True, padding_value=vocab["<PAD>"])
lbls_padded = pad_sequence(lbls, batch_first=True, padding_value=0.0)
return seqs_padded, lbls_padded, torch.tensor(dids)
# Isolate splits exactly against document IDs
X_train, X_test, y_train, y_test, id_train, id_test = train_test_split(
sequences, labels, doc_ids_list,
test_size=model_config["test_size"],
random_state=model_config["random_state"]
)
train_dataset = SequenceDataset(X_train, y_train, id_train)
test_dataset = SequenceDataset(X_test, y_test, id_test)
train_loader = DataLoader(train_dataset, batch_size=model_config["batch_size"], shuffle=True, collate_fn=collate_fn)
test_loader = DataLoader(test_dataset, batch_size=model_config["batch_size"], shuffle=False, collate_fn=collate_fn)Bidirectional LSTM Evaluation
Each token embedding is passed through a stack of bidirectional LSTM layers so that every position can see both preceding and following context before a linear head produces a per-token phrase probability.
flowchart TD
A["Input token IDs<br/>(batch × seq_len)"] --> B["nn.Embedding<br/>(vocab_size → embed_dim)"]
B --> C["nn.LSTM<br/>(bidirectional, num_layers)<br/>embed_dim → hidden_dim × 2"]
C --> D["nn.Linear<br/>(hidden_dim × 2 → 1)"]
D --> E["Sigmoid → probability per token"]
E --> F{"threshold 0.5"}
F -->|"≥ 0.5"| G["Tag = 1 (phrase token)"]
F -->|"< 0.5"| H["Tag = 0 (non-phrase)"]
Code
%%time
device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')
class PhraseTagger(nn.Module):
def __init__(self, vocab_size,
embed_dim=model_config["embed_dim"],
hidden_dim=model_config["hidden_dim"],
num_layers=model_config["num_layers"]):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=num_layers,
bidirectional=True, batch_first=True,
dropout=model_config["dropout"])
self.linear = nn.Linear(hidden_dim * 2, 1)
def forward(self, x):
embedded = self.embedding(x)
lstm_out, _ = self.lstm(embedded)
logits = self.linear(lstm_out).squeeze(-1)
return logits
model = PhraseTagger(vocab_size=len(vocab)).to(device)
pos_weight = torch.tensor([model_config["pos_weight"]], device=device)
criterion = nn.BCEWithLogitsLoss(reduction='none', pos_weight=pos_weight)
optimizer = torch.optim.Adam(model.parameters(), lr=model_config["lr"])
# Policy Gradient REINFORCE natively bound to BCE Training Loops
epochs = model_config["epochs"]
model.train()
for epoch in range(epochs):
total_loss = 0
total_bce = 0
total_pg = 0
for seqs, lbls, _ in train_loader:
optimizer.zero_grad()
seqs, lbls = seqs.to(device), lbls.to(device)
logits = model(seqs)
probs = torch.sigmoid(logits)
mask = (seqs != vocab["<PAD>"]).float()
# 1. Supervised Baseline BCE Loss stabilizing early execution
bce_loss_raw = criterion(logits, lbls)
bce_loss = (bce_loss_raw * mask).sum() / max(mask.sum(), 1)
# 2. Reinforcement Learning Feedback (REINFORCE Algorithm)
# Sample sequence tagging choices evaluating probability distributions natively
m = torch.distributions.Bernoulli(probs)
actions = m.sample()
log_probs = m.log_prob(actions)
batch_rewards = []
for i in range(len(lbls)):
doc_len = mask[i].sum().int().item()
pred_seq = actions[i][:doc_len]
true_seq = lbls[i][:doc_len]
true_indices = torch.where(true_seq == 1)[0].cpu().numpy()
pred_indices = torch.where(pred_seq == 1)[0].cpu().numpy()
if len(true_indices) > 0 and len(pred_indices) == 0:
r = 0.0
elif len(np.intersect1d(true_indices, pred_indices)) > 0:
doc_recall = len(np.intersect1d(true_indices, pred_indices)) / max(len(true_indices), 1)
r = 2.0 if doc_recall == 1.0 else 1.0
else:
r = 0.0
batch_rewards.append(r)
reward_tensor = torch.tensor(batch_rewards, device=device, dtype=torch.float)
# Policy Gradient Baseline normalization stabilizes high-variance sequence gradients avoiding collapse
advantages = reward_tensor - reward_tensor.mean()
# Policy Gradient resolves log loss arrays mapped optimally modifying document variables
masked_log_probs = (
(log_probs * mask).sum(dim=1) / max(mask.sum(dim=1).max(), 1)
)
pg_loss = - (masked_log_probs * advantages).mean()
# Joint Execution limits: BCE filters heavily imbalanced context safely, PG maps dynamic phrase limits
loss = bce_loss + model_config["pg_weight"] * pg_loss
loss.backward()
optimizer.step()
total_loss += loss.item()
total_bce += bce_loss.item()
total_pg += pg_loss.item()
avg_loss = total_loss/max(len(train_loader), 1)
avg_bce = total_bce/max(len(train_loader), 1)
avg_pg = total_pg/max(len(train_loader), 1)
print(f"Epoch {epoch+1}/{epochs} | Total Loss: {avg_loss:.4f} (BCE: {avg_bce:.4f}, PG: {avg_pg:.4f})")
torch.save(model.state_dict(), "local/models/phrases_pytorch.bin")Epoch 1/6 | Total Loss: 0.9121 (BCE: 0.9110, PG: 0.0112)
Epoch 2/6 | Total Loss: 0.7458 (BCE: 0.7455, PG: 0.0031)
Epoch 3/6 | Total Loss: 0.5821 (BCE: 0.5821, PG: -0.0004)
Epoch 4/6 | Total Loss: 0.4133 (BCE: 0.4133, PG: -0.0004)
Epoch 5/6 | Total Loss: 0.2790 (BCE: 0.2791, PG: -0.0012)
Epoch 6/6 | Total Loss: 0.1977 (BCE: 0.1978, PG: -0.0001)
CPU times: user 11 s, sys: 2.4 s, total: 13.4 s
Wall time: 27.8 s
That is it, training complete 6 epochs in a total time of a little less than 30 seconds. The total loss at each epoch is decreasing which is a good sign. This isn’t a big traing set or exhaustive traing. The results are unlikely to be generalised to other document sets. The question we are trying to answer here is can it work at all?
Final Evaluation Metrics
Calculate typical predictive behavior identifying significant precision constraints prioritizing Recall, appended by document rewards explicitly scoring phrase recovery completion natively.
flowchart TD
A["Model logits<br/>(test set)"] --> B["Sigmoid threshold 0.5<br/>binary predictions"]
B --> C["Flatten all predictions<br/>& true labels"]
C --> D["Standard Metrics<br/>Accuracy / Precision / Recall"]
B --> E["Per-document phrase<br/>boundary check"]
E --> F{"Any target<br/>phrase tokens?"}
F -- No --> G["Reward = 0<br/>(hard fail)"]
F -- Yes --> H{"Predicted tokens<br/>overlap targets?"}
H -- No --> G
H -- Yes --> I{"100% recall<br/>for this doc?"}
I -- Yes --> J["Reward = 2<br/>(perfect extraction)"]
I -- No --> K["Reward = 1<br/>(partial extraction)"]
J --> L["Average Reward"]
K --> L
G --> L
Code
from sklearn.metrics import accuracy_score, precision_score, recall_score
model.eval()
idx2word = {v: k for k, v in vocab.items()}
comparison_results = []
plot_data = []
all_preds = []
all_trues = []
rewards = []
with torch.no_grad():
for seqs, lbls, dids in test_loader:
seqs, lbls = seqs.to(device), lbls.to(device)
logits = model(seqs)
preds = (torch.sigmoid(logits) > 0.5).float()
# Compress boundary padding
mask = (seqs != vocab["<PAD>"])
all_preds.extend(preds[mask].cpu().numpy())
all_trues.extend(lbls[mask].cpu().numpy())
# Compile precise recovery rewards cleanly against native dimensions
for i in range(len(dids)):
doc_id = dids[i].item()
doc_len = mask[i].sum().int().item()
true_seq = lbls[i][:doc_len]
pred_seq = preds[i][:doc_len]
true_indices = torch.where(true_seq == 1)[0].cpu().numpy()
pred_indices = torch.where(pred_seq == 1)[0].cpu().numpy()
# Map 0, 1, 2 classification scores organically
if len(true_indices) > 0 and len(pred_indices) == 0:
rewards.append(0)
elif len(np.intersect1d(true_indices, pred_indices)) > 0:
# Ratio of overlapping correct indices matches phrase saturation perfectly handling edge slops
doc_recall = len(np.intersect1d(true_indices, pred_indices)) / max(len(true_indices), 1)
if doc_recall == 1.0:
rewards.append(2) # Score 2: All phrases mapped effectively
else:
rewards.append(1) # Score 1: Partial mapping resolved safely
else:
rewards.append(0) # Score 0: Failed mappings natively
# Render comparison variables mapping Lucene arrays vs PyTorch masks visually
if len(comparison_results) < model_config["n_comparison_docs"] and doc_id in doc_phrases:
pred_tokens = []
current_phrase = []
for j in range(doc_len):
if pred_seq[j] == 1:
current_phrase.append(idx2word.get(seqs[i][j].item(), "<UNK>"))
else:
if current_phrase:
pred_tokens.append(" ".join(current_phrase))
current_phrase = []
if current_phrase:
pred_tokens.append(" ".join(current_phrase))
raw_text_snippet = " ".join([idx2word.get(x.item(), "<UNK>") for x in seqs[i][:min(doc_len, 50)]]) + "..."
raw_text_snippet_esc = raw_text_snippet.replace('"', '"')
# Drop duplicate PyTorch predictions while preserving chronological extraction order
unique_pred_tokens = list(dict.fromkeys(pred_tokens)) if pred_tokens else ["(None)"]
# Capture full token sequence for boundary visualization
token_details = []
for j in range(min(doc_len, 50)):
token_details.append({
"word": idx2word.get(seqs[i][j].item(), "<UNK>"),
"is_target": true_seq[j].item() == 1,
"is_pred": pred_seq[j].item() == 1
})
comparison_results.append({
"doc_id": doc_id,
"pylucene_phrases": doc_phrases[doc_id],
"pytorch_phrases": unique_pred_tokens,
"raw_doc": raw_text_snippet_esc,
"tokens": token_details
})In the following code we present the results
Code
rewards = np.array(rewards)
metrics_df = pd.DataFrame([
{"Metric": "Native Model Accuracy", "Value": f"{accuracy_score(all_trues, all_preds):.4f}"},
{"Metric": "Extraction Precision", "Value": f"{precision_score(all_trues, all_preds, zero_division=0):.4f}"},
{"Metric": "Extraction Recall", "Value": f"{recall_score(all_trues, all_preds, zero_division=0):.4f}"},
{"Metric": "Reward: Perfect Extractions [2]", "Value": str((rewards == 2).sum())},
{"Metric": "Reward: Partial Extractions [1]", "Value": str((rewards == 1).sum())},
{"Metric": "Reward: Failed Misses [0]", "Value": str((rewards == 0).sum())},
{"Metric": "Average Document Reward", "Value": f"{rewards.mean():.2f}"}
])
metrics_df| Metric | Value | |
|---|---|---|
| 0 | Native Model Accuracy | 0.9477 |
| 1 | Extraction Precision | 0.1841 |
| 2 | Extraction Recall | 0.4811 |
| 3 | Reward: Perfect Extractions [2] | 48 |
| 4 | Reward: Partial Extractions [1] | 79 |
| 5 | Reward: Failed Misses [0] | 100 |
| 6 | Average Document Reward | 0.77 |
In this data accuracy is not a very good metric because most of the dataset is not a phrase. recall and precision are better metrics here. In these terms it is not a great model but looking a rewards and particularly partial extractions. The model is learning extracting phrases but quite often a little more than the target phrase.
Phrases Extracted
Figure 7 presents the overlaps and extraction in more detail. I am quite happy with this outcome. Is it perfect? No. Not by a long way but it has had 30 seconds training on a MacBook M5 with quite a small dataset. What it manages to do is very encouraging.
Code
from IPython.display import HTML
# 1. Build the Boundary Visualization
viz_html = """
<style>
.viz-container {
font-family: 'Segoe UI', sans-serif;
margin: 30px 0;
padding: 20px;
background: #fff;
border-radius: 12px;
box-shadow: 0 2px 15px rgba(0,0,0,0.05);
}
.legend {
display: flex;
gap: 20px;
margin-bottom: 20px;
font-size: 0.85rem;
font-weight: 600;
}
.legend-item { display: flex; align-items: center; gap: 6px; }
.box { width: 14px; height: 14px; border-radius: 3px; }
.doc-viz { margin-bottom: 25px; }
.doc-viz-title { font-weight: 700; font-size: 0.9rem; margin-bottom: 8px; color: #4b5563; }
.token-row { display: flex; flex-wrap: wrap; gap: 4px; }
.token {
padding: 2px 6px;
border-radius: 4px;
font-size: 0.85rem;
background: #f3f4f6;
color: #374151;
transition: transform 0.1s;
}
.token:hover { transform: scale(1.1); z-index: 10; }
/* Highlight Classes */
.t-lucene { background: #e0e7ff; color: #4338ca; border: 1px solid #c7d2fe; }
.t-pytorch { background: #fef3c7; color: #92400e; border: 1px solid #fde68a; }
.t-overlap { background: #d1fae5; color: #065f46; border: 1px solid #a7f3d0; font-weight: 700; }
</style>
<div class="viz-container">
<h3 style="margin-top:0">Phrase Extraction Boundaries (First 5 Docs)</h3>
<div class="legend">
<div class="legend-item"><div class="box t-lucene"></div> Lucene Target</div>
<div class="legend-item"><div class="box t-pytorch"></div> PyTorch Predicted</div>
<div class="legend-item"><div class="box t-overlap"></div> Organic Overlap</div>
</div>
"""
for row in comparison_results[:5]:
viz_html += f'<div class="doc-viz"><div class="doc-viz-title">Document: {row["doc_id"]}</div><div class="token-row">'
for t in row["tokens"]:
cls = ""
if t["is_target"] and t["is_pred"]: cls = "t-overlap"
elif t["is_target"]: cls = "t-lucene"
elif t["is_pred"]: cls = "t-pytorch"
viz_html += f'<span class="token {cls}">{t["word"]}</span>'
viz_html += '</div></div>'
viz_html += "</div>"
display(HTML(viz_html))
# 2. Build the Global Results Table
table_html = """
<style>
.extraction-table {
width: 100%;
border-collapse: separate;
border-spacing: 0;
font-family: 'Segoe UI', system-ui, -apple-system, sans-serif;
margin: 20px 0;
background: rgba(255, 255, 255, 0.8);
backdrop-filter: blur(8px);
border-radius: 12px;
overflow: hidden;
box-shadow: 0 4px 20px rgba(0, 0, 0, 0.08);
border: 1px solid #e1e4e8;
}
.extraction-table th {
background: linear-gradient(135deg, #6366f1, #4f46e5);
color: white;
padding: 16px;
text-align: left;
font-weight: 600;
font-size: 0.95rem;
letter-spacing: 0.02em;
border-bottom: 2px solid rgba(0,0,0,0.1);
}
.extraction-table td {
padding: 14px 16px;
border-bottom: 1px solid #f0f1f4;
vertical-align: top;
color: #374151;
font-size: 0.9rem;
line-height: 1.5;
}
.extraction-table tr:last-child td {
border-bottom: none;
}
.extraction-table tr:hover {
background-color: rgba(99, 102, 241, 0.03);
transition: background-color 0.2s ease;
}
.doc-id-cell {
font-weight: 700;
color: #4f46e5;
width: 15%;
position: relative;
cursor: help;
}
.phrase-list {
margin: 0;
padding: 0;
list-style: none;
}
.phrase-item {
margin-bottom: 6px;
display: flex;
align-items: center;
}
.phrase-item::before {
content: "•";
color: #6366f1;
font-weight: bold;
display: inline-block;
width: 1em;
margin-left: -1em;
}
.phrase-col {
padding-left: 20px !important;
}
</style>
<table class="extraction-table">
<thead>
<tr>
<th>Doc ID (Hover for context)</th>
<th>PyLucene Phrases (Target)</th>
<th>PyTorch Tags (Predicted)</th>
</tr>
</thead>
<tbody>
"""
for row in comparison_results:
lu_phrases = "".join([f'<div class="phrase-item">{p}</div>' for p in row['pylucene_phrases']])
pt_phrases = "".join([f'<div class="phrase-item">{p}</div>' for p in row['pytorch_phrases']])
table_html += f"""
<tr title="{row['raw_doc']}">
<td class="doc-id-cell">{row['doc_id']}</td>
<td class="phrase-col"><div class="phrase-list">{lu_phrases}</div></td>
<td class="phrase-col"><div class="phrase-list">{pt_phrases}</div></td>
</tr>
"""
table_html += "</tbody></table>"
display(HTML(table_html))Phrase Extraction Boundaries (First 5 Docs)
| Doc ID (Hover for context) | PyLucene Phrases (Target) | PyTorch Tags (Predicted) |
|---|---|---|
| 12247 |
liam neeson
|
liam neeson
sleep night
|
| 7218 |
star war
|
star war movi
star war fan
william origin music
star war film
|
| 6260 |
x men
|
entri x men saga
prof x
x men
x men apocalyps
mutant have been
dai
where x men
x
littl too
|
| 10503 |
spider man wai home
|
life peter
life spider man
wors spider man
spider man
deal hi secret
peter seek out doctor strang benedict cumberbatch
peter parker spider man strang
him odd doctor strang
peter
hi
save dai film
deliv spider man
|
| 11127 |
quiet place dai on
|
quiet place dai
lupita nyong'o
alien invas
quiet place dai on film
|
| 11828 |
tom hank deliv
|
man
juxtaposit
insid
him
tom hank
i like
i could have
societi selfish
widow
|
| 8448 |
black adam
|
endors black adam
super
fare special effect
said
|
| 5480 |
i did
|
much anticip
have
i
i had hard time
|
| 12919 |
28 dai later
|
28 dai later franchis
|
| 8625 |
galaxi vol 3 farewel
|
guardian
guardian rocket bittersweet
sinc spider man
|
Conclusions
We should note that we are to some extent comparing apples and pears here. The PyLucene approach is rule based it has an on disk index size of 33MB and extacting phrases by looking at multiple documents. To be a significant phrase it must appear in more than one document. The PyTorch approach is trying to extract phrases from a single document. It is an unfare comparison, but I was more interested in the idea of bootstrapping learning and from that perspective it worked. The model save size was 10MB which compares favourably to the 33MB index size, but in training it was using 10GB of RAM.
In general the model has converged and found phrases per document, it is a good start. The configuration has lots of parameters and I suspect mofing the model to a different style of written text would require retraining. Where in contrast PyLucene is just looking a colocation of terms in the text, it will be more resilient, deterministic and faster but it needs a background index for comparison.
The PyTorch model has done suprisingly well and we should be encouraged by the results, it doing so well with just 30s of training and quite a small data set demonstrates the power of learning, and this could be transfered. I did experiment briefly with both Whitespace and Standard Lucene Analyzers. The results with the English Analyzer were the best, the inclusion of stemming and removal of stop words are clearly beneficial for convergence given training time and the memory limitation of my laptop.