Learning Phrases with PyLucene and Pytorch, part 2.

Author

Peter Tillotson

Published

March 30, 2026

Abstract

In part 2 we reuse our tokenised index and use pytorch to build a model for significant phrase extraction. It worked surprisingly well and being able to switch Analyzers proved useful. We found that the English Analyzer with stopword removal and stemming worked best.

The results are indicative, neither the dataset size or the length of training cycles are sufficient for the development of a genralised phrase extractor but the succcess and ovelap found between pylucene and pytorch is very encouraging. We just need to scale it up.

This time, we have absracted away the Lucene code from part 1 [1] into a set of classes that we can use to extract significant phrases from search results and the index. We will use these results and see how we get on building a pytorch model to extract phrases directly.

The pipeline has two distinct phases, see Figure 1. First, PyLucene extracts statistically significant phrases from a search result set. Second, those phrases are used as training labels to teach a PyTorch model to recognise similar phrases directly from raw token sequences.

flowchart TB
    A["CSV Reviews"] --> B["Lucene Index<br/>(TMDB)"]
    B --> C["IndexStatsExtractor<br/>(global TF-IDF)"]
    B --> D["SearchStatsExtractor<br/>(per-query stats)"]
    C --> E["PhraseBuilder"]
    D --> E
    E --> F["all_phrases_df<br/>(phrase, doc_ids)"] 
    F --> G["Sequence Labelling<br/>(TermVectors)"] 
    G --> H["BiLSTM PhraseTagger<br/>(REINFORCE + BCE)"]
    H --> I["Evaluation<br/>(Precision / Recall / Reward)"]

Figure 1: An overview of what we will build.

Extracting Significant Phrases with Lucene

This time we have switch the dataset to the TMDB dataset, which is another movie reviews dataset with 10,000 Movies with 150k Cast, 63k Crew, and 80k User Reviews. Importantly for the test it has the movie title as a separate field. This makes it a little easier to generate lots of phrases. In the Lucene approach we can used the movie title as a query term and generate the significant phrases for each movie.

A limitation of this data set is it is quite small per movie. Many movies have only a few reviews.

Classes

The abstracted classes [1] greatly simplyfy the PyLucene phrase extraction pipeline.

classDiagram
    class ConfigurableIndexer {
        +dict index_config
        +add_document(dict)
    }
    class IndexStatsExtractor {
        +dict index_config
        +str field_name
        +extract() DataFrame
    }
    class SearchStatsExtractor {
        +dict index_config
        +str field_name
        +IndexReader reader
        +extract(TopDocs hits) DataFrame
    }
    class PhraseBuilder {
        +DataFrame index_stats_df
        +DataFrame search_stats_df
        +IndexReader reader
        +build_phrases(max_slop, num_significant_terms) DataFrame
    }

    ConfigurableIndexer ..> IndexStatsExtractor : writes index
    IndexStatsExtractor ..> PhraseBuilder : index_stats_df
    SearchStatsExtractor ..> PhraseBuilder : search_stats_df

Figure 2: We abstracted the Lucene code into a set of classes.

Phrase Extraction Algorithm

The phrase extraction algorithm is presented in Figure 3.

flowchart TD
    A["IndexStatsExtractor<br/>extract()"] --> C["Join on term<br/>compute tfidf delta"]
    B["SearchStatsExtractor<br/>extract(hits)"] --> C
    C --> D{"tfidf higher<br/>in search than index?"}
    D -- No --> E["Drop term"]
    D -- Yes --> F["Keep as significant term"]
    F --> G["Group consecutive terms<br/>within max_slop positions"]
    G --> H{"Gap ≤ max_slop?"}
    H -- Yes --> I["Fill gap with '???'"]
    H -- No --> J["End phrase"]
    I --> G
    J --> K["Resolve '???' via<br/>storedFields byte offsets"]
    K --> L["resolved_text<br/>phrases DataFrame"]

Figure 3: The phrase extraction algorithm.

Whereas in part 1 we programatically defined lucene documents and fields, all of this is now configurable. It would be a simple matter to extend this to daily indexing using templating the config file. For now though it is not that big at 33MB. The index configuration is as follows.

index_config = {
    "directory": "./index_tmdb",
    "default_analyzer": "keyword",   # Whole doc as a single token
    "fields": [
        {
            "name": "movie_title",
            "options": "docs"
        },
        {
            "name": "content",
            "analyzer": "english",   # Standard English analyzer
            "options": "docs_freqs_positions_offsets"
        }
    ]
}

The phrase extraction configuration is also presented, later we will do similar for the pytorch model. It will be interesting to compare the complexities of the two approaches.

phrases_config = {
    "min_doc_freq": 5,             # minimum document frequency for significant phrase
    "max_slop": 2,                 # maximum slop for significant phrase
    "num_significant_terms": 200,  # number of significant terms to extract
    "max_hits": 500                # maximum number search results per movie
}

Implementation

First we import the required classes.

Code

import pandas as pd
import numpy as np

from pathlib import Path

from acumed.search.index import (
    ConfigurableIndexer,
    IndexStatsExtractor
)

from acumed.search.searcher import SearchStatsExtractor
from acumed.search.phrases import PhraseBuilder

from org.apache.lucene.index import Term

from org.apache.lucene.search import (
    IndexSearcher,
    TermQuery, 
    BooleanQuery,
    BooleanClause
)

WARNING: Using incubator modules: jdk.incubator.vector
Mar 30, 2026 4:12:37 PM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=128

We index the TMDB reviews. It is small, but still large enough to be interesting.

Code

%%time
index_config = {
    "directory": "./index_tmdb",
    "default_analyzer": "keyword",
    "fields": [
        {
            "name": "movie_title",
            "options": "docs"
        },
        {
            "name": "content",
            "analyzer": "english",
            "options": "docs_freqs_positions_offsets"
        }
    ]
}

if not Path(index_config["directory"]).exists():
    # Index each review in the dataframe
    tmdb_df = pd.read_csv("data/tmdb/reviews.csv")
    with  ConfigurableIndexer(index_config) as indexer:
        def index_review(tpl):
            indexer.add_document({
                "movie_title": tpl.movie_title,
                "content": tpl.content
            })
            return tpl
        
        tmdb_df.apply(index_review, axis=1)

CPU times: user 2.87 s, sys: 130 ms, total: 3 s
Wall time: 1.82 s

As before we can extract the significant phrases from the index.

Code

%%time

phrases_config = {
    "min_doc_freq": 5,             # minimum document frequency for significant phrase
    "max_slop": 2,                 # maximum slop for significant phrase
    "num_significant_terms": 200,  # number of significant terms to extract
    "max_hits": 500                # maximum number search results per movie
}

with IndexStatsExtractor(index_config, "content") as ex:
    index_stats_df = ex.extract()

with SearchStatsExtractor(index_config, "content") as ex:
    titles = (
       pd.read_csv("data/tmdb/reviews.csv")[['movie_title', 'movie_id']]
        .groupby("movie_title")
        .count()
        .loc[lambda x: x.movie_id >= phrases_config["min_doc_freq"]] # more than 5 reviews
        .reset_index()[['movie_title']]
        .drop_duplicates()
    )

    searcher = IndexSearcher(ex.reader)

    def f(tpl):
        movie_title = tpl.movie_title
        query = (
            BooleanQuery.Builder()
            .add(TermQuery(Term('movie_title',movie_title)), BooleanClause.Occur.MUST)
        ).build()
        
        hits = searcher.search(query, phrases_config["max_hits"])
        search_stats_df = ex.extract(hits)
        pb = PhraseBuilder(index_stats_df, search_stats_df, ex.reader)
        phrases_df = pb.build_phrases(max_slop=phrases_config["max_slop"], num_significant_terms=phrases_config["num_significant_terms"])
        phrases_df['movie_title'] = movie_title
        return phrases_df
    
    # Cache invalidation sweep ensuring new PhraseBuilder doc_id schema triggers
    all_phrases_df = titles.apply(f, axis=1)
    
all_phrases_df = pd.concat(all_phrases_df.tolist())
all_phrases_df.head(8)

CPU times: user 21.6 s, sys: 97.1 ms, total: 21.7 s
Wall time: 19 s

	phrase	resolved_text	doc_ids	nos_docs	movie_title
0	13 go 30	13 go 30	[3076, 3077, 3079]	3	13 Going on 30
0	first world war	first world war	[9584, 9596]	2	1917
1	man land	man land	[9582, 9584]	2	1917
2	other film	other film	[9585, 9595]	2	1917
0	28 dai later	28 dai later	[337, 340]	2	28 Days Later
1	cillian murphi	cillian murphi	[340, 343]	2	28 Days Later
0	28 year later	28 year later	[12505, 12508, 12509]	3	28 Years Later
1	all too	all too	[12507, 12508]	2	28 Years Later

Learning Phrases with PyTorch via the Lucene Index

We now use the significant phrases extracted from the Lucene index as training data for a PyTorch model.

The task is token classification: given a sequence of words from a document, predict which tokens belong to a significant phrase. The natural candidates, in rough order of complexity, are:

Model	Memory	Hardware	Notes
Logistic Regression	\(O(N)\)	CPU	No sequential context
Bidirectional LSTM [2]	\(O(N)\)	CPU / MPS	Good context, efficient
BERT / RoBERTa [3]	\(O(N^2)\)	GPU (≥ 8 GB VRAM)	State-of-the-art, memory intensive
Longformer [4]	\(O(N \sqrt{N})\)	High-end GPU	Designed for long documents

A Transformer such as BERT would likely give the best recall, but its self-attention mechanism scales quadratically with sequence length. On a MacBook Pro with Apple Silicon, batches of 1024-token sequences caused kernel out-of-memory crashes in initial experiments. The Bidirectional LSTM [2,5] scales linearly and runs stably on Apple MPS — a practical choice when GPU memory is limited.

The training objective combines two signals:

First, a Binary Cross-Entropy loss [6] with a high positive class weight (pos_weight = 25) to compensate for label imbalance (phrase tokens are typically less than 5% of a document).
Second, a REINFORCE [7] policy gradient that rewards the model at the document level using the 0 / 1 / 2 scoring scheme, encouraging complete phrase recovery rather than isolated token hits.

For each document that has at least one target phrase, we read its term vector directly from the Lucene index, reconstruct the original token sequence ordered by position, then build a binary label vector marking which token positions belong to a target phrase, see Figure 4.

---
config:
  themeVariables:
    fontSize: 10
---
sequenceDiagram
    participant PD as all_phrases_df
    participant TV as Lucene TermVectors
    participant SP as Sequence Prep
    participant DS as SequenceDataset

    PD->>SP: doc_phrases dict {doc_id: [phrases]}
    loop For each doc_id
        SP->>TV: reader.termVectors().get(doc_id)
        TV-->>SP: terms + byte positions
        SP->>SP: Sort by position, truncate to max_seq_len
        SP->>SP: Slide phrase tokens over sequence → seq_labels [0/1]
        SP->>SP: Map tokens → vocab integer IDs
        SP->>DS: append (tensor, labels, doc_id)
    end
    DS->>DS: 75/25 train/test split

Figure 4: The pytorch sequence preparation.

As before all tuneable hyperparameters are consolidated into a single dict at the top of the PyTorch code section. The task involves highly imbalanced data (only a few tokens are part of a phrase) and long documents, a few hyperparameters are particularly critical:

pos_weight (25.0): This is the most important parameter. It tells the Binary Cross-Entropy loss to treat a “missed” phrase token as 25 times more costly than a “false alarm” non-phrase token. Without this, the model would simply predict “0” for everything to achieve 95%+ accuracy.
pg_weight (0.1): This balances the two training signals. A value of 0.1 ensures the Reinforcement Learning (REINFORCE) reward signal guides the model toward complete phrases without overwhelming the token-level BCE supervision.
max_seq_len (1024): This limits how much of a document the model “sees” at once. While LSTMs scale linearly, processing very long sequences still consumes significant memory.
hidden_dim (128): This defines the “capacity” of the LSTM. 128 units allows the model to remember enough contextual interaction between words to distinguish significant phrases from common collocations.

# All tuneable hyperparameters in one place
model_config = {
    "max_seq_len":       1024,   # Token truncation limit per document (avoids O(N²) memory in long docs)
    "embed_dim":         128,    # Embedding vector size
    "hidden_dim":        128,    # LSTM hidden units per direction
    "num_layers":        2,      # Stacked BiLSTM depth
    "dropout":           0.2,    # Dropout between LSTM layers
    "pos_weight":        25.0,   # BCE weight for phrase tokens (boosts Recall on sparse labels)
    "lr":                0.001,  # Adam learning rate
    "batch_size":        32,     # Mini-batch size
    "epochs":            6,      # Training epochs
    "pg_weight":         0.1,    # Policy-gradient loss coefficient relative to BCE
    "test_size":         0.25,   # Held-out fraction for evaluation
    "random_state":      42,     # Reproducible train/test split
    "n_comparison_docs": 10,     # Max documents displayed in visual comparison
}

In the following code we use an IndexReader directly to translate from TermVector to Tensors.

Code

%%time

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from torch.nn.utils.rnn import pad_sequence
import numpy as np

# All tuneable hyperparameters in one place
model_config = {
    "max_seq_len":       1024,   # Token truncation limit per document (avoids O(N²) memory in long docs)
    "embed_dim":         128,    # Embedding vector size
    "hidden_dim":        128,    # LSTM hidden units per direction
    "num_layers":        2,      # Stacked BiLSTM depth
    "dropout":           0.2,    # Dropout between LSTM layers
    "pos_weight":        25.0,   # BCE weight for phrase tokens (boosts Recall on sparse labels)
    "lr":                0.001,  # Adam learning rate
    "batch_size":        32,     # Mini-batch size
    "epochs":            6,      # Training epochs
    "pg_weight":         0.1,    # Policy-gradient loss coefficient relative to BCE
    "test_size":         0.25,   # Held-out fraction for evaluation
    "random_state":      42,     # Reproducible train/test split
    "n_comparison_docs": 10,     # Max documents displayed in visual comparison
}

# 1. Prepare Target Phrases per Document
doc_phrases = (
    all_phrases_df[['phrase', 'doc_ids']]
    .explode('doc_ids')
    .drop_duplicates()
    .groupby('doc_ids')['phrase'] 
    .apply(list)
    .to_dict()
)
# 2. Extract Sequential Data from Lucene TermVectors
from org.apache.lucene.util import BytesRefIterator
from acumed.search.searcher import SearchStatsExtractor

vocab = {"<PAD>": 0, "<UNK>": 1}
sequences = []
labels = []
doc_ids_list = []

# Open the reader to loop the exact sequential positions per document
with SearchStatsExtractor(index_config, "content") as ex:
    reader = ex.reader
    for doc_id, target_phrases in doc_phrases.items():
        vector = reader.termVectors().get(doc_id)
        if vector is None:
            continue
            
        termsEnum = vector.terms("content")
        if termsEnum is None:
            continue
            
        te = termsEnum.iterator()
        term_positions = []
        
        # Build out internal chronological document using native word positions
        for term in BytesRefIterator.cast_(te):
            term_str = term.utf8ToString()
            postings = te.postings(None)
            postings.nextDoc()
            freq = postings.freq()
            for _ in range(freq):
                pos = postings.nextPosition()
                term_positions.append((pos, term_str))
                
        term_positions.sort(key=lambda x: x[0])
        # Truncate long documents: Transformers scale O(N²), BiLSTM is O(N) but still memory-bounded
        seq_tokens = [t for p, t in term_positions][:model_config["max_seq_len"]]
        
        if not seq_tokens:
            continue
            
        # Build sequential BIO/Binary tags mapping targets directly onto local offsets
        seq_labels = [0] * len(seq_tokens)
        for phrase in target_phrases:
            phrase_tokens = phrase.split()
            p_len = len(phrase_tokens)
            
            # Map sub-sequence overlaps ignoring identical max_slop mappings natively
            for i in range(len(seq_tokens) - p_len + 1):
                match = True
                for j in range(p_len):
                    if phrase_tokens[j] != '???' and phrase_tokens[j] != seq_tokens[i+j]:
                        match = False
                        break
                if match:
                    for j in range(p_len):
                        seq_labels[i+j] = 1
                        
        # Map vocabulary tensors
        enc_seq = []
        for t in seq_tokens:
            if t not in vocab:
                vocab[t] = len(vocab)
            enc_seq.append(vocab[t])
            
        sequences.append(torch.tensor(enc_seq, dtype=torch.long))
        labels.append(torch.tensor(seq_labels, dtype=torch.float))
        doc_ids_list.append(doc_id)

CPU times: user 1.35 s, sys: 165 ms, total: 1.52 s
Wall time: 2.2 s

PyTorch Datasets

Next we create our standard DataLoaders with padded sequence spans and randomly split into train:test sets in the ratio 3:1.

Code

class SequenceDataset(Dataset):
    def __init__(self, sequences, labels, doc_ids):
        self.sequences = sequences
        self.labels = labels
        self.doc_ids = doc_ids
        
    def __len__(self):
        return len(self.sequences)
        
    def __getitem__(self, idx):
        return self.sequences[idx], self.labels[idx], self.doc_ids[idx]

def collate_fn(batch):
    seqs, lbls, dids = zip(*batch)
    seqs_padded = pad_sequence(seqs, batch_first=True, padding_value=vocab["<PAD>"])
    lbls_padded = pad_sequence(lbls, batch_first=True, padding_value=0.0)
    return seqs_padded, lbls_padded, torch.tensor(dids)

# Isolate splits exactly against document IDs
X_train, X_test, y_train, y_test, id_train, id_test = train_test_split(
    sequences, labels, doc_ids_list,
    test_size=model_config["test_size"],
    random_state=model_config["random_state"]
)

train_dataset = SequenceDataset(X_train, y_train, id_train)
test_dataset = SequenceDataset(X_test, y_test, id_test)

train_loader = DataLoader(train_dataset, batch_size=model_config["batch_size"], shuffle=True, collate_fn=collate_fn)
test_loader = DataLoader(test_dataset, batch_size=model_config["batch_size"], shuffle=False, collate_fn=collate_fn)

Bidirectional LSTM Evaluation

Each token embedding is passed through a stack of bidirectional LSTM layers so that every position can see both preceding and following context before a linear head produces a per-token phrase probability.

flowchart TD
    A["Input token IDs<br/>(batch × seq_len)"] --> B["nn.Embedding<br/>(vocab_size → embed_dim)"]
    B --> C["nn.LSTM<br/>(bidirectional, num_layers)<br/>embed_dim → hidden_dim × 2"]
    C --> D["nn.Linear<br/>(hidden_dim × 2 → 1)"]
    D --> E["Sigmoid → probability per token"]
    E --> F{"threshold 0.5"}
    F -->|"≥ 0.5"| G["Tag = 1 (phrase token)"]
    F -->|"< 0.5"| H["Tag = 0 (non-phrase)"]

Figure 5: The LSTM evaluation pipeline.

Code

%%time

device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')

class PhraseTagger(nn.Module):
    def __init__(self, vocab_size,
                 embed_dim=model_config["embed_dim"],
                 hidden_dim=model_config["hidden_dim"],
                 num_layers=model_config["num_layers"]):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=num_layers,
                            bidirectional=True, batch_first=True,
                            dropout=model_config["dropout"])
        self.linear = nn.Linear(hidden_dim * 2, 1)
        
    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        logits = self.linear(lstm_out).squeeze(-1)
        return logits

model = PhraseTagger(vocab_size=len(vocab)).to(device)
pos_weight = torch.tensor([model_config["pos_weight"]], device=device)
criterion = nn.BCEWithLogitsLoss(reduction='none', pos_weight=pos_weight)
optimizer = torch.optim.Adam(model.parameters(), lr=model_config["lr"])

# Policy Gradient REINFORCE natively bound to BCE Training Loops
epochs = model_config["epochs"]
model.train()
for epoch in range(epochs):
    total_loss = 0
    total_bce = 0
    total_pg = 0
    
    for seqs, lbls, _ in train_loader:
        optimizer.zero_grad()
        seqs, lbls = seqs.to(device), lbls.to(device)
        
        logits = model(seqs)
        probs = torch.sigmoid(logits)
        
        mask = (seqs != vocab["<PAD>"]).float()
        
        # 1. Supervised Baseline BCE Loss stabilizing early execution
        bce_loss_raw = criterion(logits, lbls)
        bce_loss = (bce_loss_raw * mask).sum() / max(mask.sum(), 1)
        
        # 2. Reinforcement Learning Feedback (REINFORCE Algorithm)
        # Sample sequence tagging choices evaluating probability distributions natively 
        m = torch.distributions.Bernoulli(probs)
        actions = m.sample()
        log_probs = m.log_prob(actions)
        
        batch_rewards = []
        for i in range(len(lbls)):
            doc_len = mask[i].sum().int().item()
            pred_seq = actions[i][:doc_len]
            true_seq = lbls[i][:doc_len]
            
            true_indices = torch.where(true_seq == 1)[0].cpu().numpy()
            pred_indices = torch.where(pred_seq == 1)[0].cpu().numpy()
            
            if len(true_indices) > 0 and len(pred_indices) == 0:
                r = 0.0
            elif len(np.intersect1d(true_indices, pred_indices)) > 0:
                doc_recall = len(np.intersect1d(true_indices, pred_indices)) / max(len(true_indices), 1)
                r = 2.0 if doc_recall == 1.0 else 1.0
            else:
                r = 0.0
            batch_rewards.append(r)
            
        reward_tensor = torch.tensor(batch_rewards, device=device, dtype=torch.float)
        
        # Policy Gradient Baseline normalization stabilizes high-variance sequence gradients avoiding collapse
        advantages = reward_tensor - reward_tensor.mean()
        
        # Policy Gradient resolves log loss arrays mapped optimally modifying document variables 
        masked_log_probs = (
            (log_probs * mask).sum(dim=1) / max(mask.sum(dim=1).max(), 1)
        )
        pg_loss = - (masked_log_probs * advantages).mean()
        
        # Joint Execution limits: BCE filters heavily imbalanced context safely, PG maps dynamic phrase limits 
        loss = bce_loss + model_config["pg_weight"] * pg_loss
        
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        total_bce += bce_loss.item()
        total_pg += pg_loss.item()
        
    avg_loss = total_loss/max(len(train_loader), 1)
    avg_bce = total_bce/max(len(train_loader), 1)
    avg_pg = total_pg/max(len(train_loader), 1)
    print(f"Epoch {epoch+1}/{epochs} | Total Loss: {avg_loss:.4f} (BCE: {avg_bce:.4f}, PG: {avg_pg:.4f})")

torch.save(model.state_dict(), "local/models/phrases_pytorch.bin")

Epoch 1/6 | Total Loss: 0.9121 (BCE: 0.9110, PG: 0.0112)
Epoch 2/6 | Total Loss: 0.7458 (BCE: 0.7455, PG: 0.0031)
Epoch 3/6 | Total Loss: 0.5821 (BCE: 0.5821, PG: -0.0004)
Epoch 4/6 | Total Loss: 0.4133 (BCE: 0.4133, PG: -0.0004)
Epoch 5/6 | Total Loss: 0.2790 (BCE: 0.2791, PG: -0.0012)
Epoch 6/6 | Total Loss: 0.1977 (BCE: 0.1978, PG: -0.0001)
CPU times: user 11 s, sys: 2.4 s, total: 13.4 s
Wall time: 27.8 s

That is it, training complete 6 epochs in a total time of a little less than 30 seconds. The total loss at each epoch is decreasing which is a good sign. This isn’t a big traing set or exhaustive traing. The results are unlikely to be generalised to other document sets. The question we are trying to answer here is can it work at all?

Final Evaluation Metrics

Calculate typical predictive behavior identifying significant precision constraints prioritizing Recall, appended by document rewards explicitly scoring phrase recovery completion natively.

flowchart TD
    A["Model logits<br/>(test set)"] --> B["Sigmoid threshold 0.5<br/>binary predictions"]
    B --> C["Flatten all predictions<br/>& true labels"]
    C --> D["Standard Metrics<br/>Accuracy / Precision / Recall"]
    B --> E["Per-document phrase<br/>boundary check"]
    E --> F{"Any target<br/>phrase tokens?"}
    F -- No --> G["Reward = 0<br/>(hard fail)"]
    F -- Yes --> H{"Predicted tokens<br/>overlap targets?"}
    H -- No --> G
    H -- Yes --> I{"100% recall<br/>for this doc?"}
    I -- Yes --> J["Reward = 2<br/>(perfect extraction)"]
    I -- No --> K["Reward = 1<br/>(partial extraction)"]
    J --> L["Average Reward"]
    K --> L
    G --> L

Figure 6: Evaluation pipeline: from model outputs to reward scores.

Code

from sklearn.metrics import accuracy_score, precision_score, recall_score

model.eval()
idx2word = {v: k for k, v in vocab.items()}
comparison_results = []
plot_data = []
all_preds = []
all_trues = []
rewards = []

with torch.no_grad():
    for seqs, lbls, dids in test_loader:
        seqs, lbls = seqs.to(device), lbls.to(device)
        logits = model(seqs)
        preds = (torch.sigmoid(logits) > 0.5).float()
        
        # Compress boundary padding
        mask = (seqs != vocab["<PAD>"])
        all_preds.extend(preds[mask].cpu().numpy())
        all_trues.extend(lbls[mask].cpu().numpy())
        
        # Compile precise recovery rewards cleanly against native dimensions 
        for i in range(len(dids)):
            doc_id = dids[i].item()
            doc_len = mask[i].sum().int().item()
            
            true_seq = lbls[i][:doc_len]
            pred_seq = preds[i][:doc_len]
            
            true_indices = torch.where(true_seq == 1)[0].cpu().numpy()
            pred_indices = torch.where(pred_seq == 1)[0].cpu().numpy()
            
            # Map 0, 1, 2 classification scores organically
            if len(true_indices) > 0 and len(pred_indices) == 0:
                rewards.append(0)
            elif len(np.intersect1d(true_indices, pred_indices)) > 0:
                # Ratio of overlapping correct indices matches phrase saturation perfectly handling edge slops
                doc_recall = len(np.intersect1d(true_indices, pred_indices)) / max(len(true_indices), 1)
                
                if doc_recall == 1.0:
                    rewards.append(2)  # Score 2: All phrases mapped effectively
                else:
                    rewards.append(1)  # Score 1: Partial mapping resolved safely 
            else:
                rewards.append(0)      # Score 0: Failed mappings natively
                
            # Render comparison variables mapping Lucene arrays vs PyTorch masks visually    
            if len(comparison_results) < model_config["n_comparison_docs"] and doc_id in doc_phrases:
                pred_tokens = []
                current_phrase = []
                for j in range(doc_len):
                    if pred_seq[j] == 1:
                        current_phrase.append(idx2word.get(seqs[i][j].item(), "<UNK>"))
                    else:
                        if current_phrase:
                            pred_tokens.append(" ".join(current_phrase))
                            current_phrase = []
                if current_phrase:
                    pred_tokens.append(" ".join(current_phrase))
                    
                raw_text_snippet = " ".join([idx2word.get(x.item(), "<UNK>") for x in seqs[i][:min(doc_len, 50)]]) + "..."
                raw_text_snippet_esc = raw_text_snippet.replace('"', '&quot;')
                
                # Drop duplicate PyTorch predictions while preserving chronological extraction order
                unique_pred_tokens = list(dict.fromkeys(pred_tokens)) if pred_tokens else ["(None)"]
                
                # Capture full token sequence for boundary visualization
                token_details = []
                for j in range(min(doc_len, 50)):
                    token_details.append({
                        "word": idx2word.get(seqs[i][j].item(), "<UNK>"),
                        "is_target": true_seq[j].item() == 1,
                        "is_pred": pred_seq[j].item() == 1
                    })

                comparison_results.append({
                    "doc_id": doc_id,
                    "pylucene_phrases": doc_phrases[doc_id],
                    "pytorch_phrases": unique_pred_tokens,
                    "raw_doc": raw_text_snippet_esc,
                    "tokens": token_details
                })

In the following code we present the results

Code

rewards = np.array(rewards)

metrics_df = pd.DataFrame([
    {"Metric": "Native Model Accuracy", "Value": f"{accuracy_score(all_trues, all_preds):.4f}"},
    {"Metric": "Extraction Precision", "Value": f"{precision_score(all_trues, all_preds, zero_division=0):.4f}"},
    {"Metric": "Extraction Recall", "Value": f"{recall_score(all_trues, all_preds, zero_division=0):.4f}"},
    {"Metric": "Reward: Perfect Extractions [2]", "Value": str((rewards == 2).sum())},
    {"Metric": "Reward: Partial Extractions [1]", "Value": str((rewards == 1).sum())},
    {"Metric": "Reward: Failed Misses [0]", "Value": str((rewards == 0).sum())},
    {"Metric": "Average Document Reward", "Value": f"{rewards.mean():.2f}"}
])
metrics_df

	Metric	Value
0	Native Model Accuracy	0.9477
1	Extraction Precision	0.1841
2	Extraction Recall	0.4811
3	Reward: Perfect Extractions [2]	48
4	Reward: Partial Extractions [1]	79
5	Reward: Failed Misses [0]	100
6	Average Document Reward	0.77

In this data accuracy is not a very good metric because most of the dataset is not a phrase. recall and precision are better metrics here. In these terms it is not a great model but looking a rewards and particularly partial extractions. The model is learning extracting phrases but quite often a little more than the target phrase.

Phrases Extracted

Figure 7 presents the overlaps and extraction in more detail. I am quite happy with this outcome. Is it perfect? No. Not by a long way but it has had 30 seconds training on a MacBook M5 with quite a small dataset. What it manages to do is very encouraging.

Code

from IPython.display import HTML

# 1. Build the Boundary Visualization
viz_html = """
<style>
    .viz-container {
        font-family: 'Segoe UI', sans-serif;
        margin: 30px 0;
        padding: 20px;
        background: #fff;
        border-radius: 12px;
        box-shadow: 0 2px 15px rgba(0,0,0,0.05);
    }
    .legend {
        display: flex;
        gap: 20px;
        margin-bottom: 20px;
        font-size: 0.85rem;
        font-weight: 600;
    }
    .legend-item { display: flex; align-items: center; gap: 6px; }
    .box { width: 14px; height: 14px; border-radius: 3px; }
    
    .doc-viz { margin-bottom: 25px; }
    .doc-viz-title { font-weight: 700; font-size: 0.9rem; margin-bottom: 8px; color: #4b5563; }
    .token-row { display: flex; flex-wrap: wrap; gap: 4px; }
    .token {
        padding: 2px 6px;
        border-radius: 4px;
        font-size: 0.85rem;
        background: #f3f4f6;
        color: #374151;
        transition: transform 0.1s;
    }
    .token:hover { transform: scale(1.1); z-index: 10; }
    
    /* Highlight Classes */
    .t-lucene { background: #e0e7ff; color: #4338ca; border: 1px solid #c7d2fe; }
    .t-pytorch { background: #fef3c7; color: #92400e; border: 1px solid #fde68a; }
    .t-overlap { background: #d1fae5; color: #065f46; border: 1px solid #a7f3d0; font-weight: 700; }
</style>

<div class="viz-container">
    <h3 style="margin-top:0">Phrase Extraction Boundaries (First 5 Docs)</h3>
    <div class="legend">
        <div class="legend-item"><div class="box t-lucene"></div> Lucene Target</div>
        <div class="legend-item"><div class="box t-pytorch"></div> PyTorch Predicted</div>
        <div class="legend-item"><div class="box t-overlap"></div> Organic Overlap</div>
    </div>
"""

for row in comparison_results[:5]:
    viz_html += f'<div class="doc-viz"><div class="doc-viz-title">Document: {row["doc_id"]}</div><div class="token-row">'
    for t in row["tokens"]:
        cls = ""
        if t["is_target"] and t["is_pred"]: cls = "t-overlap"
        elif t["is_target"]: cls = "t-lucene"
        elif t["is_pred"]: cls = "t-pytorch"
        
        viz_html += f'<span class="token {cls}">{t["word"]}</span>'
    viz_html += '</div></div>'

viz_html += "</div>"
display(HTML(viz_html))


# 2. Build the Global Results Table
table_html = """
<style>
    .extraction-table {
        width: 100%;
        border-collapse: separate;
        border-spacing: 0;
        font-family: 'Segoe UI', system-ui, -apple-system, sans-serif;
        margin: 20px 0;
        background: rgba(255, 255, 255, 0.8);
        backdrop-filter: blur(8px);
        border-radius: 12px;
        overflow: hidden;
        box-shadow: 0 4px 20px rgba(0, 0, 0, 0.08);
        border: 1px solid #e1e4e8;
    }
    .extraction-table th {
        background: linear-gradient(135deg, #6366f1, #4f46e5);
        color: white;
        padding: 16px;
        text-align: left;
        font-weight: 600;
        font-size: 0.95rem;
        letter-spacing: 0.02em;
        border-bottom: 2px solid rgba(0,0,0,0.1);
    }
    .extraction-table td {
        padding: 14px 16px;
        border-bottom: 1px solid #f0f1f4;
        vertical-align: top;
        color: #374151;
        font-size: 0.9rem;
        line-height: 1.5;
    }
    .extraction-table tr:last-child td {
        border-bottom: none;
    }
    .extraction-table tr:hover {
        background-color: rgba(99, 102, 241, 0.03);
        transition: background-color 0.2s ease;
    }
    .doc-id-cell {
        font-weight: 700;
        color: #4f46e5;
        width: 15%;
        position: relative;
        cursor: help;
    }
    .phrase-list {
        margin: 0;
        padding: 0;
        list-style: none;
    }
    .phrase-item {
        margin-bottom: 6px;
        display: flex;
        align-items: center;
    }
    .phrase-item::before {
        content: "•";
        color: #6366f1;
        font-weight: bold;
        display: inline-block;
        width: 1em;
        margin-left: -1em;
    }
    .phrase-col {
        padding-left: 20px !important;
    }
</style>
<table class="extraction-table">
    <thead>
        <tr>
            <th>Doc ID (Hover for context)</th>
            <th>PyLucene Phrases (Target)</th>
            <th>PyTorch Tags (Predicted)</th>
        </tr>
    </thead>
    <tbody>
"""

for row in comparison_results:
    lu_phrases = "".join([f'<div class="phrase-item">{p}</div>' for p in row['pylucene_phrases']])
    pt_phrases = "".join([f'<div class="phrase-item">{p}</div>' for p in row['pytorch_phrases']])
    
    table_html += f"""
        <tr title="{row['raw_doc']}">
            <td class="doc-id-cell">{row['doc_id']}</td>
            <td class="phrase-col"><div class="phrase-list">{lu_phrases}</div></td>
            <td class="phrase-col"><div class="phrase-list">{pt_phrases}</div></td>
        </tr>
    """

table_html += "</tbody></table>"

display(HTML(table_html))

Phrase Extraction Boundaries (First 5 Docs)

Lucene Target

PyTorch Predicted

Organic Overlap

Document: 12247

liamneesondoeprettigoodleslinielsonimpressonwhoanamesimilarmoviclassicjokegoofistuffkindfunnikindfamiliarwarmlikeliamneesoncuddlyousleepnight1hour14minutshouldgetaward

Document: 7218

funstarwarmoviplentihomagnodorigintrilogithroughoutfilmandiplenticameowellhaamazactionsequencenoughthrillkeepanitruestarwarfanhookrightfrombeginopencrawlstartchanggreatscoremichaelgiachinnostillincludoriginendcreditjohnwilliamoriginmusick2so

Document: 6260

maisomefineperformmoviihonestlithinkcriticoverrlatestentrixmensagaperformwolverinprofxhughjackmanpatrickstewartextraordinaricreatbelievlovefathersonbondbetweencharactlogancareninetiyearoldleaderxmenafterhorribleventoccurxavierschoolyearbeforstephenmerchant

Document: 10503

lifepeterparkertomhollandcomplicthankhiduallifespidermanchallengbehighschoolunfortunhimhibestintentaboutmakethingmuchworsspidermanwaihometakeplacewherespidermanfarfromhomeendpetermustdealhisecretidentbeleaktabloidjournalistj

Document: 11127

todaiivisitcinemawatchfilmquietplacedaionanticipthrilledgyourseatexperiimustadmitmovididdisappointwhilemaimasterpiecenjoyengagenoughwarrantmultiplviewpotentidiscovnewdetaileachtimelupitanyong'odelivstellarperformmaincharactportraipatientfromhospiccenterwho

(a)

Doc ID (Hover for context)	PyLucene Phrases (Target)	PyTorch Tags (Predicted)
12247	liam neeson	liam neeson sleep night
7218	star war	star war movi star war fan william origin music star war film
6260	x men	entri x men saga prof x x men x men apocalyps mutant have been dai where x men x littl too
10503	spider man wai home	life peter life spider man wors spider man spider man deal hi secret peter seek out doctor strang benedict cumberbatch peter parker spider man strang him odd doctor strang peter hi save dai film deliv spider man
11127	quiet place dai on	quiet place dai lupita nyong'o alien invas quiet place dai on film
11828	tom hank deliv	man juxtaposit insid him tom hank i like i could have societi selfish widow
8448	black adam	endors black adam super fare special effect said
5480	i did	much anticip have i i had hard time
12919	28 dai later	28 dai later franchis
8625	galaxi vol 3 farewel	guardian guardian rocket bittersweet sinc spider man

(b)

Figure 7

Conclusions

We should note that we are to some extent comparing apples and pears here. The PyLucene approach is rule based it has an on disk index size of 33MB and extacting phrases by looking at multiple documents. To be a significant phrase it must appear in more than one document. The PyTorch approach is trying to extract phrases from a single document. It is an unfare comparison, but I was more interested in the idea of bootstrapping learning and from that perspective it worked. The model save size was 10MB which compares favourably to the 33MB index size, but in training it was using 10GB of RAM.

In general the model has converged and found phrases per document, it is a good start. The configuration has lots of parameters and I suspect mofing the model to a different style of written text would require retraining. Where in contrast PyLucene is just looking a colocation of terms in the text, it will be more resilient, deterministic and faster but it needs a background index for comparison.

The PyTorch model has done suprisingly well and we should be encouraged by the results, it doing so well with just 30s of training and quite a small data set demonstrates the power of learning, and this could be transfered. I did experiment briefly with both Whitespace and Standard Lucene Analyzers. The results with the English Analyzer were the best, the inclusion of stemming and removal of stop words are clearly beneficial for convergence given training time and the memory limitation of my laptop.

References

[1]

Learning Phrases with PyLucene and Pytorch, part 1. n.d. https://acumedconsulting.com/blog/2026/03/03/pylucene_phrases/.

[2]

Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Computation 1997;9:1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.

[3]

Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019, 2019.

[4]

Beltagy I, Peters ME, Cohan A. Longformer: The Long-Document Transformer. 2020.

[5]

PyTorch: LSTM n.d. https://docs.pytorch.org/docs/stable/generated/torch.nn.LSTM.html.

[6]

PyTorch: BCEWithLogitsLoss n.d. https://docs.pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html.

[7]

Williams RJ. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning 1992;8:229–56. https://doi.org/10.1007/BF00992696.