Learning Phrases with PyLucene and Pytorch, part 2.

Author

Peter Tillotson

Published

March 30, 2026

Abstract

In part 2 we reuse our tokenised index and use pytorch to build a model for significant phrase extraction. It worked surprisingly well and being able to switch Analyzers proved useful. We found that the English Analyzer with stopword removal and stemming worked best.

The results are indicative, neither the dataset size or the length of training cycles are sufficient for the development of a genralised phrase extractor but the succcess and ovelap found between pylucene and pytorch is very encouraging. We just need to scale it up.

This time, we have absracted away the Lucene code from part 1 [1] into a set of classes that we can use to extract significant phrases from search results and the index. We will use these results and see how we get on building a pytorch model to extract phrases directly.

The pipeline has two distinct phases, see Figure 1. First, PyLucene extracts statistically significant phrases from a search result set. Second, those phrases are used as training labels to teach a PyTorch model to recognise similar phrases directly from raw token sequences.

flowchart TB
    A["CSV Reviews"] --> B["Lucene Index<br/>(TMDB)"]
    B --> C["IndexStatsExtractor<br/>(global TF-IDF)"]
    B --> D["SearchStatsExtractor<br/>(per-query stats)"]
    C --> E["PhraseBuilder"]
    D --> E
    E --> F["all_phrases_df<br/>(phrase, doc_ids)"] 
    F --> G["Sequence Labelling<br/>(TermVectors)"] 
    G --> H["BiLSTM PhraseTagger<br/>(REINFORCE + BCE)"]
    H --> I["Evaluation<br/>(Precision / Recall / Reward)"]
Figure 1: An overview of what we will build.

Extracting Significant Phrases with Lucene

This time we have switch the dataset to the TMDB dataset, which is another movie reviews dataset with 10,000 Movies with 150k Cast, 63k Crew, and 80k User Reviews. Importantly for the test it has the movie title as a separate field. This makes it a little easier to generate lots of phrases. In the Lucene approach we can used the movie title as a query term and generate the significant phrases for each movie.

A limitation of this data set is it is quite small per movie. Many movies have only a few reviews.

Classes

The abstracted classes [1] greatly simplyfy the PyLucene phrase extraction pipeline.

classDiagram
    class ConfigurableIndexer {
        +dict index_config
        +add_document(dict)
    }
    class IndexStatsExtractor {
        +dict index_config
        +str field_name
        +extract() DataFrame
    }
    class SearchStatsExtractor {
        +dict index_config
        +str field_name
        +IndexReader reader
        +extract(TopDocs hits) DataFrame
    }
    class PhraseBuilder {
        +DataFrame index_stats_df
        +DataFrame search_stats_df
        +IndexReader reader
        +build_phrases(max_slop, num_significant_terms) DataFrame
    }

    ConfigurableIndexer ..> IndexStatsExtractor : writes index
    IndexStatsExtractor ..> PhraseBuilder : index_stats_df
    SearchStatsExtractor ..> PhraseBuilder : search_stats_df
Figure 2: We abstracted the Lucene code into a set of classes.

Phrase Extraction Algorithm

The phrase extraction algorithm is presented in Figure 3.

flowchart TD
    A["IndexStatsExtractor<br/>extract()"] --> C["Join on term<br/>compute tfidf delta"]
    B["SearchStatsExtractor<br/>extract(hits)"] --> C
    C --> D{"tfidf higher<br/>in search than index?"}
    D -- No --> E["Drop term"]
    D -- Yes --> F["Keep as significant term"]
    F --> G["Group consecutive terms<br/>within max_slop positions"]
    G --> H{"Gap ≤ max_slop?"}
    H -- Yes --> I["Fill gap with '???'"]
    H -- No --> J["End phrase"]
    I --> G
    J --> K["Resolve '???' via<br/>storedFields byte offsets"]
    K --> L["resolved_text<br/>phrases DataFrame"]
Figure 3: The phrase extraction algorithm.

Whereas in part 1 we programatically defined lucene documents and fields, all of this is now configurable. It would be a simple matter to extend this to daily indexing using templating the config file. For now though it is not that big at 33MB. The index configuration is as follows.

index_config = {
    "directory": "./index_tmdb",
    "default_analyzer": "keyword",   # Whole doc as a single token
    "fields": [
        {
            "name": "movie_title",
            "options": "docs"
        },
        {
            "name": "content",
            "analyzer": "english",   # Standard English analyzer
            "options": "docs_freqs_positions_offsets"
        }
    ]
}

The phrase extraction configuration is also presented, later we will do similar for the pytorch model. It will be interesting to compare the complexities of the two approaches.

phrases_config = {
    "min_doc_freq": 5,             # minimum document frequency for significant phrase
    "max_slop": 2,                 # maximum slop for significant phrase
    "num_significant_terms": 200,  # number of significant terms to extract
    "max_hits": 500                # maximum number search results per movie
}

Implementation

First we import the required classes.

Code
import pandas as pd
import numpy as np

from pathlib import Path

from acumed.search.index import (
    ConfigurableIndexer,
    IndexStatsExtractor
)

from acumed.search.searcher import SearchStatsExtractor
from acumed.search.phrases import PhraseBuilder

from org.apache.lucene.index import Term

from org.apache.lucene.search import (
    IndexSearcher,
    TermQuery, 
    BooleanQuery,
    BooleanClause
)
WARNING: Using incubator modules: jdk.incubator.vector
Mar 30, 2026 4:12:37 PM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=128

We index the TMDB reviews. It is small, but still large enough to be interesting.

Code
%%time
index_config = {
    "directory": "./index_tmdb",
    "default_analyzer": "keyword",
    "fields": [
        {
            "name": "movie_title",
            "options": "docs"
        },
        {
            "name": "content",
            "analyzer": "english",
            "options": "docs_freqs_positions_offsets"
        }
    ]
}

if not Path(index_config["directory"]).exists():
    # Index each review in the dataframe
    tmdb_df = pd.read_csv("data/tmdb/reviews.csv")
    with  ConfigurableIndexer(index_config) as indexer:
        def index_review(tpl):
            indexer.add_document({
                "movie_title": tpl.movie_title,
                "content": tpl.content
            })
            return tpl
        
        tmdb_df.apply(index_review, axis=1)
CPU times: user 2.87 s, sys: 130 ms, total: 3 s
Wall time: 1.82 s

As before we can extract the significant phrases from the index.

Code
%%time

phrases_config = {
    "min_doc_freq": 5,             # minimum document frequency for significant phrase
    "max_slop": 2,                 # maximum slop for significant phrase
    "num_significant_terms": 200,  # number of significant terms to extract
    "max_hits": 500                # maximum number search results per movie
}

with IndexStatsExtractor(index_config, "content") as ex:
    index_stats_df = ex.extract()

with SearchStatsExtractor(index_config, "content") as ex:
    titles = (
       pd.read_csv("data/tmdb/reviews.csv")[['movie_title', 'movie_id']]
        .groupby("movie_title")
        .count()
        .loc[lambda x: x.movie_id >= phrases_config["min_doc_freq"]] # more than 5 reviews
        .reset_index()[['movie_title']]
        .drop_duplicates()
    )

    searcher = IndexSearcher(ex.reader)

    def f(tpl):
        movie_title = tpl.movie_title
        query = (
            BooleanQuery.Builder()
            .add(TermQuery(Term('movie_title',movie_title)), BooleanClause.Occur.MUST)
        ).build()
        
        hits = searcher.search(query, phrases_config["max_hits"])
        search_stats_df = ex.extract(hits)
        pb = PhraseBuilder(index_stats_df, search_stats_df, ex.reader)
        phrases_df = pb.build_phrases(max_slop=phrases_config["max_slop"], num_significant_terms=phrases_config["num_significant_terms"])
        phrases_df['movie_title'] = movie_title
        return phrases_df
    
    # Cache invalidation sweep ensuring new PhraseBuilder doc_id schema triggers
    all_phrases_df = titles.apply(f, axis=1)
    
all_phrases_df = pd.concat(all_phrases_df.tolist())
all_phrases_df.head(8)
CPU times: user 21.6 s, sys: 97.1 ms, total: 21.7 s
Wall time: 19 s
phrase resolved_text doc_ids nos_docs movie_title
0 13 go 30 13 go 30 [3076, 3077, 3079] 3 13 Going on 30
0 first world war first world war [9584, 9596] 2 1917
1 man land man land [9582, 9584] 2 1917
2 other film other film [9585, 9595] 2 1917
0 28 dai later 28 dai later [337, 340] 2 28 Days Later
1 cillian murphi cillian murphi [340, 343] 2 28 Days Later
0 28 year later 28 year later [12505, 12508, 12509] 3 28 Years Later
1 all too all too [12507, 12508] 2 28 Years Later

Learning Phrases with PyTorch via the Lucene Index

We now use the significant phrases extracted from the Lucene index as training data for a PyTorch model.

The task is token classification: given a sequence of words from a document, predict which tokens belong to a significant phrase. The natural candidates, in rough order of complexity, are:

Model Memory Hardware Notes
Logistic Regression \(O(N)\) CPU No sequential context
Bidirectional LSTM [2] \(O(N)\) CPU / MPS Good context, efficient
BERT / RoBERTa [3] \(O(N^2)\) GPU (≥ 8 GB VRAM) State-of-the-art, memory intensive
Longformer [4] \(O(N \sqrt{N})\) High-end GPU Designed for long documents

A Transformer such as BERT would likely give the best recall, but its self-attention mechanism scales quadratically with sequence length. On a MacBook Pro with Apple Silicon, batches of 1024-token sequences caused kernel out-of-memory crashes in initial experiments. The Bidirectional LSTM [2,5] scales linearly and runs stably on Apple MPS — a practical choice when GPU memory is limited.

The training objective combines two signals:

  1. First, a Binary Cross-Entropy loss [6] with a high positive class weight (pos_weight = 25) to compensate for label imbalance (phrase tokens are typically less than 5% of a document).
  2. Second, a REINFORCE [7] policy gradient that rewards the model at the document level using the 0 / 1 / 2 scoring scheme, encouraging complete phrase recovery rather than isolated token hits.

For each document that has at least one target phrase, we read its term vector directly from the Lucene index, reconstruct the original token sequence ordered by position, then build a binary label vector marking which token positions belong to a target phrase, see Figure 4.

---
config:
  themeVariables:
    fontSize: 10
---
sequenceDiagram
    participant PD as all_phrases_df
    participant TV as Lucene TermVectors
    participant SP as Sequence Prep
    participant DS as SequenceDataset

    PD->>SP: doc_phrases dict {doc_id: [phrases]}
    loop For each doc_id
        SP->>TV: reader.termVectors().get(doc_id)
        TV-->>SP: terms + byte positions
        SP->>SP: Sort by position, truncate to max_seq_len
        SP->>SP: Slide phrase tokens over sequence → seq_labels [0/1]
        SP->>SP: Map tokens → vocab integer IDs
        SP->>DS: append (tensor, labels, doc_id)
    end
    DS->>DS: 75/25 train/test split
Figure 4: The pytorch sequence preparation.

As before all tuneable hyperparameters are consolidated into a single dict at the top of the PyTorch code section. The task involves highly imbalanced data (only a few tokens are part of a phrase) and long documents, a few hyperparameters are particularly critical:

  • pos_weight (25.0): This is the most important parameter. It tells the Binary Cross-Entropy loss to treat a “missed” phrase token as 25 times more costly than a “false alarm” non-phrase token. Without this, the model would simply predict “0” for everything to achieve 95%+ accuracy.
  • pg_weight (0.1): This balances the two training signals. A value of 0.1 ensures the Reinforcement Learning (REINFORCE) reward signal guides the model toward complete phrases without overwhelming the token-level BCE supervision.
  • max_seq_len (1024): This limits how much of a document the model “sees” at once. While LSTMs scale linearly, processing very long sequences still consumes significant memory.
  • hidden_dim (128): This defines the “capacity” of the LSTM. 128 units allows the model to remember enough contextual interaction between words to distinguish significant phrases from common collocations.
# All tuneable hyperparameters in one place
model_config = {
    "max_seq_len":       1024,   # Token truncation limit per document (avoids O(N²) memory in long docs)
    "embed_dim":         128,    # Embedding vector size
    "hidden_dim":        128,    # LSTM hidden units per direction
    "num_layers":        2,      # Stacked BiLSTM depth
    "dropout":           0.2,    # Dropout between LSTM layers
    "pos_weight":        25.0,   # BCE weight for phrase tokens (boosts Recall on sparse labels)
    "lr":                0.001,  # Adam learning rate
    "batch_size":        32,     # Mini-batch size
    "epochs":            6,      # Training epochs
    "pg_weight":         0.1,    # Policy-gradient loss coefficient relative to BCE
    "test_size":         0.25,   # Held-out fraction for evaluation
    "random_state":      42,     # Reproducible train/test split
    "n_comparison_docs": 10,     # Max documents displayed in visual comparison
}

In the following code we use an IndexReader directly to translate from TermVector to Tensors.

Code
%%time

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from torch.nn.utils.rnn import pad_sequence
import numpy as np

# All tuneable hyperparameters in one place
model_config = {
    "max_seq_len":       1024,   # Token truncation limit per document (avoids O(N²) memory in long docs)
    "embed_dim":         128,    # Embedding vector size
    "hidden_dim":        128,    # LSTM hidden units per direction
    "num_layers":        2,      # Stacked BiLSTM depth
    "dropout":           0.2,    # Dropout between LSTM layers
    "pos_weight":        25.0,   # BCE weight for phrase tokens (boosts Recall on sparse labels)
    "lr":                0.001,  # Adam learning rate
    "batch_size":        32,     # Mini-batch size
    "epochs":            6,      # Training epochs
    "pg_weight":         0.1,    # Policy-gradient loss coefficient relative to BCE
    "test_size":         0.25,   # Held-out fraction for evaluation
    "random_state":      42,     # Reproducible train/test split
    "n_comparison_docs": 10,     # Max documents displayed in visual comparison
}

# 1. Prepare Target Phrases per Document
doc_phrases = (
    all_phrases_df[['phrase', 'doc_ids']]
    .explode('doc_ids')
    .drop_duplicates()
    .groupby('doc_ids')['phrase'] 
    .apply(list)
    .to_dict()
)
# 2. Extract Sequential Data from Lucene TermVectors
from org.apache.lucene.util import BytesRefIterator
from acumed.search.searcher import SearchStatsExtractor

vocab = {"<PAD>": 0, "<UNK>": 1}
sequences = []
labels = []
doc_ids_list = []

# Open the reader to loop the exact sequential positions per document
with SearchStatsExtractor(index_config, "content") as ex:
    reader = ex.reader
    for doc_id, target_phrases in doc_phrases.items():
        vector = reader.termVectors().get(doc_id)
        if vector is None:
            continue
            
        termsEnum = vector.terms("content")
        if termsEnum is None:
            continue
            
        te = termsEnum.iterator()
        term_positions = []
        
        # Build out internal chronological document using native word positions
        for term in BytesRefIterator.cast_(te):
            term_str = term.utf8ToString()
            postings = te.postings(None)
            postings.nextDoc()
            freq = postings.freq()
            for _ in range(freq):
                pos = postings.nextPosition()
                term_positions.append((pos, term_str))
                
        term_positions.sort(key=lambda x: x[0])
        # Truncate long documents: Transformers scale O(N²), BiLSTM is O(N) but still memory-bounded
        seq_tokens = [t for p, t in term_positions][:model_config["max_seq_len"]]
        
        if not seq_tokens:
            continue
            
        # Build sequential BIO/Binary tags mapping targets directly onto local offsets
        seq_labels = [0] * len(seq_tokens)
        for phrase in target_phrases:
            phrase_tokens = phrase.split()
            p_len = len(phrase_tokens)
            
            # Map sub-sequence overlaps ignoring identical max_slop mappings natively
            for i in range(len(seq_tokens) - p_len + 1):
                match = True
                for j in range(p_len):
                    if phrase_tokens[j] != '???' and phrase_tokens[j] != seq_tokens[i+j]:
                        match = False
                        break
                if match:
                    for j in range(p_len):
                        seq_labels[i+j] = 1
                        
        # Map vocabulary tensors
        enc_seq = []
        for t in seq_tokens:
            if t not in vocab:
                vocab[t] = len(vocab)
            enc_seq.append(vocab[t])
            
        sequences.append(torch.tensor(enc_seq, dtype=torch.long))
        labels.append(torch.tensor(seq_labels, dtype=torch.float))
        doc_ids_list.append(doc_id)
CPU times: user 1.35 s, sys: 165 ms, total: 1.52 s
Wall time: 2.2 s

PyTorch Datasets

Next we create our standard DataLoaders with padded sequence spans and randomly split into train:test sets in the ratio 3:1.

Code
class SequenceDataset(Dataset):
    def __init__(self, sequences, labels, doc_ids):
        self.sequences = sequences
        self.labels = labels
        self.doc_ids = doc_ids
        
    def __len__(self):
        return len(self.sequences)
        
    def __getitem__(self, idx):
        return self.sequences[idx], self.labels[idx], self.doc_ids[idx]

def collate_fn(batch):
    seqs, lbls, dids = zip(*batch)
    seqs_padded = pad_sequence(seqs, batch_first=True, padding_value=vocab["<PAD>"])
    lbls_padded = pad_sequence(lbls, batch_first=True, padding_value=0.0)
    return seqs_padded, lbls_padded, torch.tensor(dids)

# Isolate splits exactly against document IDs
X_train, X_test, y_train, y_test, id_train, id_test = train_test_split(
    sequences, labels, doc_ids_list,
    test_size=model_config["test_size"],
    random_state=model_config["random_state"]
)

train_dataset = SequenceDataset(X_train, y_train, id_train)
test_dataset = SequenceDataset(X_test, y_test, id_test)

train_loader = DataLoader(train_dataset, batch_size=model_config["batch_size"], shuffle=True, collate_fn=collate_fn)
test_loader = DataLoader(test_dataset, batch_size=model_config["batch_size"], shuffle=False, collate_fn=collate_fn)

Bidirectional LSTM Evaluation

Each token embedding is passed through a stack of bidirectional LSTM layers so that every position can see both preceding and following context before a linear head produces a per-token phrase probability.

flowchart TD
    A["Input token IDs<br/>(batch × seq_len)"] --> B["nn.Embedding<br/>(vocab_size → embed_dim)"]
    B --> C["nn.LSTM<br/>(bidirectional, num_layers)<br/>embed_dim → hidden_dim × 2"]
    C --> D["nn.Linear<br/>(hidden_dim × 2 → 1)"]
    D --> E["Sigmoid → probability per token"]
    E --> F{"threshold 0.5"}
    F -->|"≥ 0.5"| G["Tag = 1 (phrase token)"]
    F -->|"< 0.5"| H["Tag = 0 (non-phrase)"]
Figure 5: The LSTM evaluation pipeline.
Code
%%time

device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')

class PhraseTagger(nn.Module):
    def __init__(self, vocab_size,
                 embed_dim=model_config["embed_dim"],
                 hidden_dim=model_config["hidden_dim"],
                 num_layers=model_config["num_layers"]):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=num_layers,
                            bidirectional=True, batch_first=True,
                            dropout=model_config["dropout"])
        self.linear = nn.Linear(hidden_dim * 2, 1)
        
    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        logits = self.linear(lstm_out).squeeze(-1)
        return logits

model = PhraseTagger(vocab_size=len(vocab)).to(device)
pos_weight = torch.tensor([model_config["pos_weight"]], device=device)
criterion = nn.BCEWithLogitsLoss(reduction='none', pos_weight=pos_weight)
optimizer = torch.optim.Adam(model.parameters(), lr=model_config["lr"])

# Policy Gradient REINFORCE natively bound to BCE Training Loops
epochs = model_config["epochs"]
model.train()
for epoch in range(epochs):
    total_loss = 0
    total_bce = 0
    total_pg = 0
    
    for seqs, lbls, _ in train_loader:
        optimizer.zero_grad()
        seqs, lbls = seqs.to(device), lbls.to(device)
        
        logits = model(seqs)
        probs = torch.sigmoid(logits)
        
        mask = (seqs != vocab["<PAD>"]).float()
        
        # 1. Supervised Baseline BCE Loss stabilizing early execution
        bce_loss_raw = criterion(logits, lbls)
        bce_loss = (bce_loss_raw * mask).sum() / max(mask.sum(), 1)
        
        # 2. Reinforcement Learning Feedback (REINFORCE Algorithm)
        # Sample sequence tagging choices evaluating probability distributions natively 
        m = torch.distributions.Bernoulli(probs)
        actions = m.sample()
        log_probs = m.log_prob(actions)
        
        batch_rewards = []
        for i in range(len(lbls)):
            doc_len = mask[i].sum().int().item()
            pred_seq = actions[i][:doc_len]
            true_seq = lbls[i][:doc_len]
            
            true_indices = torch.where(true_seq == 1)[0].cpu().numpy()
            pred_indices = torch.where(pred_seq == 1)[0].cpu().numpy()
            
            if len(true_indices) > 0 and len(pred_indices) == 0:
                r = 0.0
            elif len(np.intersect1d(true_indices, pred_indices)) > 0:
                doc_recall = len(np.intersect1d(true_indices, pred_indices)) / max(len(true_indices), 1)
                r = 2.0 if doc_recall == 1.0 else 1.0
            else:
                r = 0.0
            batch_rewards.append(r)
            
        reward_tensor = torch.tensor(batch_rewards, device=device, dtype=torch.float)
        
        # Policy Gradient Baseline normalization stabilizes high-variance sequence gradients avoiding collapse
        advantages = reward_tensor - reward_tensor.mean()
        
        # Policy Gradient resolves log loss arrays mapped optimally modifying document variables 
        masked_log_probs = (
            (log_probs * mask).sum(dim=1) / max(mask.sum(dim=1).max(), 1)
        )
        pg_loss = - (masked_log_probs * advantages).mean()
        
        # Joint Execution limits: BCE filters heavily imbalanced context safely, PG maps dynamic phrase limits 
        loss = bce_loss + model_config["pg_weight"] * pg_loss
        
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        total_bce += bce_loss.item()
        total_pg += pg_loss.item()
        
    avg_loss = total_loss/max(len(train_loader), 1)
    avg_bce = total_bce/max(len(train_loader), 1)
    avg_pg = total_pg/max(len(train_loader), 1)
    print(f"Epoch {epoch+1}/{epochs} | Total Loss: {avg_loss:.4f} (BCE: {avg_bce:.4f}, PG: {avg_pg:.4f})")

torch.save(model.state_dict(), "local/models/phrases_pytorch.bin")
Epoch 1/6 | Total Loss: 0.9121 (BCE: 0.9110, PG: 0.0112)
Epoch 2/6 | Total Loss: 0.7458 (BCE: 0.7455, PG: 0.0031)
Epoch 3/6 | Total Loss: 0.5821 (BCE: 0.5821, PG: -0.0004)
Epoch 4/6 | Total Loss: 0.4133 (BCE: 0.4133, PG: -0.0004)
Epoch 5/6 | Total Loss: 0.2790 (BCE: 0.2791, PG: -0.0012)
Epoch 6/6 | Total Loss: 0.1977 (BCE: 0.1978, PG: -0.0001)
CPU times: user 11 s, sys: 2.4 s, total: 13.4 s
Wall time: 27.8 s

That is it, training complete 6 epochs in a total time of a little less than 30 seconds. The total loss at each epoch is decreasing which is a good sign. This isn’t a big traing set or exhaustive traing. The results are unlikely to be generalised to other document sets. The question we are trying to answer here is can it work at all?

Final Evaluation Metrics

Calculate typical predictive behavior identifying significant precision constraints prioritizing Recall, appended by document rewards explicitly scoring phrase recovery completion natively.

flowchart TD
    A["Model logits<br/>(test set)"] --> B["Sigmoid threshold 0.5<br/>binary predictions"]
    B --> C["Flatten all predictions<br/>& true labels"]
    C --> D["Standard Metrics<br/>Accuracy / Precision / Recall"]
    B --> E["Per-document phrase<br/>boundary check"]
    E --> F{"Any target<br/>phrase tokens?"}
    F -- No --> G["Reward = 0<br/>(hard fail)"]
    F -- Yes --> H{"Predicted tokens<br/>overlap targets?"}
    H -- No --> G
    H -- Yes --> I{"100% recall<br/>for this doc?"}
    I -- Yes --> J["Reward = 2<br/>(perfect extraction)"]
    I -- No --> K["Reward = 1<br/>(partial extraction)"]
    J --> L["Average Reward"]
    K --> L
    G --> L
Figure 6: Evaluation pipeline: from model outputs to reward scores.
Code
from sklearn.metrics import accuracy_score, precision_score, recall_score

model.eval()
idx2word = {v: k for k, v in vocab.items()}
comparison_results = []
plot_data = []
all_preds = []
all_trues = []
rewards = []

with torch.no_grad():
    for seqs, lbls, dids in test_loader:
        seqs, lbls = seqs.to(device), lbls.to(device)
        logits = model(seqs)
        preds = (torch.sigmoid(logits) > 0.5).float()
        
        # Compress boundary padding
        mask = (seqs != vocab["<PAD>"])
        all_preds.extend(preds[mask].cpu().numpy())
        all_trues.extend(lbls[mask].cpu().numpy())
        
        # Compile precise recovery rewards cleanly against native dimensions 
        for i in range(len(dids)):
            doc_id = dids[i].item()
            doc_len = mask[i].sum().int().item()
            
            true_seq = lbls[i][:doc_len]
            pred_seq = preds[i][:doc_len]
            
            true_indices = torch.where(true_seq == 1)[0].cpu().numpy()
            pred_indices = torch.where(pred_seq == 1)[0].cpu().numpy()
            
            # Map 0, 1, 2 classification scores organically
            if len(true_indices) > 0 and len(pred_indices) == 0:
                rewards.append(0)
            elif len(np.intersect1d(true_indices, pred_indices)) > 0:
                # Ratio of overlapping correct indices matches phrase saturation perfectly handling edge slops
                doc_recall = len(np.intersect1d(true_indices, pred_indices)) / max(len(true_indices), 1)
                
                if doc_recall == 1.0:
                    rewards.append(2)  # Score 2: All phrases mapped effectively
                else:
                    rewards.append(1)  # Score 1: Partial mapping resolved safely 
            else:
                rewards.append(0)      # Score 0: Failed mappings natively
                
            # Render comparison variables mapping Lucene arrays vs PyTorch masks visually    
            if len(comparison_results) < model_config["n_comparison_docs"] and doc_id in doc_phrases:
                pred_tokens = []
                current_phrase = []
                for j in range(doc_len):
                    if pred_seq[j] == 1:
                        current_phrase.append(idx2word.get(seqs[i][j].item(), "<UNK>"))
                    else:
                        if current_phrase:
                            pred_tokens.append(" ".join(current_phrase))
                            current_phrase = []
                if current_phrase:
                    pred_tokens.append(" ".join(current_phrase))
                    
                raw_text_snippet = " ".join([idx2word.get(x.item(), "<UNK>") for x in seqs[i][:min(doc_len, 50)]]) + "..."
                raw_text_snippet_esc = raw_text_snippet.replace('"', '&quot;')
                
                # Drop duplicate PyTorch predictions while preserving chronological extraction order
                unique_pred_tokens = list(dict.fromkeys(pred_tokens)) if pred_tokens else ["(None)"]
                
                # Capture full token sequence for boundary visualization
                token_details = []
                for j in range(min(doc_len, 50)):
                    token_details.append({
                        "word": idx2word.get(seqs[i][j].item(), "<UNK>"),
                        "is_target": true_seq[j].item() == 1,
                        "is_pred": pred_seq[j].item() == 1
                    })

                comparison_results.append({
                    "doc_id": doc_id,
                    "pylucene_phrases": doc_phrases[doc_id],
                    "pytorch_phrases": unique_pred_tokens,
                    "raw_doc": raw_text_snippet_esc,
                    "tokens": token_details
                })

In the following code we present the results

Code
rewards = np.array(rewards)

metrics_df = pd.DataFrame([
    {"Metric": "Native Model Accuracy", "Value": f"{accuracy_score(all_trues, all_preds):.4f}"},
    {"Metric": "Extraction Precision", "Value": f"{precision_score(all_trues, all_preds, zero_division=0):.4f}"},
    {"Metric": "Extraction Recall", "Value": f"{recall_score(all_trues, all_preds, zero_division=0):.4f}"},
    {"Metric": "Reward: Perfect Extractions [2]", "Value": str((rewards == 2).sum())},
    {"Metric": "Reward: Partial Extractions [1]", "Value": str((rewards == 1).sum())},
    {"Metric": "Reward: Failed Misses [0]", "Value": str((rewards == 0).sum())},
    {"Metric": "Average Document Reward", "Value": f"{rewards.mean():.2f}"}
])
metrics_df
Metric Value
0 Native Model Accuracy 0.9477
1 Extraction Precision 0.1841
2 Extraction Recall 0.4811
3 Reward: Perfect Extractions [2] 48
4 Reward: Partial Extractions [1] 79
5 Reward: Failed Misses [0] 100
6 Average Document Reward 0.77

In this data accuracy is not a very good metric because most of the dataset is not a phrase. recall and precision are better metrics here. In these terms it is not a great model but looking a rewards and particularly partial extractions. The model is learning extracting phrases but quite often a little more than the target phrase.

Phrases Extracted

Figure 7 presents the overlaps and extraction in more detail. I am quite happy with this outcome. Is it perfect? No.  Not by a long way but it has had 30 seconds training on a MacBook M5 with quite a small dataset. What it manages to do is very encouraging.

Code
from IPython.display import HTML

# 1. Build the Boundary Visualization
viz_html = """
<style>
    .viz-container {
        font-family: 'Segoe UI', sans-serif;
        margin: 30px 0;
        padding: 20px;
        background: #fff;
        border-radius: 12px;
        box-shadow: 0 2px 15px rgba(0,0,0,0.05);
    }
    .legend {
        display: flex;
        gap: 20px;
        margin-bottom: 20px;
        font-size: 0.85rem;
        font-weight: 600;
    }
    .legend-item { display: flex; align-items: center; gap: 6px; }
    .box { width: 14px; height: 14px; border-radius: 3px; }
    
    .doc-viz { margin-bottom: 25px; }
    .doc-viz-title { font-weight: 700; font-size: 0.9rem; margin-bottom: 8px; color: #4b5563; }
    .token-row { display: flex; flex-wrap: wrap; gap: 4px; }
    .token {
        padding: 2px 6px;
        border-radius: 4px;
        font-size: 0.85rem;
        background: #f3f4f6;
        color: #374151;
        transition: transform 0.1s;
    }
    .token:hover { transform: scale(1.1); z-index: 10; }
    
    /* Highlight Classes */
    .t-lucene { background: #e0e7ff; color: #4338ca; border: 1px solid #c7d2fe; }
    .t-pytorch { background: #fef3c7; color: #92400e; border: 1px solid #fde68a; }
    .t-overlap { background: #d1fae5; color: #065f46; border: 1px solid #a7f3d0; font-weight: 700; }
</style>

<div class="viz-container">
    <h3 style="margin-top:0">Phrase Extraction Boundaries (First 5 Docs)</h3>
    <div class="legend">
        <div class="legend-item"><div class="box t-lucene"></div> Lucene Target</div>
        <div class="legend-item"><div class="box t-pytorch"></div> PyTorch Predicted</div>
        <div class="legend-item"><div class="box t-overlap"></div> Organic Overlap</div>
    </div>
"""

for row in comparison_results[:5]:
    viz_html += f'<div class="doc-viz"><div class="doc-viz-title">Document: {row["doc_id"]}</div><div class="token-row">'
    for t in row["tokens"]:
        cls = ""
        if t["is_target"] and t["is_pred"]: cls = "t-overlap"
        elif t["is_target"]: cls = "t-lucene"
        elif t["is_pred"]: cls = "t-pytorch"
        
        viz_html += f'<span class="token {cls}">{t["word"]}</span>'
    viz_html += '</div></div>'

viz_html += "</div>"
display(HTML(viz_html))


# 2. Build the Global Results Table
table_html = """
<style>
    .extraction-table {
        width: 100%;
        border-collapse: separate;
        border-spacing: 0;
        font-family: 'Segoe UI', system-ui, -apple-system, sans-serif;
        margin: 20px 0;
        background: rgba(255, 255, 255, 0.8);
        backdrop-filter: blur(8px);
        border-radius: 12px;
        overflow: hidden;
        box-shadow: 0 4px 20px rgba(0, 0, 0, 0.08);
        border: 1px solid #e1e4e8;
    }
    .extraction-table th {
        background: linear-gradient(135deg, #6366f1, #4f46e5);
        color: white;
        padding: 16px;
        text-align: left;
        font-weight: 600;
        font-size: 0.95rem;
        letter-spacing: 0.02em;
        border-bottom: 2px solid rgba(0,0,0,0.1);
    }
    .extraction-table td {
        padding: 14px 16px;
        border-bottom: 1px solid #f0f1f4;
        vertical-align: top;
        color: #374151;
        font-size: 0.9rem;
        line-height: 1.5;
    }
    .extraction-table tr:last-child td {
        border-bottom: none;
    }
    .extraction-table tr:hover {
        background-color: rgba(99, 102, 241, 0.03);
        transition: background-color 0.2s ease;
    }
    .doc-id-cell {
        font-weight: 700;
        color: #4f46e5;
        width: 15%;
        position: relative;
        cursor: help;
    }
    .phrase-list {
        margin: 0;
        padding: 0;
        list-style: none;
    }
    .phrase-item {
        margin-bottom: 6px;
        display: flex;
        align-items: center;
    }
    .phrase-item::before {
        content: "•";
        color: #6366f1;
        font-weight: bold;
        display: inline-block;
        width: 1em;
        margin-left: -1em;
    }
    .phrase-col {
        padding-left: 20px !important;
    }
</style>
<table class="extraction-table">
    <thead>
        <tr>
            <th>Doc ID (Hover for context)</th>
            <th>PyLucene Phrases (Target)</th>
            <th>PyTorch Tags (Predicted)</th>
        </tr>
    </thead>
    <tbody>
"""

for row in comparison_results:
    lu_phrases = "".join([f'<div class="phrase-item">{p}</div>' for p in row['pylucene_phrases']])
    pt_phrases = "".join([f'<div class="phrase-item">{p}</div>' for p in row['pytorch_phrases']])
    
    table_html += f"""
        <tr title="{row['raw_doc']}">
            <td class="doc-id-cell">{row['doc_id']}</td>
            <td class="phrase-col"><div class="phrase-list">{lu_phrases}</div></td>
            <td class="phrase-col"><div class="phrase-list">{pt_phrases}</div></td>
        </tr>
    """

table_html += "</tbody></table>"

display(HTML(table_html))

Phrase Extraction Boundaries (First 5 Docs)

Lucene Target
PyTorch Predicted
Organic Overlap
Document: 12247
liamneesondoeprettigoodleslinielsonimpressonwhoanamesimilarmoviclassicjokegoofistuffkindfunnikindfamiliarwarmlikeliamneesoncuddlyousleepnight1hour14minutshouldgetaward
Document: 7218
funstarwarmoviplentihomagnodorigintrilogithroughoutfilmandiplenticameowellhaamazactionsequencenoughthrillkeepanitruestarwarfanhookrightfrombeginopencrawlstartchanggreatscoremichaelgiachinnostillincludoriginendcreditjohnwilliamoriginmusick2so
Document: 6260
maisomefineperformmoviihonestlithinkcriticoverrlatestentrixmensagaperformwolverinprofxhughjackmanpatrickstewartextraordinaricreatbelievlovefathersonbondbetweencharactlogancareninetiyearoldleaderxmenafterhorribleventoccurxavierschoolyearbeforstephenmerchant
Document: 10503
lifepeterparkertomhollandcomplicthankhiduallifespidermanchallengbehighschoolunfortunhimhibestintentaboutmakethingmuchworsspidermanwaihometakeplacewherespidermanfarfromhomeendpetermustdealhisecretidentbeleaktabloidjournalistj
Document: 11127
todaiivisitcinemawatchfilmquietplacedaionanticipthrilledgyourseatexperiimustadmitmovididdisappointwhilemaimasterpiecenjoyengagenoughwarrantmultiplviewpotentidiscovnewdetaileachtimelupitanyong'odelivstellarperformmaincharactportraipatientfromhospiccenterwho
(a)
Doc ID (Hover for context) PyLucene Phrases (Target) PyTorch Tags (Predicted)
12247
liam neeson
liam neeson
sleep night
7218
star war
star war movi
star war fan
william origin music
star war film
6260
x men
entri x men saga
prof x
x men
x men apocalyps
mutant have been
dai
where x men
x
littl too
10503
spider man wai home
life peter
life spider man
wors spider man
spider man
deal hi secret
peter seek out doctor strang benedict cumberbatch
peter parker spider man strang
him odd doctor strang
peter
hi
save dai film
deliv spider man
11127
quiet place dai on
quiet place dai
lupita nyong'o
alien invas
quiet place dai on film
11828
tom hank deliv
man
juxtaposit
insid
him
tom hank
i like
i could have
societi selfish
widow
8448
black adam
endors black adam
super
fare special effect
said
5480
i did
much anticip
have
i
i had hard time
12919
28 dai later
28 dai later franchis
8625
galaxi vol 3 farewel
guardian
guardian rocket bittersweet
sinc spider man
(b)
Figure 7

Conclusions

We should note that we are to some extent comparing apples and pears here. The PyLucene approach is rule based it has an on disk index size of 33MB and extacting phrases by looking at multiple documents. To be a significant phrase it must appear in more than one document. The PyTorch approach is trying to extract phrases from a single document. It is an unfare comparison, but I was more interested in the idea of bootstrapping learning and from that perspective it worked. The model save size was 10MB which compares favourably to the 33MB index size, but in training it was using 10GB of RAM.

In general the model has converged and found phrases per document, it is a good start. The configuration has lots of parameters and I suspect mofing the model to a different style of written text would require retraining. Where in contrast PyLucene is just looking a colocation of terms in the text, it will be more resilient, deterministic and faster but it needs a background index for comparison.

The PyTorch model has done suprisingly well and we should be encouraged by the results, it doing so well with just 30s of training and quite a small data set demonstrates the power of learning, and this could be transfered. I did experiment briefly with both Whitespace and Standard Lucene Analyzers. The results with the English Analyzer were the best, the inclusion of stemming and removal of stop words are clearly beneficial for convergence given training time and the memory limitation of my laptop.

References

[1]
Learning Phrases with PyLucene and Pytorch, part 1. n.d. https://acumedconsulting.com/blog/2026/03/03/pylucene_phrases/.
[2]
Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Computation 1997;9:1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.
[3]
Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019, 2019.
[4]
Beltagy I, Peters ME, Cohan A. Longformer: The Long-Document Transformer. 2020.
[5]
[6]
[7]
Williams RJ. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning 1992;8:229–56. https://doi.org/10.1007/BF00992696.