AI

CrewAI: Extracting PDFs, What Worked?

May 6, 2026

Why Docling
Embedding: What is happening?
CrewAI: Setting up and retrieving knowledge
Local LLM and Pydantic validation
Conclusions
References

This is a great example of “making AI work for you”. It replaced a manual extraction step and built a framework that is reusable across different problem spaces. In this article, we discuss getting the most out of CrewAI and RAG search/retrieval.

Using CrewAI [1] on a recent project, we discovered a few settings that significantly helped improve accuracy when building local data extraction. We thought they were well worth sharing. The task was simple: extract structured data from historic PDF reports, but the data could not leave the site.

Learning Phrases with PyLucene and Pytorch, part 2.

in solutions, programming

March 30, 2026

In part 2 we reuse our tokenised index and use pytorch to build a model for significant phrase extraction. It worked surprisingly well and being able to switch Analyzers proved useful. We found that the English Analyzer with stopword removal and stemming worked best.

The results are indicative, neither the dataset size or the length of training cycles are sufficient for the development of a genralised phrase extractor but the succcess and ovelap found between pylucene and pytorch is very encouraging. We just need to scale it up.

Learning Phrases with PyLucene, part1

in solutions, programming

March 3, 2026

Lucene is a library that is used to build text search optimised indexes, it is an Apache project and is the core file format sitting under ElasticSearch. The following code uses pylucene which is a JNI wrapper to the Java API.

The algorithm derives from the idea that the terms in search results will have increased frequencies for their search terms and associated concepts. Phrases similarly should have increased frequencies. By using Lucene it is fast, though we have to index first. The code also demonstrates an integration between Lucene and Pandas for analytics. The technique here could be used summarise in aggregate user / player entered text in surveys, reviews etc. That might otherwise get ignored by analytics.

CrewAI: Extracting PDFs, What Worked?

Learning Phrases with PyLucene and Pytorch, part 2.

Learning Phrases with PyLucene, part1

Categories

Tags