scripts/generate_synthetic_pdfs.py builds real PDF/1.4 documents with a hand-written xref so we can generate tens of thousands of ~2 KB PDFs locally. Helvetica only covers latin-1, which is fine for a load generator (throughput, not retrieval relevance); the docstring calls this out so no one mistakes the output for a quality corpus. scripts/load_ingest.py drives POST /ingest/folder, then polls a hypothetical /documents/stats endpoint every poll-interval seconds to track terminal-state progression. Writes a JSON history report so results can be diffed between runs. scripts/locustfile_search.py defines a SearchUser profile mixing hybrid / lexical / semantic queries against POST /search plus a health-check sampler. Asserts non-empty results so a "200 with zero hits" regression surfaces as a failure rather than a green percentile graph. RUNBOOK gains a Load testing section with CPU/GPU SLO tables for both axes (sustained docs/min, search latency p50/p95/p99). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LegacyHUB - Knowledge Indexing & Hybrid Search for Legacy PDF Archives
LegacyHUB is a production-oriented, fully open-source backend for ingesting, OCR-ing, structurally extracting, and hybrid-searching large legacy PDF archives (designed for ~70,000 documents).
It is part of the TeamHUB suite.
PDFs ──▶ Scanner ──▶ MinIO (originals)
└▶ OCRmyPDF (Tesseract) ──▶ MinIO (ocr_pdf)
└▶ Docling ──▶ MD + JSON ──▶ MinIO
└▶ blocks/tables/figures
├▶ PostgreSQL
├▶ OpenSearch (BM25)
└▶ Qdrant (BGE-M3 dense)
│
FastAPI /search ◀── BGE Reranker ◀── RRF merge ◀───────────────────────────┘
Stack
| Component | Tech |
|---|---|
| OCR | OCRmyPDF + Tesseract (rus + eng) |
| Extraction | Docling (layout, tables, figures) |
| Object storage | MinIO (S3-compatible) |
| Relational store | PostgreSQL 16 |
| Lexical search | OpenSearch 2.x (BM25 + ru/en analyzers) |
| Vector search | Qdrant 1.x (named dense vector) |
| Embeddings | BAAI/bge-m3 (dense, 1024d) |
| Reranker | BAAI/bge-reranker-v2-m3 |
| API | FastAPI + Uvicorn |
| Workers | Celery + Redis |
| Logging | structlog (JSON) |
Quick start
cp .env.example .env
docker compose up -d --build
docker compose exec api python scripts/init_db.py
docker compose exec api python scripts/init_opensearch.py
docker compose exec api python scripts/init_qdrant.py
docker compose exec api python scripts/smoke_test.py
Health check:
curl http://localhost:8000/api/v1/health | jq .
Open the interactive Swagger docs at http://localhost:8000/docs.
Ingest documents
Mount a folder into the container at /data/input (the compose file already
mounts ./data/input for you), drop PDFs into it, and call:
curl -X POST http://localhost:8000/api/v1/ingest/folder \
-H "Content-Type: application/json" \
-d '{"path":"/data/input","recursive":true,"force":false}'
Or run inline (no Celery, useful for ad-hoc tests):
docker compose exec api python scripts/ingest_folder.py \
--path /data/input --recursive --mode inline
To re-process a single document by ID:
docker compose exec api python scripts/reindex_document.py \
--document-id <uuid>
Search
curl -X POST http://localhost:8000/api/v1/search \
-H "Content-Type: application/json" \
-d '{
"query": "ГОСТ 21.501-93 рабочие чертежи",
"limit": 10,
"search_mode": "hybrid",
"filters": {"min_ocr_confidence": 0.5}
}' | jq .
search_mode can be lexical, semantic, or hybrid. Hybrid mode does:
- BM25 top-K from OpenSearch
- Dense top-K from Qdrant (BGE-M3)
- Reciprocal Rank Fusion merge
- Top 30-50 candidates re-scored by the BGE reranker (if available)
- Final top-N returned with citation metadata
Each hit includes the document name, page, block id, table/figure id where applicable, and quality flags - so AI consumers can produce verifiable answers with citations.
Inspect the system
| Service | URL | Credentials |
|---|---|---|
| API docs | http://localhost:8000/docs | - |
| MinIO console | http://localhost:9001 | legacyhub / legacyhub-secret |
| OpenSearch | http://localhost:9200 | - |
| Qdrant UI | http://localhost:6333/dashboard | - |
| Postgres | localhost:5432 |
legacyhub / legacyhub |
# Count docs in OpenSearch
curl 'http://localhost:9200/legacy_chunks/_count'
# Inspect Qdrant collection
curl 'http://localhost:6333/collections/legacy_chunks'
# Browse Postgres
docker compose exec postgres psql -U legacyhub -d legacyhub \
-c "SELECT id, original_file_name, status FROM documents LIMIT 20;"
Environment variables
See .env.example for the full list. Key ones:
OCR_LANGUAGES- Tesseract language packs (defaultrus+eng).OCR_ENABLED- setfalseto skip OCR completely.DOCLING_OCR_ENABLED- prefer OCRmyPDF; only enable if you do not run OCRmyPDF.EMBEDDING_DEVICE/RERANKER_DEVICE-cpu,cuda, ormps.MAX_DOCUMENT_TIMEOUT_SECONDS- per-document soft timeout for extraction.
Handling poor OCR
- The pipeline computes per-chunk
quality_flags:low_ocr_confidence,very_short_text,possible_garbled_texttable_detected,figure_detected,handwriting_detectedneeds_manual_review(any of the above except table/figure detection)
- Garbled chunks are still indexed - so they remain searchable - but the flags
let you filter them out at query time via
filters.min_ocr_confidence. - Original text is always preserved verbatim (no destructive cleaning); the
normalized_textfield is a derived form used purely for recall. - We deliberately preserve technical / legal identifiers (ГОСТ, document numbers, dates, serials, slashes, dashes, dots, brackets) during normalization.
Handling handwriting
- We do not attempt to recognize handwriting reliably. Suspected handwritten
fragments are flagged with
block_type=handwritingandquality_flags.handwriting_detected=trueplusneeds_manual_review=true. - The API does not present handwriting recognition output as authoritative.
Idempotency
- Document identity = SHA256 of the original PDF. Re-ingesting the same PDF
reuses the existing
documentsrow. - The pipeline deletes existing chunks for the document and re-creates them before re-indexing; OpenSearch and Qdrant entries are deleted-by-document before re-upsert. So re-running ingestion does not duplicate data.
Failure handling
- Each pipeline stage records a row in
processing_eventswithlevelanddataJSON. - A document that fails OCR is marked
OCR_FAILEDand the pipeline moves on. - A document that fails Docling is marked
EXTRACTION_FAILED. - Indexing failures bring the document to
FAILED; re-runningscripts/reindex_document.pyresumes processing.
Scaling notes (~70k PDFs)
- The Celery
workerservice is horizontally scalable:docker compose up -d --scale worker=8(or run several Compose stacks pointing at the same Postgres / MinIO / OpenSearch / Qdrant). - The embedding step is the biggest cost. Set
EMBEDDING_DEVICE=cudaand a GPU-aware worker image if available. - OpenSearch defaults to 1 shard / 0 replicas - increase for production
(
PUT /legacy_chunks/_settings). - Qdrant is single-node by default; for very large corpora use the cluster build of Qdrant or shard by document hash.
- For 70k PDFs at ~50 chunks each, expect ~3.5M vectors. BGE-M3 dense at 1024d is ~14 GB on disk; budget memory accordingly.
Tests
pip install -e ".[dev]"
pytest -q
The unit suite covers hashing, chunking, quality flags, hybrid result merging,
and duplicate detection. Integration tests run against the live Compose stack
via scripts/smoke_test.py.
Repository layout
legacy-knowledge-indexer/
app/
api/ # FastAPI routes & schemas
db/ # SQLAlchemy models + Alembic migrations
indexing/ # OpenSearch, Qdrant, embeddings, reranker, hybrid search
ingestion/ # scanner, OCR, Docling, chunking, quality, pipeline
storage/ # MinIO client + key conventions
utils/ # hashing, text cleaning, language detection, PDF helpers
workers/ # Celery app + tasks
scripts/ # init / ingest / reindex / smoke
tests/ # unit tests
docker/Dockerfile # API + worker image
docker-compose.yml
.env.example
pyproject.toml
alembic.ini
Known limitations
- Docling's exact JSON shape varies between versions. The extractor uses
defensive lookups and falls back to
paragraphwhen a label is unknown. - We do not currently ship a sparse vector path (BGE-M3 supports it). Hybrid recall is achieved via OpenSearch BM25 + Qdrant dense, merged with RRF - which has been observed to outperform sparse-only or dense-only setups on noisy OCR.
- Figure description does not invoke a VLM; captions plus a placeholder are
used. Plug a VLM into
figure_processor.persist_figuresif needed. - No authentication on the API surface - put it behind your reverse proxy.
License
Apache-2.0.