Initialize git, add Apache-2.0 LICENSE, .gitattributes (LF line endings), AGENTS.md (entry points, stack, discovery order, baseline checks), RUNBOOK.md (dev boot, prod deploy with overlay, ingestion, failures, rollback, scaling notes), .env.prod.example with rotated credential placeholders, and dev-only warnings on .env.example. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9.0 KiB
LegacyHUB - Knowledge Indexing & Hybrid Search for Legacy PDF Archives
LegacyHUB is a production-oriented, fully open-source backend for ingesting, OCR-ing, structurally extracting, and hybrid-searching large legacy PDF archives (designed for ~70,000 documents).
It is part of the TeamHUB suite.
PDFs ──▶ Scanner ──▶ MinIO (originals)
└▶ OCRmyPDF (Tesseract) ──▶ MinIO (ocr_pdf)
└▶ Docling ──▶ MD + JSON ──▶ MinIO
└▶ blocks/tables/figures
├▶ PostgreSQL
├▶ OpenSearch (BM25)
└▶ Qdrant (BGE-M3 dense)
│
FastAPI /search ◀── BGE Reranker ◀── RRF merge ◀───────────────────────────┘
Stack
| Component | Tech |
|---|---|
| OCR | OCRmyPDF + Tesseract (rus + eng) |
| Extraction | Docling (layout, tables, figures) |
| Object storage | MinIO (S3-compatible) |
| Relational store | PostgreSQL 16 |
| Lexical search | OpenSearch 2.x (BM25 + ru/en analyzers) |
| Vector search | Qdrant 1.x (named dense vector) |
| Embeddings | BAAI/bge-m3 (dense, 1024d) |
| Reranker | BAAI/bge-reranker-v2-m3 |
| API | FastAPI + Uvicorn |
| Workers | Celery + Redis |
| Logging | structlog (JSON) |
Quick start
cp .env.example .env
docker compose up -d --build
docker compose exec api python scripts/init_db.py
docker compose exec api python scripts/init_opensearch.py
docker compose exec api python scripts/init_qdrant.py
docker compose exec api python scripts/smoke_test.py
Health check:
curl http://localhost:8000/api/v1/health | jq .
Open the interactive Swagger docs at http://localhost:8000/docs.
Ingest documents
Mount a folder into the container at /data/input (the compose file already
mounts ./data/input for you), drop PDFs into it, and call:
curl -X POST http://localhost:8000/api/v1/ingest/folder \
-H "Content-Type: application/json" \
-d '{"path":"/data/input","recursive":true,"force":false}'
Or run inline (no Celery, useful for ad-hoc tests):
docker compose exec api python scripts/ingest_folder.py \
--path /data/input --recursive --mode inline
To re-process a single document by ID:
docker compose exec api python scripts/reindex_document.py \
--document-id <uuid>
Search
curl -X POST http://localhost:8000/api/v1/search \
-H "Content-Type: application/json" \
-d '{
"query": "ГОСТ 21.501-93 рабочие чертежи",
"limit": 10,
"search_mode": "hybrid",
"filters": {"min_ocr_confidence": 0.5}
}' | jq .
search_mode can be lexical, semantic, or hybrid. Hybrid mode does:
- BM25 top-K from OpenSearch
- Dense top-K from Qdrant (BGE-M3)
- Reciprocal Rank Fusion merge
- Top 30-50 candidates re-scored by the BGE reranker (if available)
- Final top-N returned with citation metadata
Each hit includes the document name, page, block id, table/figure id where applicable, and quality flags - so AI consumers can produce verifiable answers with citations.
Inspect the system
| Service | URL | Credentials |
|---|---|---|
| API docs | http://localhost:8000/docs | - |
| MinIO console | http://localhost:9001 | legacyhub / legacyhub-secret |
| OpenSearch | http://localhost:9200 | - |
| Qdrant UI | http://localhost:6333/dashboard | - |
| Postgres | localhost:5432 |
legacyhub / legacyhub |
# Count docs in OpenSearch
curl 'http://localhost:9200/legacy_chunks/_count'
# Inspect Qdrant collection
curl 'http://localhost:6333/collections/legacy_chunks'
# Browse Postgres
docker compose exec postgres psql -U legacyhub -d legacyhub \
-c "SELECT id, original_file_name, status FROM documents LIMIT 20;"
Environment variables
See .env.example for the full list. Key ones:
OCR_LANGUAGES- Tesseract language packs (defaultrus+eng).OCR_ENABLED- setfalseto skip OCR completely.DOCLING_OCR_ENABLED- prefer OCRmyPDF; only enable if you do not run OCRmyPDF.EMBEDDING_DEVICE/RERANKER_DEVICE-cpu,cuda, ormps.MAX_DOCUMENT_TIMEOUT_SECONDS- per-document soft timeout for extraction.
Handling poor OCR
- The pipeline computes per-chunk
quality_flags:low_ocr_confidence,very_short_text,possible_garbled_texttable_detected,figure_detected,handwriting_detectedneeds_manual_review(any of the above except table/figure detection)
- Garbled chunks are still indexed - so they remain searchable - but the flags
let you filter them out at query time via
filters.min_ocr_confidence. - Original text is always preserved verbatim (no destructive cleaning); the
normalized_textfield is a derived form used purely for recall. - We deliberately preserve technical / legal identifiers (ГОСТ, document numbers, dates, serials, slashes, dashes, dots, brackets) during normalization.
Handling handwriting
- We do not attempt to recognize handwriting reliably. Suspected handwritten
fragments are flagged with
block_type=handwritingandquality_flags.handwriting_detected=trueplusneeds_manual_review=true. - The API does not present handwriting recognition output as authoritative.
Idempotency
- Document identity = SHA256 of the original PDF. Re-ingesting the same PDF
reuses the existing
documentsrow. - The pipeline deletes existing chunks for the document and re-creates them before re-indexing; OpenSearch and Qdrant entries are deleted-by-document before re-upsert. So re-running ingestion does not duplicate data.
Failure handling
- Each pipeline stage records a row in
processing_eventswithlevelanddataJSON. - A document that fails OCR is marked
OCR_FAILEDand the pipeline moves on. - A document that fails Docling is marked
EXTRACTION_FAILED. - Indexing failures bring the document to
FAILED; re-runningscripts/reindex_document.pyresumes processing.
Scaling notes (~70k PDFs)
- The Celery
workerservice is horizontally scalable:docker compose up -d --scale worker=8(or run several Compose stacks pointing at the same Postgres / MinIO / OpenSearch / Qdrant). - The embedding step is the biggest cost. Set
EMBEDDING_DEVICE=cudaand a GPU-aware worker image if available. - OpenSearch defaults to 1 shard / 0 replicas - increase for production
(
PUT /legacy_chunks/_settings). - Qdrant is single-node by default; for very large corpora use the cluster build of Qdrant or shard by document hash.
- For 70k PDFs at ~50 chunks each, expect ~3.5M vectors. BGE-M3 dense at 1024d is ~14 GB on disk; budget memory accordingly.
Tests
pip install -e ".[dev]"
pytest -q
The unit suite covers hashing, chunking, quality flags, hybrid result merging,
and duplicate detection. Integration tests run against the live Compose stack
via scripts/smoke_test.py.
Repository layout
legacy-knowledge-indexer/
app/
api/ # FastAPI routes & schemas
db/ # SQLAlchemy models + Alembic migrations
indexing/ # OpenSearch, Qdrant, embeddings, reranker, hybrid search
ingestion/ # scanner, OCR, Docling, chunking, quality, pipeline
storage/ # MinIO client + key conventions
utils/ # hashing, text cleaning, language detection, PDF helpers
workers/ # Celery app + tasks
scripts/ # init / ingest / reindex / smoke
tests/ # unit tests
docker/Dockerfile # API + worker image
docker-compose.yml
.env.example
pyproject.toml
alembic.ini
Known limitations
- Docling's exact JSON shape varies between versions. The extractor uses
defensive lookups and falls back to
paragraphwhen a label is unknown. - We do not currently ship a sparse vector path (BGE-M3 supports it). Hybrid recall is achieved via OpenSearch BM25 + Qdrant dense, merged with RRF - which has been observed to outperform sparse-only or dense-only setups on noisy OCR.
- Figure description does not invoke a VLM; captions plus a placeholder are
used. Plug a VLM into
figure_processor.persist_figuresif needed. - No authentication on the API surface - put it behind your reverse proxy.
License
Apache-2.0.