Files

Vadim Malanov 7f72171572 chore: bootstrap repository with governance docs

Initialize git, add Apache-2.0 LICENSE, .gitattributes (LF line
endings), AGENTS.md (entry points, stack, discovery order, baseline
checks), RUNBOOK.md (dev boot, prod deploy with overlay, ingestion,
failures, rollback, scaling notes), .env.prod.example with rotated
credential placeholders, and dev-only warnings on .env.example.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-13 16:41:50 +03:00

9.0 KiB

Raw Blame History

LegacyHUB - Knowledge Indexing & Hybrid Search for Legacy PDF Archives

LegacyHUB is a production-oriented, fully open-source backend for ingesting, OCR-ing, structurally extracting, and hybrid-searching large legacy PDF archives (designed for ~70,000 documents).

It is part of the TeamHUB suite.

PDFs ──▶ Scanner ──▶ MinIO (originals)
                  └▶ OCRmyPDF (Tesseract) ──▶ MinIO (ocr_pdf)
                                          └▶ Docling ──▶ MD + JSON ──▶ MinIO
                                                       └▶ blocks/tables/figures
                                                                ├▶ PostgreSQL
                                                                ├▶ OpenSearch (BM25)
                                                                └▶ Qdrant (BGE-M3 dense)
                                                                          │
FastAPI /search ◀── BGE Reranker ◀── RRF merge ◀───────────────────────────┘

Stack

Component	Tech
OCR	OCRmyPDF + Tesseract (rus + eng)
Extraction	Docling (layout, tables, figures)
Object storage	MinIO (S3-compatible)
Relational store	PostgreSQL 16
Lexical search	OpenSearch 2.x (BM25 + ru/en analyzers)
Vector search	Qdrant 1.x (named dense vector)
Embeddings	BAAI/bge-m3 (dense, 1024d)
Reranker	BAAI/bge-reranker-v2-m3
API	FastAPI + Uvicorn
Workers	Celery + Redis
Logging	structlog (JSON)

Quick start

cp .env.example .env
docker compose up -d --build
docker compose exec api python scripts/init_db.py
docker compose exec api python scripts/init_opensearch.py
docker compose exec api python scripts/init_qdrant.py
docker compose exec api python scripts/smoke_test.py

Health check:

curl http://localhost:8000/api/v1/health | jq .

Open the interactive Swagger docs at http://localhost:8000/docs.

Ingest documents

Mount a folder into the container at /data/input (the compose file already mounts ./data/input for you), drop PDFs into it, and call:

curl -X POST http://localhost:8000/api/v1/ingest/folder \
  -H "Content-Type: application/json" \
  -d '{"path":"/data/input","recursive":true,"force":false}'

Or run inline (no Celery, useful for ad-hoc tests):

docker compose exec api python scripts/ingest_folder.py \
  --path /data/input --recursive --mode inline

To re-process a single document by ID:

docker compose exec api python scripts/reindex_document.py \
  --document-id <uuid>

Search

curl -X POST http://localhost:8000/api/v1/search \
  -H "Content-Type: application/json" \
  -d '{
        "query": "ГОСТ 21.501-93 рабочие чертежи",
        "limit": 10,
        "search_mode": "hybrid",
        "filters": {"min_ocr_confidence": 0.5}
      }' | jq .

search_mode can be lexical, semantic, or hybrid. Hybrid mode does:

BM25 top-K from OpenSearch
Dense top-K from Qdrant (BGE-M3)
Reciprocal Rank Fusion merge
Top 30-50 candidates re-scored by the BGE reranker (if available)
Final top-N returned with citation metadata

Each hit includes the document name, page, block id, table/figure id where applicable, and quality flags - so AI consumers can produce verifiable answers with citations.

Inspect the system

Service	URL	Credentials
API docs	http://localhost:8000/docs	-
MinIO console	http://localhost:9001	`legacyhub` / `legacyhub-secret`
OpenSearch	http://localhost:9200	-
Qdrant UI	http://localhost:6333/dashboard	-
Postgres	`localhost:5432`	`legacyhub` / `legacyhub`

# Count docs in OpenSearch
curl 'http://localhost:9200/legacy_chunks/_count'
# Inspect Qdrant collection
curl 'http://localhost:6333/collections/legacy_chunks'
# Browse Postgres
docker compose exec postgres psql -U legacyhub -d legacyhub \
  -c "SELECT id, original_file_name, status FROM documents LIMIT 20;"

Environment variables

See .env.example for the full list. Key ones:

OCR_LANGUAGES - Tesseract language packs (default rus+eng).
OCR_ENABLED - set false to skip OCR completely.
DOCLING_OCR_ENABLED - prefer OCRmyPDF; only enable if you do not run OCRmyPDF.
EMBEDDING_DEVICE / RERANKER_DEVICE - cpu, cuda, or mps.
MAX_DOCUMENT_TIMEOUT_SECONDS - per-document soft timeout for extraction.

Handling poor OCR

The pipeline computes per-chunk quality_flags:
- low_ocr_confidence, very_short_text, possible_garbled_text
- table_detected, figure_detected, handwriting_detected
- needs_manual_review (any of the above except table/figure detection)
Garbled chunks are still indexed - so they remain searchable - but the flags let you filter them out at query time via filters.min_ocr_confidence.
Original text is always preserved verbatim (no destructive cleaning); the normalized_text field is a derived form used purely for recall.
We deliberately preserve technical / legal identifiers (ГОСТ, document numbers, dates, serials, slashes, dashes, dots, brackets) during normalization.

Handling handwriting

We do not attempt to recognize handwriting reliably. Suspected handwritten fragments are flagged with block_type=handwriting and quality_flags.handwriting_detected=true plus needs_manual_review=true.
The API does not present handwriting recognition output as authoritative.

Idempotency

Document identity = SHA256 of the original PDF. Re-ingesting the same PDF reuses the existing documents row.
The pipeline deletes existing chunks for the document and re-creates them before re-indexing; OpenSearch and Qdrant entries are deleted-by-document before re-upsert. So re-running ingestion does not duplicate data.

Failure handling

Each pipeline stage records a row in processing_events with level and data JSON.
A document that fails OCR is marked OCR_FAILED and the pipeline moves on.
A document that fails Docling is marked EXTRACTION_FAILED.
Indexing failures bring the document to FAILED; re-running scripts/reindex_document.py resumes processing.

Scaling notes (~70k PDFs)

The Celery worker service is horizontally scalable: docker compose up -d --scale worker=8 (or run several Compose stacks pointing at the same Postgres / MinIO / OpenSearch / Qdrant).
The embedding step is the biggest cost. Set EMBEDDING_DEVICE=cuda and a GPU-aware worker image if available.
OpenSearch defaults to 1 shard / 0 replicas - increase for production (PUT /legacy_chunks/_settings).
Qdrant is single-node by default; for very large corpora use the cluster build of Qdrant or shard by document hash.
For 70k PDFs at ~50 chunks each, expect ~3.5M vectors. BGE-M3 dense at 1024d is ~14 GB on disk; budget memory accordingly.

Tests

pip install -e ".[dev]"
pytest -q

The unit suite covers hashing, chunking, quality flags, hybrid result merging, and duplicate detection. Integration tests run against the live Compose stack via scripts/smoke_test.py.

Repository layout

legacy-knowledge-indexer/
  app/
    api/            # FastAPI routes & schemas
    db/             # SQLAlchemy models + Alembic migrations
    indexing/       # OpenSearch, Qdrant, embeddings, reranker, hybrid search
    ingestion/      # scanner, OCR, Docling, chunking, quality, pipeline
    storage/        # MinIO client + key conventions
    utils/          # hashing, text cleaning, language detection, PDF helpers
    workers/        # Celery app + tasks
  scripts/          # init / ingest / reindex / smoke
  tests/            # unit tests
  docker/Dockerfile # API + worker image
  docker-compose.yml
  .env.example
  pyproject.toml
  alembic.ini

Known limitations

Docling's exact JSON shape varies between versions. The extractor uses defensive lookups and falls back to paragraph when a label is unknown.
We do not currently ship a sparse vector path (BGE-M3 supports it). Hybrid recall is achieved via OpenSearch BM25 + Qdrant dense, merged with RRF - which has been observed to outperform sparse-only or dense-only setups on noisy OCR.
Figure description does not invoke a VLM; captions plus a placeholder are used. Plug a VLM into figure_processor.persist_figures if needed.
No authentication on the API surface - put it behind your reverse proxy.

License

Apache-2.0.

9.0 KiB Raw Blame History