Vadim Malanov 463622c644 deps: tighten version ranges, pin Docling to <2.15
Docling's DocumentConverter shape (text_items, prov[0].page_no,
export_to_markdown signature) still moves between 2.x minor releases.
Cap docling to >=2.0.0,<2.15 so a wheel bump cannot silently break
the defensive walkers in app/ingestion/docling_extractor.py until a
staging smoke test has run against the new minor.

Every other runtime dep gets the same major/minor upper bound:
- web/api: fastapi <0.117, uvicorn <0.33, pydantic <3
- db: sqlalchemy <2.1, psycopg <3.3, alembic <1.14
- search: opensearch-py <3, qdrant-client <1.13
- ingest: ocrmypdf <17, pikepdf <10, pypdf <6
- ml: FlagEmbedding <2, sentence-transformers <4, transformers <5,
      torch <3, numpy <3
- ops/utils: structlog <26, orjson <4, httpx <0.29, click <9

Lift any specific upper bound only after the corresponding regression
test passes on a staging upgrade.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:12:15 +03:00

LegacyHUB - Knowledge Indexing & Hybrid Search for Legacy PDF Archives

LegacyHUB is a production-oriented, fully open-source backend for ingesting, OCR-ing, structurally extracting, and hybrid-searching large legacy PDF archives (designed for ~70,000 documents).

It is part of the TeamHUB suite.

PDFs ──▶ Scanner ──▶ MinIO (originals)
                  └▶ OCRmyPDF (Tesseract) ──▶ MinIO (ocr_pdf)
                                          └▶ Docling ──▶ MD + JSON ──▶ MinIO
                                                       └▶ blocks/tables/figures
                                                                ├▶ PostgreSQL
                                                                ├▶ OpenSearch (BM25)
                                                                └▶ Qdrant (BGE-M3 dense)
                                                                          │
FastAPI /search ◀── BGE Reranker ◀── RRF merge ◀───────────────────────────┘

Stack

Component Tech
OCR OCRmyPDF + Tesseract (rus + eng)
Extraction Docling (layout, tables, figures)
Object storage MinIO (S3-compatible)
Relational store PostgreSQL 16
Lexical search OpenSearch 2.x (BM25 + ru/en analyzers)
Vector search Qdrant 1.x (named dense vector)
Embeddings BAAI/bge-m3 (dense, 1024d)
Reranker BAAI/bge-reranker-v2-m3
API FastAPI + Uvicorn
Workers Celery + Redis
Logging structlog (JSON)

Quick start

cp .env.example .env
docker compose up -d --build
docker compose exec api python scripts/init_db.py
docker compose exec api python scripts/init_opensearch.py
docker compose exec api python scripts/init_qdrant.py
docker compose exec api python scripts/smoke_test.py

Health check:

curl http://localhost:8000/api/v1/health | jq .

Open the interactive Swagger docs at http://localhost:8000/docs.

Ingest documents

Mount a folder into the container at /data/input (the compose file already mounts ./data/input for you), drop PDFs into it, and call:

curl -X POST http://localhost:8000/api/v1/ingest/folder \
  -H "Content-Type: application/json" \
  -d '{"path":"/data/input","recursive":true,"force":false}'

Or run inline (no Celery, useful for ad-hoc tests):

docker compose exec api python scripts/ingest_folder.py \
  --path /data/input --recursive --mode inline

To re-process a single document by ID:

docker compose exec api python scripts/reindex_document.py \
  --document-id <uuid>
curl -X POST http://localhost:8000/api/v1/search \
  -H "Content-Type: application/json" \
  -d '{
        "query": "ГОСТ 21.501-93 рабочие чертежи",
        "limit": 10,
        "search_mode": "hybrid",
        "filters": {"min_ocr_confidence": 0.5}
      }' | jq .

search_mode can be lexical, semantic, or hybrid. Hybrid mode does:

  1. BM25 top-K from OpenSearch
  2. Dense top-K from Qdrant (BGE-M3)
  3. Reciprocal Rank Fusion merge
  4. Top 30-50 candidates re-scored by the BGE reranker (if available)
  5. Final top-N returned with citation metadata

Each hit includes the document name, page, block id, table/figure id where applicable, and quality flags - so AI consumers can produce verifiable answers with citations.

Inspect the system

Service URL Credentials
API docs http://localhost:8000/docs -
MinIO console http://localhost:9001 legacyhub / legacyhub-secret
OpenSearch http://localhost:9200 -
Qdrant UI http://localhost:6333/dashboard -
Postgres localhost:5432 legacyhub / legacyhub
# Count docs in OpenSearch
curl 'http://localhost:9200/legacy_chunks/_count'
# Inspect Qdrant collection
curl 'http://localhost:6333/collections/legacy_chunks'
# Browse Postgres
docker compose exec postgres psql -U legacyhub -d legacyhub \
  -c "SELECT id, original_file_name, status FROM documents LIMIT 20;"

Environment variables

See .env.example for the full list. Key ones:

  • OCR_LANGUAGES - Tesseract language packs (default rus+eng).
  • OCR_ENABLED - set false to skip OCR completely.
  • DOCLING_OCR_ENABLED - prefer OCRmyPDF; only enable if you do not run OCRmyPDF.
  • EMBEDDING_DEVICE / RERANKER_DEVICE - cpu, cuda, or mps.
  • MAX_DOCUMENT_TIMEOUT_SECONDS - per-document soft timeout for extraction.

Handling poor OCR

  • The pipeline computes per-chunk quality_flags:
    • low_ocr_confidence, very_short_text, possible_garbled_text
    • table_detected, figure_detected, handwriting_detected
    • needs_manual_review (any of the above except table/figure detection)
  • Garbled chunks are still indexed - so they remain searchable - but the flags let you filter them out at query time via filters.min_ocr_confidence.
  • Original text is always preserved verbatim (no destructive cleaning); the normalized_text field is a derived form used purely for recall.
  • We deliberately preserve technical / legal identifiers (ГОСТ, document numbers, dates, serials, slashes, dashes, dots, brackets) during normalization.

Handling handwriting

  • We do not attempt to recognize handwriting reliably. Suspected handwritten fragments are flagged with block_type=handwriting and quality_flags.handwriting_detected=true plus needs_manual_review=true.
  • The API does not present handwriting recognition output as authoritative.

Idempotency

  • Document identity = SHA256 of the original PDF. Re-ingesting the same PDF reuses the existing documents row.
  • The pipeline deletes existing chunks for the document and re-creates them before re-indexing; OpenSearch and Qdrant entries are deleted-by-document before re-upsert. So re-running ingestion does not duplicate data.

Failure handling

  • Each pipeline stage records a row in processing_events with level and data JSON.
  • A document that fails OCR is marked OCR_FAILED and the pipeline moves on.
  • A document that fails Docling is marked EXTRACTION_FAILED.
  • Indexing failures bring the document to FAILED; re-running scripts/reindex_document.py resumes processing.

Scaling notes (~70k PDFs)

  • The Celery worker service is horizontally scalable: docker compose up -d --scale worker=8 (or run several Compose stacks pointing at the same Postgres / MinIO / OpenSearch / Qdrant).
  • The embedding step is the biggest cost. Set EMBEDDING_DEVICE=cuda and a GPU-aware worker image if available.
  • OpenSearch defaults to 1 shard / 0 replicas - increase for production (PUT /legacy_chunks/_settings).
  • Qdrant is single-node by default; for very large corpora use the cluster build of Qdrant or shard by document hash.
  • For 70k PDFs at ~50 chunks each, expect ~3.5M vectors. BGE-M3 dense at 1024d is ~14 GB on disk; budget memory accordingly.

Tests

pip install -e ".[dev]"
pytest -q

The unit suite covers hashing, chunking, quality flags, hybrid result merging, and duplicate detection. Integration tests run against the live Compose stack via scripts/smoke_test.py.

Repository layout

legacy-knowledge-indexer/
  app/
    api/            # FastAPI routes & schemas
    db/             # SQLAlchemy models + Alembic migrations
    indexing/       # OpenSearch, Qdrant, embeddings, reranker, hybrid search
    ingestion/      # scanner, OCR, Docling, chunking, quality, pipeline
    storage/        # MinIO client + key conventions
    utils/          # hashing, text cleaning, language detection, PDF helpers
    workers/        # Celery app + tasks
  scripts/          # init / ingest / reindex / smoke
  tests/            # unit tests
  docker/Dockerfile # API + worker image
  docker-compose.yml
  .env.example
  pyproject.toml
  alembic.ini

Known limitations

  • Docling's exact JSON shape varies between versions. The extractor uses defensive lookups and falls back to paragraph when a label is unknown.
  • We do not currently ship a sparse vector path (BGE-M3 supports it). Hybrid recall is achieved via OpenSearch BM25 + Qdrant dense, merged with RRF - which has been observed to outperform sparse-only or dense-only setups on noisy OCR.
  • Figure description does not invoke a VLM; captions plus a placeholder are used. Plug a VLM into figure_processor.persist_figures if needed.
  • No authentication on the API surface - put it behind your reverse proxy.

License

Apache-2.0.

Description
No description provided
Readme 255 KiB
Languages
TypeScript 51.1%
Python 46.4%
CSS 1.2%
JavaScript 0.5%
Dockerfile 0.3%
Other 0.5%