Go to file

Vadim Malanov 463622c644 deps: tighten version ranges, pin Docling to <2.15

Docling's DocumentConverter shape (text_items, prov[0].page_no,
export_to_markdown signature) still moves between 2.x minor releases.
Cap docling to >=2.0.0,<2.15 so a wheel bump cannot silently break
the defensive walkers in app/ingestion/docling_extractor.py until a
staging smoke test has run against the new minor.

Every other runtime dep gets the same major/minor upper bound:
- web/api: fastapi <0.117, uvicorn <0.33, pydantic <3
- db: sqlalchemy <2.1, psycopg <3.3, alembic <1.14
- search: opensearch-py <3, qdrant-client <1.13
- ingest: ocrmypdf <17, pikepdf <10, pypdf <6
- ml: FlagEmbedding <2, sentence-transformers <4, transformers <5,
      torch <3, numpy <3
- ops/utils: structlog <26, orjson <4, httpx <0.29, click <9

Lift any specific upper bound only after the corresponding regression
test passes on a staging upgrade.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-13 17:12:15 +03:00

.github/workflows

ci: add GitHub Actions workflow and ESLint v9 config

2026-05-13 16:44:04 +03:00

app

perf(reranker): add benchmark harness and passage clipping

2026-05-13 17:08:04 +03:00

data

chore: bootstrap repository with governance docs

2026-05-13 16:41:50 +03:00

docker

chore: bootstrap repository with governance docs

2026-05-13 16:41:50 +03:00

frontend

chore: drop dead _qid helper and surface ocr_confidence on SearchHit

2026-05-13 16:55:32 +03:00

scripts

perf: add ingest and search load-test harnesses

2026-05-13 17:11:08 +03:00

tests

test: add Alembic migration smoke and /search contract tests

2026-05-13 16:54:15 +03:00

.dockerignore

chore: bootstrap repository with governance docs

2026-05-13 16:41:50 +03:00

.env.example

chore: bootstrap repository with governance docs

2026-05-13 16:41:50 +03:00

.env.prod.example

chore: bootstrap repository with governance docs

2026-05-13 16:41:50 +03:00

.gitattributes

chore: bootstrap repository with governance docs

2026-05-13 16:41:50 +03:00

.gitignore

chore: bootstrap repository with governance docs

2026-05-13 16:41:50 +03:00

AGENTS.md

chore: bootstrap repository with governance docs

2026-05-13 16:41:50 +03:00

alembic.ini

chore: bootstrap repository with governance docs

2026-05-13 16:41:50 +03:00

docker-compose.prod.yml

ops: add docker-compose.prod.yml overlay

2026-05-13 16:52:57 +03:00

docker-compose.yml

feat(api): add CORS middleware and /health contract test

2026-05-13 16:48:49 +03:00

LICENSE

chore: bootstrap repository with governance docs

2026-05-13 16:41:50 +03:00

Makefile

chore: bootstrap repository with governance docs

2026-05-13 16:41:50 +03:00

pyproject.toml

deps: tighten version ranges, pin Docling to <2.15

2026-05-13 17:12:15 +03:00

README.md

chore: bootstrap repository with governance docs

2026-05-13 16:41:50 +03:00

RUNBOOK.md

perf: add ingest and search load-test harnesses

2026-05-13 17:11:08 +03:00

README.md

LegacyHUB - Knowledge Indexing & Hybrid Search for Legacy PDF Archives

LegacyHUB is a production-oriented, fully open-source backend for ingesting, OCR-ing, structurally extracting, and hybrid-searching large legacy PDF archives (designed for ~70,000 documents).

It is part of the TeamHUB suite.

PDFs ──▶ Scanner ──▶ MinIO (originals)
                  └▶ OCRmyPDF (Tesseract) ──▶ MinIO (ocr_pdf)
                                          └▶ Docling ──▶ MD + JSON ──▶ MinIO
                                                       └▶ blocks/tables/figures
                                                                ├▶ PostgreSQL
                                                                ├▶ OpenSearch (BM25)
                                                                └▶ Qdrant (BGE-M3 dense)
                                                                          │
FastAPI /search ◀── BGE Reranker ◀── RRF merge ◀───────────────────────────┘

Stack

Component	Tech
OCR	OCRmyPDF + Tesseract (rus + eng)
Extraction	Docling (layout, tables, figures)
Object storage	MinIO (S3-compatible)
Relational store	PostgreSQL 16
Lexical search	OpenSearch 2.x (BM25 + ru/en analyzers)
Vector search	Qdrant 1.x (named dense vector)
Embeddings	BAAI/bge-m3 (dense, 1024d)
Reranker	BAAI/bge-reranker-v2-m3
API	FastAPI + Uvicorn
Workers	Celery + Redis
Logging	structlog (JSON)

Quick start

cp .env.example .env
docker compose up -d --build
docker compose exec api python scripts/init_db.py
docker compose exec api python scripts/init_opensearch.py
docker compose exec api python scripts/init_qdrant.py
docker compose exec api python scripts/smoke_test.py

Health check:

curl http://localhost:8000/api/v1/health | jq .

Open the interactive Swagger docs at http://localhost:8000/docs.

Ingest documents

Mount a folder into the container at /data/input (the compose file already mounts ./data/input for you), drop PDFs into it, and call:

curl -X POST http://localhost:8000/api/v1/ingest/folder \
  -H "Content-Type: application/json" \
  -d '{"path":"/data/input","recursive":true,"force":false}'

Or run inline (no Celery, useful for ad-hoc tests):

docker compose exec api python scripts/ingest_folder.py \
  --path /data/input --recursive --mode inline

To re-process a single document by ID:

docker compose exec api python scripts/reindex_document.py \
  --document-id <uuid>

Search

curl -X POST http://localhost:8000/api/v1/search \
  -H "Content-Type: application/json" \
  -d '{
        "query": "ГОСТ 21.501-93 рабочие чертежи",
        "limit": 10,
        "search_mode": "hybrid",
        "filters": {"min_ocr_confidence": 0.5}
      }' | jq .

search_mode can be lexical, semantic, or hybrid. Hybrid mode does:

BM25 top-K from OpenSearch
Dense top-K from Qdrant (BGE-M3)
Reciprocal Rank Fusion merge
Top 30-50 candidates re-scored by the BGE reranker (if available)
Final top-N returned with citation metadata

Each hit includes the document name, page, block id, table/figure id where applicable, and quality flags - so AI consumers can produce verifiable answers with citations.

Inspect the system

Service	URL	Credentials
API docs	http://localhost:8000/docs	-
MinIO console	http://localhost:9001	`legacyhub` / `legacyhub-secret`
OpenSearch	http://localhost:9200	-
Qdrant UI	http://localhost:6333/dashboard	-
Postgres	`localhost:5432`	`legacyhub` / `legacyhub`

# Count docs in OpenSearch
curl 'http://localhost:9200/legacy_chunks/_count'
# Inspect Qdrant collection
curl 'http://localhost:6333/collections/legacy_chunks'
# Browse Postgres
docker compose exec postgres psql -U legacyhub -d legacyhub \
  -c "SELECT id, original_file_name, status FROM documents LIMIT 20;"

Environment variables

See .env.example for the full list. Key ones:

OCR_LANGUAGES - Tesseract language packs (default rus+eng).
OCR_ENABLED - set false to skip OCR completely.
DOCLING_OCR_ENABLED - prefer OCRmyPDF; only enable if you do not run OCRmyPDF.
EMBEDDING_DEVICE / RERANKER_DEVICE - cpu, cuda, or mps.
MAX_DOCUMENT_TIMEOUT_SECONDS - per-document soft timeout for extraction.

Handling poor OCR

The pipeline computes per-chunk quality_flags:
- low_ocr_confidence, very_short_text, possible_garbled_text
- table_detected, figure_detected, handwriting_detected
- needs_manual_review (any of the above except table/figure detection)
Garbled chunks are still indexed - so they remain searchable - but the flags let you filter them out at query time via filters.min_ocr_confidence.
Original text is always preserved verbatim (no destructive cleaning); the normalized_text field is a derived form used purely for recall.
We deliberately preserve technical / legal identifiers (ГОСТ, document numbers, dates, serials, slashes, dashes, dots, brackets) during normalization.

Handling handwriting

We do not attempt to recognize handwriting reliably. Suspected handwritten fragments are flagged with block_type=handwriting and quality_flags.handwriting_detected=true plus needs_manual_review=true.
The API does not present handwriting recognition output as authoritative.

Idempotency

Document identity = SHA256 of the original PDF. Re-ingesting the same PDF reuses the existing documents row.
The pipeline deletes existing chunks for the document and re-creates them before re-indexing; OpenSearch and Qdrant entries are deleted-by-document before re-upsert. So re-running ingestion does not duplicate data.

Failure handling

Each pipeline stage records a row in processing_events with level and data JSON.
A document that fails OCR is marked OCR_FAILED and the pipeline moves on.
A document that fails Docling is marked EXTRACTION_FAILED.
Indexing failures bring the document to FAILED; re-running scripts/reindex_document.py resumes processing.

Scaling notes (~70k PDFs)

The Celery worker service is horizontally scalable: docker compose up -d --scale worker=8 (or run several Compose stacks pointing at the same Postgres / MinIO / OpenSearch / Qdrant).
The embedding step is the biggest cost. Set EMBEDDING_DEVICE=cuda and a GPU-aware worker image if available.
OpenSearch defaults to 1 shard / 0 replicas - increase for production (PUT /legacy_chunks/_settings).
Qdrant is single-node by default; for very large corpora use the cluster build of Qdrant or shard by document hash.
For 70k PDFs at ~50 chunks each, expect ~3.5M vectors. BGE-M3 dense at 1024d is ~14 GB on disk; budget memory accordingly.

Tests

pip install -e ".[dev]"
pytest -q

The unit suite covers hashing, chunking, quality flags, hybrid result merging, and duplicate detection. Integration tests run against the live Compose stack via scripts/smoke_test.py.

Repository layout

legacy-knowledge-indexer/
  app/
    api/            # FastAPI routes & schemas
    db/             # SQLAlchemy models + Alembic migrations
    indexing/       # OpenSearch, Qdrant, embeddings, reranker, hybrid search
    ingestion/      # scanner, OCR, Docling, chunking, quality, pipeline
    storage/        # MinIO client + key conventions
    utils/          # hashing, text cleaning, language detection, PDF helpers
    workers/        # Celery app + tasks
  scripts/          # init / ingest / reindex / smoke
  tests/            # unit tests
  docker/Dockerfile # API + worker image
  docker-compose.yml
  .env.example
  pyproject.toml
  alembic.ini

Known limitations

Docling's exact JSON shape varies between versions. The extractor uses defensive lookups and falls back to paragraph when a label is unknown.
We do not currently ship a sparse vector path (BGE-M3 supports it). Hybrid recall is achieved via OpenSearch BM25 + Qdrant dense, merged with RRF - which has been observed to outperform sparse-only or dense-only setups on noisy OCR.
Figure description does not invoke a VLM; captions plus a placeholder are used. Plug a VLM into figure_processor.persist_figures if needed.
No authentication on the API surface - put it behind your reverse proxy.

License

Apache-2.0.

Languages

Python 58.4%

TypeScript 39.3%

CSS 0.9%

Dockerfile 0.5%

JavaScript 0.5%

Other 0.4%