# LegacyHUB - Knowledge Indexing & Hybrid Search for Legacy PDF Archives LegacyHUB is a production-oriented, fully open-source backend for ingesting, OCR-ing, structurally extracting, and hybrid-searching large legacy PDF archives (designed for ~70,000 documents). It is part of the **TeamHUB** suite. ``` PDFs ──▶ Scanner ──▶ MinIO (originals) └▶ OCRmyPDF (Tesseract) ──▶ MinIO (ocr_pdf) └▶ Docling ──▶ MD + JSON ──▶ MinIO └▶ blocks/tables/figures ├▶ PostgreSQL ├▶ OpenSearch (BM25) └▶ Qdrant (BGE-M3 dense) │ FastAPI /search ◀── BGE Reranker ◀── RRF merge ◀───────────────────────────┘ ``` ## Stack | Component | Tech | |------------------|------------------------------------------| | OCR | OCRmyPDF + Tesseract (rus + eng) | | Extraction | Docling (layout, tables, figures) | | Object storage | MinIO (S3-compatible) | | Relational store | PostgreSQL 16 | | Lexical search | OpenSearch 2.x (BM25 + ru/en analyzers) | | Vector search | Qdrant 1.x (named dense vector) | | Embeddings | BAAI/bge-m3 (dense, 1024d) | | Reranker | BAAI/bge-reranker-v2-m3 | | API | FastAPI + Uvicorn | | Workers | Celery + Redis | | Logging | structlog (JSON) | ## Quick start ```bash cp .env.example .env docker compose up -d --build docker compose exec api python scripts/init_db.py docker compose exec api python scripts/init_opensearch.py docker compose exec api python scripts/init_qdrant.py docker compose exec api python scripts/smoke_test.py ``` Health check: ```bash curl http://localhost:8000/api/v1/health | jq . ``` Open the interactive Swagger docs at . ## Ingest documents Mount a folder into the container at `/data/input` (the compose file already mounts `./data/input` for you), drop PDFs into it, and call: ```bash curl -X POST http://localhost:8000/api/v1/ingest/folder \ -H "Content-Type: application/json" \ -d '{"path":"/data/input","recursive":true,"force":false}' ``` Or run inline (no Celery, useful for ad-hoc tests): ```bash docker compose exec api python scripts/ingest_folder.py \ --path /data/input --recursive --mode inline ``` To re-process a single document by ID: ```bash docker compose exec api python scripts/reindex_document.py \ --document-id ``` ## Search ```bash curl -X POST http://localhost:8000/api/v1/search \ -H "Content-Type: application/json" \ -d '{ "query": "ГОСТ 21.501-93 рабочие чертежи", "limit": 10, "search_mode": "hybrid", "filters": {"min_ocr_confidence": 0.5} }' | jq . ``` `search_mode` can be `lexical`, `semantic`, or `hybrid`. Hybrid mode does: 1. BM25 top-K from OpenSearch 2. Dense top-K from Qdrant (BGE-M3) 3. Reciprocal Rank Fusion merge 4. Top 30-50 candidates re-scored by the BGE reranker (if available) 5. Final top-N returned with citation metadata Each hit includes the document name, page, block id, table/figure id where applicable, and quality flags - so AI consumers can produce verifiable answers with citations. ## Inspect the system | Service | URL | Credentials | |---------------|--------------------------------------|----------------------------| | API docs | | - | | MinIO console | | `legacyhub` / `legacyhub-secret` | | OpenSearch | | - | | Qdrant UI | | - | | Postgres | `localhost:5432` | `legacyhub` / `legacyhub` | ```bash # Count docs in OpenSearch curl 'http://localhost:9200/legacy_chunks/_count' # Inspect Qdrant collection curl 'http://localhost:6333/collections/legacy_chunks' # Browse Postgres docker compose exec postgres psql -U legacyhub -d legacyhub \ -c "SELECT id, original_file_name, status FROM documents LIMIT 20;" ``` ## Environment variables See [`.env.example`](.env.example) for the full list. Key ones: - `OCR_LANGUAGES` - Tesseract language packs (default `rus+eng`). - `OCR_ENABLED` - set `false` to skip OCR completely. - `DOCLING_OCR_ENABLED` - prefer OCRmyPDF; only enable if you do not run OCRmyPDF. - `EMBEDDING_DEVICE` / `RERANKER_DEVICE` - `cpu`, `cuda`, or `mps`. - `MAX_DOCUMENT_TIMEOUT_SECONDS` - per-document soft timeout for extraction. ## Handling poor OCR - The pipeline computes per-chunk `quality_flags`: - `low_ocr_confidence`, `very_short_text`, `possible_garbled_text` - `table_detected`, `figure_detected`, `handwriting_detected` - `needs_manual_review` (any of the above except table/figure detection) - Garbled chunks are still indexed - so they remain searchable - but the flags let you filter them out at query time via `filters.min_ocr_confidence`. - Original text is always preserved verbatim (no destructive cleaning); the `normalized_text` field is a derived form used purely for recall. - We deliberately preserve technical / legal identifiers (ГОСТ, document numbers, dates, serials, slashes, dashes, dots, brackets) during normalization. ## Handling handwriting - We do not attempt to recognize handwriting reliably. Suspected handwritten fragments are flagged with `block_type=handwriting` and `quality_flags.handwriting_detected=true` plus `needs_manual_review=true`. - The API does not present handwriting recognition output as authoritative. ## Idempotency - Document identity = SHA256 of the original PDF. Re-ingesting the same PDF reuses the existing `documents` row. - The pipeline deletes existing chunks for the document and re-creates them before re-indexing; OpenSearch and Qdrant entries are deleted-by-document before re-upsert. So re-running ingestion does not duplicate data. ## Failure handling - Each pipeline stage records a row in `processing_events` with `level` and `data` JSON. - A document that fails OCR is marked `OCR_FAILED` and the pipeline moves on. - A document that fails Docling is marked `EXTRACTION_FAILED`. - Indexing failures bring the document to `FAILED`; re-running `scripts/reindex_document.py` resumes processing. ## Scaling notes (~70k PDFs) - The Celery `worker` service is horizontally scalable: `docker compose up -d --scale worker=8` (or run several Compose stacks pointing at the same Postgres / MinIO / OpenSearch / Qdrant). - The embedding step is the biggest cost. Set `EMBEDDING_DEVICE=cuda` and a GPU-aware worker image if available. - OpenSearch defaults to 1 shard / 0 replicas - increase for production (`PUT /legacy_chunks/_settings`). - Qdrant is single-node by default; for very large corpora use the cluster build of Qdrant or shard by document hash. - For 70k PDFs at ~50 chunks each, expect ~3.5M vectors. BGE-M3 dense at 1024d is ~14 GB on disk; budget memory accordingly. ## Tests ```bash pip install -e ".[dev]" pytest -q ``` The unit suite covers hashing, chunking, quality flags, hybrid result merging, and duplicate detection. Integration tests run against the live Compose stack via `scripts/smoke_test.py`. ## Repository layout ``` legacy-knowledge-indexer/ app/ api/ # FastAPI routes & schemas db/ # SQLAlchemy models + Alembic migrations indexing/ # OpenSearch, Qdrant, embeddings, reranker, hybrid search ingestion/ # scanner, OCR, Docling, chunking, quality, pipeline storage/ # MinIO client + key conventions utils/ # hashing, text cleaning, language detection, PDF helpers workers/ # Celery app + tasks scripts/ # init / ingest / reindex / smoke tests/ # unit tests docker/Dockerfile # API + worker image docker-compose.yml .env.example pyproject.toml alembic.ini ``` ## Known limitations - Docling's exact JSON shape varies between versions. The extractor uses defensive lookups and falls back to `paragraph` when a label is unknown. - We do not currently ship a sparse vector path (BGE-M3 supports it). Hybrid recall is achieved via OpenSearch BM25 + Qdrant dense, merged with RRF - which has been observed to outperform sparse-only or dense-only setups on noisy OCR. - Figure description does not invoke a VLM; captions plus a placeholder are used. Plug a VLM into `figure_processor.persist_figures` if needed. - No authentication on the API surface - put it behind your reverse proxy. ## License Apache-2.0.