LegacyHUB/README.md

# LegacyHUB - Knowledge Indexing & Hybrid Search for Legacy PDF Archives

LegacyHUB is a production-oriented, fully open-source backend for ingesting,
OCR-ing, structurally extracting, and hybrid-searching large legacy PDF
archives (designed for ~70,000 documents).

It is part of the **TeamHUB** suite.

```
PDFs ──▶ Scanner ──▶ MinIO (originals)
                  └▶ OCRmyPDF (Tesseract) ──▶ MinIO (ocr_pdf)
                                          └▶ Docling ──▶ MD + JSON ──▶ MinIO
                                                       └▶ blocks/tables/figures
                                                                ├▶ PostgreSQL
                                                                ├▶ OpenSearch (BM25)
                                                                └▶ Qdrant (BGE-M3 dense)
                                                                          │
FastAPI /search ◀── BGE Reranker ◀── RRF merge ◀───────────────────────────┘
```

## Stack

| Component        | Tech                                     |
|------------------|------------------------------------------|
| OCR              | OCRmyPDF + Tesseract (rus + eng)         |
| Extraction       | Docling (layout, tables, figures)        |
| Object storage   | MinIO (S3-compatible)                    |
| Relational store | PostgreSQL 16                            |
| Lexical search   | OpenSearch 2.x (BM25 + ru/en analyzers)  |
| Vector search    | Qdrant 1.x (named dense vector)          |
| Embeddings       | BAAI/bge-m3 (dense, 1024d)               |
| Reranker         | BAAI/bge-reranker-v2-m3                  |
| API              | FastAPI + Uvicorn                        |
| Workers          | Celery + Redis                           |
| Logging          | structlog (JSON)                         |

## Quick start

```bash
cp .env.example .env
docker compose up -d --build
docker compose exec api python scripts/init_db.py
docker compose exec api python scripts/init_opensearch.py
docker compose exec api python scripts/init_qdrant.py
docker compose exec api python scripts/smoke_test.py
```

Health check:

```bash
curl http://localhost:8000/api/v1/health | jq .
```

Open the interactive Swagger docs at <http://localhost:8000/docs>.

## Ingest documents

Mount a folder into the container at `/data/input` (the compose file already
mounts `./data/input` for you), drop PDFs into it, and call:

```bash
curl -X POST http://localhost:8000/api/v1/ingest/folder \
  -H "Content-Type: application/json" \
  -d '{"path":"/data/input","recursive":true,"force":false}'
```

Or run inline (no Celery, useful for ad-hoc tests):

```bash
docker compose exec api python scripts/ingest_folder.py \
  --path /data/input --recursive --mode inline
```

To re-process a single document by ID:

```bash
docker compose exec api python scripts/reindex_document.py \
  --document-id <uuid>
```

## Search

```bash
curl -X POST http://localhost:8000/api/v1/search \
  -H "Content-Type: application/json" \
  -d '{
        "query": "ГОСТ 21.501-93 рабочие чертежи",
        "limit": 10,
        "search_mode": "hybrid",
        "filters": {"min_ocr_confidence": 0.5}
      }' | jq .
```

`search_mode` can be `lexical`, `semantic`, or `hybrid`. Hybrid mode does:

1. BM25 top-K from OpenSearch
2. Dense top-K from Qdrant (BGE-M3)
3. Reciprocal Rank Fusion merge
4. Top 30-50 candidates re-scored by the BGE reranker (if available)
5. Final top-N returned with citation metadata

Each hit includes the document name, page, block id, table/figure id where
applicable, and quality flags - so AI consumers can produce verifiable answers
with citations.

## Inspect the system

| Service       | URL                                  | Credentials                |
|---------------|--------------------------------------|----------------------------|
| API docs      | <http://localhost:8000/docs>         | -                          |
| MinIO console | <http://localhost:9001>              | `legacyhub` / `legacyhub-secret` |
| OpenSearch    | <http://localhost:9200>              | -                          |
| Qdrant UI     | <http://localhost:6333/dashboard>    | -                          |
| Postgres      | `localhost:5432`                     | `legacyhub` / `legacyhub`  |

```bash
# Count docs in OpenSearch
curl 'http://localhost:9200/legacy_chunks/_count'
# Inspect Qdrant collection
curl 'http://localhost:6333/collections/legacy_chunks'
# Browse Postgres
docker compose exec postgres psql -U legacyhub -d legacyhub \
  -c "SELECT id, original_file_name, status FROM documents LIMIT 20;"
```

## Environment variables

See [`.env.example`](.env.example) for the full list. Key ones:

- `OCR_LANGUAGES` - Tesseract language packs (default `rus+eng`).
- `OCR_ENABLED` - set `false` to skip OCR completely.
- `DOCLING_OCR_ENABLED` - prefer OCRmyPDF; only enable if you do not run OCRmyPDF.
- `EMBEDDING_DEVICE` / `RERANKER_DEVICE` - `cpu`, `cuda`, or `mps`.
- `MAX_DOCUMENT_TIMEOUT_SECONDS` - per-document soft timeout for extraction.

## Handling poor OCR

- The pipeline computes per-chunk `quality_flags`:
  - `low_ocr_confidence`, `very_short_text`, `possible_garbled_text`
  - `table_detected`, `figure_detected`, `handwriting_detected`
  - `needs_manual_review` (any of the above except table/figure detection)
- Garbled chunks are still indexed - so they remain searchable - but the flags
  let you filter them out at query time via `filters.min_ocr_confidence`.
- Original text is always preserved verbatim (no destructive cleaning); the
  `normalized_text` field is a derived form used purely for recall.
- We deliberately preserve technical / legal identifiers (ГОСТ, document
  numbers, dates, serials, slashes, dashes, dots, brackets) during normalization.

## Handling handwriting

- We do not attempt to recognize handwriting reliably. Suspected handwritten
  fragments are flagged with `block_type=handwriting` and
  `quality_flags.handwriting_detected=true` plus `needs_manual_review=true`.
- The API does not present handwriting recognition output as authoritative.

## Idempotency

- Document identity = SHA256 of the original PDF. Re-ingesting the same PDF
  reuses the existing `documents` row.
- The pipeline deletes existing chunks for the document and re-creates them
  before re-indexing; OpenSearch and Qdrant entries are deleted-by-document
  before re-upsert. So re-running ingestion does not duplicate data.

## Failure handling

- Each pipeline stage records a row in `processing_events` with `level` and
  `data` JSON.
- A document that fails OCR is marked `OCR_FAILED` and the pipeline moves on.
- A document that fails Docling is marked `EXTRACTION_FAILED`.
- Indexing failures bring the document to `FAILED`; re-running
  `scripts/reindex_document.py` resumes processing.

## Scaling notes (~70k PDFs)

- The Celery `worker` service is horizontally scalable: `docker compose up -d
  --scale worker=8` (or run several Compose stacks pointing at the same
  Postgres / MinIO / OpenSearch / Qdrant).
- The embedding step is the biggest cost. Set `EMBEDDING_DEVICE=cuda` and a
  GPU-aware worker image if available.
- OpenSearch defaults to 1 shard / 0 replicas - increase for production
  (`PUT /legacy_chunks/_settings`).
- Qdrant is single-node by default; for very large corpora use the cluster
  build of Qdrant or shard by document hash.
- For 70k PDFs at ~50 chunks each, expect ~3.5M vectors. BGE-M3 dense at 1024d
  is ~14 GB on disk; budget memory accordingly.

## Tests

```bash
pip install -e ".[dev]"
pytest -q
```

The unit suite covers hashing, chunking, quality flags, hybrid result merging,
and duplicate detection. Integration tests run against the live Compose stack
via `scripts/smoke_test.py`.

## Repository layout

```
legacy-knowledge-indexer/
  app/
    api/            # FastAPI routes & schemas
    db/             # SQLAlchemy models + Alembic migrations
    indexing/       # OpenSearch, Qdrant, embeddings, reranker, hybrid search
    ingestion/      # scanner, OCR, Docling, chunking, quality, pipeline
    storage/        # MinIO client + key conventions
    utils/          # hashing, text cleaning, language detection, PDF helpers
    workers/        # Celery app + tasks
  scripts/          # init / ingest / reindex / smoke
  tests/            # unit tests
  docker/Dockerfile # API + worker image
  docker-compose.yml
  .env.example
  pyproject.toml
  alembic.ini
```

## Known limitations

- Docling's exact JSON shape varies between versions. The extractor uses
  defensive lookups and falls back to `paragraph` when a label is unknown.
- We do not currently ship a sparse vector path (BGE-M3 supports it). Hybrid
  recall is achieved via OpenSearch BM25 + Qdrant dense, merged with RRF -
  which has been observed to outperform sparse-only or dense-only setups on
  noisy OCR.
- Figure description does not invoke a VLM; captions plus a placeholder are
  used. Plug a VLM into `figure_processor.persist_figures` if needed.
- No authentication on the API surface - put it behind your reverse proxy.

## License

Apache-2.0.