Initialize git, add Apache-2.0 LICENSE, .gitattributes (LF line endings), AGENTS.md (entry points, stack, discovery order, baseline checks), RUNBOOK.md (dev boot, prod deploy with overlay, ingestion, failures, rollback, scaling notes), .env.prod.example with rotated credential placeholders, and dev-only warnings on .env.example. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
234 lines
9.0 KiB
Markdown
234 lines
9.0 KiB
Markdown
# LegacyHUB - Knowledge Indexing & Hybrid Search for Legacy PDF Archives
|
|
|
|
LegacyHUB is a production-oriented, fully open-source backend for ingesting,
|
|
OCR-ing, structurally extracting, and hybrid-searching large legacy PDF
|
|
archives (designed for ~70,000 documents).
|
|
|
|
It is part of the **TeamHUB** suite.
|
|
|
|
```
|
|
PDFs ──▶ Scanner ──▶ MinIO (originals)
|
|
└▶ OCRmyPDF (Tesseract) ──▶ MinIO (ocr_pdf)
|
|
└▶ Docling ──▶ MD + JSON ──▶ MinIO
|
|
└▶ blocks/tables/figures
|
|
├▶ PostgreSQL
|
|
├▶ OpenSearch (BM25)
|
|
└▶ Qdrant (BGE-M3 dense)
|
|
│
|
|
FastAPI /search ◀── BGE Reranker ◀── RRF merge ◀───────────────────────────┘
|
|
```
|
|
|
|
## Stack
|
|
|
|
| Component | Tech |
|
|
|------------------|------------------------------------------|
|
|
| OCR | OCRmyPDF + Tesseract (rus + eng) |
|
|
| Extraction | Docling (layout, tables, figures) |
|
|
| Object storage | MinIO (S3-compatible) |
|
|
| Relational store | PostgreSQL 16 |
|
|
| Lexical search | OpenSearch 2.x (BM25 + ru/en analyzers) |
|
|
| Vector search | Qdrant 1.x (named dense vector) |
|
|
| Embeddings | BAAI/bge-m3 (dense, 1024d) |
|
|
| Reranker | BAAI/bge-reranker-v2-m3 |
|
|
| API | FastAPI + Uvicorn |
|
|
| Workers | Celery + Redis |
|
|
| Logging | structlog (JSON) |
|
|
|
|
## Quick start
|
|
|
|
```bash
|
|
cp .env.example .env
|
|
docker compose up -d --build
|
|
docker compose exec api python scripts/init_db.py
|
|
docker compose exec api python scripts/init_opensearch.py
|
|
docker compose exec api python scripts/init_qdrant.py
|
|
docker compose exec api python scripts/smoke_test.py
|
|
```
|
|
|
|
Health check:
|
|
|
|
```bash
|
|
curl http://localhost:8000/api/v1/health | jq .
|
|
```
|
|
|
|
Open the interactive Swagger docs at <http://localhost:8000/docs>.
|
|
|
|
## Ingest documents
|
|
|
|
Mount a folder into the container at `/data/input` (the compose file already
|
|
mounts `./data/input` for you), drop PDFs into it, and call:
|
|
|
|
```bash
|
|
curl -X POST http://localhost:8000/api/v1/ingest/folder \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"path":"/data/input","recursive":true,"force":false}'
|
|
```
|
|
|
|
Or run inline (no Celery, useful for ad-hoc tests):
|
|
|
|
```bash
|
|
docker compose exec api python scripts/ingest_folder.py \
|
|
--path /data/input --recursive --mode inline
|
|
```
|
|
|
|
To re-process a single document by ID:
|
|
|
|
```bash
|
|
docker compose exec api python scripts/reindex_document.py \
|
|
--document-id <uuid>
|
|
```
|
|
|
|
## Search
|
|
|
|
```bash
|
|
curl -X POST http://localhost:8000/api/v1/search \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"query": "ГОСТ 21.501-93 рабочие чертежи",
|
|
"limit": 10,
|
|
"search_mode": "hybrid",
|
|
"filters": {"min_ocr_confidence": 0.5}
|
|
}' | jq .
|
|
```
|
|
|
|
`search_mode` can be `lexical`, `semantic`, or `hybrid`. Hybrid mode does:
|
|
|
|
1. BM25 top-K from OpenSearch
|
|
2. Dense top-K from Qdrant (BGE-M3)
|
|
3. Reciprocal Rank Fusion merge
|
|
4. Top 30-50 candidates re-scored by the BGE reranker (if available)
|
|
5. Final top-N returned with citation metadata
|
|
|
|
Each hit includes the document name, page, block id, table/figure id where
|
|
applicable, and quality flags - so AI consumers can produce verifiable answers
|
|
with citations.
|
|
|
|
## Inspect the system
|
|
|
|
| Service | URL | Credentials |
|
|
|---------------|--------------------------------------|----------------------------|
|
|
| API docs | <http://localhost:8000/docs> | - |
|
|
| MinIO console | <http://localhost:9001> | `legacyhub` / `legacyhub-secret` |
|
|
| OpenSearch | <http://localhost:9200> | - |
|
|
| Qdrant UI | <http://localhost:6333/dashboard> | - |
|
|
| Postgres | `localhost:5432` | `legacyhub` / `legacyhub` |
|
|
|
|
```bash
|
|
# Count docs in OpenSearch
|
|
curl 'http://localhost:9200/legacy_chunks/_count'
|
|
# Inspect Qdrant collection
|
|
curl 'http://localhost:6333/collections/legacy_chunks'
|
|
# Browse Postgres
|
|
docker compose exec postgres psql -U legacyhub -d legacyhub \
|
|
-c "SELECT id, original_file_name, status FROM documents LIMIT 20;"
|
|
```
|
|
|
|
## Environment variables
|
|
|
|
See [`.env.example`](.env.example) for the full list. Key ones:
|
|
|
|
- `OCR_LANGUAGES` - Tesseract language packs (default `rus+eng`).
|
|
- `OCR_ENABLED` - set `false` to skip OCR completely.
|
|
- `DOCLING_OCR_ENABLED` - prefer OCRmyPDF; only enable if you do not run OCRmyPDF.
|
|
- `EMBEDDING_DEVICE` / `RERANKER_DEVICE` - `cpu`, `cuda`, or `mps`.
|
|
- `MAX_DOCUMENT_TIMEOUT_SECONDS` - per-document soft timeout for extraction.
|
|
|
|
## Handling poor OCR
|
|
|
|
- The pipeline computes per-chunk `quality_flags`:
|
|
- `low_ocr_confidence`, `very_short_text`, `possible_garbled_text`
|
|
- `table_detected`, `figure_detected`, `handwriting_detected`
|
|
- `needs_manual_review` (any of the above except table/figure detection)
|
|
- Garbled chunks are still indexed - so they remain searchable - but the flags
|
|
let you filter them out at query time via `filters.min_ocr_confidence`.
|
|
- Original text is always preserved verbatim (no destructive cleaning); the
|
|
`normalized_text` field is a derived form used purely for recall.
|
|
- We deliberately preserve technical / legal identifiers (ГОСТ, document
|
|
numbers, dates, serials, slashes, dashes, dots, brackets) during normalization.
|
|
|
|
## Handling handwriting
|
|
|
|
- We do not attempt to recognize handwriting reliably. Suspected handwritten
|
|
fragments are flagged with `block_type=handwriting` and
|
|
`quality_flags.handwriting_detected=true` plus `needs_manual_review=true`.
|
|
- The API does not present handwriting recognition output as authoritative.
|
|
|
|
## Idempotency
|
|
|
|
- Document identity = SHA256 of the original PDF. Re-ingesting the same PDF
|
|
reuses the existing `documents` row.
|
|
- The pipeline deletes existing chunks for the document and re-creates them
|
|
before re-indexing; OpenSearch and Qdrant entries are deleted-by-document
|
|
before re-upsert. So re-running ingestion does not duplicate data.
|
|
|
|
## Failure handling
|
|
|
|
- Each pipeline stage records a row in `processing_events` with `level` and
|
|
`data` JSON.
|
|
- A document that fails OCR is marked `OCR_FAILED` and the pipeline moves on.
|
|
- A document that fails Docling is marked `EXTRACTION_FAILED`.
|
|
- Indexing failures bring the document to `FAILED`; re-running
|
|
`scripts/reindex_document.py` resumes processing.
|
|
|
|
## Scaling notes (~70k PDFs)
|
|
|
|
- The Celery `worker` service is horizontally scalable: `docker compose up -d
|
|
--scale worker=8` (or run several Compose stacks pointing at the same
|
|
Postgres / MinIO / OpenSearch / Qdrant).
|
|
- The embedding step is the biggest cost. Set `EMBEDDING_DEVICE=cuda` and a
|
|
GPU-aware worker image if available.
|
|
- OpenSearch defaults to 1 shard / 0 replicas - increase for production
|
|
(`PUT /legacy_chunks/_settings`).
|
|
- Qdrant is single-node by default; for very large corpora use the cluster
|
|
build of Qdrant or shard by document hash.
|
|
- For 70k PDFs at ~50 chunks each, expect ~3.5M vectors. BGE-M3 dense at 1024d
|
|
is ~14 GB on disk; budget memory accordingly.
|
|
|
|
## Tests
|
|
|
|
```bash
|
|
pip install -e ".[dev]"
|
|
pytest -q
|
|
```
|
|
|
|
The unit suite covers hashing, chunking, quality flags, hybrid result merging,
|
|
and duplicate detection. Integration tests run against the live Compose stack
|
|
via `scripts/smoke_test.py`.
|
|
|
|
## Repository layout
|
|
|
|
```
|
|
legacy-knowledge-indexer/
|
|
app/
|
|
api/ # FastAPI routes & schemas
|
|
db/ # SQLAlchemy models + Alembic migrations
|
|
indexing/ # OpenSearch, Qdrant, embeddings, reranker, hybrid search
|
|
ingestion/ # scanner, OCR, Docling, chunking, quality, pipeline
|
|
storage/ # MinIO client + key conventions
|
|
utils/ # hashing, text cleaning, language detection, PDF helpers
|
|
workers/ # Celery app + tasks
|
|
scripts/ # init / ingest / reindex / smoke
|
|
tests/ # unit tests
|
|
docker/Dockerfile # API + worker image
|
|
docker-compose.yml
|
|
.env.example
|
|
pyproject.toml
|
|
alembic.ini
|
|
```
|
|
|
|
## Known limitations
|
|
|
|
- Docling's exact JSON shape varies between versions. The extractor uses
|
|
defensive lookups and falls back to `paragraph` when a label is unknown.
|
|
- We do not currently ship a sparse vector path (BGE-M3 supports it). Hybrid
|
|
recall is achieved via OpenSearch BM25 + Qdrant dense, merged with RRF -
|
|
which has been observed to outperform sparse-only or dense-only setups on
|
|
noisy OCR.
|
|
- Figure description does not invoke a VLM; captions plus a placeholder are
|
|
used. Plug a VLM into `figure_processor.persist_figures` if needed.
|
|
- No authentication on the API surface - put it behind your reverse proxy.
|
|
|
|
## License
|
|
|
|
Apache-2.0.
|