chore: bootstrap repository with governance docs

Initialize git, add Apache-2.0 LICENSE, .gitattributes (LF line
endings), AGENTS.md (entry points, stack, discovery order, baseline
checks), RUNBOOK.md (dev boot, prod deploy with overlay, ingestion,
failures, rollback, scaling notes), .env.prod.example with rotated
credential placeholders, and dev-only warnings on .env.example.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Vadim Malanov
2026-05-13 16:41:50 +03:00
commit 7f72171572
157 changed files with 11298 additions and 0 deletions

233
README.md Normal file
View File

@@ -0,0 +1,233 @@
# LegacyHUB - Knowledge Indexing & Hybrid Search for Legacy PDF Archives
LegacyHUB is a production-oriented, fully open-source backend for ingesting,
OCR-ing, structurally extracting, and hybrid-searching large legacy PDF
archives (designed for ~70,000 documents).
It is part of the **TeamHUB** suite.
```
PDFs ──▶ Scanner ──▶ MinIO (originals)
└▶ OCRmyPDF (Tesseract) ──▶ MinIO (ocr_pdf)
└▶ Docling ──▶ MD + JSON ──▶ MinIO
└▶ blocks/tables/figures
├▶ PostgreSQL
├▶ OpenSearch (BM25)
└▶ Qdrant (BGE-M3 dense)
FastAPI /search ◀── BGE Reranker ◀── RRF merge ◀───────────────────────────┘
```
## Stack
| Component | Tech |
|------------------|------------------------------------------|
| OCR | OCRmyPDF + Tesseract (rus + eng) |
| Extraction | Docling (layout, tables, figures) |
| Object storage | MinIO (S3-compatible) |
| Relational store | PostgreSQL 16 |
| Lexical search | OpenSearch 2.x (BM25 + ru/en analyzers) |
| Vector search | Qdrant 1.x (named dense vector) |
| Embeddings | BAAI/bge-m3 (dense, 1024d) |
| Reranker | BAAI/bge-reranker-v2-m3 |
| API | FastAPI + Uvicorn |
| Workers | Celery + Redis |
| Logging | structlog (JSON) |
## Quick start
```bash
cp .env.example .env
docker compose up -d --build
docker compose exec api python scripts/init_db.py
docker compose exec api python scripts/init_opensearch.py
docker compose exec api python scripts/init_qdrant.py
docker compose exec api python scripts/smoke_test.py
```
Health check:
```bash
curl http://localhost:8000/api/v1/health | jq .
```
Open the interactive Swagger docs at <http://localhost:8000/docs>.
## Ingest documents
Mount a folder into the container at `/data/input` (the compose file already
mounts `./data/input` for you), drop PDFs into it, and call:
```bash
curl -X POST http://localhost:8000/api/v1/ingest/folder \
-H "Content-Type: application/json" \
-d '{"path":"/data/input","recursive":true,"force":false}'
```
Or run inline (no Celery, useful for ad-hoc tests):
```bash
docker compose exec api python scripts/ingest_folder.py \
--path /data/input --recursive --mode inline
```
To re-process a single document by ID:
```bash
docker compose exec api python scripts/reindex_document.py \
--document-id <uuid>
```
## Search
```bash
curl -X POST http://localhost:8000/api/v1/search \
-H "Content-Type: application/json" \
-d '{
"query": "ГОСТ 21.501-93 рабочие чертежи",
"limit": 10,
"search_mode": "hybrid",
"filters": {"min_ocr_confidence": 0.5}
}' | jq .
```
`search_mode` can be `lexical`, `semantic`, or `hybrid`. Hybrid mode does:
1. BM25 top-K from OpenSearch
2. Dense top-K from Qdrant (BGE-M3)
3. Reciprocal Rank Fusion merge
4. Top 30-50 candidates re-scored by the BGE reranker (if available)
5. Final top-N returned with citation metadata
Each hit includes the document name, page, block id, table/figure id where
applicable, and quality flags - so AI consumers can produce verifiable answers
with citations.
## Inspect the system
| Service | URL | Credentials |
|---------------|--------------------------------------|----------------------------|
| API docs | <http://localhost:8000/docs> | - |
| MinIO console | <http://localhost:9001> | `legacyhub` / `legacyhub-secret` |
| OpenSearch | <http://localhost:9200> | - |
| Qdrant UI | <http://localhost:6333/dashboard> | - |
| Postgres | `localhost:5432` | `legacyhub` / `legacyhub` |
```bash
# Count docs in OpenSearch
curl 'http://localhost:9200/legacy_chunks/_count'
# Inspect Qdrant collection
curl 'http://localhost:6333/collections/legacy_chunks'
# Browse Postgres
docker compose exec postgres psql -U legacyhub -d legacyhub \
-c "SELECT id, original_file_name, status FROM documents LIMIT 20;"
```
## Environment variables
See [`.env.example`](.env.example) for the full list. Key ones:
- `OCR_LANGUAGES` - Tesseract language packs (default `rus+eng`).
- `OCR_ENABLED` - set `false` to skip OCR completely.
- `DOCLING_OCR_ENABLED` - prefer OCRmyPDF; only enable if you do not run OCRmyPDF.
- `EMBEDDING_DEVICE` / `RERANKER_DEVICE` - `cpu`, `cuda`, or `mps`.
- `MAX_DOCUMENT_TIMEOUT_SECONDS` - per-document soft timeout for extraction.
## Handling poor OCR
- The pipeline computes per-chunk `quality_flags`:
- `low_ocr_confidence`, `very_short_text`, `possible_garbled_text`
- `table_detected`, `figure_detected`, `handwriting_detected`
- `needs_manual_review` (any of the above except table/figure detection)
- Garbled chunks are still indexed - so they remain searchable - but the flags
let you filter them out at query time via `filters.min_ocr_confidence`.
- Original text is always preserved verbatim (no destructive cleaning); the
`normalized_text` field is a derived form used purely for recall.
- We deliberately preserve technical / legal identifiers (ГОСТ, document
numbers, dates, serials, slashes, dashes, dots, brackets) during normalization.
## Handling handwriting
- We do not attempt to recognize handwriting reliably. Suspected handwritten
fragments are flagged with `block_type=handwriting` and
`quality_flags.handwriting_detected=true` plus `needs_manual_review=true`.
- The API does not present handwriting recognition output as authoritative.
## Idempotency
- Document identity = SHA256 of the original PDF. Re-ingesting the same PDF
reuses the existing `documents` row.
- The pipeline deletes existing chunks for the document and re-creates them
before re-indexing; OpenSearch and Qdrant entries are deleted-by-document
before re-upsert. So re-running ingestion does not duplicate data.
## Failure handling
- Each pipeline stage records a row in `processing_events` with `level` and
`data` JSON.
- A document that fails OCR is marked `OCR_FAILED` and the pipeline moves on.
- A document that fails Docling is marked `EXTRACTION_FAILED`.
- Indexing failures bring the document to `FAILED`; re-running
`scripts/reindex_document.py` resumes processing.
## Scaling notes (~70k PDFs)
- The Celery `worker` service is horizontally scalable: `docker compose up -d
--scale worker=8` (or run several Compose stacks pointing at the same
Postgres / MinIO / OpenSearch / Qdrant).
- The embedding step is the biggest cost. Set `EMBEDDING_DEVICE=cuda` and a
GPU-aware worker image if available.
- OpenSearch defaults to 1 shard / 0 replicas - increase for production
(`PUT /legacy_chunks/_settings`).
- Qdrant is single-node by default; for very large corpora use the cluster
build of Qdrant or shard by document hash.
- For 70k PDFs at ~50 chunks each, expect ~3.5M vectors. BGE-M3 dense at 1024d
is ~14 GB on disk; budget memory accordingly.
## Tests
```bash
pip install -e ".[dev]"
pytest -q
```
The unit suite covers hashing, chunking, quality flags, hybrid result merging,
and duplicate detection. Integration tests run against the live Compose stack
via `scripts/smoke_test.py`.
## Repository layout
```
legacy-knowledge-indexer/
app/
api/ # FastAPI routes & schemas
db/ # SQLAlchemy models + Alembic migrations
indexing/ # OpenSearch, Qdrant, embeddings, reranker, hybrid search
ingestion/ # scanner, OCR, Docling, chunking, quality, pipeline
storage/ # MinIO client + key conventions
utils/ # hashing, text cleaning, language detection, PDF helpers
workers/ # Celery app + tasks
scripts/ # init / ingest / reindex / smoke
tests/ # unit tests
docker/Dockerfile # API + worker image
docker-compose.yml
.env.example
pyproject.toml
alembic.ini
```
## Known limitations
- Docling's exact JSON shape varies between versions. The extractor uses
defensive lookups and falls back to `paragraph` when a label is unknown.
- We do not currently ship a sparse vector path (BGE-M3 supports it). Hybrid
recall is achieved via OpenSearch BM25 + Qdrant dense, merged with RRF -
which has been observed to outperform sparse-only or dense-only setups on
noisy OCR.
- Figure description does not invoke a VLM; captions plus a placeholder are
used. Plug a VLM into `figure_processor.persist_figures` if needed.
- No authentication on the API surface - put it behind your reverse proxy.
## License
Apache-2.0.