# AGENTS — LegacyHUB Operating instructions for AI agents working inside this repository. ## What this project is LegacyHUB ingests legacy PDF archives at scale (~70k docs), runs OCR (OCRmyPDF/Tesseract), extracts structured content with Docling, indexes chunks into PostgreSQL + OpenSearch (BM25) + Qdrant (BGE-M3 dense), and serves a hybrid lexical + semantic search API (FastAPI) reranked by BGE. It is one module of the TeamHUB Suite. ## Stack (canonical) | Layer | Tech | |----------|-----------------------------------------------| | API | FastAPI, Pydantic v2, SQLAlchemy 2, Alembic | | Workers | Celery + Redis | | OCR | OCRmyPDF + Tesseract (rus+eng) | | Extract | Docling | | Store | PostgreSQL 16, MinIO, OpenSearch 2.x, Qdrant | | ML | BAAI/bge-m3 (dense, 1024), bge-reranker-v2-m3 | | Frontend | React 18, TS 5, Vite 5, Tailwind, shadcn, TanStack Query, Zustand, Framer Motion, Recharts | | Tests | pytest | | CI | GitHub Actions | ## Entry points - **Backend API** — `app/main.py` (`uvicorn app.main:app`) - **Celery worker** — `celery -A app.workers.celery_app worker` - **CLI scripts** — `scripts/init_db.py`, `scripts/init_opensearch.py`, `scripts/init_qdrant.py`, `scripts/ingest_folder.py`, `scripts/reindex_document.py`, `scripts/smoke_test.py` - **Frontend dev** — `cd frontend && npm run dev` (port 5273) - **Docker** — `docker compose up -d --build` (dev), `docker compose -f docker-compose.yml -f docker-compose.prod.yml ...` (prod) ## Inventory ```text legacy-knowledge-indexer/ app/ api/ routers + Pydantic schemas db/ SQLAlchemy models + Alembic migrations indexing/ OpenSearch + Qdrant clients, embeddings, reranker, hybrid ingestion/ scanner, OCR, Docling, chunker, table/figure processors, quality, pipeline storage/ MinIO client + key conventions + ensure_artifact helper utils/ hashing, text cleaning, language detection, pdf helpers workers/ Celery app + tasks scripts/ init / ingest / reindex / smoke CLIs tests/ pytest suite docker/Dockerfile API + worker image (OCRmyPDF + tesseract-rus+eng) docker-compose.yml dev orchestration docker-compose.prod.yml production overlay frontend/ React app — see frontend/README.md .github/workflows CI gate (ruff + pytest + tsc + vite build + compose config) ``` ## Code discovery order Bounded discovery order for this repo. Use the first available that returns a usable answer; mark the rest "not available" for the task. 1. **Grep / rg** — reliable fallback, always available. First choice for strings, configs, docs, scripts, route paths, hashes. 2. **Glob** — file shape lookups (`app/**/*.py`). 3. **Semantic search** (if Sourcegraph, Zoekt, or Serena MCP is configured at user level) — go-to-symbol, references. Document the smoke command before relying on results. 4. **Docling / extracted Markdown in MinIO** — for content questions about ingested documents, not source code. Smoke command for layer 1: ```bash rg --version && rg "@router" app/api -n ``` If any indexer times out or returns stale results, capture the error and fall through. Do not retry the same failing indexer. ## Module contracts (high level) - `app/ingestion/pipeline.py::process_document_id(document_id, run_id)` — single document end-to-end. Idempotent. Returns `{status, chunks, error?}`. - `app/indexing/hybrid_search.py::run_search(SearchRequest) -> SearchResponse` — the only public search entry. Lexical + semantic + reranker. - `app/storage/artifacts.py::ensure_artifact(...)` — single source of truth for `document_artifacts` upsert. Used by scanner, pipeline, table_processor, figure_processor. - `app/storage/minio_client.py::MinioStorage` — bucket bootstrap + retryable put/get. Never bypass for object IO. - `app/indexing/opensearch_client.py::ensure_index() / index_chunks()` — chunk index lifecycle. - `app/indexing/qdrant_client.py::ensure_collection() / upsert_chunks()` — vector index lifecycle. ## Runtime vs legacy scope Everything under `app/` is runtime. `scripts/` are operational tools. `tests/` are non-runtime. There is no archived/legacy code yet. ## Baseline checks ```bash # Backend python -m pip check python -m compileall -q app scripts tests python -m pytest tests/ -q # Frontend cd frontend npx tsc --noEmit npm run lint npm run build # Docker docker compose config --quiet ``` ## Operating rules for agents - Inspect before changing. `git status` first. - Small reviewable commits. One ownership boundary per commit. - Do not delete files, routes, migrations, or env vars without evidence (see `software-project-delivery-governance` skill). - Do not invent secret values. Use `.env.example` placeholders. - Use `ensure_artifact` instead of re-implementing artifact upsert. - Use existing UI primitives in `frontend/src/components/ui/*` before adding new ones. - Never commit `node_modules/`, `dist/`, `.env`, `data/input/*`, `data/work/*`. - Failures must be logged via `processing_events` (backend) or `sonner` toast (frontend) — not silenced. ## Ownership - Backend, ingestion, search — Vadim Malanov. - Frontend, design system — Vadim Malanov. ## Where to update what - New behavior — update `README.md`. - New repeated agent rule — update this file. - New deployment / recovery step — update `RUNBOOK.md`. - Cleanup findings — `docs/cleanup-report.md` (create on demand).