Files
LegacyHUB/AGENTS.md
Vadim Malanov 7f72171572 chore: bootstrap repository with governance docs
Initialize git, add Apache-2.0 LICENSE, .gitattributes (LF line
endings), AGENTS.md (entry points, stack, discovery order, baseline
checks), RUNBOOK.md (dev boot, prod deploy with overlay, ingestion,
failures, rollback, scaling notes), .env.prod.example with rotated
credential placeholders, and dev-only warnings on .env.example.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:41:50 +03:00

148 lines
5.6 KiB
Markdown

# AGENTS — LegacyHUB
Operating instructions for AI agents working inside this repository.
## What this project is
LegacyHUB ingests legacy PDF archives at scale (~70k docs), runs OCR
(OCRmyPDF/Tesseract), extracts structured content with Docling, indexes chunks
into PostgreSQL + OpenSearch (BM25) + Qdrant (BGE-M3 dense), and serves a
hybrid lexical + semantic search API (FastAPI) reranked by BGE.
It is one module of the TeamHUB Suite.
## Stack (canonical)
| Layer | Tech |
|----------|-----------------------------------------------|
| API | FastAPI, Pydantic v2, SQLAlchemy 2, Alembic |
| Workers | Celery + Redis |
| OCR | OCRmyPDF + Tesseract (rus+eng) |
| Extract | Docling |
| Store | PostgreSQL 16, MinIO, OpenSearch 2.x, Qdrant |
| ML | BAAI/bge-m3 (dense, 1024), bge-reranker-v2-m3 |
| Frontend | React 18, TS 5, Vite 5, Tailwind, shadcn, TanStack Query, Zustand, Framer Motion, Recharts |
| Tests | pytest |
| CI | GitHub Actions |
## Entry points
- **Backend API** — `app/main.py` (`uvicorn app.main:app`)
- **Celery worker** — `celery -A app.workers.celery_app worker`
- **CLI scripts** — `scripts/init_db.py`, `scripts/init_opensearch.py`,
`scripts/init_qdrant.py`, `scripts/ingest_folder.py`,
`scripts/reindex_document.py`, `scripts/smoke_test.py`
- **Frontend dev** — `cd frontend && npm run dev` (port 5273)
- **Docker** — `docker compose up -d --build` (dev), `docker compose -f
docker-compose.yml -f docker-compose.prod.yml ...` (prod)
## Inventory
```text
legacy-knowledge-indexer/
app/
api/ routers + Pydantic schemas
db/ SQLAlchemy models + Alembic migrations
indexing/ OpenSearch + Qdrant clients, embeddings, reranker, hybrid
ingestion/ scanner, OCR, Docling, chunker, table/figure processors,
quality, pipeline
storage/ MinIO client + key conventions + ensure_artifact helper
utils/ hashing, text cleaning, language detection, pdf helpers
workers/ Celery app + tasks
scripts/ init / ingest / reindex / smoke CLIs
tests/ pytest suite
docker/Dockerfile API + worker image (OCRmyPDF + tesseract-rus+eng)
docker-compose.yml dev orchestration
docker-compose.prod.yml production overlay
frontend/ React app — see frontend/README.md
.github/workflows CI gate (ruff + pytest + tsc + vite build + compose config)
```
## Code discovery order
Bounded discovery order for this repo. Use the first available that returns a
usable answer; mark the rest "not available" for the task.
1. **Grep / rg** — reliable fallback, always available. First choice for
strings, configs, docs, scripts, route paths, hashes.
2. **Glob** — file shape lookups (`app/**/*.py`).
3. **Semantic search** (if Sourcegraph, Zoekt, or Serena MCP is configured at
user level) — go-to-symbol, references. Document the smoke command before
relying on results.
4. **Docling / extracted Markdown in MinIO** — for content questions about
ingested documents, not source code.
Smoke command for layer 1:
```bash
rg --version && rg "@router" app/api -n
```
If any indexer times out or returns stale results, capture the error and fall
through. Do not retry the same failing indexer.
## Module contracts (high level)
- `app/ingestion/pipeline.py::process_document_id(document_id, run_id)` — single
document end-to-end. Idempotent. Returns `{status, chunks, error?}`.
- `app/indexing/hybrid_search.py::run_search(SearchRequest) -> SearchResponse` —
the only public search entry. Lexical + semantic + reranker.
- `app/storage/artifacts.py::ensure_artifact(...)` — single source of truth for
`document_artifacts` upsert. Used by scanner, pipeline, table_processor,
figure_processor.
- `app/storage/minio_client.py::MinioStorage` — bucket bootstrap + retryable
put/get. Never bypass for object IO.
- `app/indexing/opensearch_client.py::ensure_index() / index_chunks()` — chunk
index lifecycle.
- `app/indexing/qdrant_client.py::ensure_collection() / upsert_chunks()` —
vector index lifecycle.
## Runtime vs legacy scope
Everything under `app/` is runtime. `scripts/` are operational tools. `tests/`
are non-runtime. There is no archived/legacy code yet.
## Baseline checks
```bash
# Backend
python -m pip check
python -m compileall -q app scripts tests
python -m pytest tests/ -q
# Frontend
cd frontend
npx tsc --noEmit
npm run lint
npm run build
# Docker
docker compose config --quiet
```
## Operating rules for agents
- Inspect before changing. `git status` first.
- Small reviewable commits. One ownership boundary per commit.
- Do not delete files, routes, migrations, or env vars without evidence (see
`software-project-delivery-governance` skill).
- Do not invent secret values. Use `.env.example` placeholders.
- Use `ensure_artifact` instead of re-implementing artifact upsert.
- Use existing UI primitives in `frontend/src/components/ui/*` before adding new
ones.
- Never commit `node_modules/`, `dist/`, `.env`, `data/input/*`, `data/work/*`.
- Failures must be logged via `processing_events` (backend) or `sonner` toast
(frontend) — not silenced.
## Ownership
- Backend, ingestion, search — Vadim Malanov.
- Frontend, design system — Vadim Malanov.
## Where to update what
- New behavior — update `README.md`.
- New repeated agent rule — update this file.
- New deployment / recovery step — update `RUNBOOK.md`.
- Cleanup findings — `docs/cleanup-report.md` (create on demand).