Initialize git, add Apache-2.0 LICENSE, .gitattributes (LF line endings), AGENTS.md (entry points, stack, discovery order, baseline checks), RUNBOOK.md (dev boot, prod deploy with overlay, ingestion, failures, rollback, scaling notes), .env.prod.example with rotated credential placeholders, and dev-only warnings on .env.example. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.6 KiB
AGENTS — LegacyHUB
Operating instructions for AI agents working inside this repository.
What this project is
LegacyHUB ingests legacy PDF archives at scale (~70k docs), runs OCR (OCRmyPDF/Tesseract), extracts structured content with Docling, indexes chunks into PostgreSQL + OpenSearch (BM25) + Qdrant (BGE-M3 dense), and serves a hybrid lexical + semantic search API (FastAPI) reranked by BGE.
It is one module of the TeamHUB Suite.
Stack (canonical)
| Layer | Tech |
|---|---|
| API | FastAPI, Pydantic v2, SQLAlchemy 2, Alembic |
| Workers | Celery + Redis |
| OCR | OCRmyPDF + Tesseract (rus+eng) |
| Extract | Docling |
| Store | PostgreSQL 16, MinIO, OpenSearch 2.x, Qdrant |
| ML | BAAI/bge-m3 (dense, 1024), bge-reranker-v2-m3 |
| Frontend | React 18, TS 5, Vite 5, Tailwind, shadcn, TanStack Query, Zustand, Framer Motion, Recharts |
| Tests | pytest |
| CI | GitHub Actions |
Entry points
- Backend API —
app/main.py(uvicorn app.main:app) - Celery worker —
celery -A app.workers.celery_app worker - CLI scripts —
scripts/init_db.py,scripts/init_opensearch.py,scripts/init_qdrant.py,scripts/ingest_folder.py,scripts/reindex_document.py,scripts/smoke_test.py - Frontend dev —
cd frontend && npm run dev(port 5273) - Docker —
docker compose up -d --build(dev),docker compose -f docker-compose.yml -f docker-compose.prod.yml ...(prod)
Inventory
legacy-knowledge-indexer/
app/
api/ routers + Pydantic schemas
db/ SQLAlchemy models + Alembic migrations
indexing/ OpenSearch + Qdrant clients, embeddings, reranker, hybrid
ingestion/ scanner, OCR, Docling, chunker, table/figure processors,
quality, pipeline
storage/ MinIO client + key conventions + ensure_artifact helper
utils/ hashing, text cleaning, language detection, pdf helpers
workers/ Celery app + tasks
scripts/ init / ingest / reindex / smoke CLIs
tests/ pytest suite
docker/Dockerfile API + worker image (OCRmyPDF + tesseract-rus+eng)
docker-compose.yml dev orchestration
docker-compose.prod.yml production overlay
frontend/ React app — see frontend/README.md
.github/workflows CI gate (ruff + pytest + tsc + vite build + compose config)
Code discovery order
Bounded discovery order for this repo. Use the first available that returns a usable answer; mark the rest "not available" for the task.
- Grep / rg — reliable fallback, always available. First choice for strings, configs, docs, scripts, route paths, hashes.
- Glob — file shape lookups (
app/**/*.py). - Semantic search (if Sourcegraph, Zoekt, or Serena MCP is configured at user level) — go-to-symbol, references. Document the smoke command before relying on results.
- Docling / extracted Markdown in MinIO — for content questions about ingested documents, not source code.
Smoke command for layer 1:
rg --version && rg "@router" app/api -n
If any indexer times out or returns stale results, capture the error and fall through. Do not retry the same failing indexer.
Module contracts (high level)
app/ingestion/pipeline.py::process_document_id(document_id, run_id)— single document end-to-end. Idempotent. Returns{status, chunks, error?}.app/indexing/hybrid_search.py::run_search(SearchRequest) -> SearchResponse— the only public search entry. Lexical + semantic + reranker.app/storage/artifacts.py::ensure_artifact(...)— single source of truth fordocument_artifactsupsert. Used by scanner, pipeline, table_processor, figure_processor.app/storage/minio_client.py::MinioStorage— bucket bootstrap + retryable put/get. Never bypass for object IO.app/indexing/opensearch_client.py::ensure_index() / index_chunks()— chunk index lifecycle.app/indexing/qdrant_client.py::ensure_collection() / upsert_chunks()— vector index lifecycle.
Runtime vs legacy scope
Everything under app/ is runtime. scripts/ are operational tools. tests/
are non-runtime. There is no archived/legacy code yet.
Baseline checks
# Backend
python -m pip check
python -m compileall -q app scripts tests
python -m pytest tests/ -q
# Frontend
cd frontend
npx tsc --noEmit
npm run lint
npm run build
# Docker
docker compose config --quiet
Operating rules for agents
- Inspect before changing.
git statusfirst. - Small reviewable commits. One ownership boundary per commit.
- Do not delete files, routes, migrations, or env vars without evidence (see
software-project-delivery-governanceskill). - Do not invent secret values. Use
.env.exampleplaceholders. - Use
ensure_artifactinstead of re-implementing artifact upsert. - Use existing UI primitives in
frontend/src/components/ui/*before adding new ones. - Never commit
node_modules/,dist/,.env,data/input/*,data/work/*. - Failures must be logged via
processing_events(backend) orsonnertoast (frontend) — not silenced.
Ownership
- Backend, ingestion, search — Vadim Malanov.
- Frontend, design system — Vadim Malanov.
Where to update what
- New behavior — update
README.md. - New repeated agent rule — update this file.
- New deployment / recovery step — update
RUNBOOK.md. - Cleanup findings —
docs/cleanup-report.md(create on demand).