Files

Vadim Malanov 7f72171572 chore: bootstrap repository with governance docs

Initialize git, add Apache-2.0 LICENSE, .gitattributes (LF line
endings), AGENTS.md (entry points, stack, discovery order, baseline
checks), RUNBOOK.md (dev boot, prod deploy with overlay, ingestion,
failures, rollback, scaling notes), .env.prod.example with rotated
credential placeholders, and dev-only warnings on .env.example.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-13 16:41:50 +03:00

5.6 KiB

Raw Blame History

AGENTS — LegacyHUB

Operating instructions for AI agents working inside this repository.

What this project is

LegacyHUB ingests legacy PDF archives at scale (~70k docs), runs OCR (OCRmyPDF/Tesseract), extracts structured content with Docling, indexes chunks into PostgreSQL + OpenSearch (BM25) + Qdrant (BGE-M3 dense), and serves a hybrid lexical + semantic search API (FastAPI) reranked by BGE.

It is one module of the TeamHUB Suite.

Stack (canonical)

Layer	Tech
API	FastAPI, Pydantic v2, SQLAlchemy 2, Alembic
Workers	Celery + Redis
OCR	OCRmyPDF + Tesseract (rus+eng)
Extract	Docling
Store	PostgreSQL 16, MinIO, OpenSearch 2.x, Qdrant
ML	BAAI/bge-m3 (dense, 1024), bge-reranker-v2-m3
Frontend	React 18, TS 5, Vite 5, Tailwind, shadcn, TanStack Query, Zustand, Framer Motion, Recharts
Tests	pytest
CI	GitHub Actions

Entry points

Backend API — app/main.py (uvicorn app.main:app)
Celery worker — celery -A app.workers.celery_app worker
CLI scripts — scripts/init_db.py, scripts/init_opensearch.py, scripts/init_qdrant.py, scripts/ingest_folder.py, scripts/reindex_document.py, scripts/smoke_test.py
Frontend dev — cd frontend && npm run dev (port 5273)
Docker — docker compose up -d --build (dev), docker compose -f docker-compose.yml -f docker-compose.prod.yml ... (prod)

Inventory

legacy-knowledge-indexer/
  app/
    api/             routers + Pydantic schemas
    db/              SQLAlchemy models + Alembic migrations
    indexing/        OpenSearch + Qdrant clients, embeddings, reranker, hybrid
    ingestion/       scanner, OCR, Docling, chunker, table/figure processors,
                     quality, pipeline
    storage/         MinIO client + key conventions + ensure_artifact helper
    utils/           hashing, text cleaning, language detection, pdf helpers
    workers/         Celery app + tasks
  scripts/           init / ingest / reindex / smoke CLIs
  tests/             pytest suite
  docker/Dockerfile  API + worker image (OCRmyPDF + tesseract-rus+eng)
  docker-compose.yml dev orchestration
  docker-compose.prod.yml  production overlay
  frontend/          React app — see frontend/README.md
  .github/workflows  CI gate (ruff + pytest + tsc + vite build + compose config)

Code discovery order

Bounded discovery order for this repo. Use the first available that returns a usable answer; mark the rest "not available" for the task.

Grep / rg — reliable fallback, always available. First choice for strings, configs, docs, scripts, route paths, hashes.
Glob — file shape lookups (app/**/*.py).
Semantic search (if Sourcegraph, Zoekt, or Serena MCP is configured at user level) — go-to-symbol, references. Document the smoke command before relying on results.
Docling / extracted Markdown in MinIO — for content questions about ingested documents, not source code.

Smoke command for layer 1:

rg --version && rg "@router" app/api -n

If any indexer times out or returns stale results, capture the error and fall through. Do not retry the same failing indexer.

Module contracts (high level)

app/ingestion/pipeline.py::process_document_id(document_id, run_id) — single document end-to-end. Idempotent. Returns {status, chunks, error?}.
app/indexing/hybrid_search.py::run_search(SearchRequest) -> SearchResponse — the only public search entry. Lexical + semantic + reranker.
app/storage/artifacts.py::ensure_artifact(...) — single source of truth for document_artifacts upsert. Used by scanner, pipeline, table_processor, figure_processor.
app/storage/minio_client.py::MinioStorage — bucket bootstrap + retryable put/get. Never bypass for object IO.
app/indexing/opensearch_client.py::ensure_index() / index_chunks() — chunk index lifecycle.
app/indexing/qdrant_client.py::ensure_collection() / upsert_chunks() — vector index lifecycle.

Runtime vs legacy scope

Everything under app/ is runtime. scripts/ are operational tools. tests/ are non-runtime. There is no archived/legacy code yet.

Baseline checks

# Backend
python -m pip check
python -m compileall -q app scripts tests
python -m pytest tests/ -q

# Frontend
cd frontend
npx tsc --noEmit
npm run lint
npm run build

# Docker
docker compose config --quiet

Operating rules for agents

Inspect before changing. git status first.
Small reviewable commits. One ownership boundary per commit.
Do not delete files, routes, migrations, or env vars without evidence (see software-project-delivery-governance skill).
Do not invent secret values. Use .env.example placeholders.
Use ensure_artifact instead of re-implementing artifact upsert.
Use existing UI primitives in frontend/src/components/ui/* before adding new ones.
Never commit node_modules/, dist/, .env, data/input/*, data/work/*.
Failures must be logged via processing_events (backend) or sonner toast (frontend) — not silenced.

Ownership

Backend, ingestion, search — Vadim Malanov.
Frontend, design system — Vadim Malanov.

Where to update what

New behavior — update README.md.
New repeated agent rule — update this file.
New deployment / recovery step — update RUNBOOK.md.
Cleanup findings — docs/cleanup-report.md (create on demand).

5.6 KiB Raw Blame History