chore: bootstrap repository with governance docs
Initialize git, add Apache-2.0 LICENSE, .gitattributes (LF line endings), AGENTS.md (entry points, stack, discovery order, baseline checks), RUNBOOK.md (dev boot, prod deploy with overlay, ingestion, failures, rollback, scaling notes), .env.prod.example with rotated credential placeholders, and dev-only warnings on .env.example. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
147
AGENTS.md
Normal file
147
AGENTS.md
Normal file
@@ -0,0 +1,147 @@
|
||||
# AGENTS — LegacyHUB
|
||||
|
||||
Operating instructions for AI agents working inside this repository.
|
||||
|
||||
## What this project is
|
||||
|
||||
LegacyHUB ingests legacy PDF archives at scale (~70k docs), runs OCR
|
||||
(OCRmyPDF/Tesseract), extracts structured content with Docling, indexes chunks
|
||||
into PostgreSQL + OpenSearch (BM25) + Qdrant (BGE-M3 dense), and serves a
|
||||
hybrid lexical + semantic search API (FastAPI) reranked by BGE.
|
||||
|
||||
It is one module of the TeamHUB Suite.
|
||||
|
||||
## Stack (canonical)
|
||||
|
||||
| Layer | Tech |
|
||||
|----------|-----------------------------------------------|
|
||||
| API | FastAPI, Pydantic v2, SQLAlchemy 2, Alembic |
|
||||
| Workers | Celery + Redis |
|
||||
| OCR | OCRmyPDF + Tesseract (rus+eng) |
|
||||
| Extract | Docling |
|
||||
| Store | PostgreSQL 16, MinIO, OpenSearch 2.x, Qdrant |
|
||||
| ML | BAAI/bge-m3 (dense, 1024), bge-reranker-v2-m3 |
|
||||
| Frontend | React 18, TS 5, Vite 5, Tailwind, shadcn, TanStack Query, Zustand, Framer Motion, Recharts |
|
||||
| Tests | pytest |
|
||||
| CI | GitHub Actions |
|
||||
|
||||
## Entry points
|
||||
|
||||
- **Backend API** — `app/main.py` (`uvicorn app.main:app`)
|
||||
- **Celery worker** — `celery -A app.workers.celery_app worker`
|
||||
- **CLI scripts** — `scripts/init_db.py`, `scripts/init_opensearch.py`,
|
||||
`scripts/init_qdrant.py`, `scripts/ingest_folder.py`,
|
||||
`scripts/reindex_document.py`, `scripts/smoke_test.py`
|
||||
- **Frontend dev** — `cd frontend && npm run dev` (port 5273)
|
||||
- **Docker** — `docker compose up -d --build` (dev), `docker compose -f
|
||||
docker-compose.yml -f docker-compose.prod.yml ...` (prod)
|
||||
|
||||
## Inventory
|
||||
|
||||
```text
|
||||
legacy-knowledge-indexer/
|
||||
app/
|
||||
api/ routers + Pydantic schemas
|
||||
db/ SQLAlchemy models + Alembic migrations
|
||||
indexing/ OpenSearch + Qdrant clients, embeddings, reranker, hybrid
|
||||
ingestion/ scanner, OCR, Docling, chunker, table/figure processors,
|
||||
quality, pipeline
|
||||
storage/ MinIO client + key conventions + ensure_artifact helper
|
||||
utils/ hashing, text cleaning, language detection, pdf helpers
|
||||
workers/ Celery app + tasks
|
||||
scripts/ init / ingest / reindex / smoke CLIs
|
||||
tests/ pytest suite
|
||||
docker/Dockerfile API + worker image (OCRmyPDF + tesseract-rus+eng)
|
||||
docker-compose.yml dev orchestration
|
||||
docker-compose.prod.yml production overlay
|
||||
frontend/ React app — see frontend/README.md
|
||||
.github/workflows CI gate (ruff + pytest + tsc + vite build + compose config)
|
||||
```
|
||||
|
||||
## Code discovery order
|
||||
|
||||
Bounded discovery order for this repo. Use the first available that returns a
|
||||
usable answer; mark the rest "not available" for the task.
|
||||
|
||||
1. **Grep / rg** — reliable fallback, always available. First choice for
|
||||
strings, configs, docs, scripts, route paths, hashes.
|
||||
2. **Glob** — file shape lookups (`app/**/*.py`).
|
||||
3. **Semantic search** (if Sourcegraph, Zoekt, or Serena MCP is configured at
|
||||
user level) — go-to-symbol, references. Document the smoke command before
|
||||
relying on results.
|
||||
4. **Docling / extracted Markdown in MinIO** — for content questions about
|
||||
ingested documents, not source code.
|
||||
|
||||
Smoke command for layer 1:
|
||||
|
||||
```bash
|
||||
rg --version && rg "@router" app/api -n
|
||||
```
|
||||
|
||||
If any indexer times out or returns stale results, capture the error and fall
|
||||
through. Do not retry the same failing indexer.
|
||||
|
||||
## Module contracts (high level)
|
||||
|
||||
- `app/ingestion/pipeline.py::process_document_id(document_id, run_id)` — single
|
||||
document end-to-end. Idempotent. Returns `{status, chunks, error?}`.
|
||||
- `app/indexing/hybrid_search.py::run_search(SearchRequest) -> SearchResponse` —
|
||||
the only public search entry. Lexical + semantic + reranker.
|
||||
- `app/storage/artifacts.py::ensure_artifact(...)` — single source of truth for
|
||||
`document_artifacts` upsert. Used by scanner, pipeline, table_processor,
|
||||
figure_processor.
|
||||
- `app/storage/minio_client.py::MinioStorage` — bucket bootstrap + retryable
|
||||
put/get. Never bypass for object IO.
|
||||
- `app/indexing/opensearch_client.py::ensure_index() / index_chunks()` — chunk
|
||||
index lifecycle.
|
||||
- `app/indexing/qdrant_client.py::ensure_collection() / upsert_chunks()` —
|
||||
vector index lifecycle.
|
||||
|
||||
## Runtime vs legacy scope
|
||||
|
||||
Everything under `app/` is runtime. `scripts/` are operational tools. `tests/`
|
||||
are non-runtime. There is no archived/legacy code yet.
|
||||
|
||||
## Baseline checks
|
||||
|
||||
```bash
|
||||
# Backend
|
||||
python -m pip check
|
||||
python -m compileall -q app scripts tests
|
||||
python -m pytest tests/ -q
|
||||
|
||||
# Frontend
|
||||
cd frontend
|
||||
npx tsc --noEmit
|
||||
npm run lint
|
||||
npm run build
|
||||
|
||||
# Docker
|
||||
docker compose config --quiet
|
||||
```
|
||||
|
||||
## Operating rules for agents
|
||||
|
||||
- Inspect before changing. `git status` first.
|
||||
- Small reviewable commits. One ownership boundary per commit.
|
||||
- Do not delete files, routes, migrations, or env vars without evidence (see
|
||||
`software-project-delivery-governance` skill).
|
||||
- Do not invent secret values. Use `.env.example` placeholders.
|
||||
- Use `ensure_artifact` instead of re-implementing artifact upsert.
|
||||
- Use existing UI primitives in `frontend/src/components/ui/*` before adding new
|
||||
ones.
|
||||
- Never commit `node_modules/`, `dist/`, `.env`, `data/input/*`, `data/work/*`.
|
||||
- Failures must be logged via `processing_events` (backend) or `sonner` toast
|
||||
(frontend) — not silenced.
|
||||
|
||||
## Ownership
|
||||
|
||||
- Backend, ingestion, search — Vadim Malanov.
|
||||
- Frontend, design system — Vadim Malanov.
|
||||
|
||||
## Where to update what
|
||||
|
||||
- New behavior — update `README.md`.
|
||||
- New repeated agent rule — update this file.
|
||||
- New deployment / recovery step — update `RUNBOOK.md`.
|
||||
- Cleanup findings — `docs/cleanup-report.md` (create on demand).
|
||||
Reference in New Issue
Block a user