perf(reranker): add benchmark harness and passage clipping
- scripts/benchmark_reranker.py exercises the configured reranker with synthetic queries or live OpenSearch samples and prints p50/p95/p99 latency, mean latency, and pairs/sec throughput. Supports --warmup, --candidates, --passage-length, --source, and a --json-only mode for CI. - app/indexing/reranker.py clips passages to 2048 characters before scoring so a runaway chunk cannot starve the cross-encoder beyond bge-reranker-v2-m3's training window. - RUNBOOK.md gains a Reranker benchmark section with CPU/GPU SLO targets and a remediation ladder (lower top-K, raise batch size, switch device, disable reranker) when measured p95 exceeds budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
34
RUNBOOK.md
34
RUNBOOK.md
@@ -117,6 +117,40 @@ docker compose exec postgres psql -U legacyhub -d legacyhub -c \
|
||||
should not be rolled back casually. Restore from backup via the standard
|
||||
TeamHUB Suite backup runbook.
|
||||
|
||||
## Reranker benchmark
|
||||
|
||||
The reranker is the latency-defining stage of the hybrid search path. Run the
|
||||
benchmark on every hardware change (CPU vs GPU, instance type, batch size)
|
||||
before promoting the configuration.
|
||||
|
||||
```bash
|
||||
# synthetic warmup + 32 queries x 40 candidates, ~700-char passages
|
||||
docker compose exec api python scripts/benchmark_reranker.py \
|
||||
--queries 32 --candidates 40 --warmup 4
|
||||
|
||||
# real corpus sample (after some documents are indexed)
|
||||
docker compose exec api python scripts/benchmark_reranker.py \
|
||||
--source opensearch --query "ГОСТ 21.501-93" --candidates 40
|
||||
```
|
||||
|
||||
Target SLOs (subject to revision once staging numbers land):
|
||||
|
||||
| Metric | CPU target | GPU target |
|
||||
|---------------------|-----------:|-----------:|
|
||||
| p95 latency / query | < 700 ms | < 120 ms |
|
||||
| Throughput | > 60 pair/s | > 600 pair/s |
|
||||
|
||||
If the measured p95 exceeds the budget, options in order of preference:
|
||||
|
||||
1. Lower `RERANK_CANDIDATES` (default 40 — reducing to 20 roughly halves work).
|
||||
2. Increase `RERANKER_BATCH_SIZE` (memory permitting).
|
||||
3. Switch `RERANKER_DEVICE=cuda` and use a GPU-capable image.
|
||||
4. Disable reranker (`RERANKER_ENABLED=false`) and accept raw RRF order — the
|
||||
API still returns useful results; the `reranked` field reports the truth.
|
||||
|
||||
Passages are clipped to 2048 chars before being fed to the cross-encoder so a
|
||||
runaway chunk cannot starve the budget.
|
||||
|
||||
## Scaling notes (~70k PDFs)
|
||||
|
||||
- Workers horizontally scale: `docker compose up -d --scale worker=8`.
|
||||
|
||||
Reference in New Issue
Block a user