refactor: extract ensure_artifact into app/storage/artifacts.py

The artifact-upsert helper was duplicated four times (scanner.py,
table_processor.py, figure_processor.py, pipeline.py) with slightly
different signatures. Consolidates into a single keyword-only function
keyed on (document_id, storage_key) - the identity the schema already
enforces - so re-running the pipeline never creates duplicate rows.

scanner / table_processor / figure_processor now import the shared
helper directly. pipeline.py keeps a thin local wrapper to preserve
the positional call sites at three artifact upsert points (OCR_PDF,
MARKDOWN, DOCLING_JSON).

Tests: 24 passed (5 health + 19 original).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Vadim Malanov
2026-05-13 16:51:54 +03:00
parent cd9977f8c3
commit a375ca55b9
6 changed files with 91 additions and 88 deletions

View File

@@ -25,6 +25,7 @@ from app.db.models import (
Page,
ProcessingEvent,
)
from app.storage.artifacts import ensure_artifact
from app.db.session import session_scope
from app.indexing import opensearch_client, qdrant_client
from app.indexing.embeddings import get_embedder
@@ -330,21 +331,14 @@ def _build_index_payloads(
def _ensure_artifact(db, document_id: uuid.UUID, artifact_type: str, bucket: str, key: str) -> None:
existing = db.execute(
select(DocumentArtifact).where(
DocumentArtifact.document_id == document_id,
DocumentArtifact.storage_key == key,
)
).scalar_one_or_none()
if existing:
return
db.add(
DocumentArtifact(
document_id=document_id,
artifact_type=artifact_type,
storage_bucket=bucket,
storage_key=key,
)
"""Thin wrapper preserving the local positional signature used inside this
module while delegating to the shared helper."""
ensure_artifact(
db,
document_id=document_id,
artifact_type=artifact_type,
bucket=bucket,
key=key,
)