Configuring Celery for Museum Data Sync

Operational Context

A Python automation engineer schedules the nightly job that pushes fifty thousand object records from a legacy collections system into the public catalog, and by morning the run has died with MemoryError or BrokerConnectionTimeout — leaving the online catalog half-updated and out of step with the source of record. This page resolves that exact failure: turning a default-configured Celery deployment into a deterministic, memory-bounded worker fleet that ingests CSV exports or CMS API payloads, validates each record against LIDO before it lands, and routes bad rows to a dead-letter queue instead of crashing the run. It is the distributed-task engine behind the broader async ingestion pipeline, the stage that runs heavy transforms without blocking the main fetch loop.

Root Cause Analysis

Default Celery settings are tuned for short, lightweight web tasks, and every one of those defaults works against a bulk cultural-heritage sync. Four distinct mechanisms combine to kill the overnight run.

Unbounded prefetch causes the memory spike. With the default worker_prefetch_multiplier of 4, each worker process reserves several tasks ahead of the one it is executing. When a task payload carries IIIF manifest URLs, OCR text blocks, or base64 thumbnails, a handful of reserved batches is enough to push resident set size past the container limit and trigger an OOM kill.

Long-lived worker processes leak. A worker that never restarts accumulates uncollected reference cycles across thousands of tasks; the cyclic garbage collector’s periodic sweeps over an ever-growing object graph both stall throughput and let memory creep upward until the kernel reaps the process mid-batch.

Broker connection pools saturate. When several sync bursts fan out concurrently, the Redis or RabbitMQ connection pool exhausts its slots, and new task dispatches block until they time out as BrokerConnectionTimeout.

Schema drift corrupts the tail of the run. Malformed controlled-vocabulary terms and missing provenance fields slip past a loose transform and surface much later as database constraint violations — the same class of failure that strict schema validation with Pydantic is designed to catch at the task boundary. The fix is to bound concurrency, recycle workers, cap the connection pool, and reject drift explicitly rather than letting it propagate.

Canonical Solution

Replace the defaults with a memory-safe configuration module. Each override below neutralizes one of the failure vectors above; the inline comments explain the non-obvious choices.

python

from celery import Celery

app = Celery("museum_sync")
app.conf.update(
    broker_url="redis://localhost:6379/0",
    result_backend="redis://localhost:6379/1",
    # One task in flight per worker: kills the prefetch-driven memory spike.
    worker_prefetch_multiplier=1,
    # Cap broker connections so concurrent bursts cannot exhaust the pool.
    broker_pool_limit=10,
    broker_connection_retry_on_startup=True,
    # Ack only after the task finishes, so a killed worker's message requeues.
    task_acks_late=True,
    task_reject_on_worker_lost=True,
    # Recycle each worker after 100 tasks to release leaked memory to the OS.
    worker_max_tasks_per_child=100,
    task_default_queue="ingestion_high",
)

With the fleet bounded, the second half of the solution is deterministic chunking. A fifty-thousand-row export must never be dispatched as one task; split it into fixed-size batches so peak memory is a function of chunk size, not export size. Route the batches to a dedicated queue with a Celery group, which lets long-running asset work run beside metadata validation without one starving the other.

python

from collections.abc import Iterator
from typing import Any

from celery import group


def chunk_records(records: list[dict[str, Any]], size: int = 500) -> Iterator[list[dict[str, Any]]]:
    # Slice lazily so the whole export is never materialized twice in memory.
    for i in range(0, len(records), size):
        yield records[i:i + size]


def dispatch_sync_batch(records: list[dict[str, Any]]) -> None:
    # One subtask per chunk; the group fans them across the worker fleet.
    tasks = [process_chunk.s(chunk) for chunk in chunk_records(records)]
    group(tasks).apply_async(queue="ingestion_high")

Enforce the record contract at the task boundary with a Pydantic v2 model aligned to LIDO. Reject payloads with malformed rights statements or missing provenance before they reach the database, so a schema violation surfaces immediately instead of corrupting the index.

python

from typing import Any

from pydantic import BaseModel, Field, ValidationError


class LIDORecord(BaseModel):
    object_id: str = Field(alias="lido:objectID")
    title: str = Field(alias="lido:title")
    rights_statement: str = Field(alias="lido:rights")
    iiif_manifest_url: str | None = Field(default=None, alias="lido:iiifManifest")


def validate_and_transform(payload: dict[str, Any]) -> LIDORecord:
    try:
        return LIDORecord.model_validate(payload)
    except ValidationError as exc:
        # Re-raise as a domain error the task can branch on for dead-lettering.
        raise ValueError(f"Schema drift detected: {exc}") from exc

The task itself distinguishes recoverable failures from fatal ones. Transient ConnectionError from a CMS poll or a database write retries with exponential backoff; fatal schema drift is rejected without requeue so the broker routes it to the dead-letter queue for manual triage rather than looping forever.

python

import logging
from typing import Any

from celery.exceptions import Reject

logger = logging.getLogger(__name__)


@app.task(bind=True, max_retries=3, default_retry_delay=60)
def process_chunk(self, chunk: list[dict[str, Any]]) -> None:
    try:
        validated = [validate_and_transform(row) for row in chunk]
        # Bulk write / Elasticsearch bulk index over `validated` goes here.
    except ConnectionError as exc:
        # Backoff schedule: 60s, 120s, 240s across successive retries.
        raise self.retry(exc=exc, countdown=60 * (2 ** self.request.retries))
    except ValueError as exc:
        # Fatal drift: reject without requeue -> broker dead-letter queue.
        logger.error("Schema drift, dead-lettering chunk: %s", exc)
        raise Reject(str(exc), requeue=False)

Two configuration values do the heavy lifting for memory: worker_max_tasks_per_child forces the OS to reclaim a worker’s heap after a fixed number of tasks, and task_acks_late=True guarantees at-least-once delivery so a recycled worker never drops an in-flight chunk. Export a resident-set-size metric through a Prometheus exporter to catch a slow leak before it reaches the queue.

Edge Cases and Variants

CSV export vs. API payload. For a flat CSV dump, chunk rows straight off disk and pair this worker with the batching patterns in CSV to Database Sync Strategies. For records pulled live from a CMS, front the dispatch with the incremental fetch loop in Polling Museum APIs with Python Requests and hand each page to dispatch_sync_batch.
XML / LIDO harvest input. When the source is an OAI-PMH LIDO XML feed, parse each lido:lido element into a dict before chunking; keep the chunk size lower (250–500) because XML records with nested event and rights blocks are heavier per row than flat CSV.
Strict vs. lenient validation. In a first-time backfill, catch ValueError and route the raw payload plus its error to a quarantine queue so the run completes; in steady-state nightly syncs, keep the strict Reject so drift is small and reviewable.
Chunk size tuning. 500–1000 records per chunk is the safe default. Drop toward 100 when payloads embed OCR text or thumbnails; raise toward 2000 only for thin metadata-only rows where task overhead dominates.
Heavy asset work. Route IIIF manifest validation and image derivation to a separate queue and worker pool with its own concurrency, so a slow manifest fetch never blocks metadata chunks on ingestion_high.
Broker choice. Redis is simplest for a single-node sync; use RabbitMQ (or Redis Sentinel) when you need native dead-letter exchanges and clustered high availability for a multi-node fleet.

Validation

Confirm the two invariants that keep the run deterministic — bounded chunking and fail-fast drift detection — with an assert-based test that needs no live broker, because chunk_records and validate_and_transform are pure functions.

python

# test_museum_sync.py  ->  run with:  pytest -q test_museum_sync.py
import pytest


def test_chunking_is_bounded_and_lossless():
    rows = [{"lido:objectID": str(n)} for n in range(1050)]
    chunks = list(chunk_records(rows, size=500))
    assert [len(c) for c in chunks] == [500, 500, 50]   # no chunk exceeds size
    assert sum(len(c) for c in chunks) == 1050          # every row survives


def test_valid_record_maps_lido_aliases():
    rec = validate_and_transform({
        "lido:objectID": "PNT-004821",
        "lido:title": "Untitled",
        "lido:rights": "InC",
    })
    assert rec.object_id == "PNT-004821"                # alias resolved
    assert rec.iiif_manifest_url is None                # optional field defaulted


def test_schema_drift_raises_before_db_write():
    with pytest.raises(ValueError, match="Schema drift"):
        validate_and_transform({"lido:title": "missing id and rights"})

A green run proves that peak memory stays a function of chunk size regardless of export volume, and that a missing lido:objectID or lido:rights is rejected at the boundary rather than propagating into the catalog. To watch the queue drain live during a real run, use celery -A museum_sync inspect active and celery -A museum_sync inspect reserved; a healthy fleet shows at most one reserved task per worker.

Standards Alignment

The record contract keeps the sync conformant on the way in. Field aliases map the payload to LIDO v1.1 element names (lido:objectID, lido:rights, lido:iiifManifest), so a validated chunk is harvest-ready without post-processing — the same LIDO shape consumed by the LIDO-to-database mapping layer downstream. Any iiif_manifest_url that passes validation must be an absolute, dereferenceable URI under the IIIF Presentation API 3.0 before indexing. Where rights and place terms need to be authoritative rather than free text, resolve them against controlled vocabularies as described in Implementing Getty AAT & TGN, so the synced records are searchable across institutions instead of trapped in local strings.

Frequently Asked Questions

Why does the overnight sync die with MemoryError when a manual run of the same data works?

The manual run processes one record at a time; the scheduled fleet prefetches. With the default worker_prefetch_multiplier, each worker reserves several heavy payloads ahead of the one it is running, and a few reserved batches of IIIF or OCR data cross the container limit. Set worker_prefetch_multiplier=1 and recycle workers with worker_max_tasks_per_child.

Should a bad record retry or fail immediately?

It depends on the exception. A ConnectionError from the CMS or database is transient and should retry with exponential backoff. A schema violation is fatal — retrying it just loops. Branch on the exception type: self.retry for connectivity, Reject(..., requeue=False) for drift so the message goes to the dead-letter queue.

How large should each chunk be?

Start at 500–1000 records. Lower it toward 100 when payloads embed OCR text blocks or thumbnails, and only raise it for thin metadata-only rows where per-task overhead outweighs memory. The goal is that peak memory tracks chunk size, never total export size.

Redis or RabbitMQ for a museum sync broker?

Redis is the simplest choice for a single-node nightly job. Move to RabbitMQ, or Redis Sentinel, when you need native dead-letter exchanges, clustered high availability, or priority queues to keep heavy IIIF work off the metadata lane.

Building Async Ingestion Pipelines — parent pipeline stage
Polling Museum APIs with Python Requests — feeding live API payloads
Schema Validation with Pydantic — enforcing record contracts
CSV to Database Sync Strategies — batching flat exports
Mapping LIDO to Internal Databases — landing validated records

Configuring Celery for Museum Data Sync

Operational Context #

Root Cause Analysis #

Canonical Solution #

Edge Cases and Variants #

Validation #

Standards Alignment #

Frequently Asked Questions #

Why does the overnight sync die with MemoryError when a manual run of the same data works? #

Should a bad record retry or fail immediately? #

How large should each chunk be? #

Redis or RabbitMQ for a museum sync broker? #

Related #

Related pages

Operational Context

Root Cause Analysis

Canonical Solution

Edge Cases and Variants

Validation

Standards Alignment

Frequently Asked Questions

Why does the overnight sync die with MemoryError when a manual run of the same data works?

Should a bad record retry or fail immediately?

How large should each chunk be?

Redis or RabbitMQ for a museum sync broker?

Related