Operational Context
Museum digitization teams routinely ingest high-resolution TIFFs and multi-page PDFs from accession ledgers, donor correspondence, and provenance cards. The operational objective is to extract structured provenance strings for direct synchronization into collection management systems. Legacy workflows rely on ad-hoc OCR calls that produce fragmented output. Modern pipelines require deterministic, memory-bounded extraction engines. Validated JSON payloads must align with institutional metadata standards before ingestion. This guide replaces heuristic scraping with a production-grade architecture.
flowchart LR
Img["Provenance card<br/>TIFF"] --> Pre["1 · Preprocess<br/>threshold · deskew"]
Pre --> Tess["2 · Tesseract<br/>PSM 6"]
Tess --> Batch["3 · Memory-bounded<br/>batch"]
Batch --> Val["4 · Validate<br/>ProvenanceRecord"]
Val --> IIIF["5 · IIIF manifest<br/>+ CMS sync"]Root Cause Analysis
Default Tesseract configurations target modern, high-contrast, single-column documents. Archival provenance materials violate core LSTM engine assumptions. Cascading failures emerge across four primary vectors.
Page Segmentation Mode (PSM) mismatches cause structural fragmentation. Provenance cards contain mixed orientations, marginalia, and institutional stamps. The default --psm 3 forces a rigid grid layout. Text blocks interleave with stamp overlays, destroying reading order.
Insufficient preprocessing pipelines degrade contour detection. Faded iron-gall ink and yellowed paper lack binary contrast. Without adaptive thresholding and morphological noise removal, ligatures and diacritics fail recognition.
Subprocess memory bloat destabilizes batch execution. pytesseract spawns independent CLI processes per image. High-DPI TIFFs accumulate orphaned workers. Shared ingestion servers trigger OOM kills under concurrent load.
Unvalidated output ingestion corrupts downstream sync queues. Raw OCR strings bypass schema enforcement. Malformed dates and restricted PII propagate into CMS records. This breaks Automated Record Ingestion & Sync Workflows and forces manual reconciliation.
Step 1: Deterministic Image Preprocessing
Archival extraction requires adaptive normalization before engine invocation. OpenCV provides deterministic binarization and geometric correction. The pipeline must isolate text regions while suppressing background degradation.
import cv2
import numpy as np
from pathlib import Path
def preprocess_provenance_image(image_path: Path) -> np.ndarray:
img = cv2.imread(str(image_path), cv2.IMREAD_GRAYSCALE)
if img is None:
raise ValueError("Invalid image path or unsupported format")
# Adaptive thresholding for uneven illumination
binary = cv2.adaptiveThreshold(
img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2
)
# Morphological noise removal
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
denoised = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel)
# Deskew via minimum area bounding rectangle around the text pixels.
# After adaptive THRESH_BINARY the ink is 0 and the background 255, so
# select the zero pixels. cv2.minAreaRect needs float32 points ordered
# (x, y), whereas np.where returns (row, col) = (y, x).
ys, xs = np.where(denoised == 0)
coords = np.column_stack((xs, ys)).astype(np.float32)
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = 90 + angle
elif angle > 45:
angle = angle - 90
(h, w) = img.shape[:2]
center = (w // 2, h // 2)
M = cv2.getRotationMatrix2D(center, angle, 1.0)
rotated = cv2.warpAffine(img, M, (w, h), flags=cv2.INTER_CUBIC, borderValue=255)
return rotatedStep 2: Tesseract Configuration & Layout Tuning
Engine parameters must override default heuristics. Provenance cards require block-level segmentation and LSTM optimization. Stamp overlays demand explicit exclusion zones or secondary pass filtering.
import pytesseract
import numpy as np
TESSERACT_CONFIG = "--oem 1 --psm 6 -c preserve_interword_spaces=1"
def extract_text(image_array: np.ndarray, lang: str = "eng") -> str:
custom_config = f"{TESSERACT_CONFIG} --tessdata-dir /usr/share/tessdata"
return pytesseract.image_to_string(
image_array, lang=lang, config=custom_config
).strip()PSM 6 enforces a single uniform text block. This prevents stamp text from fragmenting primary provenance lines. The preserve_interword_spaces flag maintains typewriter spacing for downstream parsing. Teams handling non-Latin scripts must install corresponding .traineddata files.
Step 3: Memory-Bounded Batch Execution
Concurrent processing requires explicit resource isolation. Python 3.9+ supports typed worker pools with strict concurrency caps. Memory limits prevent subprocess accumulation during multi-page PDF ingestion.
import concurrent.futures
import psutil
from pathlib import Path
from typing import Iterator
def process_batch(
image_paths: list[Path],
max_workers: int = 4,
min_free_bytes: int = 512 * 1024 * 1024,
) -> Iterator[tuple[Path, str]]:
def _run(path: Path) -> tuple[Path, str]:
return path, extract_text(preprocess_provenance_image(path))
queue = list(image_paths)
pending: set[concurrent.futures.Future] = set()
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
while queue or pending:
# Hold back new submissions while free RAM is below the floor,
# but always keep at least one task in flight to guarantee progress.
while queue and len(pending) < max_workers and (
psutil.virtual_memory().available > min_free_bytes or not pending
):
pending.add(executor.submit(_run, queue.pop()))
done, pending = concurrent.futures.wait(
pending, return_when=concurrent.futures.FIRST_COMPLETED
)
for future in done:
try:
yield future.result()
except Exception as exc:
yield Path("<unknown>"), f"OCR_FAILURE: {exc}"The ThreadPoolExecutor works well here because OpenCV and the Tesseract subprocess release the GIL during native execution, so the bounded pool achieves real parallelism rather than merely interleaving. Submission is throttled against available system RAM: when free memory drops below min_free_bytes the loop stops enqueuing new pages until in-flight work drains, capping peak memory during multi-page PDF ingestion. This architecture aligns with Automating OCR Metadata Extraction scaling guidelines.
Step 4: Schema Validation & LIDO Mapping
Raw strings require structural enforcement before CMS transmission. Pydantic v2 models validate dates, restrict PII, and map to a LIDO provenance event — a lido:eventSet whose lido:event carries the acquisition lido:displayEvent and lido:eventDate. Validation failures route to quarantine queues.
from pydantic import BaseModel, Field, field_validator
from datetime import date, datetime
import re
class ProvenanceRecord(BaseModel):
object_id: str = Field(pattern=r"^[A-Z]{3,4}-\d{4,6}$")
provenance_text: str = Field(min_length=5, max_length=500)
acquisition_date: date | None = None
previous_owner: str | None = None
transaction_type: str | None = None
@field_validator("provenance_text")
@classmethod
def strip_pii(cls, v: str) -> str:
return re.sub(r"\b\d{3}-\d{2}-\d{4}\b", "[REDACTED]", v)
@field_validator("acquisition_date", mode="before")
@classmethod
def parse_date(cls, v: str | None) -> date | None:
if not v:
return None
for fmt in ("%Y-%m-%d", "%d %B %Y", "%m/%d/%Y"):
try:
return datetime.strptime(v, fmt).date()
except ValueError:
continue
raise ValueError(f"Unrecognized date format: {v}")The ProvenanceRecord model enforces institutional ID patterns and sanitizes sensitive identifiers. Date parsing normalizes archival variations into ISO 8601. Field validators execute synchronously during model instantiation. Invalid payloads fail fast, preventing CMS corruption.
Step 5: IIIF Integration & CMS Sync
Validated records must link to source imagery for auditability. IIIF Presentation API 3.0 manifests wrap extracted metadata alongside high-resolution derivatives. The sync pipeline serializes payloads into CMS-compatible JSON.
from typing import Any
import json
# IIIF 3.0 requires every resource id to be a dereferenceable HTTP(S) URI.
IIIF_BASE = "https://iiif.example.org"
def build_iiif_provenance_manifest(record: ProvenanceRecord, image_uri: str) -> dict[str, Any]:
base = f"{IIIF_BASE}/{record.object_id}"
canvas_id = f"{base}/canvas"
return {
"@context": "http://iiif.io/api/presentation/3/context.json",
"id": f"{base}/manifest",
"type": "Manifest",
"label": {"en": [f"Provenance Record: {record.object_id}"]},
"items": [{
"id": canvas_id,
"type": "Canvas",
"height": 3000,
"width": 2400,
"items": [{
"id": f"{base}/page",
"type": "AnnotationPage",
"items": [{
"id": f"{base}/anno",
"type": "Annotation",
"motivation": "commenting",
"body": {
"type": "TextualBody",
"value": record.provenance_text,
"format": "text/plain"
},
"target": canvas_id
}]
}],
"annotations": [{
"id": f"{base}/painting-page",
"type": "AnnotationPage",
"items": [{
"id": f"{base}/painting-anno",
"type": "Annotation",
"motivation": "painting",
"body": {"id": image_uri, "type": "Image", "format": "image/tiff"},
"target": canvas_id
}]
}]
}],
"metadata": [
{"label": {"en": ["LIDO Provenance"]}, "value": {"en": [record.provenance_text]}},
{"label": {"en": ["Acquisition Date"]}, "value": {"en": [str(record.acquisition_date)]}}
]
}The manifest structure binds OCR output to the source canvas via IIIF annotations. Metadata fields map directly to LIDO lido:provenanceText and lido:eventDate. Serialization produces CMS-ready payloads. Downstream ingestion handlers consume the JSON without transformation overhead.
Production Considerations
Batch pipelines require deterministic retry logic and structured telemetry. Transient OCR failures route to exponential backoff queues. Persistent errors trigger quarantine alerts with full context payloads. Memory profiling must run continuously during peak ingestion windows.
Logging frameworks should capture PSM overrides, preprocessing parameters, and validation rejection reasons. Structured JSON logs enable rapid root cause analysis. Teams must monitor subprocess exit codes and worker thread saturation. Scaling requires horizontal pod expansion with shared tessdata volumes.
Compliance mandates strict access controls around donor restriction flags. Redaction validators execute before any network transmission. Audit trails preserve original TIFF hashes alongside extracted strings. This ensures full provenance chain integrity.
Conclusion
The five-stage pipeline — preprocess, configure Tesseract, bound the batch, validate with Pydantic, publish via IIIF — converts fragile ad-hoc OCR calls into a deterministic, memory-safe extraction service. Each stage has explicit failure boundaries: preprocessing errors surface before Tesseract is invoked, PII redaction runs before network transmission, and invalid payloads are quarantined rather than propagated into the CMS.