Automating OCR Metadata Extraction

Optical character recognition sits at the boundary between rasterized legacy documents and machine-actionable collection metadata. Within the broader automated record ingestion and sync workflows pipeline, this stage converts scanned accession ledgers, catalogue cards, conservation reports, and donor correspondence into structured fields that downstream validation and transformation can trust. The operational goal extends beyond raw text capture: reliable field population, rights-status inference, and provenance normalization are the primary objectives. Production pipelines isolate OCR execution from database commits, enforce strict schema contracts, and route ambiguous extractions to a human review queue without stalling batch throughput.

This page covers the full extraction stage — preprocessing, concurrency-bounded engine execution, confidence gating, schema validation, and access-tier routing — for teams running Tesseract at scale against archival materials.

Workflow Context

Museum digitization produces two classes of asset. Born-structured records (spreadsheet exports, TMS dumps, API payloads) flow directly into CSV to database sync strategies. Born-unstructured records — anything that exists only as a scanned image — must first pass through OCR before any field mapping is possible. This stage owns that second path.

The sub-problem is narrow but unforgiving. Archival scans violate the assumptions Tesseract’s LSTM engine makes about modern documents: faded iron-gall ink, mixed orientations, typewriter spacing, stamp overlays, and marginalia. A naive image_to_string call on these materials produces fragmented, low-confidence output that silently corrupts the catalogue when ingested unfiltered. The workflow this page establishes treats every extraction as provisional until a confidence gate and a Pydantic contract have cleared it. Records that clear advance to the Pydantic schema validation layer; records that fail route to review rather than degrading the collection.

Prerequisites

Before deploying this stage, confirm the following are in place:

Python 3.9+ — the code below uses PEP 604 union syntax (str | None) and asyncio.to_thread.
Tesseract OCR 5.x installed at the system level, with the eng (and any script-specific) .traineddata files present under TESSDATA_PREFIX. Verify with tesseract --version.
pytesseract — the Python wrapper that shells out to the Tesseract binary.
Pillow (PIL fork) for image decoding and preprocessing filters; opencv-python if you need adaptive thresholding or deskew (see extracting provenance text with Tesseract OCR).
pydantic v2 for the extraction contract (model_config, field_validator, model_dump).
aiofiles for non-blocking image reads inside the async pipeline.
Schema targets: LIDO v1.1 for descriptive metadata nodes and IIIF Presentation API 3.0 for manifest-level rights and text annotations.
Authority endpoints for normalization: Getty AAT/TGN for material and place terms, and RightsStatements.org for rights URIs.

Schema & Spec Reference

Every extraction resolves to a single canonical record before it leaves this stage. The contract below is the boundary object between OCR and ingestion — nothing advances that cannot populate it.

Field	Type	Constraint	Purpose
`source_file`	`str`	non-empty, basename only	Audit trail back to the scanned asset
`raw_text`	`str`	`min_length=1`	Full extracted text, artifacts stripped
`mean_confidence`	`float`	`0.0`–`100.0`	Tesseract per-word confidence, averaged
`status`	`Literal`	`extracted` / `review` / `failed`	Routing decision for the ingestion queue
`access_tier`	`Literal`	`public` / `research` / `restricted`	Set when PII or donor restrictions are detected
`rights_uri`	`str \| None`	valid RightsStatements.org URI	Inferred rights statement, if determinable

The confidence threshold is the single most important tuning parameter in this stage. Set it too low and unreliable text reaches the catalogue; set it too high and legible archival scans are needlessly queued for manual review. Treat the default of 60.0 as a starting point and calibrate against a labelled sample of your own material — the reasoning for public-domain-adjacent thresholds is developed further in the rights automation section’s threshold tuning for public domain guidance.

Data Flow

The pipeline follows a strict stage-gate model. Assets enter via asynchronous I/O handlers, preprocessing applies deterministic image transformations, the OCR engine executes under semaphore-controlled concurrency, and output serialization maps to cultural heritage schemas. Data flows unidirectionally to prevent state corruption, and each stage emits structured logs for auditability. This topology mirrors the producer-consumer design used for building async ingestion pipelines in high-throughput environments.

Step-by-Step Implementation

Step 1 — Bounded, non-blocking extraction

A production processor batches TIFF/PNG assets, applies preprocessing thresholds before extraction, and runs each blocking Tesseract call off the event loop. Consult the official Tesseract Command-Line Usage documentation for PSM and OEM parameter tuning; --psm 6 (single uniform block) and --oem 3 (default LSTM) are the safe defaults for card-style archival layouts.

python

import asyncio
import logging
import io
from pathlib import Path
from typing import Any
from PIL import Image, ImageEnhance, ImageFilter
import pytesseract
import aiofiles

logger = logging.getLogger("ocr_pipeline")
OCR_SEMAPHORE = asyncio.Semaphore(8)

async def preprocess_image(img_path: Path) -> Image.Image:
    async with aiofiles.open(img_path, "rb") as f:
        raw_bytes = await f.read()
    img = Image.open(io.BytesIO(raw_bytes))
    img = img.convert("L")
    img = ImageEnhance.Contrast(img).enhance(1.8)
    img = img.filter(ImageFilter.MedianFilter(size=3))
    return img

async def run_ocr(img_path: Path) -> dict[str, Any]:
    async with OCR_SEMAPHORE:
        img = await preprocess_image(img_path)
        custom_config = r"--oem 3 --psm 6 -c preserve_interword_spaces=1"
        # One pass yields both text and per-word confidence. Tesseract is a
        # blocking subprocess, so run it off the event loop with to_thread.
        data = await asyncio.to_thread(
            pytesseract.image_to_data, img,
            config=custom_config, output_type=pytesseract.Output.DICT
        )
        words = [t for t in data["text"] if t.strip()]
        # The `conf` column is -1 for non-text regions; keep real scores only.
        confidences = [
            int(c) for c, t in zip(data["conf"], data["text"])
            if t.strip() and str(c).lstrip("-").isdigit() and int(c) >= 0
        ]
        mean_conf = sum(confidences) / len(confidences) if confidences else 0.0

        return {
            "source_file": img_path.name,
            "raw_text": " ".join(words),
            "mean_confidence": round(mean_conf, 2),
            "status": "extracted",
        }

async def process_batch(file_paths: list[Path]) -> list[dict[str, Any]]:
    tasks = [run_ocr(p) for p in file_paths]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    output: list[dict[str, Any]] = []
    for path, result in zip(file_paths, results):
        if isinstance(result, Exception):
            logger.error("OCR failed for %s: %s", path, result)
        else:
            output.append(result)
    return output

The semaphore caps the number of concurrent Tesseract subprocesses; running each blocking call through asyncio.to_thread keeps the event loop responsive while the native engine works. image_to_data — not image_to_string — is deliberate: it returns the per-word conf column that the gate in Step 2 depends on. Failed extractions are logged rather than silently dropped, so a single unreadable scan never aborts a batch.

Input variants. The reader above handles single-page raster formats (TIFF, PNG, JPEG). For multi-page PDFs, expand each page to an image with pdf2image before calling run_ocr, and carry a page_index into source_file so the audit trail stays unambiguous. For API-delivered assets, replace the aiofiles read with an async HTTP fetch but keep the semaphore — the concurrency ceiling protects the OCR engine regardless of where bytes originate.

Step 2 — Validate and gate on confidence

Raw OCR output requires strict structural enforcement before it can touch the catalogue. A Pydantic v2 model coerces types, enforces the confidence range, and — critically — downgrades any record below the threshold to review status. Extracted text maps to LIDO descriptive metadata nodes only after this gate clears.

python

from typing import Literal
from pydantic import BaseModel, Field, field_validator, model_validator

CONFIDENCE_THRESHOLD = 60.0

class OCRRecord(BaseModel):
    model_config = {"str_strip_whitespace": True}

    source_file: str = Field(min_length=1)
    raw_text: str = Field(min_length=1)
    mean_confidence: float = Field(ge=0.0, le=100.0)
    status: Literal["extracted", "review", "failed"] = "extracted"
    access_tier: Literal["public", "research", "restricted"] = "public"
    rights_uri: str | None = None

    @field_validator("raw_text")
    @classmethod
    def strip_ocr_artifacts(cls, v: str) -> str:
        # Collapse the stray single-character noise Tesseract emits on
        # speckled backgrounds, then normalize whitespace.
        cleaned = " ".join(tok for tok in v.split() if len(tok) > 1 or tok.isalnum())
        return " ".join(cleaned.split())

    @model_validator(mode="after")
    def gate_low_confidence(self) -> "OCRRecord":
        if self.mean_confidence < CONFIDENCE_THRESHOLD:
            self.status = "review"
        return self

def validate_records(rows: list[dict[str, Any]]) -> list[OCRRecord]:
    validated: list[OCRRecord] = []
    for row in rows:
        try:
            validated.append(OCRRecord.model_validate(row))
        except Exception as exc:  # ValidationError and coercion failures
            logger.error("Validation rejected %s: %s", row.get("source_file"), exc)
    return validated

Because the gate lives in a model_validator, there is no way to construct a valid OCRRecord that skips it — the confidence check is part of the type, not a downstream if. This is the same cardinality-first discipline the core architecture applies before records reach the LIDO-to-internal-database mapping layer. For lenient ingestion runs (bulk backfills where partial data is acceptable), swap the hard reject in validate_records for a model_construct fallback that tags the record failed and keeps it for later reprocessing rather than discarding it.

Step 3 — Normalize fields against authorities

Extracted material terms (“gelatin silver print”, “albumen”) and place names are free text until resolved against a controlled vocabulary. Route these tokens through the Getty authorities before mapping — the identifier-resolution pattern is documented under implementing Getty AAT and TGN. Normalization strips OCR artifacts, enforces controlled vocabularies, and emits canonical URIs that IIIF manifest generation can reference directly.

Rights and Access Routing

OCR frequently surfaces information the catalogue is not licensed to publish. Donor correspondence and provenance cards routinely contain living individuals’ names, addresses, or restriction clauses. This stage must therefore make an access-tier decision before any record reaches a public discovery portal.

Two signals drive routing. First, PII detection on raw_text (Social Security patterns, contemporary addresses, explicit restriction language) forces the record to the restricted tier regardless of confidence. Second, rights inference attempts to attach a RightsStatements.org URI: documents bearing an explicit copyright notice map to an in-copyright statement, while pre-1929 imprints and government works map toward public-domain URIs. Ambiguous cases are left None and resolved downstream by automating copyright status checks. When a determinable license exists, routing Creative Commons licenses governs how the URI propagates into the IIIF manifest’s rights block.

The rule of thumb: confidence gates protect data quality; access-tier routing protects data governance. A record can be high-confidence and still restricted. Never let one gate substitute for the other.

Verification and Testing

Confirm the gate and the artifact stripper behave before wiring this stage into the ingestion queue. The assert-based test below exercises both the happy path and the low-confidence downgrade without invoking Tesseract, so it runs in CI without the native binary installed.

python

def test_ocr_record_gating():
    high = OCRRecord.model_validate({
        "source_file": "card_0001.tif",
        "raw_text": "Gift of the Hartley Estate, 1962",
        "mean_confidence": 91.4,
    })
    assert high.status == "extracted"

    low = OCRRecord.model_validate({
        "source_file": "card_0002.tif",
        "raw_text": "G1ft 0f th3 estat3",
        "mean_confidence": 42.0,
    })
    assert low.status == "review"  # gated below CONFIDENCE_THRESHOLD

    # Artifact stripping removes speckle noise but keeps real single digits.
    noisy = OCRRecord.model_validate({
        "source_file": "card_0003.tif",
        "raw_text": "  accession   x   1962  ",
        "mean_confidence": 88.0,
    })
    assert noisy.raw_text == "accession 1962"


if __name__ == "__main__":
    test_ocr_record_gating()
    print("OK")

Run it directly with python test_ocr.py or under pytest -q. For an end-to-end smoke check against the real engine, point the batch processor at a single known-good scan and assert mean_confidence clears your threshold:

bash

python -c "import asyncio, json; from pathlib import Path; \
from ocr_pipeline import process_batch; \
print(json.dumps(asyncio.run(process_batch([Path('samples/known_good.tif')])), indent=2))"

Deployment Notes

Horizontal scaling requires stateless worker nodes; container orchestration manages memory allocation for image buffers, which balloon with high-DPI TIFFs. Monitor confidence distributions and review-queue depth as first-class metrics — a sudden drop in mean confidence usually signals a batch of degraded source scans, not a code regression. Correlation IDs threaded across preprocessing, extraction, and validation make per-record debugging tractable. Keep the semaphore ceiling matched to available CPU: each concurrent Tesseract process is a full native subprocess, and oversubscription trades throughput for OOM risk on shared ingestion servers.

Extended Techniques

For provenance-specific materials — accession ledgers, donor cards, and correspondence where layout and PII handling dominate — the extraction recipe changes materially:

Extracting provenance text with Tesseract OCR — adaptive thresholding, deskew, PSM tuning for stamp overlays, memory-bounded batch execution, and a ProvenanceRecord contract that maps directly to LIDO event nodes and IIIF annotations.

FAQ

Why is mean confidence near zero even though the text looks readable?

The conf column returns -1 for non-text regions and layout containers, not just for failed words. If those -1 values leak into the average, they crater the score. The batch code above filters them explicitly (int(c) >= 0) — verify your own aggregation does the same before assuming the scan is at fault.

I get `TesseractNotFoundError` even though pytesseract is installed.

pytesseract is only a wrapper; it shells out to the native tesseract binary. Install the engine at the OS level (not via pip) and either put it on PATH or set pytesseract.pytesseract.tesseract_cmd to its absolute path. In containers, add the Tesseract package and the required .traineddata files to the image and confirm TESSDATA_PREFIX points at them.

Text from catalogue cards comes out scrambled or out of order.

That is a Page Segmentation Mode problem. The default --psm 3 imposes a rigid full-page grid that interleaves stamp and marginalia text with the primary block. Switch to --psm 6 (single uniform block) for card layouts, or --psm 4 for single-column variable-size text. See the provenance guide for handling mixed-orientation stamps.

Memory climbs until workers are OOM-killed during large batches.

Each pytesseract call spawns an independent subprocess, and high-DPI TIFFs hold large buffers. Cap concurrency with the semaphore (start at CPU-count) and, for multi-page PDF ingestion, throttle submission against available RAM rather than firing every page at once. The memory-bounded pool in the provenance guide shows the pattern.

How do I keep restricted PII from reaching the public catalogue?

Never rely on the confidence gate for this — a restricted document can be perfectly legible. Run PII detection on raw_text and force access_tier="restricted" before the record is eligible for public delivery, then let the rights automation section resolve licensing. Access-tier routing and quality gating are independent controls.

Automated Record Ingestion & Sync Workflows — parent pipeline overview
Extracting Provenance Text with Tesseract OCR — archival layout tuning
Schema Validation with Pydantic — post-extraction contracts
Building Async Ingestion Pipelines — concurrency architecture
Implementing Getty AAT & TGN — vocabulary normalization

Automating OCR Metadata Extraction

Workflow Context #

Prerequisites #

Schema & Spec Reference #

Data Flow #

Step-by-Step Implementation #

Step 1 — Bounded, non-blocking extraction #

Step 2 — Validate and gate on confidence #

Step 3 — Normalize fields against authorities #

Rights and Access Routing #

Verification and Testing #

Deployment Notes #

Extended Techniques #

FAQ #

Why is mean confidence near zero even though the text looks readable? #

I get TesseractNotFoundError even though pytesseract is installed. #

Text from catalogue cards comes out scrambled or out of order. #

Memory climbs until workers are OOM-killed during large batches. #

How do I keep restricted PII from reaching the public catalogue? #

Related #

Explore this section

Workflow Context

Prerequisites

Schema & Spec Reference

Data Flow

Step-by-Step Implementation

Step 1 — Bounded, non-blocking extraction

Step 2 — Validate and gate on confidence

Step 3 — Normalize fields against authorities

Rights and Access Routing

Verification and Testing

Deployment Notes

Extended Techniques

FAQ

Why is mean confidence near zero even though the text looks readable?

I get `TesseractNotFoundError` even though pytesseract is installed.

Text from catalogue cards comes out scrambled or out of order.

Memory climbs until workers are OOM-killed during large batches.

How do I keep restricted PII from reaching the public catalogue?

Related