Extracting Provenance Text with Tesseract OCR

Operational Context

A digitization technician runs a batch of high-resolution TIFFs from accession ledgers, donor correspondence, and provenance cards through OCR, expecting clean provenance strings to land in the collection management system — and instead gets fragmented lines, mangled dates, and stamp text bleeding into ownership history. This page resolves that exact failure: turning ad-hoc pytesseract calls into a deterministic, memory-bounded extraction service whose output is validated before it ever reaches ingestion. It is the engine that feeds the broader automated record ingestion pipeline, and it is the concrete implementation behind the OCR stage described in Automating OCR Metadata Extraction.

Five deterministic stages with explicit failure boundaries: preprocessing errors surface before Tesseract runs, and records that fail schema validation branch into a quarantine queue instead of reaching CMS sync.

Root Cause Analysis

Default Tesseract configurations target modern, high-contrast, single-column documents. Archival provenance materials violate the core assumptions of the LSTM engine, and the failures cascade across four vectors.

Page Segmentation Mode (PSM) mismatches cause structural fragmentation. Provenance cards contain mixed orientations, marginalia, and institutional stamps. The default --psm 3 forces a rigid grid layout, so text blocks interleave with stamp overlays and reading order is destroyed.

Insufficient preprocessing degrades contour detection. Faded iron-gall ink and yellowed paper lack binary contrast. Without adaptive thresholding and morphological noise removal, ligatures and diacritics fail recognition outright.

Subprocess memory bloat destabilizes batch execution. pytesseract spawns an independent CLI process per image, and high-DPI TIFFs accumulate orphaned workers until shared ingestion servers trigger OOM kills under concurrent load.

Unvalidated output corrupts downstream sync queues. Raw OCR strings bypass schema enforcement, so malformed dates and restricted PII propagate into records and force manual reconciliation. Enforcing a contract at this boundary is the same discipline applied in schema validation with Pydantic.

Canonical Solution

The service is a five-stage pipeline with explicit failure boundaries: preprocessing errors surface before Tesseract is invoked, PII redaction runs before any network transmission, and invalid payloads are quarantined rather than propagated.

Step 1: Deterministic Image Preprocessing

Archival extraction requires adaptive normalization before engine invocation. OpenCV provides deterministic binarization and geometric correction, isolating text regions while suppressing background degradation.

python

import cv2
import numpy as np
from pathlib import Path

def preprocess_provenance_image(image_path: Path) -> np.ndarray:
    img = cv2.imread(str(image_path), cv2.IMREAD_GRAYSCALE)
    if img is None:
        raise ValueError("Invalid image path or unsupported format")

    # Adaptive thresholding for uneven illumination
    binary = cv2.adaptiveThreshold(
        img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2
    )

    # Morphological noise removal
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
    denoised = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel)

    # Deskew via minimum area bounding rectangle around the text pixels.
    # After adaptive THRESH_BINARY the ink is 0 and the background 255, so
    # select the zero pixels. cv2.minAreaRect needs float32 points ordered
    # (x, y), whereas np.where returns (row, col) = (y, x).
    ys, xs = np.where(denoised == 0)
    coords = np.column_stack((xs, ys)).astype(np.float32)
    angle = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = 90 + angle
    elif angle > 45:
        angle = angle - 90

    (h, w) = img.shape[:2]
    center = (w // 2, h // 2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    rotated = cv2.warpAffine(img, M, (w, h), flags=cv2.INTER_CUBIC, borderValue=255)
    return rotated

Step 2: Tesseract Configuration & Layout Tuning

Engine parameters must override the default heuristics. Provenance cards require block-level segmentation and LSTM optimization, and stamp overlays demand explicit exclusion zones or secondary-pass filtering.

python

import pytesseract
import numpy as np

TESSERACT_CONFIG = "--oem 1 --psm 6 -c preserve_interword_spaces=1"

def extract_text(image_array: np.ndarray, lang: str = "eng") -> str:
    custom_config = f"{TESSERACT_CONFIG} --tessdata-dir /usr/share/tessdata"
    return pytesseract.image_to_string(
        image_array, lang=lang, config=custom_config
    ).strip()

PSM 6 enforces a single uniform text block, which prevents stamp text from fragmenting primary provenance lines. The preserve_interword_spaces flag maintains typewriter spacing for downstream parsing. Teams handling non-Latin scripts must install the corresponding .traineddata files and pass the matching lang code.

Step 3: Memory-Bounded Batch Execution

Concurrent processing requires explicit resource isolation. Python 3.9+ supports typed worker pools with strict concurrency caps, and memory limits prevent subprocess accumulation during multi-page PDF ingestion.

python

import concurrent.futures
import psutil
from pathlib import Path
from typing import Iterator

def process_batch(
    image_paths: list[Path],
    max_workers: int = 4,
    min_free_bytes: int = 512 * 1024 * 1024,
) -> Iterator[tuple[Path, str]]:
    def _run(path: Path) -> tuple[Path, str]:
        return path, extract_text(preprocess_provenance_image(path))

    queue = list(image_paths)
    pending: set[concurrent.futures.Future] = set()
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        while queue or pending:
            # Hold back new submissions while free RAM is below the floor,
            # but always keep at least one task in flight to guarantee progress.
            while queue and len(pending) < max_workers and (
                psutil.virtual_memory().available > min_free_bytes or not pending
            ):
                pending.add(executor.submit(_run, queue.pop()))
            done, pending = concurrent.futures.wait(
                pending, return_when=concurrent.futures.FIRST_COMPLETED
            )
            for future in done:
                try:
                    yield future.result()
                except Exception as exc:
                    yield Path("<unknown>"), f"OCR_FAILURE: {exc}"

The ThreadPoolExecutor works well here because OpenCV and the Tesseract subprocess release the GIL during native execution, so the bounded pool achieves real parallelism rather than merely interleaving. Submission is throttled against available system RAM: when free memory drops below min_free_bytes the loop stops enqueuing new pages until in-flight work drains, capping peak memory during multi-page PDF ingestion. This mirrors the scaling guidance in Automating OCR Metadata Extraction.

Step 4: Schema Validation & LIDO Mapping

Raw strings require structural enforcement before CMS transmission. Pydantic v2 models validate dates, redact PII, and map to a LIDO provenance event — a lido:eventSet whose lido:event carries the acquisition lido:displayEvent and lido:eventDate. Validation failures route to quarantine queues rather than into the LIDO-to-database mapping layer.

python

from pydantic import BaseModel, Field, field_validator
from datetime import date, datetime
import re

class ProvenanceRecord(BaseModel):
    object_id: str = Field(pattern=r"^[A-Z]{3,4}-\d{4,6}$")
    provenance_text: str = Field(min_length=5, max_length=500)
    acquisition_date: date | None = None
    previous_owner: str | None = None
    transaction_type: str | None = None

    @field_validator("provenance_text")
    @classmethod
    def strip_pii(cls, v: str) -> str:
        return re.sub(r"\b\d{3}-\d{2}-\d{4}\b", "[REDACTED]", v)

    @field_validator("acquisition_date", mode="before")
    @classmethod
    def parse_date(cls, v: str | None) -> date | None:
        if not v:
            return None
        for fmt in ("%Y-%m-%d", "%d %B %Y", "%m/%d/%Y"):
            try:
                return datetime.strptime(v, fmt).date()
            except ValueError:
                continue
        raise ValueError(f"Unrecognized date format: {v}")

The model enforces institutional ID patterns and sanitizes sensitive identifiers. Date parsing normalizes archival variations into ISO 8601, field validators execute synchronously during instantiation, and invalid payloads fail fast. The free-text previous_owner and place names embedded in provenance_text are strong candidates for reconciliation against controlled vocabularies — see Implementing Getty AAT & TGN for resolving those strings to stable authority identifiers.

Step 5: IIIF Integration & CMS Sync

Validated records must link back to source imagery for auditability. IIIF Presentation API 3.0 manifests wrap extracted metadata alongside the high-resolution derivatives, and the sync pipeline serializes payloads into CMS-compatible JSON — the same JSON-LD discipline covered in structuring JSON-LD for museum objects.

python

from typing import Any
import json

# IIIF 3.0 requires every resource id to be a dereferenceable HTTP(S) URI.
IIIF_BASE = "https://iiif.example.org"

def build_iiif_provenance_manifest(record: ProvenanceRecord, image_uri: str) -> dict[str, Any]:
    base = f"{IIIF_BASE}/{record.object_id}"
    canvas_id = f"{base}/canvas"
    return {
        "@context": "http://iiif.io/api/presentation/3/context.json",
        "id": f"{base}/manifest",
        "type": "Manifest",
        "label": {"en": [f"Provenance Record: {record.object_id}"]},
        "items": [{
            "id": canvas_id,
            "type": "Canvas",
            "height": 3000,
            "width": 2400,
            "items": [{
                "id": f"{base}/page",
                "type": "AnnotationPage",
                "items": [{
                    "id": f"{base}/anno",
                    "type": "Annotation",
                    "motivation": "commenting",
                    "body": {
                        "type": "TextualBody",
                        "value": record.provenance_text,
                        "format": "text/plain"
                    },
                    "target": canvas_id
                }]
            }],
            "annotations": [{
                "id": f"{base}/painting-page",
                "type": "AnnotationPage",
                "items": [{
                    "id": f"{base}/painting-anno",
                    "type": "Annotation",
                    "motivation": "painting",
                    "body": {"id": image_uri, "type": "Image", "format": "image/tiff"},
                    "target": canvas_id
                }]
            }]
        }],
        "metadata": [
            {"label": {"en": ["LIDO Provenance"]}, "value": {"en": [record.provenance_text]}},
            {"label": {"en": ["Acquisition Date"]}, "value": {"en": [str(record.acquisition_date)]}}
        ]
    }

The manifest binds OCR output to the source canvas via IIIF annotations. Metadata fields map directly to LIDO lido:provenanceText and lido:eventDate, and serialization produces a CMS-ready payload that downstream ingestion handlers consume without transformation overhead.

Edge Cases and Variants

Single card vs. multi-page PDF. For paginated donor correspondence, rasterize each page (for example with pdf2image at 300 DPI) into individual Path inputs so the memory floor in process_batch throttles per page rather than per document; a 40-page PDF held entirely in memory is the most common OOM trigger.
PSM selection. --psm 6 suits uniform typewritten cards. Use --psm 4 for cards with a single column of variable-size lines, and --psm 11 (sparse text) when stamps and marginalia dominate and you want raw tokens for a second reconciliation pass.
Stamp and seal overlays. Where an accession stamp overprints the text, run a second sparse pass on a masked copy of the region and merge tokens by bounding-box position instead of trusting a single reading order.
Non-Latin and mixed scripts. Pass a combined lang such as "eng+deu" for German dealer records or "eng+fra" for French sale catalogues, and confirm the matching .traineddata files are mounted on the shared tessdata volume.
Strict vs. lenient validation. In backfill runs, catch pydantic.ValidationError and route the record to a quarantine queue with the raw text attached; in interactive re-keying, surface the error to the operator instead of silently dropping the row.
Confidence-gated review. When you need per-word confidence to decide review routing, switch extract_text to pytesseract.image_to_data(..., output_type=Output.DICT) and threshold on the mean conf column, exactly as the parent OCR stage does.

Validation

Confirm the contract holds end to end with an assert-based smoke test — no live Tesseract binary required, because Steps 4 and 5 are pure functions over a validated record:

python

# test_provenance_pipeline.py  ->  run with:  pytest -q test_provenance_pipeline.py
from datetime import date
import pytest
from pydantic import ValidationError

def test_pii_is_redacted_and_date_normalized():
    rec = ProvenanceRecord(
        object_id="PNT-004821",
        provenance_text="Acquired from J. Doe, SSN 123-45-6789, gift of the estate",
        acquisition_date="14 March 1961",
    )
    assert "[REDACTED]" in rec.provenance_text        # SSN scrubbed before egress
    assert rec.acquisition_date == date(1961, 3, 14)  # archival date -> ISO 8601

def test_malformed_object_id_is_rejected():
    with pytest.raises(ValidationError):
        ProvenanceRecord(object_id="painting_1", provenance_text="Purchased 1901")

def test_manifest_ids_are_absolute_uris():
    rec = ProvenanceRecord(object_id="PNT-004821", provenance_text="Gift, 1961")
    manifest = build_iiif_provenance_manifest(rec, "https://iiif.example.org/PNT-004821/full.tif")
    assert manifest["id"].startswith("https://")       # IIIF 3.0 dereferenceable id
    assert manifest["items"][0]["type"] == "Canvas"

A green run proves the two invariants that keep downstream ingestion clean: sensitive identifiers never leave the process, and every emitted IIIF id is an absolute, dereferenceable URI.

Standards Alignment

The output conforms to three cultural-heritage standards at once. The validated record maps to LIDO by carrying lido:provenanceText and a dated lido:event, so it slots cleanly into a LIDO harvest without post-processing. The manifest follows the IIIF Presentation API 3.0, where every id is a resolvable HTTP(S) URI and the OCR string is attached to its canvas as a commenting annotation distinct from the painting image body. Where previous_owner and place names are reconciled against Getty AAT and TGN, the record gains stable authority URIs that make provenance searchable across institutions rather than trapped in free text.

Frequently Asked Questions

Why does Tesseract mangle dates on typewritten cards even after preprocessing?

The LSTM engine is guessing at reading order, not the digits themselves. Force a single uniform block with --psm 6 and keep preserve_interword_spaces=1 so a typewriter date like 14 March 1961 is not collapsed into an unparseable token before it reaches the parse_date validator.

How do I stop batch runs from being OOM-killed on multi-page PDFs?

Rasterize each PDF page to a separate image and feed the pages individually to process_batch, which holds submissions when free RAM drops below min_free_bytes. Holding a whole high-DPI PDF in memory at once defeats the throttle.

Should redaction happen in OCR or later in the pipeline?

As early as possible. The strip_pii validator runs during model instantiation, before the record is serialized into a manifest or transmitted, so restricted identifiers never reach the network or the CMS.

Automating OCR Metadata Extraction — parent OCR stage overview
Schema Validation with Pydantic — enforcing record contracts
Mapping LIDO to Internal Databases — landing provenance events
Implementing Getty AAT & TGN — resolving owners and places
Automated Record Ingestion & Sync Workflows — the full ingestion pipeline

Extracting Provenance Text with Tesseract OCR

Operational Context #

Root Cause Analysis #

Canonical Solution #

Step 1: Deterministic Image Preprocessing #

Step 2: Tesseract Configuration & Layout Tuning #

Step 3: Memory-Bounded Batch Execution #

Step 4: Schema Validation & LIDO Mapping #

Step 5: IIIF Integration & CMS Sync #

Edge Cases and Variants #

Validation #

Standards Alignment #

Frequently Asked Questions #

Why does Tesseract mangle dates on typewritten cards even after preprocessing? #

How do I stop batch runs from being OOM-killed on multi-page PDFs? #

Should redaction happen in OCR or later in the pipeline? #

Related #

Related pages

Operational Context

Root Cause Analysis

Canonical Solution

Step 1: Deterministic Image Preprocessing

Step 2: Tesseract Configuration & Layout Tuning

Step 3: Memory-Bounded Batch Execution

Step 4: Schema Validation & LIDO Mapping

Step 5: IIIF Integration & CMS Sync

Edge Cases and Variants

Validation

Standards Alignment

Frequently Asked Questions

Why does Tesseract mangle dates on typewritten cards even after preprocessing?

How do I stop batch runs from being OOM-killed on multi-page PDFs?

Should redaction happen in OCR or later in the pipeline?

Related