Implementing Getty AAT & TGN in Digital Asset Pipelines

Workflow Context

Getty vocabularies are the authoritative backbone for controlled terminology in museum collection management. The Art & Architecture Thesaurus (AAT) standardizes object classification, material composition, and stylistic period; the Thesaurus of Geographic Names (TGN) anchors provenance, creation site, and cultural jurisdiction. The specific sub-problem this page solves within the Core Architecture & Collection Taxonomy pipeline stage is the resolution step: turning a bare Getty URI — or worse, a free-text curator string — into a typed, validated record carrying a preferred label and a hierarchical path, without hammering the Getty Linked Open Data endpoints or letting an unverified term reach a public index.

The role that owns this code is the Python automation engineer; the role that consumes its output is the collections manager who expects a search for a material or a place to return everything, and only what it should. This stage runs immediately after structural validation. When museum object schemas queue a raw medium or place string for enrichment, the resolver described here converts it into a stable identifier before the record is handed to the LIDO-to-database mapping layer for persistence. Treating Getty URIs as immutable identifiers rather than ephemeral labels is what prevents vocabulary drift from fragmenting downstream indexes: a preferred label may change at Getty, but the URI that keys your cache never does.

Prerequisites

Before wiring the resolver into an ingestion pipeline, confirm the following are in place:

Python 3.9+ for the PEP 604 union syntax and native generic hints used throughout.
aiohttp (pip install aiohttp) for concurrent, non-blocking HTTP against the Getty LOD endpoints.
Pydantic v2 (pip install "pydantic>=2.6") for the typed GettyTerm record and its field_validator; see the Pydantic v2 documentation for the validation surface.
The Getty vocabulary endpoints: AAT at http://vocab.getty.edu/aat/{id} and TGN at http://vocab.getty.edu/tgn/{id}. Appending .json to any subject URI returns its JSON-LD representation — the Getty Vocabularies as Linked Open Data reference documents the semantics.
A persistent cache (an in-process dict for a single run, or Redis for a long-lived service) keyed on the canonical URI, to collapse duplicate lookups and survive endpoint rate limiting.
A quarantine or review queue for terms that fail to resolve, so an unmatched string never blocks the structural write and never silently reaches production.

Schema Reference

A resolved term is a small, closed record. The GettyTerm model below is the canonical structure every downstream stage consumes; the table is the specification the code enforces.

Field	Type	Constraint	Source / role
`uri`	`HttpUrl`	must parse as an absolute URL	The immutable Getty subject URI; primary cache key
`preferred_label`	`str`	non-empty	`skos:prefLabel` `@value`; the display term
`term_type`	`str`	one of `AAT`, `TGN` (upper-cased)	Selects the LIDO element the term maps to
`hierarchical_path`	`str \| None`	optional	`gvp:broaderPreferred` label; the immediate parent concept
`resolved_at`	`datetime`	UTC, auto-set	Audit timestamp for re-resolution and cache expiry

The controlled-vocabulary constraint lives on term_type: a field validator rejects anything outside the {AAT, TGN} set, so a mistyped or foreign identifier fails loudly at construction rather than producing a record that maps to no LIDO element. The Getty JSON-LD payload nests the label inside a @graph array of nodes, and the preferred label itself is an array of language-tagged {"@value": ...} objects — the parser has to reach through both layers, which is the single most error-prone line in a naive implementation.

Step-by-Step Implementation

1. Model the resolved term and validate its type

Define the record first, so every later stage has a typed contract. The field_validator closes the term_type enum, and resolved_at stamps each record for audit and cache expiry.

python

import asyncio
import logging
from typing import Any
from datetime import datetime, timezone
from pydantic import BaseModel, Field, HttpUrl, field_validator

logger = logging.getLogger("aat_tgn_pipeline")


class GettyTerm(BaseModel):
    uri: HttpUrl
    preferred_label: str
    term_type: str
    hierarchical_path: str | None = None
    resolved_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))

    @field_validator("term_type")
    @classmethod
    def validate_type(cls, v: str) -> str:
        if v.upper() not in {"AAT", "TGN"}:
            raise ValueError("term_type must be 'AAT' or 'TGN'")
        return v.upper()

Using HttpUrl rather than a plain str means a malformed identifier is rejected at construction, and normalizing term_type to upper-case in the validator guarantees aat, AAT, and Aat all resolve to one canonical value before the record is cache-keyed.

2. Fetch the JSON-LD with backoff and a shared session

The Getty endpoints enforce rate limits and will return HTTP 429 under batch load. The fetch method retries on 429 with exponential backoff and reuses a single aiohttp.ClientSession across the whole run so connection setup is not repeated per request.

python

import aiohttp


class GettyResolver:
    def __init__(self, batch_size: int = 20, max_retries: int = 3):
        self.batch_size = batch_size
        self.max_retries = max_retries
        self.session: aiohttp.ClientSession | None = None
        self.cache: dict[str, GettyTerm] = {}

    async def __aenter__(self) -> "GettyResolver":
        self.session = aiohttp.ClientSession(
            headers={"Accept": "application/json", "User-Agent": "MuseumPipeline/1.0"}
        )
        return self

    async def __aexit__(self, exc_type: Any, exc_val: Any, exc_tb: Any) -> None:
        if self.session:
            await self.session.close()

    async def _fetch_with_retry(self, uri: str) -> dict[str, Any]:
        json_uri = f"{uri.rstrip('/')}.json"
        for attempt in range(self.max_retries):
            try:
                async with self.session.get(
                    json_uri, timeout=aiohttp.ClientTimeout(total=10)
                ) as resp:
                    if resp.status == 429:
                        await asyncio.sleep(2 ** attempt)  # 1s, 2s, 4s
                        continue
                    resp.raise_for_status()
                    return await resp.json()
            except (aiohttp.ClientError, asyncio.TimeoutError) as e:
                logger.warning("Attempt %d failed for %s: %s", attempt + 1, uri, e)
                if attempt == self.max_retries - 1:
                    raise
        raise RuntimeError("Max retries exceeded")

The async with GettyResolver() as resolver pattern guarantees the session is closed even when a batch raises, and the 2 ** attempt sleep spaces retries out (1s, 2s, 4s) so a transient 429 does not cascade into a thundering-herd retry storm against Getty.

3. Parse the SKOS / GVP graph into a typed record

The Getty JSON-LD nests the concept inside a @graph array, and @type may be a single string or a list. Resolution locates the skos:Concept node, reads the first skos:prefLabel value, and pulls the immediate parent from gvp:broaderPreferred. A cache check short-circuits the whole method on a repeat URI.

python

    async def resolve_term(self, uri: str) -> GettyTerm:
        if uri in self.cache:
            return self.cache[uri]

        payload = await self._fetch_with_retry(uri)
        graph = payload.get("@graph", [])
        concept_data = next(
            (c for c in graph
             if "skos:Concept" in (
                 c["@type"] if isinstance(c.get("@type"), list) else [c.get("@type")]
             )),
            None,
        )
        if not concept_data:
            raise ValueError(f"No concept data found for {uri}")

        term = GettyTerm(
            uri=uri,
            preferred_label=concept_data.get(
                "skos:prefLabel", [{"@value": "Unknown"}]
            )[0]["@value"],
            term_type="AAT" if "aat" in uri else "TGN",
            hierarchical_path=concept_data.get(
                "gvp:broaderPreferred", [{}]
            )[0].get("skos:prefLabel", [{"@value": ""}])[0].get("@value"),
        )
        self.cache[uri] = term
        return term

The nested [0]["@value"] indexing is deliberate and load-bearing: Getty returns labels as language-tagged arrays, so reaching for [0] takes the first (default-language) label. The defensive defaults ([{"@value": "Unknown"}], [{}]) keep a concept that legitimately has no parent — a top-of-hierarchy facet in AAT — from raising a KeyError and aborting the batch.

4. Resolve a batch without letting one failure sink the run

Batch resolution gathers the per-term coroutines and isolates failures with return_exceptions=True, so a single unresolvable URI routes to the log rather than cancelling its siblings.

python

    async def resolve_batch(self, uris: list[str]) -> list[GettyTerm]:
        tasks = [self.resolve_term(u) for u in uris]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        valid_terms: list[GettyTerm] = []
        for r in results:
            if isinstance(r, GettyTerm):
                valid_terms.append(r)
            else:
                logger.error("Resolution failed: %s", r)
        return valid_terms

Because every resolve_term call consults the shared self.cache first, a batch containing the same material URI a thousand times issues exactly one HTTP request — the pattern that keeps bulk ingestion within Getty’s rate budget. Chunk very large URI sets by self.batch_size if you need to bound in-flight connections rather than fan out the entire list at once.

Resolution Flow

LIDO Element Placement and IIIF Manifests

A resolved term is only useful once it lands in the right slot of the interchange record. AAT and TGN resolutions map to different LIDO elements, and term_type is what selects between them.

`term_type`	LIDO element	Purpose
`AAT`	`objectWorkType`	Object classification (what the thing is)
`AAT`	`termMaterialsTech`	Material and technique
`TGN`	`place` / `eventPlace`	Creation site, find spot, jurisdiction

This placement is what makes the controlled-vocabulary output consumable by cross-institutional harvesters — a term dropped into the wrong element resolves for a human reader but breaks faceted search downstream. See the LIDO schema for the element definitions and the How to Structure JSON-LD for Museum Objects guide for binding the same AAT IRI into a schema:material node. When the record reaches delivery, each resolved term becomes a discrete label / value pair in the IIIF Presentation API 3.0 manifest metadata block, with the Getty URI carried through so a viewer can link back to the authority record.

Rights and Access Routing

TGN resolution does more than label a place — it can gate visibility. A jurisdiction flag derived from a TGN coordinate is the trigger for access tiering: culturally sensitive material tied to a specific geographic origin routes through an embargo path before any public exposure. The resolver therefore feeds two consumers at once. Its preferred_label and hierarchical_path populate the descriptive record, while the underlying TGN identity is handed to the access layer that the security boundaries for collection APIs enforce.

Because this decision hangs off an authoritative identifier rather than a free-text place name, it is deterministic and auditable: “restricted because this object’s find spot resolves under TGN parent X” is a defensible, reproducible routing reason, where “restricted because someone typed a country name” is not. Restricted geographic data routes to the embargo workflow; everything else proceeds to the normal delivery tier.

Verification and Testing

Confirm the parser and the type gate behave before pointing the resolver at live endpoints. This test exercises graph parsing, the language-tagged label extraction, and the term_type validator against a fixture payload with no network call.

python

import asyncio


def test_getty_parsing() -> None:
    # A minimal AAT-shaped payload: a skos:Concept with a language-tagged
    # prefLabel and a broaderPreferred parent.
    fixture = {
        "@graph": [
            {
                "@type": ["skos:Concept", "gvp:Concept"],
                "skos:prefLabel": [{"@value": "etchings", "@language": "en"}],
                "gvp:broaderPreferred": [
                    {"skos:prefLabel": [{"@value": "prints", "@language": "en"}]}
                ],
            }
        ]
    }

    resolver = GettyResolver()
    resolver._fetch_with_retry = lambda uri: _immediate(fixture)  # type: ignore
    term = asyncio.run(resolver.resolve_term("http://vocab.getty.edu/aat/300041365"))

    assert term.term_type == "AAT"
    assert term.preferred_label == "etchings"
    assert term.hierarchical_path == "prints"

    # The type gate rejects anything outside the AAT/TGN set.
    try:
        GettyTerm(uri="http://vocab.getty.edu/ulan/500", preferred_label="x", term_type="ULAN")
        raise AssertionError("term_type gate did not fire")
    except ValueError:
        pass

    print("getty parsing OK")


async def _immediate(value):
    return value


if __name__ == "__main__":
    test_getty_parsing()

A green run means the @graph traversal, the [0]["@value"] label extraction, and the controlled-vocabulary gate all hold together. Once it passes, a live smoke test is a single command: python -c "import asyncio; from resolver import GettyResolver; ..." against one known AAT URI to confirm the endpoint, headers, and backoff path work end to end.

Production Hardening Checklist

Back the in-process dict with a persistent Redis cache so resolutions survive process restarts and peak ingestion windows.
Set an explicit User-Agent identifying your institution, as the Getty LOD guidance requests, so throttling decisions can be attributed.
Add a circuit breaker that isolates a failing Getty endpoint from the core pipeline instead of retrying indefinitely.
Emit metrics for resolution latency, cache hit ratio, and unresolved-term count, and alert on a rising quarantine rate.
Re-resolve cached terms on a schedule and diff preferred_label against the stored value to catch upstream Getty relabelling.
Route every unresolved URI to the review queue with its original curator string attached, never to a silent default.

FAQ

Why do I get a KeyError parsing some Getty payloads?

Getty returns skos:prefLabel as an array of language-tagged {"@value": ...} objects, and top-of-hierarchy concepts have no gvp:broaderPreferred at all. Indexing ["skos:prefLabel"][0]["@value"] on a concept that lacks the key, or ["gvp:broaderPreferred"][0] on a facet root, raises KeyError. Keep the defensive defaults ([{"@value": "Unknown"}], [{}]) so a legitimately parent-less term resolves with hierarchical_path=None instead of crashing the batch.

I’m being rate-limited (HTTP 429) during a bulk harvest. What do I change?

The _fetch_with_retry backoff handles transient 429s, but sustained throttling means you are issuing too many distinct requests. Confirm the cache is actually shared across the run — one GettyResolver instance for the whole batch, not one per record — so repeated terms cost zero HTTP calls, and back the cache with Redis so a restart does not re-fetch everything. Lower batch_size to bound in-flight connections if the endpoint still pushes back.

Should I store the Getty label or the URI in my database?

Store the URI as the durable key and the label as a cached, refreshable attribute. Labels change at Getty; URIs do not. Persisting the label as your primary reference reintroduces the vocabulary drift the whole resolution stage exists to prevent — a re-resolution should be able to overwrite preferred_label without breaking any foreign key.

How do I tell an AAT URI from a TGN URI programmatically?

The path segment carries the vocabulary: vocab.getty.edu/aat/... versus vocab.getty.edu/tgn/.... The resolver keys term_type off the substring "aat" in uri, and the field_validator enforces that the result is one of the two allowed values. For a more defensive check, inspect the @type array in the payload rather than the URL string.

Can I resolve a free-text curator string instead of a URI?

Not with this resolver — it takes an identifier. Free-text matching is a separate reconciliation step (fuzzy match against the AAT label set, then present candidates to a curator) that emits a chosen URI. Feed that URI here. Auto-accepting a fuzzy match without human confirmation is exactly how the wrong objectWorkType reaches a public facet.

External Standards Reference

Core Architecture & Collection Taxonomy — parent pipeline stage
Designing Museum Object Schemas — queues terms for enrichment
Mapping LIDO to Internal Databases — persists resolved terms
Security Boundaries for Collection APIs — enforces jurisdiction routing
How to Structure JSON-LD for Museum Objects — binds AAT IRIs into delivery
Getty AAT & TGN Vocabulary Mapping Tables — bookmarkable term-to-URI tables
Resolving Getty Vocabulary URIs with SPARQL — query-based term resolution

Implementing Getty AAT & TGN in Digital Asset Pipelines

Workflow Context #

Prerequisites #

Schema Reference #

Step-by-Step Implementation #

1. Model the resolved term and validate its type #

2. Fetch the JSON-LD with backoff and a shared session #

3. Parse the SKOS / GVP graph into a typed record #

4. Resolve a batch without letting one failure sink the run #

Resolution Flow #

LIDO Element Placement and IIIF Manifests #

Rights and Access Routing #

Verification and Testing #

Production Hardening Checklist #

FAQ #

Why do I get a KeyError parsing some Getty payloads? #

I’m being rate-limited (HTTP 429) during a bulk harvest. What do I change? #

Should I store the Getty label or the URI in my database? #

How do I tell an AAT URI from a TGN URI programmatically? #

Can I resolve a free-text curator string instead of a URI? #

External Standards Reference #

Related #

Explore this section

Workflow Context

Prerequisites

Schema Reference

Step-by-Step Implementation

1. Model the resolved term and validate its type

2. Fetch the JSON-LD with backoff and a shared session

3. Parse the SKOS / GVP graph into a typed record

4. Resolve a batch without letting one failure sink the run

Resolution Flow

LIDO Element Placement and IIIF Manifests

Rights and Access Routing

Verification and Testing

Production Hardening Checklist

FAQ

Why do I get a KeyError parsing some Getty payloads?

I’m being rate-limited (HTTP 429) during a bulk harvest. What do I change?

Should I store the Getty label or the URI in my database?

How do I tell an AAT URI from a TGN URI programmatically?

Can I resolve a free-text curator string instead of a URI?

External Standards Reference

Related