Engine

The remediation engine.

A six-step deterministic pipeline — the same source PDF in produces the same remediated PDF out. Every structure element, every font program, every metadata entry is rebuilt against ISO 14289-1:2014 and validated by veraPDF on every run.

TrueSection remediation pipeline · automated PDF/UA-1 engine · originally built for a court-system engagement

6: Pipeline steps
106: PDF/UA-1 rules addressed
17: Verified fixes
86: WCAG 2.2 criteria covered

Architecture

The engine accepts any PDF, extracts its semantic structure, and applies a deterministic six-step remediation pipeline. Visual content is never altered. Only the accessibility layer is rebuilt.

Input

Source PDF

Non-conforming.

→

Pipeline

Six steps

Extract · Structure · Glyph mapping · Identification · Annotations · Validation.

→

Output

PDF/UA-1 conformant

veraPDF-validated.

Pipeline steps

Each step is applied in strict sequence. A failure at any step halts the pipeline for that document — no partial output, no silent corruption.

01
Extract

bytes → structured input

Pull text, structure cues, and metadata out of the source PDF.

Show details → Hide details ↑

The first stage performs structured extraction over the source PDF. The extractor emits XHTML with marked content structure — headings, tables, lists, figures, annotations — plus JSON metadata covering font names, document language, and title. This is the only semantic-analysis step in the pipeline; every downstream structural decision derives from this output. Before extraction, three pre-flight hazards are resolved: encryption (saved unencrypted), dynamic XFA forms (stripped from /AcroForm), and reference XObjects (/Ref deleted).

Rules addressed: 7.10-1 · 7.10-2 · 7.11-1 · 7.15-1 · 7.16-1 · 7.20-1

Failure modes handled

Extractor unavailable

Pipeline aborts with a clear error message. No partial output written.
Out-of-memory on large PDFs

Same abort path. Large PDFs that exhaust the extractor's heap are surfaced as failures, never silently truncated.
Malformed XHTML returned

The structure parser uses XMLParser(recover=True) and pre-strips invalid XML character references and bare control characters.
Encrypted PDFs (B1)

Decrypted on open. Saving without the encryption= parameter produces unencrypted output that passes isEncrypted == false (rule 7.16-1).
Reference XObject deletion (B16)

/Ref XObjects are deleted rather than inlined. May lose visual content if the XObject depends on /Ref for rendering. Acknowledged trade-off for the court-system corpus.
02
Structure

tags: 0 → 247

Rebuild the logical document tree so headings, lists, tables, and reading order are explicit.

Show details → Hide details ↑

Walks the extracted XHTML and builds an in-memory tree of typed structure nodes: Document, Part, Sect, H1–H6, P, Table / THead / TBody / TR / TH / TD, L / LI / Lbl / LBody, Figure, Note, Art. Existing tags from the source PDF are removed first — patching an unknown structure tree is more error-prone than replacement. The new tree serializes into PDF structure elements with /StructTreeRoot, /ParentTree number tree, per-page MCID arrays, and bidirectional parent/child references. Every page's content stream is rewritten so every operator sits inside either a tagged BDC...EMC block (with MCID linking to the structure tree) or an /Artifact BMC...EMC block.

Rules addressed: 7.1-1..3 · 7.1-11 · 7.1-12 · 7.2-x · 7.4.2-1 · 7.4.4-x · 7.5-1 · 7.20-2

Failure modes handled

Per-page MCIDs (B4)

ISO 32000-1 §14.7.4.4 requires MCIDs unique per page. A previous global counter caused cross-page ParentTree contamination.
Heading sequence (B5)

Rule 7.4.2-1 — heading levels must not skip. H1→H3 is clamped to H1→H2.
Phantom Lbl MCID (B8)

List label nodes assigned no MCID. The extractor includes bullet text inline in <li>, so a phantom MCID stole a real BT block, desynchronizing all subsequent MCIDs.
Artifact wrapping (B9)

Pages with no MCIDs strip existing markers before wrapping as Artifact, to prevent nested tagged content inside Artifact (rule 7.1-2).
Single-row tables (B10)

No empty THead is created. Single-row tables get TBody only.
Mixed td/th rows (B11)

All cells collected in document order. Previous code dropped <th> in mixed rows.
Empty leading page collapse (B13)

Page index increments after the first page div regardless of content. Empty leading pages no longer collapse into index 0.
Double-tagging in list items (B14)

List items use direct text only, not recursive text, to avoid double MCID assignment.
03
Glyph mapping

tounicode: missing → present

Repair character mappings so assistive technology can read glyphs back as text.

Show details → Hide details ↑

Three sub-steps. (a) Font embedding: every font missing /FontFile, /FontFile2, or /FontFile3 has its system file located, extracted (including from TrueType Collections), and embedded as program data. (b) ToUnicode CMap repair: symbol fonts (Wingdings) use non-standard encodings; even when embedded, missing /ToUnicode CMaps mean glyphs cannot map back to Unicode. A 220-entry authoritative mapping table generates a complete CMap. (c) Verification: every font with a descriptor must now have an embedded program; raises FontVerificationError on any violation — the pipeline halts.

Rules addressed: 7.21.4.1-1 · 7.21.7-1 · 7.21.7-2 · 7.21.8-1

Failure modes handled

Font not found on system

Halts with FontVerificationError naming the font. Never silently produces non-conforming output.
TrueType Collection (.ttc)

fontTools extracts the matching face by PostScript name (nameID 6). Falls back to face 0 if no match.
Supplementary Unicode (U+1F589 etc.)

_unicode_to_utf16be_hex() uses Python's chr().encode("utf-16-be"), which correctly produces 4-byte surrogate pairs for codepoints above U+FFFF.
04
Identification

xmp: missing → present

Stamp title, language, and the PDF/UA-1 conformance identifier — the only conformance marker PDFs natively carry.

Show details → Hide details ↑

Two adjacent actions. First, catalog-level entries: /MarkInfo Marked=true (rule 6.2-1), /MarkInfo Suspects=false (7.1-4), /ViewerPreferences DisplayDocTitle=true (7.1-10), and /Lang sourced from extraction metadata, defaulting to "en" (7.2-29). Second, a complete XMP metadata stream replaces any existing metadata: pdfuaid:part=1 (5-1, 5-2), dc:title sourced from extraction / existing XMP / filename stem (7.1-9), /Type /Metadata and /Subtype /XML on the stream (7.1-8), and the correct pdfuaid: namespace prefix on all three identification properties (5-3, 5-4, 5-5).

Rules addressed: 5-1 · 5-2 · 5-3 · 5-4 · 5-5 · 6.2-1 · 7.1-4 · 7.1-8 · 7.1-9 · 7.1-10 · 7.2-29

Failure modes handled

Missing language

Extraction returns no language or returns "MISSING" — defaults to "en". Both "en" and "en-US" pass rule 7.2-29.
Control characters in title (B17)

XML 1.0 forbids U+0000–U+0008, U+000B, U+000C, U+000E–U+001F. _xml_escape() strips them before escaping &, <, >, ". Without this the XMP stream is malformed XML.
Null-byte title

Detected and replaced with the filename stem. Never writes an empty dc:title.
ViewerPreferences not a Dictionary

Replaced entirely with {/DisplayDocTitle: true}. Existing valid preferences are preserved when type is correct.
05
Annotations

roles: 0 → 18

Assign required accessibility roles to every interactive element — links, form fields, alt-text slots — so each is reachable.

Show details → Hide details ↑

Per-annotation handling by subtype: /Link → Link struct element with /Contents (rule 7.18.5-1); /Widget → Form struct element with /TU (7.18.4-1); /TrapNet → deleted entirely from /Annots (7.18.2-1); /PrinterMark → no struct element (7.18.8-1); all others → Annot struct element with /Contents (7.18.1-1). Every page with annotations receives /Tabs=/S to enforce structure-order tab navigation (7.18.3-1). Media clip data dictionaries receive /CT and /Alt (7.18.6.2).

Rules addressed: 7.18.1-1 · 7.18.1-2 · 7.18.1-3 · 7.18.2-1 · 7.18.3-1 · 7.18.4-1 · 7.18.4-2 · 7.18.5-1 · 7.18.5-2 · 7.18.6.2-1 · 7.18.6.2-2 · 7.18.8-1

Failure modes handled

TrapNet left in /Annots (B2)

Previous code skipped struct element creation but left the annotation object. veraPDF evaluates every PDTrapNetAnnot. The rebuilt array now excludes them entirely.
Orphaned StructParent (B3)

Each annotation gets a /StructParent value starting at len(pages); the function returns {StructParent: struct_element} so the caller appends entries to the ParentTree /Nums array.
PrinterMark struct elements (B15)

Explicitly skipped. veraPDF checks structParentType == null for PrinterMark annotations.
06
Validation

violations: 251 → 2

Validate against PDF/UA-1 and WCAG 2.2 with veraPDF and emit the violation delta.

Show details → Hide details ↑

The remediated PDF is validated end-to-end with veraPDF (1.30.0 / 1.30.1) against two profiles: PDF_UA/PDFUA-1.xml (ISO 14289-1:2014) and WCAG-2-2-Complete.xml (the PDF Association's PDF surface coverage of WCAG 2.2). Per-file rule-level pass/fail diffs ship as part of the deliverable. The site never claims a result veraPDF does not confirm.

Rules addressed: all 106 PDF/UA-1 rules · all 86 WCAG 2.2 criteria covered by the WCAG-2-2-Complete profile

Failure modes handled

Validation regression

If any deterministic rule fails on the remediated output, the pipeline marks the document as failed and surfaces the rule-level diff. Failed is a terminal status; no silent retry.

Verified fixes.

17 defects · ISO-rule cited · veraPDF-discovered

Each was discovered through veraPDF validation of real court documents — not theoretical, not from an internal scorecard. Tagged in source with the ISO rule and the specific code change.

View all 17 fixes → Collapse fixes ↑

B1
Rule 7.16-1

Encryption handling

pikepdf decrypts on open; saving without encryption= produces unencrypted output that passes isEncrypted == false.
B2
Rule 7.18.2-1

TrapNet annotations

TrapNet left in /Annots after skipping struct element creation. veraPDF evaluates every PDTrapNetAnnot regardless. Rebuilt array excludes them.
B3
Rule 7.18.x

Orphaned StructParent

Annotation StructParent values assigned but no corresponding ParentTree entries created — orphaned reverse mappings.
B4
ISO 32000-1 §14.7.4.4

Cross-page MCID contamination

MCIDs were globally numbered instead of per-page. Caused cross-page ParentTree contamination where page N's array contained entries for pages 0..N-1.
B5
Rule 7.4.2-1

Heading-level skip

Heading levels could skip (e.g., H1 directly to H3). Now clamped to previous_level + 1 on descent.
B6
Rule 7.20-2

Form XObject orphan MCIDs

Form XObject streams retained old BMC/BDC/EMC markers after page stripping, leaving orphaned MCIDs.
B7
Rule 7.1-3

Excess BT blocks untagged

More BT blocks than MCIDs left text blocks untagged. Excess BT blocks now wrapped as Artifact.
B8
Rule 7.2-20

Phantom Lbl MCID

Lbl (list label) nodes assigned a phantom MCID. The extractor includes bullet text inline in <li>, so the phantom MCID stole a real BT block, desynchronizing all subsequent MCIDs.
B9
Rule 7.1-2

Artifact-wrap nested tags

_wrap_all_as_artifact didn't strip existing markers before wrapping, creating nested tagged content inside Artifact.
B10
Rule 7.2-14

Single-row table empty THead

Single-row tables created an empty THead + TBody pair. THead without TBody violates 7.2-14. Single-row tables now get TBody only.
B11
Rule 7.2-10

Mixed td/th rows

Mixed <td>/<th> rows: code searched for <td> first, fell back to <th> only if zero <td> found, silently dropping <th> cells in mixed rows.
B12
Rule 7.1-3

Skipped-container text loss

Direct text on skipped containers (annotation, acroform, ocr) was lost from the accessibility tree. Now wrapped in a P node with MCID.
B13
ISO 32000-1 §14.7.4.4

Empty leading page collapse

Empty leading pages collapsed into page index 0 because page index only incremented when page_mcids was non-empty.
B14
Rule 7.2-20

Double-tagging in list items

_full_text() in list items captured nested children's text. When children were recursed into, they created additional MCIDs for the same text — double-tagging.
B15
Rule 7.18.8-1

PrinterMark struct elements

PrinterMark annotations incorrectly received struct elements. veraPDF checks structParentType == null for PrinterMark.
B16
Rule 7.20-1

/Ref deletion (acknowledged trade-off)

/Ref deletion on Reference XObjects may lose visual data if the XObject depends on it for rendering. Accepted trade-off for the court-system corpus.
B17
Rule 5-1

Control characters in title

Control characters in document title (U+0000–U+001F) produced malformed XMP XML. Now stripped before XML escaping.

Reference

Full machine-checkable rule and criterion references for the two standards the engine validates against.

ISO 14289-1:2014

PDF/UA-1 validation rules

All 106 machine-checkable rules grouped by ISO clause, with object type, requirement, and the error veraPDF emits on failure.

Open reference →

W3C Recommendation · ISO/IEC 40500:2025

WCAG 2.2 success criteria

All 86 criteria across 4 principles and 13 guidelines, with level (A/AA/AAA) and version (2.0, 2.1, 2.2) labels.

Open reference →

The pipeline is deterministic. The validation is third-party. The receipts are per file.

Every customer engagement runs on this engine. Same source PDF in, same remediated PDF out, with the veraPDF report attached. The audit trail is the deliverable.

Request an invite →

source: TrueSection remediation pipeline. iso 14289-1:2014. verapdf 1.30.0 / 1.30.1 against PDF_UA/PDFUA-1.xml and WCAG-2-2-Complete.xml profiles.