Engine
The remediation engine.
A six-step deterministic pipeline that transforms non-conforming PDFs into fully PDF/UA-1 conformant documents. Every structure element, every font program, every metadata entry — verified against ISO 14289-1:2014 and validated by veraPDF.
TrueSection remediation pipeline · automated PDF/UA-1 engine · originally built for DC Courts
- 6
- Pipeline steps
- 106
- PDF/UA-1 rules addressed
- 17
- Verified fixes
- 86
- WCAG 2.2 criteria covered
Architecture
The engine accepts any PDF, extracts its semantic structure, and applies a deterministic six-step remediation pipeline. Visual content is never altered. Only the accessibility layer is rebuilt.
Input
Source PDF
Non-conforming.
Pipeline
Six steps
Extract · Structure · Glyph mapping · Identification · Annotations · Validation.
Output
PDF/UA-1 conformant
veraPDF-validated.
Pipeline steps
Each step is applied in strict sequence. A failure at any step halts the pipeline for that document — no partial output, no silent corruption.
- 01
Extract
bytes → structured input
Pull text, structure cues, and metadata out of the source PDF.
Show details → Hide details ↑
The first stage performs structured extraction over the source PDF. The extractor emits XHTML with marked content structure — headings, tables, lists, figures, annotations — plus JSON metadata covering font names, document language, and title. This is the only semantic-analysis step in the pipeline; every downstream structural decision derives from this output. Before extraction, three pre-flight hazards are resolved: encryption (saved unencrypted), dynamic XFA forms (stripped from /AcroForm), and reference XObjects (/Ref deleted).
Rules addressed: 7.10-1 · 7.10-2 · 7.11-1 · 7.15-1 · 7.16-1 · 7.20-1
Failure modes handled
- Extractor unavailable
- Pipeline aborts with a clear error message. No partial output written.
- Out-of-memory on large PDFs
- Same abort path. Large PDFs that exhaust the extractor's heap are surfaced as failures, never silently truncated.
- Malformed XHTML returned
- The structure parser uses XMLParser(recover=True) and pre-strips invalid XML character references and bare control characters.
- Encrypted PDFs (B1)
- Decrypted on open. Saving without the encryption= parameter produces unencrypted output that passes isEncrypted == false (rule 7.16-1).
- Reference XObject deletion (B16)
- /Ref XObjects are deleted rather than inlined. May lose visual content if the XObject depends on /Ref for rendering. Acknowledged trade-off for the DC Courts corpus.
- 02
Structure
tags: 0 → 247
Rebuild the logical document tree so headings, lists, tables, and reading order are explicit.
Show details → Hide details ↑
Walks the extracted XHTML and builds an in-memory tree of typed structure nodes: Document, Part, Sect, H1–H6, P, Table / THead / TBody / TR / TH / TD, L / LI / Lbl / LBody, Figure, Note, Art. Existing tags from the source PDF are removed first — patching an unknown structure tree is more error-prone than replacement. The new tree serializes into PDF structure elements with /StructTreeRoot, /ParentTree number tree, per-page MCID arrays, and bidirectional parent/child references. Every page's content stream is rewritten so every operator sits inside either a tagged BDC...EMC block (with MCID linking to the structure tree) or an /Artifact BMC...EMC block.
Rules addressed: 7.1-1..3 · 7.1-11 · 7.1-12 · 7.2-x · 7.4.2-1 · 7.4.4-x · 7.5-1 · 7.20-2
Failure modes handled
- Per-page MCIDs (B4)
- ISO 32000-1 §14.7.4.4 requires MCIDs unique per page. A previous global counter caused cross-page ParentTree contamination.
- Heading sequence (B5)
- Rule 7.4.2-1 — heading levels must not skip. H1→H3 is clamped to H1→H2.
- Phantom Lbl MCID (B8)
- List label nodes assigned no MCID. The extractor includes bullet text inline in <li>, so a phantom MCID stole a real BT block, desynchronizing all subsequent MCIDs.
- Artifact wrapping (B9)
- Pages with no MCIDs strip existing markers before wrapping as Artifact, to prevent nested tagged content inside Artifact (rule 7.1-2).
- Single-row tables (B10)
- No empty THead is created. Single-row tables get TBody only.
- Mixed td/th rows (B11)
- All cells collected in document order. Previous code dropped <th> in mixed rows.
- Empty leading page collapse (B13)
- Page index increments after the first page div regardless of content. Empty leading pages no longer collapse into index 0.
- Double-tagging in list items (B14)
- List items use direct text only, not recursive text, to avoid double MCID assignment.
- 03
Glyph mapping
tounicode: missing → present
Repair character mappings so assistive technology can read glyphs back as text.
Show details → Hide details ↑
Three sub-steps. (a) Font embedding: every font missing /FontFile, /FontFile2, or /FontFile3 has its system file located, extracted (including from TrueType Collections), and embedded as program data. (b) ToUnicode CMap repair: symbol fonts (Wingdings) use non-standard encodings; even when embedded, missing /ToUnicode CMaps mean glyphs cannot map back to Unicode. A 220-entry authoritative mapping table generates a complete CMap. (c) Verification: every font with a descriptor must now have an embedded program; raises FontVerificationError on any violation — the pipeline halts.
Rules addressed: 7.21.4.1-1 · 7.21.7-1 · 7.21.7-2 · 7.21.8-1
Failure modes handled
- Font not found on system
- Halts with FontVerificationError naming the font. Never silently produces non-conforming output.
- TrueType Collection (.ttc)
- fontTools extracts the matching face by PostScript name (nameID 6). Falls back to face 0 if no match.
- Supplementary Unicode (U+1F589 etc.)
- _unicode_to_utf16be_hex() uses Python's chr().encode("utf-16-be"), which correctly produces 4-byte surrogate pairs for codepoints above U+FFFF.
- 04
Identification
xmp: missing → present
Stamp title, language, and the PDF/UA-1 conformance identifier — the only conformance marker PDFs natively carry.
Show details → Hide details ↑
Two adjacent actions. First, catalog-level entries: /MarkInfo Marked=true (rule 6.2-1), /MarkInfo Suspects=false (7.1-4), /ViewerPreferences DisplayDocTitle=true (7.1-10), and /Lang sourced from extraction metadata, defaulting to "en" (7.2-29). Second, a complete XMP metadata stream replaces any existing metadata: pdfuaid:part=1 (5-1, 5-2), dc:title sourced from extraction / existing XMP / filename stem (7.1-9), /Type /Metadata and /Subtype /XML on the stream (7.1-8), and the correct pdfuaid: namespace prefix on all three identification properties (5-3, 5-4, 5-5).
Rules addressed: 5-1 · 5-2 · 5-3 · 5-4 · 5-5 · 6.2-1 · 7.1-4 · 7.1-8 · 7.1-9 · 7.1-10 · 7.2-29
Failure modes handled
- Missing language
- Extraction returns no language or returns "MISSING" — defaults to "en". Both "en" and "en-US" pass rule 7.2-29.
- Control characters in title (B17)
- XML 1.0 forbids U+0000–U+0008, U+000B, U+000C, U+000E–U+001F. _xml_escape() strips them before escaping &, <, >, ". Without this the XMP stream is malformed XML.
- Null-byte title
- Detected and replaced with the filename stem. Never writes an empty dc:title.
- ViewerPreferences not a Dictionary
- Replaced entirely with {/DisplayDocTitle: true}. Existing valid preferences are preserved when type is correct.
- 05
Annotations
roles: 0 → 18
Assign required accessibility roles to every interactive element — links, form fields, alt-text slots — so each is reachable.
Show details → Hide details ↑
Per-annotation handling by subtype: /Link → Link struct element with /Contents (rule 7.18.5-1); /Widget → Form struct element with /TU (7.18.4-1); /TrapNet → deleted entirely from /Annots (7.18.2-1); /PrinterMark → no struct element (7.18.8-1); all others → Annot struct element with /Contents (7.18.1-1). Every page with annotations receives /Tabs=/S to enforce structure-order tab navigation (7.18.3-1). Media clip data dictionaries receive /CT and /Alt (7.18.6.2).
Rules addressed: 7.18.1-1 · 7.18.1-2 · 7.18.1-3 · 7.18.2-1 · 7.18.3-1 · 7.18.4-1 · 7.18.4-2 · 7.18.5-1 · 7.18.5-2 · 7.18.6.2-1 · 7.18.6.2-2 · 7.18.8-1
Failure modes handled
- TrapNet left in /Annots (B2)
- Previous code skipped struct element creation but left the annotation object. veraPDF evaluates every PDTrapNetAnnot. The rebuilt array now excludes them entirely.
- Orphaned StructParent (B3)
- Each annotation gets a /StructParent value starting at len(pages); the function returns {StructParent: struct_element} so the caller appends entries to the ParentTree /Nums array.
- PrinterMark struct elements (B15)
- Explicitly skipped. veraPDF checks structParentType == null for PrinterMark annotations.
- 06
Validation
violations: 251 → 2
Validate against PDF/UA-1 and WCAG 2.2 with veraPDF and emit the violation delta.
Show details → Hide details ↑
The remediated PDF is validated end-to-end with veraPDF (1.30.0 / 1.30.1) against two profiles: PDF_UA/PDFUA-1.xml (ISO 14289-1:2014) and WCAG-2-2-Complete.xml (the PDF Association's PDF surface coverage of WCAG 2.2). Per-file rule-level pass/fail diffs ship as part of the deliverable. The site never claims a result veraPDF does not confirm.
Rules addressed: all 106 PDF/UA-1 rules · all 86 WCAG 2.2 criteria covered by the WCAG-2-2-Complete profile
Failure modes handled
- Validation regression
- If any deterministic rule fails on the remediated output, the pipeline marks the document as failed and surfaces the rule-level diff. Failed is a terminal status; no silent retry.
Verified fixes.
17 defects · ISO-rule cited · veraPDF-discovered
Each was discovered through veraPDF validation of real court documents — not theoretical, not from an internal scorecard. Tagged in source with the ISO rule and the specific code change.
View all 17 fixes → Collapse fixes ↑
- B1 Rule 7.16-1
Encryption handling
pikepdf decrypts on open; saving without encryption= produces unencrypted output that passes isEncrypted == false.
- B2 Rule 7.18.2-1
TrapNet annotations
TrapNet left in /Annots after skipping struct element creation. veraPDF evaluates every PDTrapNetAnnot regardless. Rebuilt array excludes them.
- B3 Rule 7.18.x
Orphaned StructParent
Annotation StructParent values assigned but no corresponding ParentTree entries created — orphaned reverse mappings.
- B4 ISO 32000-1 §14.7.4.4
Cross-page MCID contamination
MCIDs were globally numbered instead of per-page. Caused cross-page ParentTree contamination where page N's array contained entries for pages 0..N-1.
- B5 Rule 7.4.2-1
Heading-level skip
Heading levels could skip (e.g., H1 directly to H3). Now clamped to previous_level + 1 on descent.
- B6 Rule 7.20-2
Form XObject orphan MCIDs
Form XObject streams retained old BMC/BDC/EMC markers after page stripping, leaving orphaned MCIDs.
- B7 Rule 7.1-3
Excess BT blocks untagged
More BT blocks than MCIDs left text blocks untagged. Excess BT blocks now wrapped as Artifact.
- B8 Rule 7.2-20
Phantom Lbl MCID
Lbl (list label) nodes assigned a phantom MCID. The extractor includes bullet text inline in <li>, so the phantom MCID stole a real BT block, desynchronizing all subsequent MCIDs.
- B9 Rule 7.1-2
Artifact-wrap nested tags
_wrap_all_as_artifact didn't strip existing markers before wrapping, creating nested tagged content inside Artifact.
- B10 Rule 7.2-14
Single-row table empty THead
Single-row tables created an empty THead + TBody pair. THead without TBody violates 7.2-14. Single-row tables now get TBody only.
- B11 Rule 7.2-10
Mixed td/th rows
Mixed <td>/<th> rows: code searched for <td> first, fell back to <th> only if zero <td> found, silently dropping <th> cells in mixed rows.
- B12 Rule 7.1-3
Skipped-container text loss
Direct text on skipped containers (annotation, acroform, ocr) was lost from the accessibility tree. Now wrapped in a P node with MCID.
- B13 ISO 32000-1 §14.7.4.4
Empty leading page collapse
Empty leading pages collapsed into page index 0 because page index only incremented when page_mcids was non-empty.
- B14 Rule 7.2-20
Double-tagging in list items
_full_text() in list items captured nested children's text. When children were recursed into, they created additional MCIDs for the same text — double-tagging.
- B15 Rule 7.18.8-1
PrinterMark struct elements
PrinterMark annotations incorrectly received struct elements. veraPDF checks structParentType == null for PrinterMark.
- B16 Rule 7.20-1
/Ref deletion (acknowledged trade-off)
/Ref deletion on Reference XObjects may lose visual data if the XObject depends on it for rendering. Accepted trade-off for the DC Courts corpus.
- B17 Rule 5-1
Control characters in title
Control characters in document title (U+0000–U+001F) produced malformed XMP XML. Now stripped before XML escaping.
Reference
Full machine-checkable rule and criterion references for the two standards the engine validates against.
ISO 14289-1:2014
PDF/UA-1 validation rules
All 106 machine-checkable rules grouped by ISO clause, with object type, requirement, and the error veraPDF emits on failure.
Open reference →
W3C Recommendation · ISO/IEC 40500:2025
WCAG 2.2 success criteria
All 86 criteria across 4 principles and 13 guidelines, with level (A/AA/AAA) and version (2.0, 2.1, 2.2) labels.
Open reference →
The pipeline is deterministic. The validation is third-party. The receipts are per file.
Every customer engagement runs on this engine. Same source PDF in, same remediated PDF out, with the veraPDF report attached. The audit trail is the deliverable.
Request an invite →source: TrueSection remediation pipeline. iso 14289-1:2014. verapdf 1.30.0 / 1.30.1 against PDF_UA/PDFUA-1.xml and WCAG-2-2-Complete.xml profiles.