Engine

The remediation engine.

A six-step deterministic pipeline that transforms non-conforming PDFs into fully PDF/UA-1 conformant documents. Every structure element, every font program, every metadata entry — verified against ISO 14289-1:2014 and validated by veraPDF.

TrueSection remediation pipeline · automated PDF/UA-1 engine · originally built for DC Courts

6
Pipeline steps
106
PDF/UA-1 rules addressed
17
Verified fixes
86
WCAG 2.2 criteria covered

Architecture

The engine accepts any PDF, extracts its semantic structure, and applies a deterministic six-step remediation pipeline. Visual content is never altered. Only the accessibility layer is rebuilt.

Input

Source PDF

Non-conforming.

Pipeline

Six steps

Extract · Structure · Glyph mapping · Identification · Annotations · Validation.

Output

PDF/UA-1 conformant

veraPDF-validated.

Pipeline steps

Each step is applied in strict sequence. A failure at any step halts the pipeline for that document — no partial output, no silent corruption.

  1. 01

    Extract

    bytes → structured input

    Pull text, structure cues, and metadata out of the source PDF.

    Show details →

    The first stage performs structured extraction over the source PDF. The extractor emits XHTML with marked content structure — headings, tables, lists, figures, annotations — plus JSON metadata covering font names, document language, and title. This is the only semantic-analysis step in the pipeline; every downstream structural decision derives from this output. Before extraction, three pre-flight hazards are resolved: encryption (saved unencrypted), dynamic XFA forms (stripped from /AcroForm), and reference XObjects (/Ref deleted).

    Rules addressed: 7.10-1 · 7.10-2 · 7.11-1 · 7.15-1 · 7.16-1 · 7.20-1

    Failure modes handled

    Extractor unavailable
    Pipeline aborts with a clear error message. No partial output written.
    Out-of-memory on large PDFs
    Same abort path. Large PDFs that exhaust the extractor's heap are surfaced as failures, never silently truncated.
    Malformed XHTML returned
    The structure parser uses XMLParser(recover=True) and pre-strips invalid XML character references and bare control characters.
    Encrypted PDFs (B1)
    Decrypted on open. Saving without the encryption= parameter produces unencrypted output that passes isEncrypted == false (rule 7.16-1).
    Reference XObject deletion (B16)
    /Ref XObjects are deleted rather than inlined. May lose visual content if the XObject depends on /Ref for rendering. Acknowledged trade-off for the DC Courts corpus.
  2. 02

    Structure

    tags: 0 → 247

    Rebuild the logical document tree so headings, lists, tables, and reading order are explicit.

    Show details →

    Walks the extracted XHTML and builds an in-memory tree of typed structure nodes: Document, Part, Sect, H1–H6, P, Table / THead / TBody / TR / TH / TD, L / LI / Lbl / LBody, Figure, Note, Art. Existing tags from the source PDF are removed first — patching an unknown structure tree is more error-prone than replacement. The new tree serializes into PDF structure elements with /StructTreeRoot, /ParentTree number tree, per-page MCID arrays, and bidirectional parent/child references. Every page's content stream is rewritten so every operator sits inside either a tagged BDC...EMC block (with MCID linking to the structure tree) or an /Artifact BMC...EMC block.

    Rules addressed: 7.1-1..3 · 7.1-11 · 7.1-12 · 7.2-x · 7.4.2-1 · 7.4.4-x · 7.5-1 · 7.20-2

    Failure modes handled

    Per-page MCIDs (B4)
    ISO 32000-1 §14.7.4.4 requires MCIDs unique per page. A previous global counter caused cross-page ParentTree contamination.
    Heading sequence (B5)
    Rule 7.4.2-1 — heading levels must not skip. H1→H3 is clamped to H1→H2.
    Phantom Lbl MCID (B8)
    List label nodes assigned no MCID. The extractor includes bullet text inline in <li>, so a phantom MCID stole a real BT block, desynchronizing all subsequent MCIDs.
    Artifact wrapping (B9)
    Pages with no MCIDs strip existing markers before wrapping as Artifact, to prevent nested tagged content inside Artifact (rule 7.1-2).
    Single-row tables (B10)
    No empty THead is created. Single-row tables get TBody only.
    Mixed td/th rows (B11)
    All cells collected in document order. Previous code dropped <th> in mixed rows.
    Empty leading page collapse (B13)
    Page index increments after the first page div regardless of content. Empty leading pages no longer collapse into index 0.
    Double-tagging in list items (B14)
    List items use direct text only, not recursive text, to avoid double MCID assignment.
  3. 03

    Glyph mapping

    tounicode: missing → present

    Repair character mappings so assistive technology can read glyphs back as text.

    Show details →

    Three sub-steps. (a) Font embedding: every font missing /FontFile, /FontFile2, or /FontFile3 has its system file located, extracted (including from TrueType Collections), and embedded as program data. (b) ToUnicode CMap repair: symbol fonts (Wingdings) use non-standard encodings; even when embedded, missing /ToUnicode CMaps mean glyphs cannot map back to Unicode. A 220-entry authoritative mapping table generates a complete CMap. (c) Verification: every font with a descriptor must now have an embedded program; raises FontVerificationError on any violation — the pipeline halts.

    Rules addressed: 7.21.4.1-1 · 7.21.7-1 · 7.21.7-2 · 7.21.8-1

    Failure modes handled

    Font not found on system
    Halts with FontVerificationError naming the font. Never silently produces non-conforming output.
    TrueType Collection (.ttc)
    fontTools extracts the matching face by PostScript name (nameID 6). Falls back to face 0 if no match.
    Supplementary Unicode (U+1F589 etc.)
    _unicode_to_utf16be_hex() uses Python's chr().encode("utf-16-be"), which correctly produces 4-byte surrogate pairs for codepoints above U+FFFF.
  4. 04

    Identification

    xmp: missing → present

    Stamp title, language, and the PDF/UA-1 conformance identifier — the only conformance marker PDFs natively carry.

    Show details →

    Two adjacent actions. First, catalog-level entries: /MarkInfo Marked=true (rule 6.2-1), /MarkInfo Suspects=false (7.1-4), /ViewerPreferences DisplayDocTitle=true (7.1-10), and /Lang sourced from extraction metadata, defaulting to "en" (7.2-29). Second, a complete XMP metadata stream replaces any existing metadata: pdfuaid:part=1 (5-1, 5-2), dc:title sourced from extraction / existing XMP / filename stem (7.1-9), /Type /Metadata and /Subtype /XML on the stream (7.1-8), and the correct pdfuaid: namespace prefix on all three identification properties (5-3, 5-4, 5-5).

    Rules addressed: 5-1 · 5-2 · 5-3 · 5-4 · 5-5 · 6.2-1 · 7.1-4 · 7.1-8 · 7.1-9 · 7.1-10 · 7.2-29

    Failure modes handled

    Missing language
    Extraction returns no language or returns "MISSING" — defaults to "en". Both "en" and "en-US" pass rule 7.2-29.
    Control characters in title (B17)
    XML 1.0 forbids U+0000–U+0008, U+000B, U+000C, U+000E–U+001F. _xml_escape() strips them before escaping &, <, >, ". Without this the XMP stream is malformed XML.
    Null-byte title
    Detected and replaced with the filename stem. Never writes an empty dc:title.
    ViewerPreferences not a Dictionary
    Replaced entirely with {/DisplayDocTitle: true}. Existing valid preferences are preserved when type is correct.
  5. 05

    Annotations

    roles: 0 → 18

    Assign required accessibility roles to every interactive element — links, form fields, alt-text slots — so each is reachable.

    Show details →

    Per-annotation handling by subtype: /Link → Link struct element with /Contents (rule 7.18.5-1); /Widget → Form struct element with /TU (7.18.4-1); /TrapNet → deleted entirely from /Annots (7.18.2-1); /PrinterMark → no struct element (7.18.8-1); all others → Annot struct element with /Contents (7.18.1-1). Every page with annotations receives /Tabs=/S to enforce structure-order tab navigation (7.18.3-1). Media clip data dictionaries receive /CT and /Alt (7.18.6.2).

    Rules addressed: 7.18.1-1 · 7.18.1-2 · 7.18.1-3 · 7.18.2-1 · 7.18.3-1 · 7.18.4-1 · 7.18.4-2 · 7.18.5-1 · 7.18.5-2 · 7.18.6.2-1 · 7.18.6.2-2 · 7.18.8-1

    Failure modes handled

    TrapNet left in /Annots (B2)
    Previous code skipped struct element creation but left the annotation object. veraPDF evaluates every PDTrapNetAnnot. The rebuilt array now excludes them entirely.
    Orphaned StructParent (B3)
    Each annotation gets a /StructParent value starting at len(pages); the function returns {StructParent: struct_element} so the caller appends entries to the ParentTree /Nums array.
    PrinterMark struct elements (B15)
    Explicitly skipped. veraPDF checks structParentType == null for PrinterMark annotations.
  6. 06

    Validation

    violations: 251 → 2

    Validate against PDF/UA-1 and WCAG 2.2 with veraPDF and emit the violation delta.

    Show details →

    The remediated PDF is validated end-to-end with veraPDF (1.30.0 / 1.30.1) against two profiles: PDF_UA/PDFUA-1.xml (ISO 14289-1:2014) and WCAG-2-2-Complete.xml (the PDF Association's PDF surface coverage of WCAG 2.2). Per-file rule-level pass/fail diffs ship as part of the deliverable. The site never claims a result veraPDF does not confirm.

    Rules addressed: all 106 PDF/UA-1 rules · all 86 WCAG 2.2 criteria covered by the WCAG-2-2-Complete profile

    Failure modes handled

    Validation regression
    If any deterministic rule fails on the remediated output, the pipeline marks the document as failed and surfaces the rule-level diff. Failed is a terminal status; no silent retry.

Verified fixes.

17 defects · ISO-rule cited · veraPDF-discovered

Each was discovered through veraPDF validation of real court documents — not theoretical, not from an internal scorecard. Tagged in source with the ISO rule and the specific code change.

View all 17 fixes →

  1. B1
    Rule 7.16-1

    Encryption handling

    pikepdf decrypts on open; saving without encryption= produces unencrypted output that passes isEncrypted == false.

  2. B2
    Rule 7.18.2-1

    TrapNet annotations

    TrapNet left in /Annots after skipping struct element creation. veraPDF evaluates every PDTrapNetAnnot regardless. Rebuilt array excludes them.

  3. B3
    Rule 7.18.x

    Orphaned StructParent

    Annotation StructParent values assigned but no corresponding ParentTree entries created — orphaned reverse mappings.

  4. B4
    ISO 32000-1 §14.7.4.4

    Cross-page MCID contamination

    MCIDs were globally numbered instead of per-page. Caused cross-page ParentTree contamination where page N's array contained entries for pages 0..N-1.

  5. B5
    Rule 7.4.2-1

    Heading-level skip

    Heading levels could skip (e.g., H1 directly to H3). Now clamped to previous_level + 1 on descent.

  6. B6
    Rule 7.20-2

    Form XObject orphan MCIDs

    Form XObject streams retained old BMC/BDC/EMC markers after page stripping, leaving orphaned MCIDs.

  7. B7
    Rule 7.1-3

    Excess BT blocks untagged

    More BT blocks than MCIDs left text blocks untagged. Excess BT blocks now wrapped as Artifact.

  8. B8
    Rule 7.2-20

    Phantom Lbl MCID

    Lbl (list label) nodes assigned a phantom MCID. The extractor includes bullet text inline in <li>, so the phantom MCID stole a real BT block, desynchronizing all subsequent MCIDs.

  9. B9
    Rule 7.1-2

    Artifact-wrap nested tags

    _wrap_all_as_artifact didn't strip existing markers before wrapping, creating nested tagged content inside Artifact.

  10. B10
    Rule 7.2-14

    Single-row table empty THead

    Single-row tables created an empty THead + TBody pair. THead without TBody violates 7.2-14. Single-row tables now get TBody only.

  11. B11
    Rule 7.2-10

    Mixed td/th rows

    Mixed <td>/<th> rows: code searched for <td> first, fell back to <th> only if zero <td> found, silently dropping <th> cells in mixed rows.

  12. B12
    Rule 7.1-3

    Skipped-container text loss

    Direct text on skipped containers (annotation, acroform, ocr) was lost from the accessibility tree. Now wrapped in a P node with MCID.

  13. B13
    ISO 32000-1 §14.7.4.4

    Empty leading page collapse

    Empty leading pages collapsed into page index 0 because page index only incremented when page_mcids was non-empty.

  14. B14
    Rule 7.2-20

    Double-tagging in list items

    _full_text() in list items captured nested children's text. When children were recursed into, they created additional MCIDs for the same text — double-tagging.

  15. B15
    Rule 7.18.8-1

    PrinterMark struct elements

    PrinterMark annotations incorrectly received struct elements. veraPDF checks structParentType == null for PrinterMark.

  16. B16
    Rule 7.20-1

    /Ref deletion (acknowledged trade-off)

    /Ref deletion on Reference XObjects may lose visual data if the XObject depends on it for rendering. Accepted trade-off for the DC Courts corpus.

  17. B17
    Rule 5-1

    Control characters in title

    Control characters in document title (U+0000–U+001F) produced malformed XMP XML. Now stripped before XML escaping.

The pipeline is deterministic. The validation is third-party. The receipts are per file.

Every customer engagement runs on this engine. Same source PDF in, same remediated PDF out, with the veraPDF report attached. The audit trail is the deliverable.

Request an invite →

source: TrueSection remediation pipeline. iso 14289-1:2014. verapdf 1.30.0 / 1.30.1 against PDF_UA/PDFUA-1.xml and WCAG-2-2-Complete.xml profiles.