Nicolas: Import from duralex/spec/QUALITY.md — faithful conversion to wikitext (via create-page on MediaWiki MCP Server)

2026-04-23T02:10:16Z

Import from duralex/spec/QUALITY.md — faithful conversion to wikitext (via create-page on MediaWiki MCP Server)

New page

= Content Quality =

== Two quality axes ==

The system tracks two independent quality dimensions:

=== 1. Content quality (tags.content_quality on corpus documents) ===

Quality of the text in <code>body</code>. Progression:

{| class="wikitable"
! Level !! Description !! Searchable? !! Reliable for display?
|-
| ocr_raw || OCR extraction, errors likely || Degraded FTS || No -- warn user
|-
| ocr_cleaned || OCR cleaned by regex or LLM || Better FTS || Partially
|-
| native_raw || Native text (PDF text layer, .doc), structure lost || Good FTS || Yes, but no structure
|-
| native_structured || Structured source (HTML Legifrance, XML Akoma Ntoso) || Optimal FTS || Yes
|-
| machine_reviewed || Reviewed/corrected by LLM || Optimal FTS || Yes
|-
| human_reviewed || Reviewed by a human || Optimal FTS || Yes
|-
| jurist_reviewed || Validated by a jurist || Optimal FTS || Yes (gold standard)
|}

Only the current level is stored. Progression is one-way (except disputes).

=== 2. Annotation confidence (on graph.annotations -- future, documented here for reference) ===

Quality of knowledge graph annotations. Separate from content quality.

{| class="wikitable"
! Method !! Confidence !! Meaning
|-
| stub || stub || Placeholder, not yet created
|-
| llm || memory_only || LLM-generated, not verified against source
|-
| llm || source_checked || Verification pass confirmed citation exists
|-
| llm || cross_validated || Cross-validation confirmed consistency
|-
| human || (any) || Non-expert human reviewed
|-
| jurist || (any) || Legal professional validated
|}

Progression: stub -> memory_only -> source_checked -> cross_validated. Can be downgraded to <code>disputed</code>.

=== How they interact ===

A corpus document has content_quality. An annotation on that document has its own confidence. The weakest link determines the trust of any reasoning path:
* An annotation with confidence=cross_validated on a document with content_quality=ocr_raw is still unreliable (the source text may have OCR errors that corrupted the annotation).
* A high-quality document (native_structured) with a stub annotation has no knowledge graph coverage yet.

== Impact on the system ==

* '''MCP tools''': display warnings for content_quality < native_structured
* '''Knowledge graph''': propagate minimum confidence in reasoning paths
* '''Quality pipeline''': prioritize ocr_raw documents for LLM cleaning, then for annotation
* '''body_search''': documents with content_quality >= machine_reviewed should have body_search populated with cleaned text

[[Category:Corpus]]

Corpus/Quality - Revision history

Nicolas: Import from duralex/spec/QUALITY.md — faithful conversion to wikitext (via create-page on MediaWiki MCP Server)