Corpus/Quality

From Dura Lex Wiki
Revision as of 02:10, 23 April 2026 by Nicolas (talk | contribs) (Import from duralex/spec/QUALITY.md — faithful conversion to wikitext (via create-page on MediaWiki MCP Server))
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Content Quality

[edit | edit source]

Two quality axes

[edit | edit source]

The system tracks two independent quality dimensions:

1. Content quality (tags.content_quality on corpus documents)

[edit | edit source]

Quality of the text in body. Progression:

Level Description Searchable? Reliable for display?
ocr_raw OCR extraction, errors likely Degraded FTS No -- warn user
ocr_cleaned OCR cleaned by regex or LLM Better FTS Partially
native_raw Native text (PDF text layer, .doc), structure lost Good FTS Yes, but no structure
native_structured Structured source (HTML Legifrance, XML Akoma Ntoso) Optimal FTS Yes
machine_reviewed Reviewed/corrected by LLM Optimal FTS Yes
human_reviewed Reviewed by a human Optimal FTS Yes
jurist_reviewed Validated by a jurist Optimal FTS Yes (gold standard)

Only the current level is stored. Progression is one-way (except disputes).

2. Annotation confidence (on graph.annotations -- future, documented here for reference)

[edit | edit source]

Quality of knowledge graph annotations. Separate from content quality.

Method Confidence Meaning
stub stub Placeholder, not yet created
llm memory_only LLM-generated, not verified against source
llm source_checked Verification pass confirmed citation exists
llm cross_validated Cross-validation confirmed consistency
human (any) Non-expert human reviewed
jurist (any) Legal professional validated

Progression: stub -> memory_only -> source_checked -> cross_validated. Can be downgraded to disputed.

How they interact

[edit | edit source]

A corpus document has content_quality. An annotation on that document has its own confidence. The weakest link determines the trust of any reasoning path:

  • An annotation with confidence=cross_validated on a document with content_quality=ocr_raw is still unreliable (the source text may have OCR errors that corrupted the annotation).
  • A high-quality document (native_structured) with a stub annotation has no knowledge graph coverage yet.

Impact on the system

[edit | edit source]
  • MCP tools: display warnings for content_quality < native_structured
  • Knowledge graph: propagate minimum confidence in reasoning paths
  • Quality pipeline: prioritize ocr_raw documents for LLM cleaning, then for annotation
  • body_search: documents with content_quality >= machine_reviewed should have body_search populated with cleaned text