Content Quality

Two quality axes

The system tracks two independent quality dimensions:

1. Content quality (tags.content_quality on corpus documents)

Quality of the text in body. Progression:

Level	Description	Searchable?	Reliable for display?
ocr_raw	OCR extraction, errors likely	Degraded FTS	No -- warn user
ocr_cleaned	OCR cleaned by regex or LLM	Better FTS	Partially
native_raw	Native text (PDF text layer, .doc), structure lost	Good FTS	Yes, but no structure
native_structured	Structured source (HTML Legifrance, XML Akoma Ntoso)	Optimal FTS	Yes
machine_reviewed	Reviewed/corrected by LLM	Optimal FTS	Yes
human_reviewed	Reviewed by a human	Optimal FTS	Yes
jurist_reviewed	Validated by a jurist	Optimal FTS	Yes (gold standard)

Only the current level is stored. Progression is one-way (except disputes).

2. Annotation confidence (on graph.annotations -- future, documented here for reference)

Quality of knowledge graph annotations. Separate from content quality.

Method	Confidence	Meaning
stub	stub	Placeholder, not yet created
llm	memory_only	LLM-generated, not verified against source
llm	source_checked	Verification pass confirmed citation exists
llm	cross_validated	Cross-validation confirmed consistency
human	(any)	Non-expert human reviewed
jurist	(any)	Legal professional validated

Progression: stub -> memory_only -> source_checked -> cross_validated. Can be downgraded to disputed.

How they interact

A corpus document has content_quality. An annotation on that document has its own confidence. The weakest link determines the trust of any reasoning path:

An annotation with confidence=cross_validated on a document with content_quality=ocr_raw is still unreliable (the source text may have OCR errors that corrupted the annotation).
A high-quality document (native_structured) with a stub annotation has no knowledge graph coverage yet.

Impact on the system

MCP tools: display warnings for content_quality < native_structured
Knowledge graph: propagate minimum confidence in reasoning paths
Quality pipeline: prioritize ocr_raw documents for LLM cleaning, then for annotation
body_search: documents with content_quality >= machine_reviewed should have body_search populated with cleaned text

Corpus/Quality

Contents

Content Quality

Two quality axes

1. Content quality (tags.content_quality on corpus documents)

2. Annotation confidence (on graph.annotations -- future, documented here for reference)

How they interact

Impact on the system

Navigation menu

Corpus/Quality

Content Quality

Two quality axes

1. Content quality (tags.content_quality on corpus documents)

2. Annotation confidence (on graph.annotations -- future, documented here for reference)

How they interact

Impact on the system

Navigation menu

Search