Corpus/Quality
Jump to navigation
Jump to search
Content Quality
[edit | edit source]Two quality axes
[edit | edit source]The system tracks two independent quality dimensions:
1. Content quality (tags.content_quality on corpus documents)
[edit | edit source]Quality of the text in body. Progression:
| Level | Description | Searchable? | Reliable for display? |
|---|---|---|---|
| ocr_raw | OCR extraction, errors likely | Degraded FTS | No -- warn user |
| ocr_cleaned | OCR cleaned by regex or LLM | Better FTS | Partially |
| native_raw | Native text (PDF text layer, .doc), structure lost | Good FTS | Yes, but no structure |
| native_structured | Structured source (HTML Legifrance, XML Akoma Ntoso) | Optimal FTS | Yes |
| machine_reviewed | Reviewed/corrected by LLM | Optimal FTS | Yes |
| human_reviewed | Reviewed by a human | Optimal FTS | Yes |
| jurist_reviewed | Validated by a jurist | Optimal FTS | Yes (gold standard) |
Only the current level is stored. Progression is one-way (except disputes).
2. Annotation confidence (on graph.annotations -- future, documented here for reference)
[edit | edit source]Quality of knowledge graph annotations. Separate from content quality.
| Method | Confidence | Meaning |
|---|---|---|
| stub | stub | Placeholder, not yet created |
| llm | memory_only | LLM-generated, not verified against source |
| llm | source_checked | Verification pass confirmed citation exists |
| llm | cross_validated | Cross-validation confirmed consistency |
| human | (any) | Non-expert human reviewed |
| jurist | (any) | Legal professional validated |
Progression: stub -> memory_only -> source_checked -> cross_validated. Can be downgraded to disputed.
How they interact
[edit | edit source]A corpus document has content_quality. An annotation on that document has its own confidence. The weakest link determines the trust of any reasoning path:
- An annotation with confidence=cross_validated on a document with content_quality=ocr_raw is still unreliable (the source text may have OCR errors that corrupted the annotation).
- A high-quality document (native_structured) with a stub annotation has no knowledge graph coverage yet.
Impact on the system
[edit | edit source]- MCP tools: display warnings for content_quality < native_structured
- Knowledge graph: propagate minimum confidence in reasoning paths
- Quality pipeline: prioritize ocr_raw documents for LLM cleaning, then for annotation
- body_search: documents with content_quality >= machine_reviewed should have body_search populated with cleaned text