Corpus/FTS

From Dura Lex Wiki
Revision as of 02:12, 23 April 2026 by Nicolas (talk | contribs) (Import from duralex/spec/FTS.md — faithful conversion to wikitext (via create-page on MediaWiki MCP Server))
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Full-Text Search

[edit | edit source]

Overview

[edit | edit source]

FTS uses PostgreSQL tsvector with per-language text search configurations. The content_fts column is populated by a trigger from coalesce(body_search, body). CJK languages require the pgroonga extension.

The two-text model

[edit | edit source]
  • body: clean displayable content. Always renderable to a user (HTML, links, formatted text). Immutable after ingestion. Never contains noisy OCR or raw PDF binary.
  • body_search: FTS-optimized text used for indexing. Can be noisy (raw PDF extraction, OCR output) because users never see it directly. Nullable. When NULL, FTS uses body directly.

Why two columns:

  1. Display vs index separation. get_document returns body, so it must always be clean. FTS reads coalesce(body_search, body), so body_search can hold a noisy-but-indexable representation (e.g., extracted PDF text) without polluting display.
  2. PDF-sourced documents. When a document is only available as PDF, body is a clean stub like <p>Source: <a href="...">PDF officiel</a></p> (the LLM follows the link or uses search snippets). body_search holds the pdfminer-extracted text so the document is findable via FTS.
  3. Legal reference normalization (e.g., "L442-1" to "L442.1") must be consistent between index and query. body_search holds the normalized text; the search plugin applies the same normalization to the query.
  4. body is the source of truth for display and annotation anchoring. It never changes.

See ADR-010 for the full rationale of this semantic.

Trigger

[edit | edit source]

<syntaxhighlight lang="sql"> CREATE FUNCTION corpus.update_content_fts() RETURNS trigger AS $$ DECLARE

   lang text;
   cfg  regconfig;

BEGIN

   -- Skip recomputation if searchable content unchanged (tag-only upserts)
   IF TG_OP = 'UPDATE'
      AND NEW.title IS NOT DISTINCT FROM OLD.title
      AND NEW.body IS NOT DISTINCT FROM OLD.body
      AND NEW.body_search IS NOT DISTINCT FROM OLD.body_search
      AND NEW.language IS NOT DISTINCT FROM OLD.language
      AND NEW.jurisdiction IS NOT DISTINCT FROM OLD.jurisdiction
      AND (NEW.tags->>'summary') IS NOT DISTINCT FROM (OLD.tags->>'summary')
      AND (NEW.tags->>'headnote_classification') IS NOT DISTINCT FROM (OLD.tags->>'headnote_classification') THEN
       NEW.content_fts := OLD.content_fts;
       RETURN NEW;
   END IF;
   -- Extract country prefix for subdivisions (gb-sct → gb, es-ct → es)
   lang := coalesce(NEW.language, split_part(NEW.jurisdiction, '-', 1));
   -- CJK: tsvector non-functional, use pgroonga instead
   IF lang IN ('zh', 'ja', 'ko') THEN
       NEW.content_fts := NULL;
       RETURN NEW;
   END IF;
   cfg := CASE lang
       WHEN 'fr' THEN 'french'::regconfig
       WHEN 'de' THEN 'german'::regconfig
       WHEN 'en' THEN 'english'::regconfig
       WHEN 'es' THEN 'spanish'::regconfig
       WHEN 'it' THEN 'italian'::regconfig
       WHEN 'pt' THEN 'portuguese'::regconfig
       WHEN 'nl' THEN 'dutch'::regconfig
       WHEN 'sv' THEN 'swedish'::regconfig
       WHEN 'da' THEN 'danish'::regconfig
       WHEN 'fi' THEN 'finnish'::regconfig
       WHEN 'hu' THEN 'hungarian'::regconfig
       WHEN 'ro' THEN 'romanian'::regconfig
       WHEN 'tr' THEN 'turkish'::regconfig
       WHEN 'ru' THEN 'russian'::regconfig
       WHEN 'no' THEN 'norwegian'::regconfig
       -- Arabic: no built-in PG config. Falls through to 'simple'.
       -- Install a custom Arabic text search config before ingesting AR data.
       ELSE 'simple'::regconfig
   END;
   -- Binary body check: skip FTS on PDF/binary content
   IF coalesce(NEW.body_search, NEW.body) IS NOT NULL
      AND left(coalesce(NEW.body_search, NEW.body), 4) = '%PDF' THEN
       NEW.content_fts :=
           setweight(to_tsvector(cfg, unaccent(
               corpus.normalize_for_fts(NEW.jurisdiction, coalesce(NEW.title, ))
           )), 'A') ||
           setweight(to_tsvector(cfg, unaccent(
               coalesce(NEW.tags->>'summary', ) || ' ' ||
               coalesce(NEW.tags->>'headnote_classification', )
           )), 'B');
       RETURN NEW;
   END IF;
   NEW.content_fts :=
       setweight(to_tsvector(cfg, unaccent(
           corpus.normalize_for_fts(NEW.jurisdiction, coalesce(NEW.title, ))
       )), 'A') ||
       setweight(to_tsvector(cfg, unaccent(
           coalesce(NEW.tags->>'summary', ) || ' ' ||
           coalesce(NEW.tags->>'headnote_classification', )
       )), 'B') ||
       setweight(to_tsvector(cfg, unaccent(
           corpus.normalize_for_fts(NEW.jurisdiction,
               regexp_replace(coalesce(NEW.body_search, NEW.body, ), '<[^>]+>', ' ', 'g'))
       )), 'C');
   RETURN NEW;

END $$ LANGUAGE plpgsql;

CREATE TRIGGER trg_content_fts

   BEFORE INSERT OR UPDATE OF title, body, body_search, tags, language, jurisdiction
   ON corpus.documents FOR EACH ROW
   EXECUTE FUNCTION corpus.update_content_fts();

</syntaxhighlight>

Pluggable normalization (normalize_for_fts)

[edit | edit source]

<syntaxhighlight lang="sql"> -- Core: identity (no-op) CREATE FUNCTION corpus.normalize_for_fts(p_jurisdiction text, p_input text) RETURNS text LANGUAGE sql IMMUTABLE AS $$ SELECT p_input; $$; </syntaxhighlight>

Jurisdiction plugins override this function:

  • FR plugin installs normalize_legal_refs_final() for French legal references (L442-1 to L442.1).
  • Future DE plugin could install normalize_de_refs() for German references (paragraph 823 BGB).

The same normalization function must be applied to search queries (by the search plugin) to ensure consistency between indexed text and query.

Phased approach to normalization

[edit | edit source]

Phase 1 (now): For HTML-sourced documents, body_search=NULL. normalize_for_fts() in the trigger handles normalization on body.

Phase 2 (PDF ingest, ADR-010): For PDF-sourced documents, body is a clean HTML stub with the source URL, and body_search holds the pdfminer-extracted text. normalize_for_fts() still runs. The trigger uses coalesce(body_search, body), preferring the indexable representation when present.

Phase 3 (LLM pipeline): LLM writes a normalized/cleaned version to body_search for HTML-sourced documents whose body has irregular references. normalize_for_fts() reset to identity.

Query-side normalization (regex) remains necessary in all phases.

Weight system

[edit | edit source]

Three FTS weights:

  • Weight A (highest): title.
  • Weight B: tags.summary + tags.headnote_classification, read directly from JSONB by the trigger. Provides FTS on curated legal metadata (headnotes, summaries) without duplicating body content. Not all documents have these tags — weight B is empty when absent.
  • Weight C: coalesce(body_search, body) with HTML tag stripping via regexp_replace. body_search holds the indexable representation when body is a clean stub (e.g., PDF-sourced documents with body=<a>PDF link</a> and body_search=<extracted text>). NULL for HTML-sourced documents — the trigger falls back to body directly. See ADR-010.

Weight D is unused.

CJK languages

[edit | edit source]

PostgreSQL built-in text search configurations do not support Chinese, Japanese, or Korean. The simple config produces single-character tokens that are functionally useless for search.

Hard requirement: install pgroonga or zhparser extension when ingesting CJK data.

The trigger sets content_fts to NULL for CJK. A separate pgroonga index on body handles CJK search:

<syntaxhighlight lang="sql"> -- When pgroonga is available: CREATE INDEX idx_doc_pgroonga ON corpus.documents

   USING pgroonga (body) WHERE language IN ('zh', 'ja', 'ko');

</syntaxhighlight>

The search engine routes CJK queries to pgroonga and non-CJK queries to tsvector.

Partial GIN indexes

[edit | edit source]

FTS GIN indexes are split per kind for performance isolation:

<syntaxhighlight lang="sql"> CREATE INDEX idx_doc_fts_legislation ON corpus.documents USING GIN (content_fts) WHERE kind = 'legislation'; CREATE INDEX idx_doc_fts_decision ON corpus.documents USING GIN (content_fts) WHERE kind = 'decision'; CREATE INDEX idx_doc_fts_record ON corpus.documents USING GIN (content_fts) WHERE kind = 'record'; CREATE INDEX idx_doc_fts_notice ON corpus.documents USING GIN (content_fts) WHERE kind = 'notice'; </syntaxhighlight>

Reindexing annotations (graph layer) does not touch corpus FTS indexes. Bulk ingestion of one kind does not bloat another kind's GIN index.

[edit | edit source]

unaccent() is applied in the trigger before tokenization. Requires the unaccent PostgreSQL extension. Ensures "responsabilite" matches "responsabilité".

The search engine must also apply unaccent() to the query:

<syntaxhighlight lang="sql"> websearch_to_tsquery(cfg, unaccent(corpus.normalize_for_fts(jurisdiction, query))) </syntaxhighlight>