Nicolas: Import from duralex/spec/FTS.md — faithful conversion to wikitext (via create-page on MediaWiki MCP Server)

2026-04-23T02:12:09Z

Import from duralex/spec/FTS.md — faithful conversion to wikitext (via create-page on MediaWiki MCP Server)

New page

= Full-Text Search =

== Overview ==

FTS uses PostgreSQL tsvector with per-language text search configurations. The <code>content_fts</code> column is populated by a trigger from <code>coalesce(body_search, body)</code>. CJK languages require the pgroonga extension.

== The two-text model ==

* <code>body</code>: clean displayable content. Always renderable to a user (HTML, links, formatted text). Immutable after ingestion. Never contains noisy OCR or raw PDF binary.
* <code>body_search</code>: FTS-optimized text used for indexing. Can be noisy (raw PDF extraction, OCR output) because users never see it directly. Nullable. When NULL, FTS uses body directly.

Why two columns:
# '''Display vs index separation.''' <code>get_document</code> returns <code>body</code>, so it must always be clean. FTS reads <code>coalesce(body_search, body)</code>, so <code>body_search</code> can hold a noisy-but-indexable representation (e.g., extracted PDF text) without polluting display.
# '''PDF-sourced documents.''' When a document is only available as PDF, <code>body</code> is a clean stub like <code><p>Source: <a href="...">PDF officiel</a></p></code> (the LLM follows the link or uses search snippets). <code>body_search</code> holds the pdfminer-extracted text so the document is findable via FTS.
# '''Legal reference normalization''' (e.g., "L442-1" to "L442.1") must be consistent between index and query. <code>body_search</code> holds the normalized text; the search plugin applies the same normalization to the query.
# <code>body</code> is the source of truth for display and annotation anchoring. It never changes.

See ADR-010 for the full rationale of this semantic.

== Trigger ==

<syntaxhighlight lang="sql">
CREATE FUNCTION corpus.update_content_fts() RETURNS trigger AS $$
DECLARE
lang text;
cfg regconfig;
BEGIN
-- Skip recomputation if searchable content unchanged (tag-only upserts)
IF TG_OP = 'UPDATE'
AND NEW.title IS NOT DISTINCT FROM OLD.title
AND NEW.body IS NOT DISTINCT FROM OLD.body
AND NEW.body_search IS NOT DISTINCT FROM OLD.body_search
AND NEW.language IS NOT DISTINCT FROM OLD.language
AND NEW.jurisdiction IS NOT DISTINCT FROM OLD.jurisdiction
AND (NEW.tags->>'summary') IS NOT DISTINCT FROM (OLD.tags->>'summary')
AND (NEW.tags->>'headnote_classification') IS NOT DISTINCT FROM (OLD.tags->>'headnote_classification') THEN
NEW.content_fts := OLD.content_fts;
RETURN NEW;
END IF;

-- Extract country prefix for subdivisions (gb-sct → gb, es-ct → es)
lang := coalesce(NEW.language, split_part(NEW.jurisdiction, '-', 1));

-- CJK: tsvector non-functional, use pgroonga instead
IF lang IN ('zh', 'ja', 'ko') THEN
NEW.content_fts := NULL;
RETURN NEW;
END IF;

cfg := CASE lang
WHEN 'fr' THEN 'french'::regconfig
WHEN 'de' THEN 'german'::regconfig
WHEN 'en' THEN 'english'::regconfig
WHEN 'es' THEN 'spanish'::regconfig
WHEN 'it' THEN 'italian'::regconfig
WHEN 'pt' THEN 'portuguese'::regconfig
WHEN 'nl' THEN 'dutch'::regconfig
WHEN 'sv' THEN 'swedish'::regconfig
WHEN 'da' THEN 'danish'::regconfig
WHEN 'fi' THEN 'finnish'::regconfig
WHEN 'hu' THEN 'hungarian'::regconfig
WHEN 'ro' THEN 'romanian'::regconfig
WHEN 'tr' THEN 'turkish'::regconfig
WHEN 'ru' THEN 'russian'::regconfig
WHEN 'no' THEN 'norwegian'::regconfig
-- Arabic: no built-in PG config. Falls through to 'simple'.
-- Install a custom Arabic text search config before ingesting AR data.
ELSE 'simple'::regconfig
END;

-- Binary body check: skip FTS on PDF/binary content
IF coalesce(NEW.body_search, NEW.body) IS NOT NULL
AND left(coalesce(NEW.body_search, NEW.body), 4) = '%PDF' THEN
NEW.content_fts :=
setweight(to_tsvector(cfg, unaccent(
corpus.normalize_for_fts(NEW.jurisdiction, coalesce(NEW.title, ''))
)), 'A') ||
setweight(to_tsvector(cfg, unaccent(
coalesce(NEW.tags->>'summary', '') || ' ' ||
coalesce(NEW.tags->>'headnote_classification', '')
)), 'B');
RETURN NEW;
END IF;

NEW.content_fts :=
setweight(to_tsvector(cfg, unaccent(
corpus.normalize_for_fts(NEW.jurisdiction, coalesce(NEW.title, ''))
)), 'A') ||
setweight(to_tsvector(cfg, unaccent(
coalesce(NEW.tags->>'summary', '') || ' ' ||
coalesce(NEW.tags->>'headnote_classification', '')
)), 'B') ||
setweight(to_tsvector(cfg, unaccent(
corpus.normalize_for_fts(NEW.jurisdiction,
regexp_replace(coalesce(NEW.body_search, NEW.body, ''), '<[^>]+>', ' ', 'g'))
)), 'C');
RETURN NEW;
END $$ LANGUAGE plpgsql;

CREATE TRIGGER trg_content_fts
BEFORE INSERT OR UPDATE OF title, body, body_search, tags, language, jurisdiction
ON corpus.documents FOR EACH ROW
EXECUTE FUNCTION corpus.update_content_fts();
</syntaxhighlight>

== Pluggable normalization (normalize_for_fts) ==

<syntaxhighlight lang="sql">
-- Core: identity (no-op)
CREATE FUNCTION corpus.normalize_for_fts(p_jurisdiction text, p_input text)
RETURNS text LANGUAGE sql IMMUTABLE AS $$ SELECT p_input; $$;
</syntaxhighlight>

Jurisdiction plugins override this function:
* FR plugin installs <code>normalize_legal_refs_final()</code> for French legal references (L442-1 to L442.1).
* Future DE plugin could install <code>normalize_de_refs()</code> for German references (paragraph 823 BGB).

The same normalization function must be applied to search queries (by the search plugin) to ensure consistency between indexed text and query.

== Phased approach to normalization ==

Phase 1 (now): For HTML-sourced documents, <code>body_search=NULL</code>. <code>normalize_for_fts()</code> in the trigger handles normalization on <code>body</code>.

Phase 2 (PDF ingest, ADR-010): For PDF-sourced documents, <code>body</code> is a clean HTML stub with the source URL, and <code>body_search</code> holds the pdfminer-extracted text. <code>normalize_for_fts()</code> still runs. The trigger uses <code>coalesce(body_search, body)</code>, preferring the indexable representation when present.

Phase 3 (LLM pipeline): LLM writes a normalized/cleaned version to <code>body_search</code> for HTML-sourced documents whose body has irregular references. <code>normalize_for_fts()</code> reset to identity.

Query-side normalization (regex) remains necessary in all phases.

== Weight system ==

Three FTS weights:

* '''Weight A''' (highest): title.
* '''Weight B''': <code>tags.summary</code> + <code>tags.headnote_classification</code>, read directly from JSONB by the trigger. Provides FTS on curated legal metadata (headnotes, summaries) without duplicating body content. Not all documents have these tags — weight B is empty when absent.
* '''Weight C''': <code>coalesce(body_search, body)</code> with HTML tag stripping via <code>regexp_replace</code>. <code>body_search</code> holds the indexable representation when <code>body</code> is a clean stub (e.g., PDF-sourced documents with <code>body=<a>PDF link</a></code> and <code>body_search=<extracted text></code>). NULL for HTML-sourced documents — the trigger falls back to <code>body</code> directly. See ADR-010.

Weight D is unused.

== CJK languages ==

PostgreSQL built-in text search configurations do not support Chinese, Japanese, or Korean. The <code>simple</code> config produces single-character tokens that are functionally useless for search.

Hard requirement: install pgroonga or zhparser extension when ingesting CJK data.

The trigger sets <code>content_fts</code> to NULL for CJK. A separate pgroonga index on body handles CJK search:

<syntaxhighlight lang="sql">
-- When pgroonga is available:
CREATE INDEX idx_doc_pgroonga ON corpus.documents
USING pgroonga (body) WHERE language IN ('zh', 'ja', 'ko');
</syntaxhighlight>

The search engine routes CJK queries to pgroonga and non-CJK queries to tsvector.

== Partial GIN indexes ==

FTS GIN indexes are split per kind for performance isolation:

<syntaxhighlight lang="sql">
CREATE INDEX idx_doc_fts_legislation ON corpus.documents USING GIN (content_fts) WHERE kind = 'legislation';
CREATE INDEX idx_doc_fts_decision ON corpus.documents USING GIN (content_fts) WHERE kind = 'decision';
CREATE INDEX idx_doc_fts_record ON corpus.documents USING GIN (content_fts) WHERE kind = 'record';
CREATE INDEX idx_doc_fts_notice ON corpus.documents USING GIN (content_fts) WHERE kind = 'notice';
</syntaxhighlight>

Reindexing annotations (graph layer) does not touch corpus FTS indexes. Bulk ingestion of one kind does not bloat another kind's GIN index.

== Accent-insensitive search ==

<code>unaccent()</code> is applied in the trigger before tokenization. Requires the <code>unaccent</code> PostgreSQL extension. Ensures "responsabilite" matches "responsabilité".

The search engine must also apply <code>unaccent()</code> to the query:

<syntaxhighlight lang="sql">
websearch_to_tsquery(cfg, unaccent(corpus.normalize_for_fts(jurisdiction, query)))
</syntaxhighlight>

[[Category:Corpus]]

Corpus/FTS - Revision history

Nicolas: Import from duralex/spec/FTS.md — faithful conversion to wikitext (via create-page on MediaWiki MCP Server)