Corpus/FTS
Full-Text Search
[edit | edit source]Overview
[edit | edit source]FTS uses PostgreSQL tsvector with per-language text search configurations. The content_fts column is populated by a trigger from coalesce(body_search, body). CJK languages require the pgroonga extension.
The two-text model
[edit | edit source]body: clean displayable content. Always renderable to a user (HTML, links, formatted text). Immutable after ingestion. Never contains noisy OCR or raw PDF binary.body_search: FTS-optimized text used for indexing. Can be noisy (raw PDF extraction, OCR output) because users never see it directly. Nullable. When NULL, FTS uses body directly.
Why two columns:
- Display vs index separation.
get_documentreturnsbody, so it must always be clean. FTS readscoalesce(body_search, body), sobody_searchcan hold a noisy-but-indexable representation (e.g., extracted PDF text) without polluting display. - PDF-sourced documents. When a document is only available as PDF,
bodyis a clean stub like<p>Source: <a href="...">PDF officiel</a></p>(the LLM follows the link or uses search snippets).body_searchholds the pdfminer-extracted text so the document is findable via FTS. - Legal reference normalization (e.g., "L442-1" to "L442.1") must be consistent between index and query.
body_searchholds the normalized text; the search plugin applies the same normalization to the query. bodyis the source of truth for display and annotation anchoring. It never changes.
See ADR-010 for the full rationale of this semantic.
Trigger
[edit | edit source]<syntaxhighlight lang="sql"> CREATE FUNCTION corpus.update_content_fts() RETURNS trigger AS $$ DECLARE
lang text; cfg regconfig;
BEGIN
-- Skip recomputation if searchable content unchanged (tag-only upserts)
IF TG_OP = 'UPDATE'
AND NEW.title IS NOT DISTINCT FROM OLD.title
AND NEW.body IS NOT DISTINCT FROM OLD.body
AND NEW.body_search IS NOT DISTINCT FROM OLD.body_search
AND NEW.language IS NOT DISTINCT FROM OLD.language
AND NEW.jurisdiction IS NOT DISTINCT FROM OLD.jurisdiction
AND (NEW.tags->>'summary') IS NOT DISTINCT FROM (OLD.tags->>'summary')
AND (NEW.tags->>'headnote_classification') IS NOT DISTINCT FROM (OLD.tags->>'headnote_classification') THEN
NEW.content_fts := OLD.content_fts;
RETURN NEW;
END IF;
-- Extract country prefix for subdivisions (gb-sct → gb, es-ct → es) lang := coalesce(NEW.language, split_part(NEW.jurisdiction, '-', 1));
-- CJK: tsvector non-functional, use pgroonga instead
IF lang IN ('zh', 'ja', 'ko') THEN
NEW.content_fts := NULL;
RETURN NEW;
END IF;
cfg := CASE lang
WHEN 'fr' THEN 'french'::regconfig
WHEN 'de' THEN 'german'::regconfig
WHEN 'en' THEN 'english'::regconfig
WHEN 'es' THEN 'spanish'::regconfig
WHEN 'it' THEN 'italian'::regconfig
WHEN 'pt' THEN 'portuguese'::regconfig
WHEN 'nl' THEN 'dutch'::regconfig
WHEN 'sv' THEN 'swedish'::regconfig
WHEN 'da' THEN 'danish'::regconfig
WHEN 'fi' THEN 'finnish'::regconfig
WHEN 'hu' THEN 'hungarian'::regconfig
WHEN 'ro' THEN 'romanian'::regconfig
WHEN 'tr' THEN 'turkish'::regconfig
WHEN 'ru' THEN 'russian'::regconfig
WHEN 'no' THEN 'norwegian'::regconfig
-- Arabic: no built-in PG config. Falls through to 'simple'.
-- Install a custom Arabic text search config before ingesting AR data.
ELSE 'simple'::regconfig
END;
-- Binary body check: skip FTS on PDF/binary content
IF coalesce(NEW.body_search, NEW.body) IS NOT NULL
AND left(coalesce(NEW.body_search, NEW.body), 4) = '%PDF' THEN
NEW.content_fts :=
setweight(to_tsvector(cfg, unaccent(
corpus.normalize_for_fts(NEW.jurisdiction, coalesce(NEW.title, ))
)), 'A') ||
setweight(to_tsvector(cfg, unaccent(
coalesce(NEW.tags->>'summary', ) || ' ' ||
coalesce(NEW.tags->>'headnote_classification', )
)), 'B');
RETURN NEW;
END IF;
NEW.content_fts :=
setweight(to_tsvector(cfg, unaccent(
corpus.normalize_for_fts(NEW.jurisdiction, coalesce(NEW.title, ))
)), 'A') ||
setweight(to_tsvector(cfg, unaccent(
coalesce(NEW.tags->>'summary', ) || ' ' ||
coalesce(NEW.tags->>'headnote_classification', )
)), 'B') ||
setweight(to_tsvector(cfg, unaccent(
corpus.normalize_for_fts(NEW.jurisdiction,
regexp_replace(coalesce(NEW.body_search, NEW.body, ), '<[^>]+>', ' ', 'g'))
)), 'C');
RETURN NEW;
END $$ LANGUAGE plpgsql;
CREATE TRIGGER trg_content_fts
BEFORE INSERT OR UPDATE OF title, body, body_search, tags, language, jurisdiction ON corpus.documents FOR EACH ROW EXECUTE FUNCTION corpus.update_content_fts();
</syntaxhighlight>
Pluggable normalization (normalize_for_fts)
[edit | edit source]<syntaxhighlight lang="sql"> -- Core: identity (no-op) CREATE FUNCTION corpus.normalize_for_fts(p_jurisdiction text, p_input text) RETURNS text LANGUAGE sql IMMUTABLE AS $$ SELECT p_input; $$; </syntaxhighlight>
Jurisdiction plugins override this function:
- FR plugin installs
normalize_legal_refs_final()for French legal references (L442-1 to L442.1). - Future DE plugin could install
normalize_de_refs()for German references (paragraph 823 BGB).
The same normalization function must be applied to search queries (by the search plugin) to ensure consistency between indexed text and query.
Phased approach to normalization
[edit | edit source]Phase 1 (now): For HTML-sourced documents, body_search=NULL. normalize_for_fts() in the trigger handles normalization on body.
Phase 2 (PDF ingest, ADR-010): For PDF-sourced documents, body is a clean HTML stub with the source URL, and body_search holds the pdfminer-extracted text. normalize_for_fts() still runs. The trigger uses coalesce(body_search, body), preferring the indexable representation when present.
Phase 3 (LLM pipeline): LLM writes a normalized/cleaned version to body_search for HTML-sourced documents whose body has irregular references. normalize_for_fts() reset to identity.
Query-side normalization (regex) remains necessary in all phases.
Weight system
[edit | edit source]Three FTS weights:
- Weight A (highest): title.
- Weight B:
tags.summary+tags.headnote_classification, read directly from JSONB by the trigger. Provides FTS on curated legal metadata (headnotes, summaries) without duplicating body content. Not all documents have these tags — weight B is empty when absent. - Weight C:
coalesce(body_search, body)with HTML tag stripping viaregexp_replace.body_searchholds the indexable representation whenbodyis a clean stub (e.g., PDF-sourced documents withbody=<a>PDF link</a>andbody_search=<extracted text>). NULL for HTML-sourced documents — the trigger falls back tobodydirectly. See ADR-010.
Weight D is unused.
CJK languages
[edit | edit source]PostgreSQL built-in text search configurations do not support Chinese, Japanese, or Korean. The simple config produces single-character tokens that are functionally useless for search.
Hard requirement: install pgroonga or zhparser extension when ingesting CJK data.
The trigger sets content_fts to NULL for CJK. A separate pgroonga index on body handles CJK search:
<syntaxhighlight lang="sql"> -- When pgroonga is available: CREATE INDEX idx_doc_pgroonga ON corpus.documents
USING pgroonga (body) WHERE language IN ('zh', 'ja', 'ko');
</syntaxhighlight>
The search engine routes CJK queries to pgroonga and non-CJK queries to tsvector.
Partial GIN indexes
[edit | edit source]FTS GIN indexes are split per kind for performance isolation:
<syntaxhighlight lang="sql"> CREATE INDEX idx_doc_fts_legislation ON corpus.documents USING GIN (content_fts) WHERE kind = 'legislation'; CREATE INDEX idx_doc_fts_decision ON corpus.documents USING GIN (content_fts) WHERE kind = 'decision'; CREATE INDEX idx_doc_fts_record ON corpus.documents USING GIN (content_fts) WHERE kind = 'record'; CREATE INDEX idx_doc_fts_notice ON corpus.documents USING GIN (content_fts) WHERE kind = 'notice'; </syntaxhighlight>
Reindexing annotations (graph layer) does not touch corpus FTS indexes. Bulk ingestion of one kind does not bloat another kind's GIN index.
Accent-insensitive search
[edit | edit source]unaccent() is applied in the trigger before tokenization. Requires the unaccent PostgreSQL extension. Ensures "responsabilite" matches "responsabilité".
The search engine must also apply unaccent() to the query:
<syntaxhighlight lang="sql"> websearch_to_tsquery(cfg, unaccent(corpus.normalize_for_fts(jurisdiction, query))) </syntaxhighlight>