<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.dura-lex.org/index.php?action=history&amp;feed=atom&amp;title=Corpus%2FFTS</id>
	<title>Corpus/FTS - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.dura-lex.org/index.php?action=history&amp;feed=atom&amp;title=Corpus%2FFTS"/>
	<link rel="alternate" type="text/html" href="https://wiki.dura-lex.org/index.php?title=Corpus/FTS&amp;action=history"/>
	<updated>2026-04-23T05:46:04Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.45.3</generator>
	<entry>
		<id>https://wiki.dura-lex.org/index.php?title=Corpus/FTS&amp;diff=52&amp;oldid=prev</id>
		<title>Nicolas: Import from duralex/spec/FTS.md — faithful conversion to wikitext (via create-page on MediaWiki MCP Server)</title>
		<link rel="alternate" type="text/html" href="https://wiki.dura-lex.org/index.php?title=Corpus/FTS&amp;diff=52&amp;oldid=prev"/>
		<updated>2026-04-23T02:12:09Z</updated>

		<summary type="html">&lt;p&gt;Import from duralex/spec/FTS.md — faithful conversion to wikitext (via create-page on MediaWiki MCP Server)&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;= Full-Text Search =&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
FTS uses PostgreSQL tsvector with per-language text search configurations. The &amp;lt;code&amp;gt;content_fts&amp;lt;/code&amp;gt; column is populated by a trigger from &amp;lt;code&amp;gt;coalesce(body_search, body)&amp;lt;/code&amp;gt;. CJK languages require the pgroonga extension.&lt;br /&gt;
&lt;br /&gt;
== The two-text model ==&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;code&amp;gt;body&amp;lt;/code&amp;gt;: clean displayable content. Always renderable to a user (HTML, links, formatted text). Immutable after ingestion. Never contains noisy OCR or raw PDF binary.&lt;br /&gt;
* &amp;lt;code&amp;gt;body_search&amp;lt;/code&amp;gt;: FTS-optimized text used for indexing. Can be noisy (raw PDF extraction, OCR output) because users never see it directly. Nullable. When NULL, FTS uses body directly.&lt;br /&gt;
&lt;br /&gt;
Why two columns:&lt;br /&gt;
# &amp;#039;&amp;#039;&amp;#039;Display vs index separation.&amp;#039;&amp;#039;&amp;#039; &amp;lt;code&amp;gt;get_document&amp;lt;/code&amp;gt; returns &amp;lt;code&amp;gt;body&amp;lt;/code&amp;gt;, so it must always be clean. FTS reads &amp;lt;code&amp;gt;coalesce(body_search, body)&amp;lt;/code&amp;gt;, so &amp;lt;code&amp;gt;body_search&amp;lt;/code&amp;gt; can hold a noisy-but-indexable representation (e.g., extracted PDF text) without polluting display.&lt;br /&gt;
# &amp;#039;&amp;#039;&amp;#039;PDF-sourced documents.&amp;#039;&amp;#039;&amp;#039; When a document is only available as PDF, &amp;lt;code&amp;gt;body&amp;lt;/code&amp;gt; is a clean stub like &amp;lt;code&amp;gt;&amp;amp;lt;p&amp;amp;gt;Source: &amp;amp;lt;a href=&amp;quot;...&amp;quot;&amp;amp;gt;PDF officiel&amp;amp;lt;/a&amp;amp;gt;&amp;amp;lt;/p&amp;amp;gt;&amp;lt;/code&amp;gt; (the LLM follows the link or uses search snippets). &amp;lt;code&amp;gt;body_search&amp;lt;/code&amp;gt; holds the pdfminer-extracted text so the document is findable via FTS.&lt;br /&gt;
# &amp;#039;&amp;#039;&amp;#039;Legal reference normalization&amp;#039;&amp;#039;&amp;#039; (e.g., &amp;quot;L442-1&amp;quot; to &amp;quot;L442.1&amp;quot;) must be consistent between index and query. &amp;lt;code&amp;gt;body_search&amp;lt;/code&amp;gt; holds the normalized text; the search plugin applies the same normalization to the query.&lt;br /&gt;
# &amp;lt;code&amp;gt;body&amp;lt;/code&amp;gt; is the source of truth for display and annotation anchoring. It never changes.&lt;br /&gt;
&lt;br /&gt;
See ADR-010 for the full rationale of this semantic.&lt;br /&gt;
&lt;br /&gt;
== Trigger ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;sql&amp;quot;&amp;gt;&lt;br /&gt;
CREATE FUNCTION corpus.update_content_fts() RETURNS trigger AS $$&lt;br /&gt;
DECLARE&lt;br /&gt;
    lang text;&lt;br /&gt;
    cfg  regconfig;&lt;br /&gt;
BEGIN&lt;br /&gt;
    -- Skip recomputation if searchable content unchanged (tag-only upserts)&lt;br /&gt;
    IF TG_OP = &amp;#039;UPDATE&amp;#039;&lt;br /&gt;
       AND NEW.title IS NOT DISTINCT FROM OLD.title&lt;br /&gt;
       AND NEW.body IS NOT DISTINCT FROM OLD.body&lt;br /&gt;
       AND NEW.body_search IS NOT DISTINCT FROM OLD.body_search&lt;br /&gt;
       AND NEW.language IS NOT DISTINCT FROM OLD.language&lt;br /&gt;
       AND NEW.jurisdiction IS NOT DISTINCT FROM OLD.jurisdiction&lt;br /&gt;
       AND (NEW.tags-&amp;gt;&amp;gt;&amp;#039;summary&amp;#039;) IS NOT DISTINCT FROM (OLD.tags-&amp;gt;&amp;gt;&amp;#039;summary&amp;#039;)&lt;br /&gt;
       AND (NEW.tags-&amp;gt;&amp;gt;&amp;#039;headnote_classification&amp;#039;) IS NOT DISTINCT FROM (OLD.tags-&amp;gt;&amp;gt;&amp;#039;headnote_classification&amp;#039;) THEN&lt;br /&gt;
        NEW.content_fts := OLD.content_fts;&lt;br /&gt;
        RETURN NEW;&lt;br /&gt;
    END IF;&lt;br /&gt;
&lt;br /&gt;
    -- Extract country prefix for subdivisions (gb-sct → gb, es-ct → es)&lt;br /&gt;
    lang := coalesce(NEW.language, split_part(NEW.jurisdiction, &amp;#039;-&amp;#039;, 1));&lt;br /&gt;
&lt;br /&gt;
    -- CJK: tsvector non-functional, use pgroonga instead&lt;br /&gt;
    IF lang IN (&amp;#039;zh&amp;#039;, &amp;#039;ja&amp;#039;, &amp;#039;ko&amp;#039;) THEN&lt;br /&gt;
        NEW.content_fts := NULL;&lt;br /&gt;
        RETURN NEW;&lt;br /&gt;
    END IF;&lt;br /&gt;
&lt;br /&gt;
    cfg := CASE lang&lt;br /&gt;
        WHEN &amp;#039;fr&amp;#039; THEN &amp;#039;french&amp;#039;::regconfig&lt;br /&gt;
        WHEN &amp;#039;de&amp;#039; THEN &amp;#039;german&amp;#039;::regconfig&lt;br /&gt;
        WHEN &amp;#039;en&amp;#039; THEN &amp;#039;english&amp;#039;::regconfig&lt;br /&gt;
        WHEN &amp;#039;es&amp;#039; THEN &amp;#039;spanish&amp;#039;::regconfig&lt;br /&gt;
        WHEN &amp;#039;it&amp;#039; THEN &amp;#039;italian&amp;#039;::regconfig&lt;br /&gt;
        WHEN &amp;#039;pt&amp;#039; THEN &amp;#039;portuguese&amp;#039;::regconfig&lt;br /&gt;
        WHEN &amp;#039;nl&amp;#039; THEN &amp;#039;dutch&amp;#039;::regconfig&lt;br /&gt;
        WHEN &amp;#039;sv&amp;#039; THEN &amp;#039;swedish&amp;#039;::regconfig&lt;br /&gt;
        WHEN &amp;#039;da&amp;#039; THEN &amp;#039;danish&amp;#039;::regconfig&lt;br /&gt;
        WHEN &amp;#039;fi&amp;#039; THEN &amp;#039;finnish&amp;#039;::regconfig&lt;br /&gt;
        WHEN &amp;#039;hu&amp;#039; THEN &amp;#039;hungarian&amp;#039;::regconfig&lt;br /&gt;
        WHEN &amp;#039;ro&amp;#039; THEN &amp;#039;romanian&amp;#039;::regconfig&lt;br /&gt;
        WHEN &amp;#039;tr&amp;#039; THEN &amp;#039;turkish&amp;#039;::regconfig&lt;br /&gt;
        WHEN &amp;#039;ru&amp;#039; THEN &amp;#039;russian&amp;#039;::regconfig&lt;br /&gt;
        WHEN &amp;#039;no&amp;#039; THEN &amp;#039;norwegian&amp;#039;::regconfig&lt;br /&gt;
        -- Arabic: no built-in PG config. Falls through to &amp;#039;simple&amp;#039;.&lt;br /&gt;
        -- Install a custom Arabic text search config before ingesting AR data.&lt;br /&gt;
        ELSE &amp;#039;simple&amp;#039;::regconfig&lt;br /&gt;
    END;&lt;br /&gt;
&lt;br /&gt;
    -- Binary body check: skip FTS on PDF/binary content&lt;br /&gt;
    IF coalesce(NEW.body_search, NEW.body) IS NOT NULL&lt;br /&gt;
       AND left(coalesce(NEW.body_search, NEW.body), 4) = &amp;#039;%PDF&amp;#039; THEN&lt;br /&gt;
        NEW.content_fts :=&lt;br /&gt;
            setweight(to_tsvector(cfg, unaccent(&lt;br /&gt;
                corpus.normalize_for_fts(NEW.jurisdiction, coalesce(NEW.title, &amp;#039;&amp;#039;))&lt;br /&gt;
            )), &amp;#039;A&amp;#039;) ||&lt;br /&gt;
            setweight(to_tsvector(cfg, unaccent(&lt;br /&gt;
                coalesce(NEW.tags-&amp;gt;&amp;gt;&amp;#039;summary&amp;#039;, &amp;#039;&amp;#039;) || &amp;#039; &amp;#039; ||&lt;br /&gt;
                coalesce(NEW.tags-&amp;gt;&amp;gt;&amp;#039;headnote_classification&amp;#039;, &amp;#039;&amp;#039;)&lt;br /&gt;
            )), &amp;#039;B&amp;#039;);&lt;br /&gt;
        RETURN NEW;&lt;br /&gt;
    END IF;&lt;br /&gt;
&lt;br /&gt;
    NEW.content_fts :=&lt;br /&gt;
        setweight(to_tsvector(cfg, unaccent(&lt;br /&gt;
            corpus.normalize_for_fts(NEW.jurisdiction, coalesce(NEW.title, &amp;#039;&amp;#039;))&lt;br /&gt;
        )), &amp;#039;A&amp;#039;) ||&lt;br /&gt;
        setweight(to_tsvector(cfg, unaccent(&lt;br /&gt;
            coalesce(NEW.tags-&amp;gt;&amp;gt;&amp;#039;summary&amp;#039;, &amp;#039;&amp;#039;) || &amp;#039; &amp;#039; ||&lt;br /&gt;
            coalesce(NEW.tags-&amp;gt;&amp;gt;&amp;#039;headnote_classification&amp;#039;, &amp;#039;&amp;#039;)&lt;br /&gt;
        )), &amp;#039;B&amp;#039;) ||&lt;br /&gt;
        setweight(to_tsvector(cfg, unaccent(&lt;br /&gt;
            corpus.normalize_for_fts(NEW.jurisdiction,&lt;br /&gt;
                regexp_replace(coalesce(NEW.body_search, NEW.body, &amp;#039;&amp;#039;), &amp;#039;&amp;lt;[^&amp;gt;]+&amp;gt;&amp;#039;, &amp;#039; &amp;#039;, &amp;#039;g&amp;#039;))&lt;br /&gt;
        )), &amp;#039;C&amp;#039;);&lt;br /&gt;
    RETURN NEW;&lt;br /&gt;
END $$ LANGUAGE plpgsql;&lt;br /&gt;
&lt;br /&gt;
CREATE TRIGGER trg_content_fts&lt;br /&gt;
    BEFORE INSERT OR UPDATE OF title, body, body_search, tags, language, jurisdiction&lt;br /&gt;
    ON corpus.documents FOR EACH ROW&lt;br /&gt;
    EXECUTE FUNCTION corpus.update_content_fts();&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Pluggable normalization (normalize_for_fts) ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;sql&amp;quot;&amp;gt;&lt;br /&gt;
-- Core: identity (no-op)&lt;br /&gt;
CREATE FUNCTION corpus.normalize_for_fts(p_jurisdiction text, p_input text)&lt;br /&gt;
RETURNS text LANGUAGE sql IMMUTABLE AS $$ SELECT p_input; $$;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Jurisdiction plugins override this function:&lt;br /&gt;
* FR plugin installs &amp;lt;code&amp;gt;normalize_legal_refs_final()&amp;lt;/code&amp;gt; for French legal references (L442-1 to L442.1).&lt;br /&gt;
* Future DE plugin could install &amp;lt;code&amp;gt;normalize_de_refs()&amp;lt;/code&amp;gt; for German references (paragraph 823 BGB).&lt;br /&gt;
&lt;br /&gt;
The same normalization function must be applied to search queries (by the search plugin) to ensure consistency between indexed text and query.&lt;br /&gt;
&lt;br /&gt;
== Phased approach to normalization ==&lt;br /&gt;
&lt;br /&gt;
Phase 1 (now): For HTML-sourced documents, &amp;lt;code&amp;gt;body_search=NULL&amp;lt;/code&amp;gt;. &amp;lt;code&amp;gt;normalize_for_fts()&amp;lt;/code&amp;gt; in the trigger handles normalization on &amp;lt;code&amp;gt;body&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
Phase 2 (PDF ingest, ADR-010): For PDF-sourced documents, &amp;lt;code&amp;gt;body&amp;lt;/code&amp;gt; is a clean HTML stub with the source URL, and &amp;lt;code&amp;gt;body_search&amp;lt;/code&amp;gt; holds the pdfminer-extracted text. &amp;lt;code&amp;gt;normalize_for_fts()&amp;lt;/code&amp;gt; still runs. The trigger uses &amp;lt;code&amp;gt;coalesce(body_search, body)&amp;lt;/code&amp;gt;, preferring the indexable representation when present.&lt;br /&gt;
&lt;br /&gt;
Phase 3 (LLM pipeline): LLM writes a normalized/cleaned version to &amp;lt;code&amp;gt;body_search&amp;lt;/code&amp;gt; for HTML-sourced documents whose body has irregular references. &amp;lt;code&amp;gt;normalize_for_fts()&amp;lt;/code&amp;gt; reset to identity.&lt;br /&gt;
&lt;br /&gt;
Query-side normalization (regex) remains necessary in all phases.&lt;br /&gt;
&lt;br /&gt;
== Weight system ==&lt;br /&gt;
&lt;br /&gt;
Three FTS weights:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Weight A&amp;#039;&amp;#039;&amp;#039; (highest): title.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Weight B&amp;#039;&amp;#039;&amp;#039;: &amp;lt;code&amp;gt;tags.summary&amp;lt;/code&amp;gt; + &amp;lt;code&amp;gt;tags.headnote_classification&amp;lt;/code&amp;gt;, read directly from JSONB by the trigger. Provides FTS on curated legal metadata (headnotes, summaries) without duplicating body content. Not all documents have these tags — weight B is empty when absent.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Weight C&amp;#039;&amp;#039;&amp;#039;: &amp;lt;code&amp;gt;coalesce(body_search, body)&amp;lt;/code&amp;gt; with HTML tag stripping via &amp;lt;code&amp;gt;regexp_replace&amp;lt;/code&amp;gt;. &amp;lt;code&amp;gt;body_search&amp;lt;/code&amp;gt; holds the indexable representation when &amp;lt;code&amp;gt;body&amp;lt;/code&amp;gt; is a clean stub (e.g., PDF-sourced documents with &amp;lt;code&amp;gt;body=&amp;amp;lt;a&amp;amp;gt;PDF link&amp;amp;lt;/a&amp;amp;gt;&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;body_search=&amp;amp;lt;extracted text&amp;amp;gt;&amp;lt;/code&amp;gt;). NULL for HTML-sourced documents — the trigger falls back to &amp;lt;code&amp;gt;body&amp;lt;/code&amp;gt; directly. See ADR-010.&lt;br /&gt;
&lt;br /&gt;
Weight D is unused.&lt;br /&gt;
&lt;br /&gt;
== CJK languages ==&lt;br /&gt;
&lt;br /&gt;
PostgreSQL built-in text search configurations do not support Chinese, Japanese, or Korean. The &amp;lt;code&amp;gt;simple&amp;lt;/code&amp;gt; config produces single-character tokens that are functionally useless for search.&lt;br /&gt;
&lt;br /&gt;
Hard requirement: install pgroonga or zhparser extension when ingesting CJK data.&lt;br /&gt;
&lt;br /&gt;
The trigger sets &amp;lt;code&amp;gt;content_fts&amp;lt;/code&amp;gt; to NULL for CJK. A separate pgroonga index on body handles CJK search:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;sql&amp;quot;&amp;gt;&lt;br /&gt;
-- When pgroonga is available:&lt;br /&gt;
CREATE INDEX idx_doc_pgroonga ON corpus.documents&lt;br /&gt;
    USING pgroonga (body) WHERE language IN (&amp;#039;zh&amp;#039;, &amp;#039;ja&amp;#039;, &amp;#039;ko&amp;#039;);&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The search engine routes CJK queries to pgroonga and non-CJK queries to tsvector.&lt;br /&gt;
&lt;br /&gt;
== Partial GIN indexes ==&lt;br /&gt;
&lt;br /&gt;
FTS GIN indexes are split per kind for performance isolation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;sql&amp;quot;&amp;gt;&lt;br /&gt;
CREATE INDEX idx_doc_fts_legislation ON corpus.documents USING GIN (content_fts) WHERE kind = &amp;#039;legislation&amp;#039;;&lt;br /&gt;
CREATE INDEX idx_doc_fts_decision  ON corpus.documents USING GIN (content_fts) WHERE kind = &amp;#039;decision&amp;#039;;&lt;br /&gt;
CREATE INDEX idx_doc_fts_record    ON corpus.documents USING GIN (content_fts) WHERE kind = &amp;#039;record&amp;#039;;&lt;br /&gt;
CREATE INDEX idx_doc_fts_notice    ON corpus.documents USING GIN (content_fts) WHERE kind = &amp;#039;notice&amp;#039;;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Reindexing annotations (graph layer) does not touch corpus FTS indexes. Bulk ingestion of one kind does not bloat another kind&amp;#039;s GIN index.&lt;br /&gt;
&lt;br /&gt;
== Accent-insensitive search ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;unaccent()&amp;lt;/code&amp;gt; is applied in the trigger before tokenization. Requires the &amp;lt;code&amp;gt;unaccent&amp;lt;/code&amp;gt; PostgreSQL extension. Ensures &amp;quot;responsabilite&amp;quot; matches &amp;quot;responsabilité&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
The search engine must also apply &amp;lt;code&amp;gt;unaccent()&amp;lt;/code&amp;gt; to the query:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;sql&amp;quot;&amp;gt;&lt;br /&gt;
websearch_to_tsquery(cfg, unaccent(corpus.normalize_for_fts(jurisdiction, query)))&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:Corpus]]&lt;/div&gt;</summary>
		<author><name>Nicolas</name></author>
	</entry>
</feed>