Editing Corpus/FTS (section)

= Full-Text Search =

== Overview ==

FTS uses PostgreSQL tsvector with per-language text search configurations. The <code>content_fts</code> column is populated by a trigger from <code>coalesce(body_search, body)</code>. CJK languages require the pgroonga extension.

== The two-text model ==

* <code>body</code>: clean displayable content. Always renderable to a user (HTML, links, formatted text). Immutable after ingestion. Never contains noisy OCR or raw PDF binary.
* <code>body_search</code>: FTS-optimized text used for indexing. Can be noisy (raw PDF extraction, OCR output) because users never see it directly. Nullable. When NULL, FTS uses body directly.

Why two columns:
# '''Display vs index separation.''' <code>get_document</code> returns <code>body</code>, so it must always be clean. FTS reads <code>coalesce(body_search, body)</code>, so <code>body_search</code> can hold a noisy-but-indexable representation (e.g., extracted PDF text) without polluting display.
# '''PDF-sourced documents.''' When a document is only available as PDF, <code>body</code> is a clean stub like <code>&lt;p&gt;Source: &lt;a href="..."&gt;PDF officiel&lt;/a&gt;&lt;/p&gt;</code> (the LLM follows the link or uses search snippets). <code>body_search</code> holds the pdfminer-extracted text so the document is findable via FTS.
# '''Legal reference normalization''' (e.g., "L442-1" to "L442.1") must be consistent between index and query. <code>body_search</code> holds the normalized text; the search plugin applies the same normalization to the query.
# <code>body</code> is the source of truth for display and annotation anchoring. It never changes.

See ADR-010 for the full rationale of this semantic.

== Trigger ==

<syntaxhighlight lang="sql">
CREATE FUNCTION corpus.update_content_fts() RETURNS trigger AS $$
DECLARE
    lang text;
    cfg  regconfig;
BEGIN
    -- Skip recomputation if searchable content unchanged (tag-only upserts)
    IF TG_OP = 'UPDATE'
       AND NEW.title IS NOT DISTINCT FROM OLD.title
       AND NEW.body IS NOT DISTINCT FROM OLD.body
       AND NEW.body_search IS NOT DISTINCT FROM OLD.body_search
       AND NEW.language IS NOT DISTINCT FROM OLD.language
       AND NEW.jurisdiction IS NOT DISTINCT FROM OLD.jurisdiction
       AND (NEW.tags->>'summary') IS NOT DISTINCT FROM (OLD.tags->>'summary')
       AND (NEW.tags->>'headnote_classification') IS NOT DISTINCT FROM (OLD.tags->>'headnote_classification') THEN
        NEW.content_fts := OLD.content_fts;
        RETURN NEW;
    END IF;

    -- Extract country prefix for subdivisions (gb-sct → gb, es-ct → es)
    lang := coalesce(NEW.language, split_part(NEW.jurisdiction, '-', 1));

    -- CJK: tsvector non-functional, use pgroonga instead
    IF lang IN ('zh', 'ja', 'ko') THEN
        NEW.content_fts := NULL;
        RETURN NEW;
    END IF;

    cfg := CASE lang
        WHEN 'fr' THEN 'french'::regconfig
        WHEN 'de' THEN 'german'::regconfig
        WHEN 'en' THEN 'english'::regconfig
        WHEN 'es' THEN 'spanish'::regconfig
        WHEN 'it' THEN 'italian'::regconfig
        WHEN 'pt' THEN 'portuguese'::regconfig
        WHEN 'nl' THEN 'dutch'::regconfig
        WHEN 'sv' THEN 'swedish'::regconfig
        WHEN 'da' THEN 'danish'::regconfig
        WHEN 'fi' THEN 'finnish'::regconfig
        WHEN 'hu' THEN 'hungarian'::regconfig
        WHEN 'ro' THEN 'romanian'::regconfig
        WHEN 'tr' THEN 'turkish'::regconfig
        WHEN 'ru' THEN 'russian'::regconfig
        WHEN 'no' THEN 'norwegian'::regconfig
        -- Arabic: no built-in PG config. Falls through to 'simple'.
        -- Install a custom Arabic text search config before ingesting AR data.
        ELSE 'simple'::regconfig
    END;

    -- Binary body check: skip FTS on PDF/binary content
    IF coalesce(NEW.body_search, NEW.body) IS NOT NULL
       AND left(coalesce(NEW.body_search, NEW.body), 4) = '%PDF' THEN
        NEW.content_fts :=
            setweight(to_tsvector(cfg, unaccent(
                corpus.normalize_for_fts(NEW.jurisdiction, coalesce(NEW.title, ''))
            )), 'A') ||
            setweight(to_tsvector(cfg, unaccent(
                coalesce(NEW.tags->>'summary', '') || ' ' ||
                coalesce(NEW.tags->>'headnote_classification', '')
            )), 'B');
        RETURN NEW;
    END IF;

    NEW.content_fts :=
        setweight(to_tsvector(cfg, unaccent(
            corpus.normalize_for_fts(NEW.jurisdiction, coalesce(NEW.title, ''))
        )), 'A') ||
        setweight(to_tsvector(cfg, unaccent(
            coalesce(NEW.tags->>'summary', '') || ' ' ||
            coalesce(NEW.tags->>'headnote_classification', '')
        )), 'B') ||
        setweight(to_tsvector(cfg, unaccent(
            corpus.normalize_for_fts(NEW.jurisdiction,
                regexp_replace(coalesce(NEW.body_search, NEW.body, ''), '<[^>]+>', ' ', 'g'))
        )), 'C');
    RETURN NEW;
END $$ LANGUAGE plpgsql;

CREATE TRIGGER trg_content_fts
    BEFORE INSERT OR UPDATE OF title, body, body_search, tags, language, jurisdiction
    ON corpus.documents FOR EACH ROW
    EXECUTE FUNCTION corpus.update_content_fts();
</syntaxhighlight>

== Pluggable normalization (normalize_for_fts) ==

<syntaxhighlight lang="sql">
-- Core: identity (no-op)
CREATE FUNCTION corpus.normalize_for_fts(p_jurisdiction text, p_input text)
RETURNS text LANGUAGE sql IMMUTABLE AS $$ SELECT p_input; $$;
</syntaxhighlight>

Jurisdiction plugins override this function:
* FR plugin installs <code>normalize_legal_refs_final()</code> for French legal references (L442-1 to L442.1).
* Future DE plugin could install <code>normalize_de_refs()</code> for German references (paragraph 823 BGB).

The same normalization function must be applied to search queries (by the search plugin) to ensure consistency between indexed text and query.

== Phased approach to normalization ==

Phase 1 (now): For HTML-sourced documents, <code>body_search=NULL</code>. <code>normalize_for_fts()</code> in the trigger handles normalization on <code>body</code>.

Phase 2 (PDF ingest, ADR-010): For PDF-sourced documents, <code>body</code> is a clean HTML stub with the source URL, and <code>body_search</code> holds the pdfminer-extracted text. <code>normalize_for_fts()</code> still runs. The trigger uses <code>coalesce(body_search, body)</code>, preferring the indexable representation when present.

Phase 3 (LLM pipeline): LLM writes a normalized/cleaned version to <code>body_search</code> for HTML-sourced documents whose body has irregular references. <code>normalize_for_fts()</code> reset to identity.

Query-side normalization (regex) remains necessary in all phases.

== Weight system ==

Three FTS weights:

* '''Weight A''' (highest): title.
* '''Weight B''': <code>tags.summary</code> + <code>tags.headnote_classification</code>, read directly from JSONB by the trigger. Provides FTS on curated legal metadata (headnotes, summaries) without duplicating body content. Not all documents have these tags — weight B is empty when absent.
* '''Weight C''': <code>coalesce(body_search, body)</code> with HTML tag stripping via <code>regexp_replace</code>. <code>body_search</code> holds the indexable representation when <code>body</code> is a clean stub (e.g., PDF-sourced documents with <code>body=&lt;a&gt;PDF link&lt;/a&gt;</code> and <code>body_search=&lt;extracted text&gt;</code>). NULL for HTML-sourced documents — the trigger falls back to <code>body</code> directly. See ADR-010.

Weight D is unused.

== CJK languages ==

PostgreSQL built-in text search configurations do not support Chinese, Japanese, or Korean. The <code>simple</code> config produces single-character tokens that are functionally useless for search.

Hard requirement: install pgroonga or zhparser extension when ingesting CJK data.

The trigger sets <code>content_fts</code> to NULL for CJK. A separate pgroonga index on body handles CJK search:

<syntaxhighlight lang="sql">
-- When pgroonga is available:
CREATE INDEX idx_doc_pgroonga ON corpus.documents
    USING pgroonga (body) WHERE language IN ('zh', 'ja', 'ko');
</syntaxhighlight>

The search engine routes CJK queries to pgroonga and non-CJK queries to tsvector.

== Partial GIN indexes ==

FTS GIN indexes are split per kind for performance isolation:

<syntaxhighlight lang="sql">
CREATE INDEX idx_doc_fts_legislation ON corpus.documents USING GIN (content_fts) WHERE kind = 'legislation';
CREATE INDEX idx_doc_fts_decision  ON corpus.documents USING GIN (content_fts) WHERE kind = 'decision';
CREATE INDEX idx_doc_fts_record    ON corpus.documents USING GIN (content_fts) WHERE kind = 'record';
CREATE INDEX idx_doc_fts_notice    ON corpus.documents USING GIN (content_fts) WHERE kind = 'notice';
</syntaxhighlight>

Reindexing annotations (graph layer) does not touch corpus FTS indexes. Bulk ingestion of one kind does not bloat another kind's GIN index.

== Accent-insensitive search ==

<code>unaccent()</code> is applied in the trigger before tokenization. Requires the <code>unaccent</code> PostgreSQL extension. Ensures "responsabilite" matches "responsabilité".

The search engine must also apply <code>unaccent()</code> to the query:

<syntaxhighlight lang="sql">
websearch_to_tsquery(cfg, unaccent(corpus.normalize_for_fts(jurisdiction, query)))
</syntaxhighlight>

[[Category:Corpus]]