Editing Corpus/Schema (section)

== Operational notes ==

'''No FK on edges or parent_id.''' Foreign keys on multi-million row tables with parallel ingestion from independent sources create ordering constraints and lock contention that outweigh referential safety. Orphan detection runs as a periodic batch job.

'''Partial GIN indexes per kind.''' See the indexes section. Keeps FTS index sizes small, avoids partitioning complexity.

'''WHERE IS DISTINCT FROM in upserts.''' Use <code>IS DISTINCT FROM</code> instead of <code>!=</code> in <code>ON CONFLICT ... DO UPDATE SET ... WHERE</code> clauses. <code>!=</code> returns NULL when comparing with NULL (and the row is not updated), while <code>IS DISTINCT FROM</code> treats NULL as a comparable value. Critical for columns like <code>date_end</code>, <code>body_search</code>, <code>language</code> that are frequently NULL.

'''Autovacuum tuning.''' Set <code>autovacuum_vacuum_scale_factor = 0.01</code> on <code>corpus.documents</code>. The default (0.2) means autovacuum triggers after 20% of rows are dead -- on a 3.4M row table, that is 680K dead rows before cleanup. At 0.01, it triggers at 34K dead rows, keeping bloat and index performance under control.

'''Estimated counts at scale.''' <code>COUNT(*) OVER()</code> becomes expensive above 100M rows. Use <code>EXPLAIN</code>-based row estimates or <code>pg_class.reltuples</code> for approximate counts in paginated APIs. Exact counts only when explicitly requested.

'''CJK full-text search.''' PostgreSQL built-in FTS does not support Chinese, Japanese, or Korean tokenization. <code>content_fts</code> is NULL for CJK documents. Use pgroonga (all three languages) or zhparser (Chinese only) as an extension. CJK search queries must use a different code path.

'''Partitioning at CN scale.''' At 130M+ rows (realistic for a full Chinese legal corpus), partition <code>documents</code> by <code>jurisdiction</code>. This is the one case where partitioning is justified: queries almost always filter by jurisdiction, and partition pruning eliminates entire shards. The composite PK concern is manageable when partitioning by a column that is already always present in queries.

----