Editing
Corpus/Schema
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Operational notes == '''No FK on edges or parent_id.''' Foreign keys on multi-million row tables with parallel ingestion from independent sources create ordering constraints and lock contention that outweigh referential safety. Orphan detection runs as a periodic batch job. '''Partial GIN indexes per kind.''' See the indexes section. Keeps FTS index sizes small, avoids partitioning complexity. '''WHERE IS DISTINCT FROM in upserts.''' Use <code>IS DISTINCT FROM</code> instead of <code>!=</code> in <code>ON CONFLICT ... DO UPDATE SET ... WHERE</code> clauses. <code>!=</code> returns NULL when comparing with NULL (and the row is not updated), while <code>IS DISTINCT FROM</code> treats NULL as a comparable value. Critical for columns like <code>date_end</code>, <code>body_search</code>, <code>language</code> that are frequently NULL. '''Autovacuum tuning.''' Set <code>autovacuum_vacuum_scale_factor = 0.01</code> on <code>corpus.documents</code>. The default (0.2) means autovacuum triggers after 20% of rows are dead -- on a 3.4M row table, that is 680K dead rows before cleanup. At 0.01, it triggers at 34K dead rows, keeping bloat and index performance under control. '''Estimated counts at scale.''' <code>COUNT(*) OVER()</code> becomes expensive above 100M rows. Use <code>EXPLAIN</code>-based row estimates or <code>pg_class.reltuples</code> for approximate counts in paginated APIs. Exact counts only when explicitly requested. '''CJK full-text search.''' PostgreSQL built-in FTS does not support Chinese, Japanese, or Korean tokenization. <code>content_fts</code> is NULL for CJK documents. Use pgroonga (all three languages) or zhparser (Chinese only) as an extension. CJK search queries must use a different code path. '''Partitioning at CN scale.''' At 130M+ rows (realistic for a full Chinese legal corpus), partition <code>documents</code> by <code>jurisdiction</code>. This is the one case where partitioning is justified: queries almost always filter by jurisdiction, and partition pruning eliminates entire shards. The composite PK concern is manageable when partitioning by a column that is already always present in queries. ----
Summary:
Please note that all contributions to Dura Lex Wiki are considered to be released under the Creative Commons Attribution-ShareAlike (see
Dura Lex Wiki:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
Edit source
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Tools
What links here
Related changes
Page information