<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.dura-lex.org/index.php?action=history&amp;feed=atom&amp;title=Development%2FAdding_a_source</id>
	<title>Development/Adding a source - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.dura-lex.org/index.php?action=history&amp;feed=atom&amp;title=Development%2FAdding_a_source"/>
	<link rel="alternate" type="text/html" href="https://wiki.dura-lex.org/index.php?title=Development/Adding_a_source&amp;action=history"/>
	<updated>2026-04-23T05:29:39Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.45.3</generator>
	<entry>
		<id>https://wiki.dura-lex.org/index.php?title=Development/Adding_a_source&amp;diff=46&amp;oldid=prev</id>
		<title>Nicolas: Create guide for adding a new data source to the ingest pipeline (via create-page on MediaWiki MCP Server)</title>
		<link rel="alternate" type="text/html" href="https://wiki.dura-lex.org/index.php?title=Development/Adding_a_source&amp;diff=46&amp;oldid=prev"/>
		<updated>2026-04-23T02:07:32Z</updated>

		<summary type="html">&lt;p&gt;Create guide for adding a new data source to the ingest pipeline (via create-page on MediaWiki MCP Server)&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;= Adding a source =&lt;br /&gt;
&lt;br /&gt;
This guide explains how to add a new data source to the Dura Lex ingest pipeline. A &amp;quot;source&amp;quot; is a public institutional data feed (legislation, case law, company records, official notices) that gets downloaded, parsed, and inserted into the corpus.&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
Adding a source involves these steps:&lt;br /&gt;
&lt;br /&gt;
# Register the source in the source registry&lt;br /&gt;
# Implement a downloader (and optionally a parser)&lt;br /&gt;
# Follow tag and edge conventions&lt;br /&gt;
# Document the source on the wiki&lt;br /&gt;
&lt;br /&gt;
== Step 1: Register the source ==&lt;br /&gt;
&lt;br /&gt;
Every source must be registered in the jurisdiction&amp;#039;s &amp;lt;code&amp;gt;source_registry.py&amp;lt;/code&amp;gt; module. Registration happens at import time via the &amp;lt;code&amp;gt;register_source()&amp;lt;/code&amp;gt; function from &amp;lt;code&amp;gt;duralex.ingest.source_registry&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
=== Location ===&lt;br /&gt;
&lt;br /&gt;
* French sources: &amp;lt;code&amp;gt;duralex-ingest/duralex-ingest-fr/src/duralex/ingest/fr/source_registry.py&amp;lt;/code&amp;gt;&lt;br /&gt;
* EU sources: &amp;lt;code&amp;gt;duralex-ingest/duralex-ingest-eu/src/duralex/ingest/eu/source_registry.py&amp;lt;/code&amp;gt;&lt;br /&gt;
* New jurisdictions: create &amp;lt;code&amp;gt;duralex-ingest/duralex-ingest-{jur}/src/duralex/ingest/{jur}/source_registry.py&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Required fields ===&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;code&amp;gt;register_source()&amp;lt;/code&amp;gt; function requires the following fields:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
! Field !! Type !! Description !! Example&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;source_key&amp;lt;/code&amp;gt; || &amp;lt;code&amp;gt;str&amp;lt;/code&amp;gt; (positional) || Unique short identifier for the source || &amp;lt;code&amp;gt;&amp;quot;cada&amp;quot;&amp;lt;/code&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;jurisdiction&amp;lt;/code&amp;gt; || &amp;lt;code&amp;gt;str&amp;lt;/code&amp;gt; || ISO-style jurisdiction code || &amp;lt;code&amp;gt;&amp;quot;fr&amp;quot;&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;&amp;quot;eu&amp;quot;&amp;lt;/code&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;name&amp;lt;/code&amp;gt; || &amp;lt;code&amp;gt;str&amp;lt;/code&amp;gt; || Human-readable source name || &amp;lt;code&amp;gt;&amp;quot;CADA&amp;quot;&amp;lt;/code&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;description&amp;lt;/code&amp;gt; || &amp;lt;code&amp;gt;str&amp;lt;/code&amp;gt; || What this source contains || &amp;lt;code&amp;gt;&amp;quot;Avis et conseils de la Commission d&amp;#039;accès aux documents administratifs&amp;quot;&amp;lt;/code&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;kind&amp;lt;/code&amp;gt; || &amp;lt;code&amp;gt;str&amp;lt;/code&amp;gt; || One of the 6 structural kinds: &amp;lt;code&amp;gt;legislation&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;decision&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;record&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;notice&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;section&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;chunk&amp;lt;/code&amp;gt; || &amp;lt;code&amp;gt;&amp;quot;decision&amp;quot;&amp;lt;/code&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;publisher&amp;lt;/code&amp;gt; || &amp;lt;code&amp;gt;str&amp;lt;/code&amp;gt; || Organization that publishes the data || &amp;lt;code&amp;gt;&amp;quot;CADA&amp;quot;&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;&amp;quot;DILA&amp;quot;&amp;lt;/code&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;publisher_url&amp;lt;/code&amp;gt; || &amp;lt;code&amp;gt;str&amp;lt;/code&amp;gt; || URL where the data can be found || &amp;lt;code&amp;gt;&amp;quot;https://www.data.gouv.fr/fr/datasets/...&amp;quot;&amp;lt;/code&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;license&amp;lt;/code&amp;gt; || &amp;lt;code&amp;gt;str&amp;lt;/code&amp;gt; || License name || &amp;lt;code&amp;gt;&amp;quot;Licence Ouverte 2.0&amp;quot;&amp;lt;/code&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;license_url&amp;lt;/code&amp;gt; || &amp;lt;code&amp;gt;str&amp;lt;/code&amp;gt; || URL of the license || &amp;lt;code&amp;gt;&amp;quot;https://www.etalab.gouv.fr/licence-ouverte-open-licence/&amp;quot;&amp;lt;/code&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;language&amp;lt;/code&amp;gt; || &amp;lt;code&amp;gt;str&amp;lt;/code&amp;gt; || ISO 639-1 language code || &amp;lt;code&amp;gt;&amp;quot;fr&amp;quot;&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;&amp;quot;en&amp;quot;&amp;lt;/code&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;date_bounds&amp;lt;/code&amp;gt; || &amp;lt;code&amp;gt;DateBounds&amp;lt;/code&amp;gt; || Valid date range for this source || see below&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== DateBounds ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;DateBounds&amp;lt;/code&amp;gt; validates that document dates are within a plausible range. Two factory methods are available:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;code&amp;gt;DateBounds.strict(min_year=NNNN)&amp;lt;/code&amp;gt; — for sources where documents cannot have future dates (decisions, records, notices). Max year defaults to current year + small margin.&lt;br /&gt;
* &amp;lt;code&amp;gt;DateBounds.permissive(min_year=NNNN, max_year=NNNN)&amp;lt;/code&amp;gt; — for legislation with legitimate forward entry-into-force or expiration dates.&lt;br /&gt;
&lt;br /&gt;
=== Example: simple source registration ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;python&amp;quot;&amp;gt;&lt;br /&gt;
from duralex.ingest.date_validation import DateBounds&lt;br /&gt;
from duralex.ingest.source_registry import register_source&lt;br /&gt;
&lt;br /&gt;
register_source(&lt;br /&gt;
    &amp;quot;cada&amp;quot;,&lt;br /&gt;
    jurisdiction=&amp;quot;fr&amp;quot;,&lt;br /&gt;
    name=&amp;quot;CADA&amp;quot;,&lt;br /&gt;
    description=&amp;quot;Avis et conseils de la Commission d&amp;#039;accès aux documents administratifs&amp;quot;,&lt;br /&gt;
    kind=&amp;quot;decision&amp;quot;,&lt;br /&gt;
    publisher=&amp;quot;CADA&amp;quot;,&lt;br /&gt;
    publisher_url=&amp;quot;https://www.data.gouv.fr/fr/datasets/53698f37a3a729239d2036a0/&amp;quot;,&lt;br /&gt;
    license=&amp;quot;Licence Ouverte 2.0&amp;quot;,&lt;br /&gt;
    license_url=&amp;quot;https://www.etalab.gouv.fr/licence-ouverte-open-licence/&amp;quot;,&lt;br /&gt;
    language=&amp;quot;fr&amp;quot;,&lt;br /&gt;
    date_bounds=DateBounds.strict(min_year=1900),&lt;br /&gt;
)&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Sub-sources ===&lt;br /&gt;
&lt;br /&gt;
If your source has logical sub-feeds (e.g., different court levels within a single API), register them as sub-sources. Sub-sources inherit all metadata from their parent:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;python&amp;quot;&amp;gt;&lt;br /&gt;
from duralex.ingest.source_registry import register_sub_source&lt;br /&gt;
&lt;br /&gt;
# Parent must be registered first&lt;br /&gt;
for sub_key in (&amp;quot;judilibre_cc&amp;quot;, &amp;quot;judilibre_ca&amp;quot;, &amp;quot;judilibre_tj&amp;quot;, &amp;quot;judilibre_tcom&amp;quot;):&lt;br /&gt;
    register_sub_source(sub_key, &amp;quot;judilibre&amp;quot;)&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Step 2: Implement a downloader ==&lt;br /&gt;
&lt;br /&gt;
=== BaseDownloader (for DILA-style tar.gz archives) ===&lt;br /&gt;
&lt;br /&gt;
If your source publishes data as &amp;lt;code&amp;gt;.tar.gz&amp;lt;/code&amp;gt; archives on an HTTP index page (like DILA sources), subclass &amp;lt;code&amp;gt;BaseDownloader&amp;lt;/code&amp;gt; from &amp;lt;code&amp;gt;duralex.ingest.sources.base_downloader&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;BaseDownloader&amp;lt;/code&amp;gt; provides:&lt;br /&gt;
* HTTP retry with exponential backoff and 429/Retry-After support&lt;br /&gt;
* Listing remote files from an HTML index page&lt;br /&gt;
* Downloading with partial-download safety (&amp;lt;code&amp;gt;.tmp&amp;lt;/code&amp;gt; files + atomic rename)&lt;br /&gt;
* Extracting &amp;lt;code&amp;gt;.tar.gz&amp;lt;/code&amp;gt; archives with prefix stripping&lt;br /&gt;
* State tracking (last downloaded diff filename) for incremental updates&lt;br /&gt;
* Freemium base file + incremental diff pattern&lt;br /&gt;
&lt;br /&gt;
Constructor parameters:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
! Parameter !! Description&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;dataset_name&amp;lt;/code&amp;gt; || Identifier for this dataset (e.g. &amp;lt;code&amp;gt;&amp;quot;cass&amp;quot;&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;&amp;quot;legi&amp;quot;&amp;lt;/code&amp;gt;)&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;data_directory&amp;lt;/code&amp;gt; || Root directory for extracted data and archives&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;base_url&amp;lt;/code&amp;gt; || URL of the HTTP index listing &amp;lt;code&amp;gt;.tar.gz&amp;lt;/code&amp;gt; files&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;get_state&amp;lt;/code&amp;gt; || Callback returning the last processed filename, or &amp;lt;code&amp;gt;None&amp;lt;/code&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;set_state&amp;lt;/code&amp;gt; || Callback to persist the last processed filename&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;archive_prefix_markers&amp;lt;/code&amp;gt; || Directory names marking the start of useful content in tar archives (e.g. &amp;lt;code&amp;gt;[&amp;quot;JURI&amp;quot;, &amp;quot;CETA&amp;quot;]&amp;lt;/code&amp;gt;)&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;code&amp;gt;run()&amp;lt;/code&amp;gt; method is the main entry point. It downloads the freemium base (on first run) and/or incremental diffs, extracts them, and returns a list of extracted XML file paths. State is NOT written by &amp;lt;code&amp;gt;run()&amp;lt;/code&amp;gt; -- call &amp;lt;code&amp;gt;commit_state()&amp;lt;/code&amp;gt; after successful ingest.&lt;br /&gt;
&lt;br /&gt;
=== Inline downloader (for non-archive sources) ===&lt;br /&gt;
&lt;br /&gt;
For sources that do not follow the tar.gz archive pattern (e.g., CSV files, API endpoints), write a standalone downloader class that handles its own parsing and DB insertion. This is called an &amp;quot;inline source&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
The CADA downloader is a good example of an inline source. Key patterns:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;python&amp;quot;&amp;gt;&lt;br /&gt;
class CadaDownloader:&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot;Downloads CADA CSV from data.gouv.fr and ingests into corpus.documents.&amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
    def __init__(self, dsn: str, data_directory: Path) -&amp;gt; None:&lt;br /&gt;
        self.dsn = dsn&lt;br /&gt;
        self._cache_directory = data_directory / &amp;quot;cada&amp;quot;&lt;br /&gt;
&lt;br /&gt;
    def run(self, errors: IngestErrors | None = None) -&amp;gt; list[str]:&lt;br /&gt;
        &amp;quot;&amp;quot;&amp;quot;Download, parse, and insert. Returns [].&amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
        # 1. Check upstream for changes (compare last_modified)&lt;br /&gt;
        # 2. Download or use cached file&lt;br /&gt;
        # 3. Parse rows into corpus.documents records&lt;br /&gt;
        # 4. Insert in batches using insert_batch()&lt;br /&gt;
        # 5. Clean up orphaned records&lt;br /&gt;
        # 6. Update ingest state&lt;br /&gt;
        ...&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Key conventions for inline downloaders:&lt;br /&gt;
&lt;br /&gt;
* Accept &amp;lt;code&amp;gt;dsn&amp;lt;/code&amp;gt; (PostgreSQL connection string) and &amp;lt;code&amp;gt;data_directory&amp;lt;/code&amp;gt; (root for cached files) in the constructor&lt;br /&gt;
* Accept an optional &amp;lt;code&amp;gt;errors: IngestErrors | None&amp;lt;/code&amp;gt; parameter in &amp;lt;code&amp;gt;run()&amp;lt;/code&amp;gt; for error tracking&lt;br /&gt;
* Cache downloaded files to disk with a &amp;lt;code&amp;gt;.tmp&amp;lt;/code&amp;gt; extension during download and atomic rename on completion&lt;br /&gt;
* Store a sidecar metadata file (e.g., &amp;lt;code&amp;gt;.meta&amp;lt;/code&amp;gt;) for change detection&lt;br /&gt;
* Use &amp;lt;code&amp;gt;get_ingest_state()&amp;lt;/code&amp;gt; / &amp;lt;code&amp;gt;set_ingest_state()&amp;lt;/code&amp;gt; from &amp;lt;code&amp;gt;duralex.ingest.state&amp;lt;/code&amp;gt; to track whether the upstream data has changed&lt;br /&gt;
* Use &amp;lt;code&amp;gt;insert_batch()&amp;lt;/code&amp;gt; from &amp;lt;code&amp;gt;duralex.ingest.database.batch_writer&amp;lt;/code&amp;gt; for bulk insertion&lt;br /&gt;
* Return an empty list (inline sources handle their own insertion)&lt;br /&gt;
&lt;br /&gt;
=== Document record format ===&lt;br /&gt;
&lt;br /&gt;
Each record inserted into &amp;lt;code&amp;gt;corpus.documents&amp;lt;/code&amp;gt; must have:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
! Field !! Description&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;id&amp;lt;/code&amp;gt; || Globally unique document ID, prefixed with jurisdiction (e.g. &amp;lt;code&amp;gt;&amp;quot;fr.cada_20240001&amp;quot;&amp;lt;/code&amp;gt;)&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;kind&amp;lt;/code&amp;gt; || One of: &amp;lt;code&amp;gt;legislation&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;decision&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;record&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;notice&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;section&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;chunk&amp;lt;/code&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;jurisdiction&amp;lt;/code&amp;gt; || Jurisdiction code (e.g. &amp;lt;code&amp;gt;&amp;quot;fr&amp;quot;&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;&amp;quot;eu&amp;quot;&amp;lt;/code&amp;gt;)&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;language&amp;lt;/code&amp;gt; || ISO 639-1 language code&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;source&amp;lt;/code&amp;gt; || Source key matching the registered source&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;date&amp;lt;/code&amp;gt; || Primary date (ISO 8601 format &amp;lt;code&amp;gt;YYYY-MM-DD&amp;lt;/code&amp;gt;), or &amp;lt;code&amp;gt;None&amp;lt;/code&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;date_end&amp;lt;/code&amp;gt; || End date for documents with a date range, or &amp;lt;code&amp;gt;None&amp;lt;/code&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;parent_id&amp;lt;/code&amp;gt; || Parent document ID for hierarchical sources, or &amp;lt;code&amp;gt;None&amp;lt;/code&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;title&amp;lt;/code&amp;gt; || Human-readable title&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;body&amp;lt;/code&amp;gt; || Clean displayable content (HTML or formatted text). Immutable after ingestion.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;body_search&amp;lt;/code&amp;gt; || Indexable text for FTS (can be noisy: PDF OCR, etc.), or &amp;lt;code&amp;gt;None&amp;lt;/code&amp;gt; to use &amp;lt;code&amp;gt;body&amp;lt;/code&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;tags&amp;lt;/code&amp;gt; || JSON string of metadata tags (see below)&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Step 3: Follow tag conventions ==&lt;br /&gt;
&lt;br /&gt;
Tags are the primary metadata mechanism. Every document gets a &amp;lt;code&amp;gt;tags&amp;lt;/code&amp;gt; JSON object. Refer to [[Corpus/Tag conventions]] for the full shared vocabulary.&lt;br /&gt;
&lt;br /&gt;
=== Mandatory tags ===&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;code&amp;gt;type&amp;lt;/code&amp;gt; — legal document type (e.g. &amp;lt;code&amp;gt;&amp;quot;law&amp;quot;&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;&amp;quot;decree&amp;quot;&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;&amp;quot;judgment&amp;quot;&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;&amp;quot;administrative_decision&amp;quot;&amp;lt;/code&amp;gt;). See the tag conventions for the full list.&lt;br /&gt;
* &amp;lt;code&amp;gt;content_quality&amp;lt;/code&amp;gt; — quality of the body content (e.g. &amp;lt;code&amp;gt;&amp;quot;native_structured&amp;quot;&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;&amp;quot;ocr_raw&amp;quot;&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;&amp;quot;metadata_only&amp;quot;&amp;lt;/code&amp;gt;)&lt;br /&gt;
&lt;br /&gt;
=== Common optional tags ===&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;code&amp;gt;nature&amp;lt;/code&amp;gt; — source-specific document nature&lt;br /&gt;
* &amp;lt;code&amp;gt;court&amp;lt;/code&amp;gt; — court identifier (for decisions)&lt;br /&gt;
* &amp;lt;code&amp;gt;authority&amp;lt;/code&amp;gt; — issuing authority&lt;br /&gt;
* &amp;lt;code&amp;gt;case_number&amp;lt;/code&amp;gt; — case/dossier numbers (as a list)&lt;br /&gt;
* &amp;lt;code&amp;gt;solution&amp;lt;/code&amp;gt; — outcome/disposition&lt;br /&gt;
* &amp;lt;code&amp;gt;summary&amp;lt;/code&amp;gt; — headnote or summary&lt;br /&gt;
* &amp;lt;code&amp;gt;headnote_classification&amp;lt;/code&amp;gt; — subject classification&lt;br /&gt;
* &amp;lt;code&amp;gt;_importance_level_default&amp;lt;/code&amp;gt; — default importance for ranking&lt;br /&gt;
&lt;br /&gt;
=== Tag construction pattern ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;python&amp;quot;&amp;gt;&lt;br /&gt;
import json&lt;br /&gt;
&lt;br /&gt;
def _tags(**kwargs: object) -&amp;gt; str:&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot;Build a tags JSON string, stripping None values.&amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
    return json.dumps(&lt;br /&gt;
        {k: v for k, v in kwargs.items() if v is not None},&lt;br /&gt;
        ensure_ascii=False,&lt;br /&gt;
    )&lt;br /&gt;
&lt;br /&gt;
# Usage&lt;br /&gt;
tags = _tags(&lt;br /&gt;
    type=&amp;quot;administrative_decision&amp;quot;,&lt;br /&gt;
    content_quality=&amp;quot;native_structured&amp;quot;,&lt;br /&gt;
    court=&amp;quot;cada&amp;quot;,&lt;br /&gt;
    case_number=[number],&lt;br /&gt;
    solution=outcome or None,&lt;br /&gt;
)&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Step 4: Follow edge conventions ==&lt;br /&gt;
&lt;br /&gt;
If your source contains cross-references to other documents (citations, amendments, transpositions, etc.), create edges in the &amp;lt;code&amp;gt;corpus.edges&amp;lt;/code&amp;gt; table. Refer to [[Corpus/Edge types]] for the full taxonomy of ~75 edge types.&lt;br /&gt;
&lt;br /&gt;
Common edge types for new sources:&lt;br /&gt;
* &amp;lt;code&amp;gt;cites&amp;lt;/code&amp;gt; — document A cites document B&lt;br /&gt;
* &amp;lt;code&amp;gt;amends&amp;lt;/code&amp;gt; — document A amends document B&lt;br /&gt;
* &amp;lt;code&amp;gt;repeals&amp;lt;/code&amp;gt; — document A repeals document B&lt;br /&gt;
* &amp;lt;code&amp;gt;implements&amp;lt;/code&amp;gt; — national law implements an EU directive&lt;br /&gt;
* &amp;lt;code&amp;gt;transposes&amp;lt;/code&amp;gt; — national law transposes an EU directive&lt;br /&gt;
* &amp;lt;code&amp;gt;consolidates&amp;lt;/code&amp;gt; — consolidated version of a text&lt;br /&gt;
&lt;br /&gt;
== Step 5: Document the source ==&lt;br /&gt;
&lt;br /&gt;
Create a wiki page at &amp;lt;code&amp;gt;Sources/{jurisdiction}/{source_name}&amp;lt;/code&amp;gt; documenting:&lt;br /&gt;
&lt;br /&gt;
* What the source contains&lt;br /&gt;
* Publisher and license&lt;br /&gt;
* Data format (XML, CSV, JSON, API)&lt;br /&gt;
* Update frequency&lt;br /&gt;
* Known quirks or data quality issues&lt;br /&gt;
* Volume (approximate number of documents)&lt;br /&gt;
* Coverage dates&lt;br /&gt;
&lt;br /&gt;
== Complete example: CADA source ==&lt;br /&gt;
&lt;br /&gt;
The CADA (Commission d&amp;#039;acces aux documents administratifs) source is a good reference implementation for a simple inline source.&lt;br /&gt;
&lt;br /&gt;
=== Registration ===&lt;br /&gt;
&lt;br /&gt;
In &amp;lt;code&amp;gt;duralex-ingest/duralex-ingest-fr/src/duralex/ingest/fr/source_registry.py&amp;lt;/code&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;python&amp;quot;&amp;gt;&lt;br /&gt;
register_source(&lt;br /&gt;
    &amp;quot;cada&amp;quot;,&lt;br /&gt;
    jurisdiction=&amp;quot;fr&amp;quot;,&lt;br /&gt;
    name=&amp;quot;CADA&amp;quot;,&lt;br /&gt;
    description=&amp;quot;Avis et conseils de la Commission d&amp;#039;accès aux documents administratifs&amp;quot;,&lt;br /&gt;
    kind=&amp;quot;decision&amp;quot;,&lt;br /&gt;
    publisher=&amp;quot;CADA&amp;quot;,&lt;br /&gt;
    publisher_url=&amp;quot;https://www.data.gouv.fr/fr/datasets/53698f37a3a729239d2036a0/&amp;quot;,&lt;br /&gt;
    license=&amp;quot;Licence Ouverte 2.0&amp;quot;,&lt;br /&gt;
    license_url=&amp;quot;https://www.etalab.gouv.fr/licence-ouverte-open-licence/&amp;quot;,&lt;br /&gt;
    language=&amp;quot;fr&amp;quot;,&lt;br /&gt;
    date_bounds=DateBounds.strict(min_year=1900),&lt;br /&gt;
)&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Downloader ===&lt;br /&gt;
&lt;br /&gt;
In &amp;lt;code&amp;gt;duralex-ingest/duralex-ingest-fr/src/duralex/ingest/fr/sources/cada.py&amp;lt;/code&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
* Downloads a consolidated CSV (~184 MB, ~60k records, 1984-present) from data.gouv.fr&lt;br /&gt;
* Caches the CSV to disk with change detection via &amp;lt;code&amp;gt;last_modified&amp;lt;/code&amp;gt; timestamp&lt;br /&gt;
* Parses each CSV row into a &amp;lt;code&amp;gt;corpus.documents&amp;lt;/code&amp;gt; record with structured tags&lt;br /&gt;
* Inserts in batches of 5000 using &amp;lt;code&amp;gt;insert_batch()&amp;lt;/code&amp;gt;&lt;br /&gt;
* Cleans up orphaned records (with a safety threshold of 30,000 to avoid accidental mass deletion)&lt;br /&gt;
* Tracks ingest state so unchanged upstream data is skipped&lt;br /&gt;
&lt;br /&gt;
=== Record construction ===&lt;br /&gt;
&lt;br /&gt;
Each CADA record maps to:&lt;br /&gt;
* &amp;lt;code&amp;gt;id&amp;lt;/code&amp;gt;: &amp;lt;code&amp;gt;&amp;quot;fr.cada_{number}&amp;quot;&amp;lt;/code&amp;gt;&lt;br /&gt;
* &amp;lt;code&amp;gt;kind&amp;lt;/code&amp;gt;: &amp;lt;code&amp;gt;&amp;quot;decision&amp;quot;&amp;lt;/code&amp;gt;&lt;br /&gt;
* &amp;lt;code&amp;gt;source&amp;lt;/code&amp;gt;: &amp;lt;code&amp;gt;&amp;quot;cada&amp;quot;&amp;lt;/code&amp;gt;&lt;br /&gt;
* &amp;lt;code&amp;gt;title&amp;lt;/code&amp;gt;: &amp;lt;code&amp;gt;&amp;quot;CADA {year} n°{number} — {subject}&amp;quot;&amp;lt;/code&amp;gt;&lt;br /&gt;
* &amp;lt;code&amp;gt;tags&amp;lt;/code&amp;gt;: includes &amp;lt;code&amp;gt;type&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;content_quality&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;nature&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;court&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;authority&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;theme&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;case_number&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;solution&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;summary&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;headnote_classification&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Existing French sources ==&lt;br /&gt;
&lt;br /&gt;
For reference, the following sources are currently registered for France:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
! Source key !! Name !! Kind !! Publisher&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;cass&amp;lt;/code&amp;gt; || Cour de cassation || decision || DILA&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;inca&amp;lt;/code&amp;gt; || INCA || decision || DILA&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;capp&amp;lt;/code&amp;gt; || Cours d&amp;#039;appel || decision || DILA&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;jade&amp;lt;/code&amp;gt; || JADE || decision || DILA&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;constit&amp;lt;/code&amp;gt; || Conseil constitutionnel || decision || DILA&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;cnil&amp;lt;/code&amp;gt; || CNIL || decision || DILA&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;legi&amp;lt;/code&amp;gt; || LEGI || legislation || DILA&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;kali&amp;lt;/code&amp;gt; || KALI || legislation || DILA&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;acco&amp;lt;/code&amp;gt; || ACCO || legislation || DILA&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;jorf&amp;lt;/code&amp;gt; || JORF || legislation || DILA&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;ce_opendata&amp;lt;/code&amp;gt; || Justice administrative (open data) || decision || Conseil d&amp;#039;État&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;cada&amp;lt;/code&amp;gt; || CADA || decision || CADA&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;bofip&amp;lt;/code&amp;gt; || BOFiP || legislation || DGFiP&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;judilibre&amp;lt;/code&amp;gt; || Judilibre || decision || Cour de cassation&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;jufi&amp;lt;/code&amp;gt; || Juridictions financières || decision || DILA&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;rne&amp;lt;/code&amp;gt; || Registre national des entreprises || record || INPI&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;bodacc&amp;lt;/code&amp;gt; || BODACC || notice || DILA&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Nicolas</name></author>
	</entry>
</feed>