Editing
Development/Adding a source
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
= Adding a source = This guide explains how to add a new data source to the Dura Lex ingest pipeline. A "source" is a public institutional data feed (legislation, case law, company records, official notices) that gets downloaded, parsed, and inserted into the corpus. == Overview == Adding a source involves these steps: # Register the source in the source registry # Implement a downloader (and optionally a parser) # Follow tag and edge conventions # Document the source on the wiki == Step 1: Register the source == Every source must be registered in the jurisdiction's <code>source_registry.py</code> module. Registration happens at import time via the <code>register_source()</code> function from <code>duralex.ingest.source_registry</code>. === Location === * French sources: <code>duralex-ingest/duralex-ingest-fr/src/duralex/ingest/fr/source_registry.py</code> * EU sources: <code>duralex-ingest/duralex-ingest-eu/src/duralex/ingest/eu/source_registry.py</code> * New jurisdictions: create <code>duralex-ingest/duralex-ingest-{jur}/src/duralex/ingest/{jur}/source_registry.py</code> === Required fields === The <code>register_source()</code> function requires the following fields: {| class="wikitable" ! Field !! Type !! Description !! Example |- | <code>source_key</code> || <code>str</code> (positional) || Unique short identifier for the source || <code>"cada"</code> |- | <code>jurisdiction</code> || <code>str</code> || ISO-style jurisdiction code || <code>"fr"</code>, <code>"eu"</code> |- | <code>name</code> || <code>str</code> || Human-readable source name || <code>"CADA"</code> |- | <code>description</code> || <code>str</code> || What this source contains || <code>"Avis et conseils de la Commission d'accès aux documents administratifs"</code> |- | <code>kind</code> || <code>str</code> || One of the 6 structural kinds: <code>legislation</code>, <code>decision</code>, <code>record</code>, <code>notice</code>, <code>section</code>, <code>chunk</code> || <code>"decision"</code> |- | <code>publisher</code> || <code>str</code> || Organization that publishes the data || <code>"CADA"</code>, <code>"DILA"</code> |- | <code>publisher_url</code> || <code>str</code> || URL where the data can be found || <code>"https://www.data.gouv.fr/fr/datasets/..."</code> |- | <code>license</code> || <code>str</code> || License name || <code>"Licence Ouverte 2.0"</code> |- | <code>license_url</code> || <code>str</code> || URL of the license || <code>"https://www.etalab.gouv.fr/licence-ouverte-open-licence/"</code> |- | <code>language</code> || <code>str</code> || ISO 639-1 language code || <code>"fr"</code>, <code>"en"</code> |- | <code>date_bounds</code> || <code>DateBounds</code> || Valid date range for this source || see below |} === DateBounds === <code>DateBounds</code> validates that document dates are within a plausible range. Two factory methods are available: * <code>DateBounds.strict(min_year=NNNN)</code> — for sources where documents cannot have future dates (decisions, records, notices). Max year defaults to current year + small margin. * <code>DateBounds.permissive(min_year=NNNN, max_year=NNNN)</code> — for legislation with legitimate forward entry-into-force or expiration dates. === Example: simple source registration === <syntaxhighlight lang="python"> from duralex.ingest.date_validation import DateBounds from duralex.ingest.source_registry import register_source register_source( "cada", jurisdiction="fr", name="CADA", description="Avis et conseils de la Commission d'accès aux documents administratifs", kind="decision", publisher="CADA", publisher_url="https://www.data.gouv.fr/fr/datasets/53698f37a3a729239d2036a0/", license="Licence Ouverte 2.0", license_url="https://www.etalab.gouv.fr/licence-ouverte-open-licence/", language="fr", date_bounds=DateBounds.strict(min_year=1900), ) </syntaxhighlight> === Sub-sources === If your source has logical sub-feeds (e.g., different court levels within a single API), register them as sub-sources. Sub-sources inherit all metadata from their parent: <syntaxhighlight lang="python"> from duralex.ingest.source_registry import register_sub_source # Parent must be registered first for sub_key in ("judilibre_cc", "judilibre_ca", "judilibre_tj", "judilibre_tcom"): register_sub_source(sub_key, "judilibre") </syntaxhighlight> == Step 2: Implement a downloader == === BaseDownloader (for DILA-style tar.gz archives) === If your source publishes data as <code>.tar.gz</code> archives on an HTTP index page (like DILA sources), subclass <code>BaseDownloader</code> from <code>duralex.ingest.sources.base_downloader</code>. <code>BaseDownloader</code> provides: * HTTP retry with exponential backoff and 429/Retry-After support * Listing remote files from an HTML index page * Downloading with partial-download safety (<code>.tmp</code> files + atomic rename) * Extracting <code>.tar.gz</code> archives with prefix stripping * State tracking (last downloaded diff filename) for incremental updates * Freemium base file + incremental diff pattern Constructor parameters: {| class="wikitable" ! Parameter !! Description |- | <code>dataset_name</code> || Identifier for this dataset (e.g. <code>"cass"</code>, <code>"legi"</code>) |- | <code>data_directory</code> || Root directory for extracted data and archives |- | <code>base_url</code> || URL of the HTTP index listing <code>.tar.gz</code> files |- | <code>get_state</code> || Callback returning the last processed filename, or <code>None</code> |- | <code>set_state</code> || Callback to persist the last processed filename |- | <code>archive_prefix_markers</code> || Directory names marking the start of useful content in tar archives (e.g. <code>["JURI", "CETA"]</code>) |} The <code>run()</code> method is the main entry point. It downloads the freemium base (on first run) and/or incremental diffs, extracts them, and returns a list of extracted XML file paths. State is NOT written by <code>run()</code> -- call <code>commit_state()</code> after successful ingest. === Inline downloader (for non-archive sources) === For sources that do not follow the tar.gz archive pattern (e.g., CSV files, API endpoints), write a standalone downloader class that handles its own parsing and DB insertion. This is called an "inline source". The CADA downloader is a good example of an inline source. Key patterns: <syntaxhighlight lang="python"> class CadaDownloader: """Downloads CADA CSV from data.gouv.fr and ingests into corpus.documents.""" def __init__(self, dsn: str, data_directory: Path) -> None: self.dsn = dsn self._cache_directory = data_directory / "cada" def run(self, errors: IngestErrors | None = None) -> list[str]: """Download, parse, and insert. Returns [].""" # 1. Check upstream for changes (compare last_modified) # 2. Download or use cached file # 3. Parse rows into corpus.documents records # 4. Insert in batches using insert_batch() # 5. Clean up orphaned records # 6. Update ingest state ... </syntaxhighlight> Key conventions for inline downloaders: * Accept <code>dsn</code> (PostgreSQL connection string) and <code>data_directory</code> (root for cached files) in the constructor * Accept an optional <code>errors: IngestErrors | None</code> parameter in <code>run()</code> for error tracking * Cache downloaded files to disk with a <code>.tmp</code> extension during download and atomic rename on completion * Store a sidecar metadata file (e.g., <code>.meta</code>) for change detection * Use <code>get_ingest_state()</code> / <code>set_ingest_state()</code> from <code>duralex.ingest.state</code> to track whether the upstream data has changed * Use <code>insert_batch()</code> from <code>duralex.ingest.database.batch_writer</code> for bulk insertion * Return an empty list (inline sources handle their own insertion) === Document record format === Each record inserted into <code>corpus.documents</code> must have: {| class="wikitable" ! Field !! Description |- | <code>id</code> || Globally unique document ID, prefixed with jurisdiction (e.g. <code>"fr.cada_20240001"</code>) |- | <code>kind</code> || One of: <code>legislation</code>, <code>decision</code>, <code>record</code>, <code>notice</code>, <code>section</code>, <code>chunk</code> |- | <code>jurisdiction</code> || Jurisdiction code (e.g. <code>"fr"</code>, <code>"eu"</code>) |- | <code>language</code> || ISO 639-1 language code |- | <code>source</code> || Source key matching the registered source |- | <code>date</code> || Primary date (ISO 8601 format <code>YYYY-MM-DD</code>), or <code>None</code> |- | <code>date_end</code> || End date for documents with a date range, or <code>None</code> |- | <code>parent_id</code> || Parent document ID for hierarchical sources, or <code>None</code> |- | <code>title</code> || Human-readable title |- | <code>body</code> || Clean displayable content (HTML or formatted text). Immutable after ingestion. |- | <code>body_search</code> || Indexable text for FTS (can be noisy: PDF OCR, etc.), or <code>None</code> to use <code>body</code> |- | <code>tags</code> || JSON string of metadata tags (see below) |} == Step 3: Follow tag conventions == Tags are the primary metadata mechanism. Every document gets a <code>tags</code> JSON object. Refer to [[Corpus/Tag conventions]] for the full shared vocabulary. === Mandatory tags === * <code>type</code> — legal document type (e.g. <code>"law"</code>, <code>"decree"</code>, <code>"judgment"</code>, <code>"administrative_decision"</code>). See the tag conventions for the full list. * <code>content_quality</code> — quality of the body content (e.g. <code>"native_structured"</code>, <code>"ocr_raw"</code>, <code>"metadata_only"</code>) === Common optional tags === * <code>nature</code> — source-specific document nature * <code>court</code> — court identifier (for decisions) * <code>authority</code> — issuing authority * <code>case_number</code> — case/dossier numbers (as a list) * <code>solution</code> — outcome/disposition * <code>summary</code> — headnote or summary * <code>headnote_classification</code> — subject classification * <code>_importance_level_default</code> — default importance for ranking === Tag construction pattern === <syntaxhighlight lang="python"> import json def _tags(**kwargs: object) -> str: """Build a tags JSON string, stripping None values.""" return json.dumps( {k: v for k, v in kwargs.items() if v is not None}, ensure_ascii=False, ) # Usage tags = _tags( type="administrative_decision", content_quality="native_structured", court="cada", case_number=[number], solution=outcome or None, ) </syntaxhighlight> == Step 4: Follow edge conventions == If your source contains cross-references to other documents (citations, amendments, transpositions, etc.), create edges in the <code>corpus.edges</code> table. Refer to [[Corpus/Edge types]] for the full taxonomy of ~75 edge types. Common edge types for new sources: * <code>cites</code> — document A cites document B * <code>amends</code> — document A amends document B * <code>repeals</code> — document A repeals document B * <code>implements</code> — national law implements an EU directive * <code>transposes</code> — national law transposes an EU directive * <code>consolidates</code> — consolidated version of a text == Step 5: Document the source == Create a wiki page at <code>Sources/{jurisdiction}/{source_name}</code> documenting: * What the source contains * Publisher and license * Data format (XML, CSV, JSON, API) * Update frequency * Known quirks or data quality issues * Volume (approximate number of documents) * Coverage dates == Complete example: CADA source == The CADA (Commission d'acces aux documents administratifs) source is a good reference implementation for a simple inline source. === Registration === In <code>duralex-ingest/duralex-ingest-fr/src/duralex/ingest/fr/source_registry.py</code>: <syntaxhighlight lang="python"> register_source( "cada", jurisdiction="fr", name="CADA", description="Avis et conseils de la Commission d'accès aux documents administratifs", kind="decision", publisher="CADA", publisher_url="https://www.data.gouv.fr/fr/datasets/53698f37a3a729239d2036a0/", license="Licence Ouverte 2.0", license_url="https://www.etalab.gouv.fr/licence-ouverte-open-licence/", language="fr", date_bounds=DateBounds.strict(min_year=1900), ) </syntaxhighlight> === Downloader === In <code>duralex-ingest/duralex-ingest-fr/src/duralex/ingest/fr/sources/cada.py</code>: * Downloads a consolidated CSV (~184 MB, ~60k records, 1984-present) from data.gouv.fr * Caches the CSV to disk with change detection via <code>last_modified</code> timestamp * Parses each CSV row into a <code>corpus.documents</code> record with structured tags * Inserts in batches of 5000 using <code>insert_batch()</code> * Cleans up orphaned records (with a safety threshold of 30,000 to avoid accidental mass deletion) * Tracks ingest state so unchanged upstream data is skipped === Record construction === Each CADA record maps to: * <code>id</code>: <code>"fr.cada_{number}"</code> * <code>kind</code>: <code>"decision"</code> * <code>source</code>: <code>"cada"</code> * <code>title</code>: <code>"CADA {year} n°{number} — {subject}"</code> * <code>tags</code>: includes <code>type</code>, <code>content_quality</code>, <code>nature</code>, <code>court</code>, <code>authority</code>, <code>theme</code>, <code>case_number</code>, <code>solution</code>, <code>summary</code>, <code>headnote_classification</code> == Existing French sources == For reference, the following sources are currently registered for France: {| class="wikitable" ! Source key !! Name !! Kind !! Publisher |- | <code>cass</code> || Cour de cassation || decision || DILA |- | <code>inca</code> || INCA || decision || DILA |- | <code>capp</code> || Cours d'appel || decision || DILA |- | <code>jade</code> || JADE || decision || DILA |- | <code>constit</code> || Conseil constitutionnel || decision || DILA |- | <code>cnil</code> || CNIL || decision || DILA |- | <code>legi</code> || LEGI || legislation || DILA |- | <code>kali</code> || KALI || legislation || DILA |- | <code>acco</code> || ACCO || legislation || DILA |- | <code>jorf</code> || JORF || legislation || DILA |- | <code>ce_opendata</code> || Justice administrative (open data) || decision || Conseil d'État |- | <code>cada</code> || CADA || decision || CADA |- | <code>bofip</code> || BOFiP || legislation || DGFiP |- | <code>judilibre</code> || Judilibre || decision || Cour de cassation |- | <code>jufi</code> || Juridictions financières || decision || DILA |- | <code>rne</code> || Registre national des entreprises || record || INPI |- | <code>bodacc</code> || BODACC || notice || DILA |} [[Category:Development]]
Summary:
Please note that all contributions to Dura Lex Wiki are considered to be released under the Creative Commons Attribution-ShareAlike (see
Dura Lex Wiki:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
Edit source
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Tools
What links here
Related changes
Page information