Datasets · residency-first registry

Training and evaluation corpora with per-row jurisdiction — not just per-dataset

Quickstart
30datasets in preview
EUjurisdiction at β
Q4 2026public β
per-rowresidency granularity
Pre-β. Catalogue below is a preview of what the EU operator will land at datasets β. Streaming + cross-jurisdictional manifests ship together.

Catalogue

# Dataset Task Size License Status
Loading…

Per-row residency

Residency at row level, not just dataset level

Most registries pin residency to the whole dataset. Concord pins it to the row. A single corpus can ship rows tagged eu, na, and any in the same manifest. The serving operator filters rows on the wire: an EU puller sees every row whose residency permits EU; a JP puller sees only rows tagged any or as. The puller cannot fetch what the residency clause refuses, and the refusal is signed.

dataset-manifest.tomlCN-DS-0001 draft
[dataset]
name    = "concord/finance-eval-multilingual"
version = "v1.2"

[[shard]]
role      = "questions"
format    = "parquet"
rows      = 412_080
residency = "any"      # every puller can pull
merkle    = "b3:1f8c…ab02"

[[shard]]
role      = "labels-eu"
format    = "parquet"
rows      = 247_201
residency = "eu"       # EU-pullers only · cross-border refused
privacy   = "pseudonymous"
merkle    = "b3:4d22…7e91"

Streaming + datasets compat

HF datasets library works unchanged · OpenSearch-style byte-range reads

Datasets live as parquet/arrow shards behind a HuggingFace-protocol shim. Set HF_ENDPOINT once and the standard datasets library streams from the EU operator. The concord dataset stream verb does line-by-line consumption for pipelines that can't load a whole shard.

python · datasetsstreaming
$ pip install datasets
$ export HF_ENDPOINT=https://api.eu.concordfaces.org/hf

>>> from datasets import load_dataset
>>> ds = load_dataset(
...     "HuggingFaceFW/fineweb-edu",
...     streaming=True,
... )
>>> for row in ds["train"]:
...     process(row)
concord datasetbyte-range stream
# Pull the whole dataset (parquet/arrow shards).
$ concord dataset pull cais/mmlu

# Stream row-by-row without materialising the file.
$ concord dataset stream cais/mmlu --split=test

# Inspect signed manifest + residency clauses.
$ concord dataset show cais/mmlu

Privacy tiers

Declared on the manifest · enforced by the serving operator

public

No personal data. Pulled by anyone in any jurisdiction the residency clause permits. Default tier for encyclopedic corpora and synthetic data.

pseudonymous

Stable identifiers without direct PII (hashed user ids, region-stripped geocodes, k-anonymous aggregates). Cross-border pull permitted only with a signed pseudonym-respect token.

restricted

Direct PII, regulated data (health, finance KYC), or operator-classified sensitive content. Requires a signed access grant tied to an institutional key; every fetch is logged + countersigned by the operator.

Licenses + audit

SPDX-tagged · provenance every pull

Supported license SPDX out of the box: CC0-1.0, CC-BY-4.0, CC-BY-SA-4.0, CC-BY-NC-4.0, ODbL-1.0, ODC-By-1.0, MIT, Apache-2.0. A dataset without a declared license never enters the operator catalogue. Every pull writes a signed receipt: dataset hash, version, residency observed, puller's institutional key, timestamp. Receipts are admissible in the jurisdiction of the serving operator.

What residency does not enforce. A dataset row tagged eu cannot leave the EU operator's jurisdiction without a signed cross-border token — but once it has left, bytes do not recognise borders. Concord guarantees signed refusal at the boundary + admissible audit if a boundary is crossed. It does not guarantee that an attacker controlling their own infrastructure cannot exfiltrate data they have already received. Same disclaimer as the enforcement section, restated plainly because data carries more legal weight than weights.

Quickstart

concord CLI · install once · pull from your jurisdiction

concorddatasets β
# Install the CLI (Linux / macOS).
$ curl -fsSL https://concordfaces.org/install.sh | sh

# Pull a dataset.
$ concord dataset pull HuggingFaceFW/fineweb-edu

# Stream row-by-row.
$ concord dataset stream cais/mmlu --split=test

# Verify signature + residency clause.
$ concord dataset verify HuggingFaceFW/fineweb-edu