Training and evaluation corpora with per-row jurisdiction — not just per-dataset
| # | Dataset | Task | Size | License | Status |
|---|---|---|---|---|---|
| Loading… | |||||
Residency at row level, not just dataset level
Most registries pin residency to the whole dataset. Concord pins it to the
row. A single corpus can ship rows tagged eu,
na, and
any in the same manifest. The serving
operator filters rows on the wire: an EU puller sees every row whose residency permits
EU; a JP puller sees only rows tagged any or
as. The puller cannot fetch what
the residency clause refuses, and the refusal is signed.
[dataset] name = "concord/finance-eval-multilingual" version = "v1.2" [[shard]] role = "questions" format = "parquet" rows = 412_080 residency = "any" # every puller can pull merkle = "b3:1f8c…ab02" [[shard]] role = "labels-eu" format = "parquet" rows = 247_201 residency = "eu" # EU-pullers only · cross-border refused privacy = "pseudonymous" merkle = "b3:4d22…7e91"
datasets compatHF datasets library works unchanged · OpenSearch-style byte-range reads
Datasets live as parquet/arrow shards behind a HuggingFace-protocol shim. Set
HF_ENDPOINT once and the standard
datasets library streams from the EU
operator. The concord dataset stream
verb does line-by-line consumption for pipelines that can't load a whole shard.
$ pip install datasets $ export HF_ENDPOINT=https://api.eu.concordfaces.org/hf >>> from datasets import load_dataset >>> ds = load_dataset( ... "HuggingFaceFW/fineweb-edu", ... streaming=True, ... ) >>> for row in ds["train"]: ... process(row)
# Pull the whole dataset (parquet/arrow shards). $ concord dataset pull cais/mmlu # Stream row-by-row without materialising the file. $ concord dataset stream cais/mmlu --split=test # Inspect signed manifest + residency clauses. $ concord dataset show cais/mmlu
Declared on the manifest · enforced by the serving operator
No personal data. Pulled by anyone in any jurisdiction the
residency clause permits. Default tier for
encyclopedic corpora and synthetic data.
Stable identifiers without direct PII (hashed user ids, region-stripped geocodes, k-anonymous aggregates). Cross-border pull permitted only with a signed pseudonym-respect token.
Direct PII, regulated data (health, finance KYC), or operator-classified sensitive content. Requires a signed access grant tied to an institutional key; every fetch is logged + countersigned by the operator.
SPDX-tagged · provenance every pull
Supported license SPDX out of the box: CC0-1.0,
CC-BY-4.0,
CC-BY-SA-4.0,
CC-BY-NC-4.0,
ODbL-1.0,
ODC-By-1.0,
MIT,
Apache-2.0.
A dataset without a declared license never enters the operator catalogue.
Every pull writes a signed receipt: dataset hash, version, residency observed,
puller's institutional key, timestamp. Receipts are admissible in the jurisdiction
of the serving operator.
eu cannot leave the EU operator's jurisdiction without a signed
cross-border token — but once it has left, bytes do not recognise borders.
Concord guarantees signed refusal at the boundary + admissible audit if a
boundary is crossed. It does not guarantee that an attacker controlling their
own infrastructure cannot exfiltrate data they have already received. Same
disclaimer as the enforcement section, restated
plainly because data carries more legal weight than weights.
concord CLI · install once · pull from your jurisdiction
# Install the CLI (Linux / macOS). $ curl -fsSL https://concordfaces.org/install.sh | sh # Pull a dataset. $ concord dataset pull HuggingFaceFW/fineweb-edu # Stream row-by-row. $ concord dataset stream cais/mmlu --split=test # Verify signature + residency clause. $ concord dataset verify HuggingFaceFW/fineweb-edu