Serving · inference inside a jurisdiction

OpenAI-compatible HTTP API · model pinned to operator · every token receipted

Quickstart
OpenAIHTTP API compatible
vLLM+ TGI back-ends
Q1 2027α opens
per-tokenresidency receipt
Pre-α. The catalogue + protocol are defined; the runtime fleets land with Asia + Pacific operators. Spec is published so a compliant runtime can be built independently.

What it is

Take any model in the Concord registry, ask the operator of your jurisdiction to serve it, get back an HTTPS endpoint that speaks the OpenAI HTTP API. Same /v1/chat/completions, /v1/embeddings, and /v1/audio/transcriptions shape your client already knows — but the inference happens on operator hardware in the operator's jurisdiction, with a signed per-request receipt naming model version, operator, region, and the puller's institutional key.

Use cases: regulated finance running RAG against models without exporting prompts; national health systems on sovereign weights; defence procurement that needs a chain-of-custody from manifest to token; reproducible eval pipelines pinned to a specific model + operator + date.

Endpoints

OpenAI-shaped · base URL https://serve.eu.concordfaces.org/v1

MethodPathDescriptionStatus
POST/v1/chat/completionsOpenAI-compatible chat completion. stream=true supported.α Q1 2027
POST/v1/completionsLegacy text completion shape, for older clients.α Q1 2027
POST/v1/embeddingsVector embeddings — encoder + sentence-transformer models.α Q1 2027
POST/v1/audio/transcriptionsSpeech-to-text (Whisper-class).β Q2 2027
POST/v1/images/generationsDiffusion image generation (FLUX, SDXL, SD3.5).β Q2 2027
GET/v1/modelsList models hosted by this operator endpoint.α Q1 2027
GET/v1/receipts/{id}Fetch the signed receipt for a prior request id.α Q1 2027

Residency pinning

Inference does not cross a border without a signed token

The operator hosts the model, runs the GPU, and emits the tokens — all inside one jurisdiction. Prompts are not forwarded to other operators. A request from outside the operator's region either resolves at the puller's continental operator (if it also hosts the model), or carries a signed cross-border token that countersigns the receipt. There is no shadow back-end on a foreign cloud.

receipt.tomlCN-RC-0001 draft
[receipt]
id          = "rcp_01HZ2A…7P3K"
issued_at   = 2027-01-14T11:22:09Z
operator    = "eu:concord-eu"
node        = "eu:ams-3"

[model]
name        = "deepseek-ai/DeepSeek-R1"
version     = "v1"
manifest    = "b3:1f8c…ab02"     # content-addressed

[request]
endpoint     = "/v1/chat/completions"
puller_key   = "eu:acme-bank:k/2027-01"
residency    = "eu → eu"      # inferred jurisdiction → served jurisdiction
in_tokens    = 2_134
out_tokens   = 418

[signature]
alg         = "ed25519"
key         = "eu:europa:k/2027-01"
sig         = "5f3c…d091"

Runtimes

Best-of-breed open source · operator picks the engine, the protocol stays the same

LLM

vLLM

Default text-gen back-end. PagedAttention, prefix caching, speculative decoding. Operator-selectable per model.

LLM (alt)

TGI

HuggingFace text-generation-inference. Selected for models with custom tokenizers or unusual quantizations vLLM doesn't yet support.

Embeddings

text-embeddings-inference

HF TEI for sentence-transformers + BGE + nomic + cross-encoders. Sub-millisecond batched embeddings.

Speech

faster-whisper

CTranslate2-backed Whisper variants. Streaming transcription with per-segment receipts.

Image

diffusers

FLUX, SDXL, SD3.5 generation. Operator may enforce a default safety guard depending on its jurisdiction's regulator.

Custom

Bring-your-own

Any runtime that speaks the OpenAI HTTP shape and produces a signed receipt to spec. Operators register their runtimes in the federation gossip layer.

Pricing shape

Cost-passthrough + sovereignty premium

Each operator publishes its own per-token rate per model. Phase-0 pricing targets cost-passthrough on the GPU hour plus a small premium that funds the operator's audit, compliance, and storage obligations. Rates are posted at serve.<op>.concordfaces.org/v1/pricing in machine-readable form and indexed in the federation gossip layer so a client can shop the federation under a residency constraint.

No metered surveillance. Pricing is per-token but the receipt does not record the content of prompts or completions — only counts, model version, residency, and puller key. The operator cannot reconstruct your requests from billing data; the billing data is the receipt set, hashed.

SLOs + observability

Operator-published · per-model · per-region

Each operator publishes per-model SLO targets (p50 / p95 / p99 first-token latency, tokens-per-second steady-state, monthly availability) and a public status page at status.<op>.concordfaces.org. Breach of a posted SLO is a credit owed by the operator under its standard terms. The federation does not arbitrate operator SLOs; each operator is accountable under its seat's law.

Quickstart

Drop-in OpenAI client · point base_url at the operator

curlchat/completions
$ curl https://serve.eu.concordfaces.org/v1/chat/completions \
  -H "Authorization: Bearer $CONCORD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1",
    "messages": [
      {"role": "user", "content": "Hello"}
    ]
  }'
python · openaibase_url override
>>> from openai import OpenAI
>>> client = OpenAI(
...     base_url="https://serve.eu.concordfaces.org/v1",
...     api_key=os.environ["CONCORD_API_KEY"],
... )
>>> r = client.chat.completions.create(
...     model="meta-llama/Llama-3.1-8B-Instruct",
...     messages=[{"role":"user","content":"Hello"}],
... )
>>> r.id  # signed receipt id
Receipt every response. The id on every response maps to a signed receipt at /v1/receipts/{id}. Pull and store receipts to discharge audit obligations; they are admissible in the operator's jurisdiction.