// product

Run open frontier models
through a single API.

Open models, operated for you on EU infrastructure, the full inference stack behind a single endpoint. Sovereignty + flat rate + zero logs.

// architecture

One endpoint. It never leaves the EU.

A request hits a single OpenAI-compatible URL, is routed and rate-limited in our control plane, and answered by open models on managed GPUs — all inside the EU, none of it logged.

Your request enters one EU endpoint and never leaves — no prompt is stored, no data crosses to a US hyperscaler.

// guarantees

Four things, by architecture.

Not features you configure — properties of how the platform is built. They hold for every model and every use case.

Unlimited tokens

No caps on consumption — only RPM and concurrency per API key.

OpenAI-compatible

Change the base URL and key. Every OpenAI-compatible client works as-is.

Zero logs

Prompts are never stored. Your data and code never train a model.

Data in the EU

Processed only on EU infrastructure — not subject to the Cloud Act.

// capabilities

Everything the API can do.

One OpenAI-compatible endpoint, the full feature surface — text, vision, voice, retrieval and agents.

Tool & function calling

Native function calling with the OpenAI JSON schema — agents that act, not just chat.

all LLMs

Structured outputs

Constrain responses to your JSON schema with response_format — typed, every time.

response_format

Vision & multimodal

Image and audio input on Gemma 4 and MiMo — read scans, charts and screenshots.

gemma4 · mimo

Streaming

Token streaming over SSE for real-time chat, copilots and voice UX.

SSE

Long context

Up to a 1M-token context window on DeepSeek V4-Flash — whole corpora in one pass.

up to 1M

Embeddings & reranking

4096-dim multilingual vectors plus cross-lingual reranking — retrieval, built in.

qwen3-embedding · rerank

Speech · STT & TTS

Whisper transcription and Kokoro synthesis — 99+ languages, sub-second voice.

whisper · kokoro

Unlimited tokens

No caps on consumption — limits are RPM and concurrency per API key.

per API key

// by the numbers

The platform, spec'd out.

The hard numbers behind the stack — context, hardware, region and reliability.

Context window up to 1M tokens
Embedding dims 4096
Models 9 in production
Hardware B200 · 192GB
Region EU · Madrid
Uptime SLA 99.9%
API OpenAI-compatible · 6 endpoints
Data retention zero logs

// product faq

The platform, answered.

What teams ask before moving inference onto Helmcode.

What does Helmcode actually run?

Open-weight models — DeepSeek, Qwen, Gemma, plus embeddings, reranking and speech — served behind an OpenAI-compatible API and operated by us on EU GPUs, with zero logs.

How do I get started?

Get an API key from the console, change your base URL and key, and you're running. Any OpenAI-compatible SDK or tool works unchanged — most teams ship the same day.

Which models are available?

Nine in production: DeepSeek V4-Flash, MiMo, Qwen 3.6 and Gemma 4 for text, qwen3-embedding and rerank for retrieval, and Whisper and Kokoro for speech. See the Models page for specs.

Where does inference run?

Exclusively on EU infrastructure — never on US hyperscalers subject to the Cloud Act. GDPR and AI Act native, by architecture rather than configuration.

Is it managed, or do I self-host?

Fully managed: we provision, monitor and operate the whole stack. For stricter needs you can move to dedicated GPUs or a full on-premise deployment inside your own datacenter.

How is it priced?

Per API key — a flat monthly rate, not per token. Unlimited tokens on open models, no usage surprises, no lock-in. See Pricing for plans.

// get started

START BURNING TOKENS

Skip the AI infra work. Deploy your first private inference endpoint today.

Flat rate. EU data. OpenAI API compatible.