step 01
Ingest & embed
qwen3-embedding Turn your documents, code and policies into vectors — 4096 dimensions, 100+ languages. Re-index your whole corpus as often as you want; tokens are unlimited.
// use cases · rag
Ground your models in your documents, code and policies, embeddings, reranking and generation on our infrastructure. Consume your knowledge or expand it without paying more on your bill.
// how it works
Embeddings, reranking and generation from a single OpenAI-compatible endpoint — so your data takes one short trip, and only inside the EU.
step 01
qwen3-embedding Turn your documents, code and policies into vectors — 4096 dimensions, 100+ languages. Re-index your whole corpus as often as you want; tokens are unlimited.
step 02
rerank Pull the most relevant chunks from your own vector store, then sharpen the ranking with our cross-lingual reranker so the model sees the right context first.
step 03
deepseek-v4-flash Answer strictly from the retrieved context — with up to 1M tokens of context window when retrieval alone isn't enough. Nothing is logged, ever.
// drop-in
Point the OpenAI SDK — or LangChain, LlamaIndex, your own pipeline — at Helmcode. Same calls, same shapes, private models on EU infrastructure.
read_the_docsfrom openai import OpenAI client = OpenAI( api_key="sk-...", base_url="https://api.helmcode.com/v1", # one line changes ) # 1 · embed your documents — privately, in the EU vectors = client.embeddings.create( model="qwen3-embedding", input=documents, ) # 2 · answer grounded in the retrieved context answer = client.chat.completions.create( model="deepseek-v4-flash", messages=[ {"role": "system", "content": "Answer only from the context."}, {"role": "user", "content": context + question}, ], )
// why helmcode
The use case where confidentiality, cost and control all matter at once — and where closed APIs ask you to give up all three.
Your prompts and retrieved context are never stored. Your knowledge base never trains a model — not ours, not anyone's.
Indexing and generation run only on EU infrastructure — not on US hyperscalers subject to the Cloud Act. GDPR and AI Act native.
Embeddings, reranking and generation behind a single OpenAI-compatible endpoint. No three vendors to wire together.
Re-embed your entire corpus as often as you need. Limits are RPM and concurrency per key — never total tokens.
When retrieval isn't enough, deepseek-v4-flash takes whole-corpus prompts — fewer chunks to tune, fewer answers missed.
Change the base URL and key. LangChain, LlamaIndex, Haystack and your own retrieval code keep working as-is.
// rag faq
What engineering and security teams ask before grounding models in their own data.
No. Zero logs — your inputs, embeddings and retrieved context are never persisted, and nothing you send ever trains a model. Confidentiality is enforced by architecture, not by policy.
qwen3-embedding (8B, 4096 dimensions, 100+ languages, MMTEB 70.58) for embeddings, and rerank (Qwen3 Reranker, cross-lingual) for reranking. Both are served from the same OpenAI-compatible API as the LLMs.
Yes. Helmcode handles embeddings, reranking and generation — you keep your vector store (pgvector, Qdrant, Pinecone, Weaviate…). We don't lock you into a storage layer.
Yes. Point any OpenAI-compatible client or framework at our base URL with your API key. LangChain, LlamaIndex, Haystack and custom pipelines work unchanged.
deepseek-v4-flash supports up to a 1M-token context window, so you can pass large retrieved sets — or whole documents — when chunked retrieval isn't enough.
For strict compliance, run RAG on a dedicated GPU or fully on-premise inside your own datacenter. The same API and code, with data that never leaves your network.
// get started
Skip the AI infra work. Deploy your first private inference endpoint today.
Flat rate. EU data. OpenAI API compatible.
// cookies
We use strictly necessary cookies to run the site and, only with your consent, Google Analytics to understand usage. No advertising, ever — see our Cookie Policy.
// preferences