// use cases · rag

Private RAG over
your own knowledge.

Ground your models in your documents, code and policies, embeddings, reranking and generation on our infrastructure. Consume your knowledge or expand it without paying more on your bill.

book_a_call

// how it works

One API, the whole retrieval stack.

Embeddings, reranking and generation from a single OpenAI-compatible endpoint — so your data takes one short trip, and only inside the EU.

step 01

Ingest & embed

qwen3-embedding

Turn your documents, code and policies into vectors — 4096 dimensions, 100+ languages. Re-index your whole corpus as often as you want; tokens are unlimited.

step 02

Retrieve & rerank

rerank

Pull the most relevant chunks from your own vector store, then sharpen the ranking with our cross-lingual reranker so the model sees the right context first.

step 03

Generate, grounded

deepseek-v4-flash

Answer strictly from the retrieved context — with up to 1M tokens of context window when retrieval alone isn't enough. Nothing is logged, ever.

// drop-in

Change one line. Keep your code.

Point the OpenAI SDK — or LangChain, LlamaIndex, your own pipeline — at Helmcode. Same calls, same shapes, private models on EU infrastructure.

read_the_docs

rag.py

from openai import OpenAI

client = OpenAI(
    api_key="sk-...",
    base_url="https://api.helmcode.com/v1",  # one line changes
)

# 1 · embed your documents — privately, in the EU
vectors = client.embeddings.create(
    model="qwen3-embedding",
    input=documents,
)

# 2 · answer grounded in the retrieved context
answer = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {"role": "system", "content": "Answer only from the context."},
        {"role": "user", "content": context + question},
    ],
)

// why helmcode

Built for RAG you can't outsource.

The use case where confidentiality, cost and control all matter at once — and where closed APIs ask you to give up all three.

Zero logs, by architecture.

Your prompts and retrieved context are never stored. Your knowledge base never trains a model — not ours, not anyone's.

Embeddings stay in the EU.

Indexing and generation run only on EU infrastructure — not on US hyperscalers subject to the Cloud Act. GDPR and AI Act native.

The whole retrieval stack.

Embeddings, reranking and generation behind a single OpenAI-compatible endpoint. No three vendors to wire together.

Re-index without a bill.

Re-embed your entire corpus as often as you need. Limits are RPM and concurrency per key — never total tokens.

Up to 1M tokens.

When retrieval isn't enough, deepseek-v4-flash takes whole-corpus prompts — fewer chunks to tune, fewer answers missed.

Your pipeline, unchanged.

Change the base URL and key. LangChain, LlamaIndex, Haystack and your own retrieval code keep working as-is.

In production across

Banking & fintech
Insurance
Legal
Healthcare
Pharma & biotech
Public sector
Telco
Energy & utilities
Industry
Education
Dev tools

In production at

// rag faq

RAG, answered.

What engineering and security teams ask before grounding models in their own data.

Do you store the documents I embed or the context I retrieve?

No. Zero logs — your inputs, embeddings and retrieved context are never persisted, and nothing you send ever trains a model. Confidentiality is enforced by architecture, not by policy.

Which embedding and reranking models do you offer?

qwen3-embedding (8B, 4096 dimensions, 100+ languages, MMTEB 70.58) for embeddings, and rerank (Qwen3 Reranker, cross-lingual) for reranking. Both are served from the same OpenAI-compatible API as the LLMs.

Can I keep my own vector database?

Yes. Helmcode handles embeddings, reranking and generation — you keep your vector store (pgvector, Qdrant, Pinecone, Weaviate…). We don't lock you into a storage layer.

Does it work with LangChain or LlamaIndex?

Yes. Point any OpenAI-compatible client or framework at our base URL with your API key. LangChain, LlamaIndex, Haystack and custom pipelines work unchanged.

How large a context can I send?

deepseek-v4-flash supports up to a 1M-token context window, so you can pass large retrieved sets — or whole documents — when chunked retrieval isn't enough.

What about highly sensitive corpora?

For strict compliance, run RAG on a dedicated GPU or fully on-premise inside your own datacenter. The same API and code, with data that never leaves your network.

// get started

START BURNING TOKENS

Skip the AI infra work. Deploy your first private inference endpoint today.

Flat rate. EU data. OpenAI API compatible.