Inference provider · EU

Open-weight models, served fast from European hardware.

LLM Tech runs open-weight language models on NVIDIA Blackwell GPUs using NVFP4 quantization. Every number below is measured on the production deployment — not quoted from a datasheet.

Measured — Qwen3.5-9B · NVFP4 · production node
single-stream output110 tok/s
median TTFT, sustained load227 ms
context window262,144 tokens
peak aggregate throughput7,723 tok/s
GPQA Diamond (measured)72.2%
quantizationNVFP4 · MLP-only · MSE-calibrated
hardwareNVIDIA Blackwell (RTX Pro series)
serving stackvLLM · streaming · prefix caching
regionEU

TTFT measured over 1,000 requests at a sustained 3.5 req/s with a realistic traffic mix (≈1.8K input / 525 output tokens). GPQA Diamond evaluated with lm-evaluation-harness, zero-shot CoT, flexible-extract, temperature 0.6. Throughput measured with the vLLM serving benchmark.

Why NVFP4

NVFP4 is NVIDIA's 4-bit floating-point format with FP8 block scaling, executed natively on Blackwell tensor cores. Quantization is applied to MLP layers only, with MSE calibration — attention runs at full precision.

The result: 2–3× the single-stream speed of FP8 and BF16 deployments of the same model, with quality measured on standard benchmarks rather than assumed.

Data policy, in one paragraph

Prompts and completions are processed in memory and never written to disk. Nothing is used for training — ours or anyone else's. Zero content retention. Full details in the privacy policy.

zero retention no prompt training EU jurisdiction TLS 1.3

API

The endpoint is OpenAI-compatible (chat completions, streaming, tool calling, reasoning content). Access is currently provisioned per-integration — reach us at artemburej5@gmail.com.