Open-weight models, served fast from European hardware.
LLM Tech runs open-weight language models on NVIDIA Blackwell GPUs using NVFP4 quantization. Every number below is measured on the production deployment — not quoted from a datasheet.
TTFT measured over 1,000 requests at a sustained 3.5 req/s with a realistic traffic mix (≈1.8K input / 525 output tokens). GPQA Diamond evaluated with lm-evaluation-harness, zero-shot CoT, flexible-extract, temperature 0.6. Throughput measured with the vLLM serving benchmark.
Why NVFP4
NVFP4 is NVIDIA's 4-bit floating-point format with FP8 block scaling, executed natively on Blackwell tensor cores. Quantization is applied to MLP layers only, with MSE calibration — attention runs at full precision.
The result: 2–3× the single-stream speed of FP8 and BF16 deployments of the same model, with quality measured on standard benchmarks rather than assumed.
Data policy, in one paragraph
Prompts and completions are processed in memory and never written to disk. Nothing is used for training — ours or anyone else's. Zero content retention. Full details in the privacy policy.
API
The endpoint is OpenAI-compatible (chat completions, streaming, tool calling, reasoning content). Access is currently provisioned per-integration — reach us at artemburej5@gmail.com.