blogs.nvidia.com
|
ksl
|
|
NVIDIA released Nemotron 3 Super, a 120B-parameter mixture-of-experts model that activates only 12B parameters at inference – built specifically for multi-agent workflows where context and cost compound fast. The hybrid Mamba-Transformer architecture handles a million-token context window while claiming 5x throughput over its predecessor, and NVIDIA says it tops Artificial Analysis efficiency rankings in its class. Multi-token prediction pushes inference speed to 3x, which matters when every step in an agent chain requires reasoning. Available open-weight through Hugging Face, build.nvidia.com, and major clouds including Google Vertex and Oracle, with AWS Bedrock and Azure coming. NVIDIA keeps pushing hard on the inference-efficient open model tier that Meta’s Llama and Mistral also target, but the explicit agentic framing here sets a more specific product direction.
