Skip to playerSkip to main contentSkip to footer
  • 2 days ago
open source AI model, Llama 4 vs DeepSeek R1, best open source AI 2025, new AI model outperforming Llama, AI model comparison, DeepSeek R1 performance, Llama 4 vs new AI, AI task performance, open source AI breakthrough, AI model benchmark, next-gen AI tools, open source AI technology, AI speed accuracy comparison, AI model review, open-source AI vs commercial, DeepSeek AI review, AI 2025 performance, fastest open source AI, Llama 4 alternative, new AI tools 2025, AI task handling comparison

#AI #Nvidia #Nemotron
Nvidia has released Nemotron Ultra, a powerful open source AI model with 253 billion parameters that outperforms larger models like DeepSeek R1 and Llama 4 in most tasks. It features a unique “reasoning on” and “reasoning off” mode, allowing it to switch between deep and fast thinking, making it ideal for code generation, math, and complex instruction-following. Built using Neural Architecture Search and optimized for efficiency, it runs on a single 8xH100 setup and supports extended context lengths up to 128,000 tokens.

🔍 Key Topics:
Nvidia unveils *Nemotron Ultra*, a 253B open source AI model optimized for efficiency
Beats DeepSeek R1 and Llama 4 in tasks like code, math, and instruction-following
Features toggleable reasoning modes for shallow or deep thinking on demand

🎥 What You’ll Learn:
How Nemotron Ultra uses Neural Architecture Search and model compression to run on 8xH100
Why its “reasoning on/off” feature boosts performance across multiple benchmarks
What this means for *AI deployment*, cost-effective inference, and commercial applications

📊 Why It Matters:
This video breaks down how Nvidia’s Nemotron Ultra is redefining large language models with powerful reasoning control, massive context windows, and state-of-the-art results—making high-performance AI more accessible than ever.

DISCLAIMER:
This video explores Nvidia’s Nemotron Ultra and its impact on the AI landscape, highlighting key benchmarks, architecture choices, and its practical advantages over larger models.

#Nemotron #Nvidia #AI
#OpenSourceAI #AImodel #Llama4 #DeepSeekR1 #AIRevolution #NewAI #AIOutperforms #TechBreakthrough #OpenAI #ArtificialIntelligence #AIcompetition #AI2025 #NextGenAI #AItools #AIinnovation #FutureOfAI #OpenSourceTechnology #AIperformance #AIbenchmark #AIinTech #AIdevelopment

Category

🤖
Tech
Transcript
00:00NVIDIA just dropped a monster of a model that's smaller than DeepSeek R1 but still beats it in
00:08most tasks. Yep, even with half the size. It can flip between shallow and deep reasoning like a
00:13switch and it runs on just one setup of eight H100 GPUs. This thing is open source, crazy efficient
00:21and packed with new tricks you'll actually want to try out. Alright, so this new model is built
00:25around META's older LAMA 3.1405b Instruct model which was already known for pretty robust performance
00:32in reasoning and instruction following tasks. The team at NVIDIA did something pretty unique.
00:37They used a technique called Neural Architecture Search or NAS to pick and choose which parts of
00:44LAMA's architecture to keep, skip or fuse together. Some blocks of the network skip attention entirely
00:50or compress feed forward layers and others fuse multiple feed forward networks into bigger,
00:55more efficient ones. It's all about optimizing memory usage and the end result is a model that
01:00you can run on a single 8x H100 node or even on systems rocking NVIDIA's B100 or Hopper architectures
01:08with BF16 or FP8 precision. It's honestly impressive that something so large, 253 billion parameters to
01:16be exact, can be whipped into shape so that it doesn't break the bank on hardware resources.
01:20Now, one of the coolest features is this built-in reasoning on and reasoning off mode. If you're
01:27dealing with complex tasks like detailed math or code generation or advanced Q&A, you can switch it to
01:34reasoning on and the model will tap into its deeper chain of thought processes. If you need simpler outputs
01:40like short instructions or quick answers, you toggle reasoning off. NVIDIA wants to give developers that
01:47fine-grained control. They've even tested the toggling feature on all kinds of tasks. For instance,
01:51the model absolutely crushed it on the math 500 benchmark, going from 80.40% accuracy in reasoning
01:59off mode to a whopping 97.00% in reasoning on. Another big improvement is in AIM E25, which jumped from
02:0716.67% to 72.5% with reasoning enabled. That's a pretty significant leap, and it shows you how
02:15important that special reasoning feature can be. They also tested something called Live Code Bench,
02:20which is basically all about generating correct code. The difference was night and day, from 29.03%
02:26pass at one, up to 66.31%. And it's not just code tasks. Another test, GPQA, which is a general
02:34question answering challenge, soared from around 56.60% in reasoning off to 76.01% when reasoning on.
02:42In side-by-side comparisons with DeepSeq R1, a state-of-the-art mixture of experts model that has a
02:48massive 671 billion parameters, the new Nematron Ultra manages to beat it on tasks like GPQA
02:55and IF-Eval instruction following. Even in code generation, it slightly edges out DeepSeq R1.
03:02However, the math tests are kind of a mixed bag. Nematron Ultra is not quite at the top on the
03:08trickiest math benchmark. For example, on the AIM E25 test, it got 72.50%, while DeepSeq R1 hit
03:17around 79.8%. And Math 500 is really close between the two, but DeepSeq still narrowly tops it with
03:2497.3% compared to Nematron's 97.00%. It's basically a game of trade-offs still. Considering
03:33that Nematron has fewer parameters, it's pretty amazing. Now, NVIDIA has been keen on telling
03:38everyone that it's fully open source with commercial licensing. They're releasing it under something
03:43called the NVIDIA Open Model License, but it also falls under the LAMA 3.1 Community License
03:50Agreement since it's built on top of Meta's LAMA. Either way, you can actually grab the
03:56open weights and even the post-training data on Hugging Face, and yeah, it's ready for commercial
04:01use. Meaning that if you're a developer who wants to deploy an AI assistant or chatbot or some advanced
04:07tool that uses retrieval augmented generation, you can legally do it. NVIDIA is also telling folks to
04:13do their own alignment checks and safety evaluations as usual, because these powerful models can
04:19sometimes produce outputs you don't expect or might not want. So how exactly did they get to this level
04:26of performance? It's not just the architecture search. NVIDIA also ran a multi-phase post-training
04:32pipeline. That pipeline started with supervised fine-tuning on tasks like math, code generation,
04:38chat, and tool use. Then they did a round of reinforcement learning with something called
04:42Group Relative Policy Optimization, Great Poly for short, to hone instruction following. On top of that,
04:48they used knowledge distillation for 65 billion tokens and then continued pre-training on another 88
04:55billion tokens. They name-dropped data sets like FineWeb, BuzzV 1.2, and Dolma, plus a bunch of
05:03synthetic data. The synthetic data, by the way, is used to train the model on how to handle these
05:07reasoning-on and reasoning-off modes in real scenarios. NVIDIA says they combined public corpora
05:13with machine-generated prompts, so it's basically a big blend. Another interesting detail is that the
05:18model's maximum sequence length is up to 128,000 tokens or 131,072 tokens, to be super precise in
05:27some references. That's huge, especially when you consider typical contexts used to be more like
05:334K or 8K tokens. The ability to handle extremely long inputs is especially handy for chatbots that
05:39need to remember extended back-and-forth conversations or for analyzing extensive documents, and, let's be
05:44real, long context windows can be game changers for folks who want to feed entire sets of text data
05:49or lengthy code repositories for debugging and summarization. If you're a developer, hooking
05:54into Nemetron Ultra is pretty straightforward. If you're used to hugging face transformer, NVIDIA
05:59suggests version 4.48.3, setting up your pipeline in Python and controlling the reasoning-on and reasoning-off
06:07modes with a little system prompt that says, detailed thinking-on or detailed thinking-off. For reasoning-on,
06:13they recommend a temperature of 0.6 and top underscore p of 0.95, which basically introduces
06:21some creativity. For reasoning-off, they say, try setting it to greedy decoding or temperature equals
06:26zero. That means your outputs will be more deterministic. All the code you need is shown
06:31in their examples, and they've also put out instructions for using VLLM, so you can serve up
06:36an API for your apps. Just set up the server with the model name, trust underscore remote underscore
06:42code, pick your device map, and you're off to the races. Now, if you're curious about real hardware
06:48performance, NVIDIA tested it on 8 H180 GB chips in BF-16 Precision, or sometimes 4 B100 chips.
06:55They even tried FP8 on 4 H180 GB cards. So you have multiple ways to run it, but obviously you need
07:03some serious GPU horsepower if you're going to do large-scale inference. The main reason it's more
07:08economical than some even bigger models is that it uses less memory, thanks to skipping certain
07:14attention blocks and compressing feed-forward networks. It's also important to mention that
07:19they do vertical compression so that even though the model is 253B, it's not as monstrous in memory
07:26usage as you might think. Let's talk about the additional technical stuff. The base model,
07:32LAMA 3.1405B Instruct, was apparently trained up until 2023, so any knowledge after that might be
07:40limited unless it's captured during the post-training data that took place up to 2025. NVIDIA says they
07:48started working on this project in November 2024 and continued right up until April 2025. All that time
07:56they were refining the architecture, testing the skip attention approach, and doing big knowledge
08:01distillation phases. They also highlight something called QUIN, which presumably is a technique or maybe
08:06a set of data that improved the model's reasoning. And because they want the process to be transparent,
08:11they're releasing the LAMA Nemetron post-training data set so people can understand exactly what they
08:17fed into this AI. The benchmarks were pretty thorough. I've mentioned a few, but let me run through them
08:22the way NVIDIA does. There's MATH 500, which saw a leap to 97.00% in reasoning mode. Then AIME 25 soared up
08:33to 72.50%. BFCL V2 Live was around 74.10% with reasoning on. There's also something called Live Code Bench 2024080120250201, which doubled the IF eval test for instruction following got as high as 89.45% for reasoning on. These numbers are all from multiple runs up to 16 passes to make sure they're consistent.
08:38And yeah, that's how NVIDIA gets these final figures. They're pitching it as a
08:43general purpose model for AI agent systems, chat bots, code generation, and
08:48code generation. There's also something called Live Code Bench 2024080120250250201, which doubled the IF eval test for instruction following got as high as 89.45% for reasoning on. These numbers are all from multiple runs up to 16 passes to make sure they're consistent. And yeah, that's how NVIDIA gets these final figures.
09:06They're pitching it as a general purpose model for AI agent systems, chat bots, code generation, RAG, retrieval, augmented generation, you name it. It also supports multiple languages, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. So it's definitely not just for English use cases. The interesting part is the recommended usage instructions set that system prompt to detailed thinking on or off and feed your user instructions.
09:35Feed your user instructions into the user. They specifically say you shouldn't add extra system prompts because you might confuse the mode toggle. And for folks wanting to do advanced reasoned responses, you basically just rely on the on mode. If you want something quick and factual, you set off.
09:50Now, let's talk about the licensing. Because it's open, that means you can check out the code, the weights, and the post-training data. But you do need to respect the NVIDIA Open Model License plus the LAMA 3.1 Community License.
10:04NVIDIA always emphasizes the importance of building responsibly. So they want teams to do their own testing for bias, safety, and any ethical or compliance issues. They've posted a place where you can report potential security vulnerabilities or AI concerns.
10:19The gist is that they want to be transparent, but they also want to make sure people don't misuse this tech.
10:25A quick note on the Nematron Ultra naming. Apparently, this is part of the LAMA Nematron Collection.
10:31You can also find smaller siblings like Nematron Nano 8BV1 or the bigger Nematron Super 49BV1. It's a whole family.
10:42The 253B version is the Ultra. So think of it as that sweet spot between raw horsepower and efficiency.
10:49There are references to some academic papers too, like a puzzle-based distillation approach, reward-aware preference optimization, and FFN fusion for large language models.
11:00So it's not just marketing hype. There's real technical depth behind how they got these results.
11:06Bottom line, NVIDIA's LAMA 3.1 Nematron Ultra 253BV1 proves you don't need a trillion parameters to hit top-tier performance.
11:18It beats DeepSeq R1 in most tasks, runs smoother, and costs less to deploy.
11:24You can grab it now on Hugging Face. It's open source, handles super long inputs, and switches between light and deep reasoning on demand.
11:32That's it. Give it a try and see what it can do.
11:34Thanks for watching and I'll catch you in the next one.

Recommended