• last year
Groq builds an AI accelerator application-specific integrated circuit (ASIC) that they call the Language Processing Unit (LPU) and related hardware to accelerate the inference performance of AI workloads.

At Imagination In Action’s ‘Forging the Future of Business with AI’ Summit, Groq’s Chief Technology Advisor Dinesh Maheshwari talks about how this technology works and why it’s important for the future of AI.

Subscribe to FORBES: https://www.youtube.com/user/Forbes?sub_confirmation=1

Fuel your success with Forbes. Gain unlimited access to premium journalism, including breaking news, groundbreaking in-depth reported stories, daily digests and more. Plus, members get a front-row seat at members-only events with leading thinkers and doers, access to premium video that can help you get ahead, an ad-light experience, early access to select products including NFT drops and more:

https://account.forbes.com/membership/?utm_source=youtube&utm_medium=display&utm_campaign=growth_non-sub_paid_subscribe_ytdescript

Stay Connected
Forbes newsletters: https://newsletters.editorial.forbes.com
Forbes on Facebook: http://fb.com/forbes
Forbes Video on Twitter: http://www.twitter.com/forbes
Forbes Video on Instagram: http://instagram.com/forbes
More From Forbes: http://forbes.com

Forbes covers the intersection of entrepreneurship, wealth, technology, business and lifestyle with a focus on people and success.

Category

🤖
Tech
Transcript
00:00 Welcome, thank you for taking all the time to be here.
00:06 Thank you, thank you for being here.
00:10 And good afternoon folks.
00:11 Awesome, so let's get started.
00:14 Grok does something that is different from what Nvidia does.
00:18 You don't make GPUs, you make this thing called an LPU.
00:21 Could you tell us a bit about an LPU?
00:23 LPU stands for Logic Processing Unit, that is, think of it as a product branding that
00:30 we have done based on our unique architectures, a new paradigm, a tensor streaming processor.
00:36 It's a general purpose linear algebra accelerator that's during complete that applies to all
00:42 HPC.
00:43 Deep learning, machine learning happens to be a linear algebra acceleration problem and
00:50 we do obviously exceptionally well at that.
00:54 So I believe you have some benchmarks that we can show.
00:58 Look at the graph on the right hand side.
01:01 The x-axis is tokens per second.
01:06 So let me set the stage for generative AI.
01:12 What matters is how one interacts with the AI, with machine learning.
01:17 What we do is we make machine learning human and making it interact with human beings at
01:24 human latencies.
01:25 For doing that, what matters is the time to first word, whether it's spoken or text, and
01:33 the time to last word.
01:36 And human beings, the cognitive function of the human beings actually varies quite a bit
01:43 whether the response comes in 200 milliseconds or a second or 10 seconds or beyond.
01:50 So it's Google, for example, shows the first word in 200 milliseconds or when you do a
01:56 search.
01:57 They finish the whole page in 400 milliseconds.
01:59 It is what keeps human beings engaged.
02:02 You can imagine if you're talking to someone, if it's voice, and the other person takes
02:07 10 seconds to respond, you'll be wondering if the other person actually heard you or
02:11 not.
02:14 Interacting with computers with "artificial intelligence," if you want to have a semblance
02:23 of human interaction, you need low latency.
02:29 The metrics for that, the ideal ones, are the application of time to first word, time
02:33 to last token.
02:34 So the x-axis here reflects time to last token.
02:39 The y-axis reflects the inverse of time to first token.
02:45 This is a third-party benchmark, so we didn't get to tell them exactly how to plot this.
02:52 Irrespective, the quadrant at the bottom here, the extreme bottom right, which is green,
03:00 being in that quadrant, in the right bottom quadrant, is the Holy Grail.
03:07 And Grok is the only one in that quadrant.
03:10 That is what makes machine learning real.
03:14 That is from three months ago.
03:19 And the way we work is, we won't tell you where we are going, but we just got started.
03:24 So you can expect better and better things.
03:27 That's very interesting.
03:28 So just to recap, on the left side, time to first chunk, and on the x-axis we have tokens
03:35 per second.
03:36 Yes, and that reflects time to last chunk.
03:41 So that means as soon as I start a query, the first chunk, that's on the--
03:46 That's right.
03:47 And the bottom one is the rate at which the chunks come, and the higher it is, the better.
03:55 Again, we didn't choose those axes, so the best way to look at it is, the farther you
04:02 are on the bottom right, the better.
04:05 That's perfect.
04:06 So clearly, Grok's killing it, but what's the secret sauce?
04:09 What's making this--
04:10 It's a new compute paradigm.
04:12 So I'm sure most of you have heard of von Neumann architectures.
04:17 That's how computation essentially started.
04:23 And most of the computation today, general purpose compute, follows von Neumann architecture.
04:29 Putting Harvard and Princeton for the geeks, Harvard and Princeton architecture in the
04:32 same category.
04:35 It essentially has a single-- What we have-- Let me frame it the other way.
04:42 We have a programming assembly line architecture, whereas the existing architectures are more
04:48 like a hub and spoke architecture.
04:50 Very similar to how cars were being manufactured and the buggies were being manufactured before
04:56 Ford introduced the assembly line.
05:00 And the way they were being manufactured, you would have a manager.
05:03 In the case of the computation, you have a centralized instruction queue and a centralized
05:08 register that would instruct the various experts, functional units, and compute.
05:15 And you have a central repository where you have to go back and forth to, therefore, hub
05:18 and spoke.
05:20 And that creates a bottleneck to begin with.
05:26 That works great if you have to do highly branched computation.
05:32 If then else kind of a compute.
05:34 Very similar to what the prefrontal cortex does in human beings.
05:38 We are thinking logically, step by step.
05:40 We're looking at what we need to do and what conditions we need to do something else.
05:47 But your eyes and ears, when they're processing it, most of it is large data being processed
05:54 in parallel.
05:56 And by the way, quite a bit of that happens in your visual cortex.
05:58 Now, going back to the assembly line and the hub and spoke, the hub and spoke limits because
06:07 there's a bottleneck, it limits how the data gets processed.
06:11 It becomes worse.
06:12 Imagine, now, you have to go to the warehouse to get the part every time.
06:16 In the case of the GPUs and CPUs, it's your DRAM.
06:20 That makes it even worse.
06:23 What we have as an assembly line architecture, we have a sequence of functional units that
06:31 are arranged such that they compute as the data let me rephrase it the other way around.
06:40 The data, the instructions come over conveyables.
06:44 The compute station doesn't know where it came from, doesn't know where it's going.
06:49 It just knows at what time it needs to pick up the data and what instruction it needs
06:54 to perform.
06:56 And that way, the data, the instruction, the computation just goes on seamlessly without
07:02 any bottlenecks.
07:04 The memory for it, which is the warehouse, think of it as a warehouse, is on the chip.
07:10 And it is the fastest memory that is two orders of magnitude faster than what the DRAM can
07:17 provide.
07:20 This arrangement is what gives better latency.
07:27 It so happens it also gives better throughput per dollar and per watt.
07:31 Very similar to how, again, assembly line transformed not just car manufacturing, it
07:37 actually transformed industrialization after that.
07:40 And that is what we expect our architecture to do on the compute side.
07:46 My vision for that for computers, make computers cheap as water.
07:52 Because it is going to be as essential as water.
07:56 To be able to process the mountains of data that we have for the past few centuries from
08:04 which we can gather more insight and to be able to enable that real-time human interaction.
08:10 So we also have a graph, a diagram which might make it easier to follow along with.
08:16 But this is basically the assembly line architecture we're seeing from the LPU.
08:21 That's correct.
08:22 So LPU on the right and on my left hand side.
08:29 And on the right hand side, the GPU can see that in the GPU that the compute code is fetched
08:41 from the DRAM, high bandwidth memory, HBM.
08:48 It operates on it, puts the intermediator back into the DRAM, and it has to do that
08:55 thousands of times at a time.
08:56 That's why you see that blinking.
09:01 You can imagine, so this already looks as a hub and spoke design.
09:05 You can see there's a switch in the center and you have the four GPUs on the side.
09:09 Now interestingly, even within the GPUs, the various small codes that you see also have
09:15 been spoke designed.
09:18 As opposed to that, on the LPU side, as you can see, we map the compute volume once.
09:28 And after that, the computation, the tokens in the case of LLMs, are processed very much
09:36 like an assembly line.
09:39 And that is what makes it low latency, better throughput per dollar, and better throughput
09:45 per watt, all at the same time.
09:47 So it's an and.
09:48 We don't do ors.
09:49 Another way of looking at it is, this is, if you innovate on a given solution surface
09:57 for the Geeks solution manifolds, you are doing tradeoffs.
10:05 But given the problem that we are solving is quite different than the GPUs and the CPUs
10:11 started, and we really have to emulate what human beings actually do with their eyes and
10:15 ears.
10:19 We have chosen a solution manifold that allows us to do ands.
10:25 Low latency and high throughput per dollar and high throughput per watt.
10:33 And for the Geeks, the other difference, the way of looking at it would be, the GPUs scale
10:39 in time.
10:41 You get the compute code and the weights from the HBM and you loop over it thousands of
10:45 times.
10:46 So you scale in time.
10:47 The larger the problem, the more you're going to get those partitions from the memory and
10:54 do that.
10:56 We scale in space.
10:58 We map it once.
11:00 We scale in space as a pipeline.
11:03 And after that, the compute just flows through.
11:05 So one of the key benefits, the unique selling points of Grok is this ultra low latency.
11:11 Everyone's going to get access to because of this LPU architecture.
11:16 So I recently read an article about Grok mentioning that they don't want to just be a hardware
11:20 company and it came out with a Grok cloud product which lets people use the open source
11:27 models that have been deployed on Grok.
11:29 Could you speak a bit about this?
11:31 So as a new solution in architecture that we have enabled, you can imagine that in the
11:40 infrastructure space where the spend is pretty large checks and no one wants to make a large
11:47 bet on something very new.
11:50 So one way of breaking into such a market is you enable end to end.
11:54 So in that context, what we have done is we have, think of it as four business units within
12:02 Grok.
12:03 One of them that takes things to the end customer, that's Grok cloud.
12:09 The other one to serve the compute system need of the government.
12:14 We have a government business unit.
12:16 Similarly for the enterprise, for the compute system.
12:18 It's not a cloud.
12:19 It's not API for the various public models.
12:24 The third one of course is enabling the largest of largest hyperscalers.
12:31 On the Grok cloud front, at this point, we don't make our own models.
12:38 We don't want to make our own models.
12:40 We would rather avail of the ecosystem.
12:44 And in the open ecosystem, the open source models that are available, we are hosting
12:50 them today that can be accessed through the APIs as you would access open AI or cloud
12:59 from Anthropic and so on and so forth.
13:03 So as an engineer myself, I can see immense benefits of just using something that has
13:09 incredible inference, especially for stuff that's wearable related.
13:13 Because we're building all these applications where you're giving it a lot of context and
13:18 then you have to wait for the whole output to finish.
13:20 But I see a lot of value in, instead of me having to cache responses, I could potentially
13:25 just run inference multiple times.
13:27 And if it's 10 times cheaper than using NVIDIA, I see a lot of value in this.
13:33 Something very interesting you mentioned was that there's an energy benefit as well to
13:37 using this.
13:38 And I keep seeing these discussions now going around where people are discussing that energy
13:42 is the new currency, because compute is very expensive and you need energy to run it.
13:48 How much better is the energy consumption on Grox architecture?
13:54 Minimum 10x.
13:55 And again, we don't put the final numbers because they're, and we're running it after
14:00 10x.
14:01 Everyone likes 10x numbers.
14:03 Minimum 10x.
14:06 And as we announce things, you'll see where we go from there.
14:10 And the reason for that is, so you can see that in the GPU architecture, you have to
14:16 get the compute code from that external warehouse, the HBM.
14:21 That takes a lot of energy.
14:23 It turns out that the compute energy is few orders of magnitude lower than having to go
14:32 back and forth to the HBM to get that data.
14:37 And on the other hand, once we have mapped everything, compute volume to our sequence
14:46 of chips here, we do not get penalized for that energy at all.
14:54 Assembly line allows you to use the hardware resources and the energy a lot more efficiently.
15:01 As a reason why most industrial production today is done with assembly lines.
15:08 You can expect actually better and better energy and cost structure.
15:13 So NVIDIA had a pretty big announcement last month at GDC.
15:17 They announced the Blackwell platform.
15:20 Is Grox concerned?
15:22 No.
15:23 So the way, again, so around the time when Ford introduced the assembly line, actually
15:30 let's go even a bit earlier than that.
15:33 When the automotives were seen as a viable mode of transportation, the response from
15:38 the buggy manufacturers were, okay, automotive guys, you have a five horsepower engine, we'll
15:42 put two more horses.
15:44 So horses as in horses.
15:46 There were two horses plus two horses.
15:48 So four horses, we are almost there, faster and bigger buggies.
15:52 We'll catch up with you guys.
15:53 But we know what happened with the automotives.
15:56 The architectural space for the automotives allowed for better optimization and it allowed
16:03 for the automotives to go from five horsepower to 10 horsepower to 20 and then came the V8
16:09 that was 50 horsepower.
16:12 So the announcement for NVIDIA was bigger and better buggies.
16:19 We are just getting started with the automotives.
16:22 They are still they didn't catch up even now.
16:27 We don't think they will catch up.
16:31 So the automotives outran the buggies.
16:34 We all know the history.
16:36 It is GPUs are good for certain kind of things.
16:41 They were created for graphics.
16:44 They're good for parallel load execution where for training, for example, they work well.
16:51 But you don't want to use things, tools that were done for something else when you have
16:59 a better architectural solution.
17:01 Do you think so NVIDIA's secret sauce is CUDA, their kernel.
17:07 Do you think that could also be something that's holding them back if they were to pivot
17:10 to something like an LPU?
17:11 Actually, that so CUDA is the reason why we think of NVIDIA as a hardware company.
17:17 But the reason NVIDIA has the top dog space today is because of their software.
17:25 AMD that manufactures the other GPUs and their hardware is at least as good as NVIDIA's if
17:32 not better.
17:34 The reason AMD is not adopted well is because NVIDIA has had 15 years of creating the CUDA
17:44 ecosystem and for the GPU architectures, whether it's AMD, NVIDIA or Intel, Intel also has GPUs.
17:54 Unfortunately to get good utilization from that hardware, from that architecture, you
18:01 need to hand tweak the kernels.
18:04 The compiler out of the box doesn't give you good hardware utilization.
18:08 What that translates to is your cost structure is going to be more expensive, your power
18:13 is going to be more and that, to avoid that, you need actually hand coded kernels.
18:19 We have avoided that problem completely.
18:22 We actually started from the compiler side.
18:24 We didn't start from hardware up.
18:28 We looked at the whole problem and then mathematically reduced it to the bare minimum that we needed
18:34 to put in the hardware.
18:37 Our hardware is, again for the geeks, it is a one dimensional bin array and that translates
18:46 to a problem of 1D packing, one dimensional packing, as opposed to the GPUs which is on
18:53 the chip is a two dimensional packing and when you put the network together it's a multi
18:57 dimensional bin packing.
19:00 The 2D and multi dimensional bin packing is an NP complete problem.
19:05 So the geeks here would recognize that that's a difficult problem to solve.
19:10 1D bin packing problem is not an NP complete problem.
19:15 So our compiler works straight out of the box with high hardware utilization.
19:23 We were able to bring up models that would take NVIDIA six months to nine months to get
19:32 good utilization in a matter of days.
19:40 That's another reason why the new architecture we can actually ramp up and bring up workloads
19:46 that other folks cannot, which AMD cannot today, even with the good hardware that they
19:51 have.
19:53 So a lot of people that have actually heard about Grok here probably heard about it because
19:57 everyone on Twitter was posting pictures of going to Grok chat and as soon as they finish
20:03 typing something, pressing enter, as soon as they finish typing, the text just starts
20:08 generating.
20:09 It's incredibly fast.
20:11 Right now I believe Grok Cloud only supports Gemma, Mistral and Lama.
20:15 When do you, what new models are we going to see on Grok Cloud?
20:20 All the popular models.
20:22 So it depends upon the ecosystem as to what the ecosystem works as the popular ones.
20:28 We can only do open source and obviously some of the folks would like their proprietary
20:34 models to be hosted, but if there's enough business, we'll entertain that.
20:40 That's interesting.
20:43 What do you think, what is next for Grok?
20:44 What is, what should we all be looking forward to over the next year?
20:50 We don't give targets.
20:52 What you can see, I'll reiterate, the automotives are only getting started.
20:57 We are at 10 horsepower.
21:00 I think the buggies are at 2 horsepower, maybe 3, 4.
21:04 You can imagine what happens when the automotive is put on the V4s and the V8s.
21:12 This is not, what we have here is not just for machine learning.
21:18 This is for high performance computing.
21:21 We believe this will change the computation for linear algebra acceleration in general
21:29 for large compute significantly.
21:32 Awesome.
21:33 Well, thank you so much for taking out the time to be here.
21:37 Thank you.
21:38 I hope you all learned something today.
21:39 Awesome.
21:40 Thank you.
21:41 All right.
21:41 All right.
21:42 All right.
21:43 All right.
21:44 All right.
21:45 All right.
21:45 All right.
21:46 All right.
21:47 All right.
21:48 All right.
21:48 All right.
21:49 All right.
21:50 All right.
21:51 All right.
21:51 All right.
21:52 All right.
21:53 All right.
21:54 All right.
21:54 All right.
21:55 All right.
21:56 All right.
21:57 All right.
21:57 All right.
21:58 All right.
21:59 All right.
22:00 All right.
22:00 [BLANK_AUDIO]

Recommended