Google Engineers On Learnings From Building Gemini

Forbes

Three Google engineers, James Rubin, Peter Danenberg and Peter Grabowski, discuss what they’re learned so far working on Google’s Gemini ai and what’s to come next.  Subscribe to FORBES: https://www.youtube.com/user/Forbes?sub_confirmation=1  Fuel your success with Forbes. Gain unlimited access to premium journalism, including breaking news, groundbreaking in-depth reported stories, daily digests and more. Plus, members get a front-row seat at members-only events with leading thinkers and doers, access to premium video that can help you get ahead, an ad-light experience, early access to select products including NFT drops and more:  https://account.forbes.com/membership/?utm_source=youtube&utm_medium=display&utm_campaign=growth_non-sub_paid_subscribe_ytdescript  Stay Connected Forbes newsletters: https://newsletters.editorial.forbes.com Forbes on Facebook: http://fb.com/forbes Forbes Video on Twitter: http://www.twitter.com/forbes Forbes Video on Instagram: http://instagram.com/forbes More From Forbes:  http://forbes.com  Forbes covers the intersection of entrepreneurship, wealth, technology, business and lifestyle with a focus on people and success.

Transcript

00:00Well, hello everyone.

00:02I'm James Rubin.

00:03I'm going to be the moderator for this fireside chat with my esteemed Google colleagues, Peter

00:09Grabowski and Peter Dannenberg.

00:11We're going to be touching on the key blockers and solutions to productionizing enterprise

00:17ready LLMs and talk about some of the state-of-the-art approaches to really solving those challenges.

00:24We want to really focus on the practical applications today and we hope this is really a jumping

00:28off point for folks.

00:30It's a conversation starter for folks and we'll be in the media lab later if we have

00:33any more deep dive questions.

00:36But I think before we dive in, maybe we'll start with some introductions.

00:40I can go first.

00:41I'm James Rubin.

00:42I'm a product manager at Google for a Gemini applied research team.

00:48I work with Peter Grabowski.

00:50I'm his product counterpart.

00:53Prior to Google, I was at Amazon as a PM for the better part of three years.

00:58Worked across the AI stack, beginning with Zooks, which is their self-driving subsidiary,

01:03but also on custom AI chips and machine learning services at AWS.

01:08With that, I guess I'll pass it off to Peter Dannenberg.

01:11Hi.

01:12I'm Peter.

01:13I work on Gemini, which was formerly BARD and before that was Assistant.

01:17Peter Dannenberg on Assistant, formerly BARD, formerly, sorry, Gemini, formerly BARD, formerly

01:22Assistant.

01:23And I've been a senior SWE for a while, currently working on Gemini extensions.

01:29My name is Peter Grabowski.

01:30I am lucky enough to work with James on the Gemini applied research team.

01:34I've been at Google for about 10 years.

01:36Came in through the Nest acquisition, then spent some time working on the Google Assistant.

01:40And in the evenings, I moonlight as a teacher.

01:42So I'm on the faculty of UC Berkeley's master's in data science program and teach deep learning

01:46with natural language processing.

01:49And one thing I'll add there, Peter is too humble to say it, but during really the height

01:54of the AI craze in early 2023, he created and taught a LLM bootcamp that has really

02:00become the foundational course for Google.

02:02It's been taken by tens of thousands of Googlers, including myself.

02:05So we definitely have two incredible experts on the stage today.

02:11Before we dive in, though, I want to discuss kind of what the motivation for this talk

02:15was.

02:16A recent survey from Andreessen showed that the overwhelming majority of enterprises adopting

02:21AI are choosing to build AI apps in-house on top of common foundation models, as opposed

02:29to leveraging B2B AI software off the shelf.

02:33There may be a founder in the audience today that disrupts that trend.

02:36But regardless, the thing that we're most interested in is this disconnect between that

02:41enthusiasm to build and this hesitance and slowness that we're seeing with enterprises

02:47to deploy LLMs externally in production.

02:51External apps we're seeing, testing we're seeing, use cases for employee productivity,

02:55but when it comes to external apps, comparatively, they lag behind.

02:59So we really want to keep that in focus today, and we want to focus on ways that perhaps

03:04we can make the process of productionizing these LLMs a bit more clear.

03:09So with that, I thought we'd just level set with everyone and maybe start with the basics.

03:13Peter Gee, how would you describe an LLM, and why should businesses care?

03:19Happy to.

03:20I'll give a really simple example, but hopefully it's got some strong pedagogic value.

03:25LLMs are really nothing more than fancy autocomplete.

03:27You might have heard that metaphor before.

03:29And so if I said, or if I gave the audience a prompt, I went to go see a baseball game

03:33last night, I got to see the Boston Red Sox.

03:37Hopefully everybody's thinking about Sox.

03:39That's at their core what LLMs are doing.

03:42Now with short context windows for just a little bit of data, that's not all that interesting.

03:46But when you start showing these examples or giving them billions of parameters to learn

03:50with and start showing them hundreds of thousands or millions or billions of examples with longer

03:54and longer context windows, you start to see some really interesting behavior emerge.

04:01And I think one thing that we may want to expand on a bit is the fact that LLMs aren't

04:07just good for next word prediction and chat, but actually can be used for a wide range

04:10of traditional ML approaches as well.

04:12So perhaps just unpack that for us and also talk about the ways in which tasks can be

04:17reframed to work with LLMs.

04:19And so that's one of the things that's super interesting.

04:21Once you get into the couple billion parameter size, you start to see these fascinating properties

04:26emerge.

04:27And so if you've heard the term zero shot or few shot learning, that's what folks are

04:30talking about.

04:31If any folks in the audience either have a two-year-old or are ML engineers, this is

04:35the answer to the question you might have had, which is why can I show my two-year-old

04:38a picture of a zebra and show her maybe two or three?

04:41And then on the fourth, she's able to correctly identify what a zebra is.

04:45But with more traditional machine learning techniques, you need it to show 10 or 20 or

04:4930,000 images of a zebra.

04:51So I think that's what James is alluding to.

04:53The other thing that's really interesting is you can start to recast many of these traditional

04:58ML problem framings as next word prediction problems.

05:01So if I gave you a sentence and then I gave it the prompt, this sentiment of this sentence

05:05is blank, it would happily fill in positive, negative, neutral, happy, sad.

05:10And all of a sudden, you've reframed a classification problem into a next word prediction problem.

05:15Awesome.

05:16So basically, no excuse for businesses not to think creatively about how to apply LLMs

05:20to their use cases.

05:22Plenty of solution space exploration there.

05:25Peter D., I want to get your insights on this, especially how LLMs with these traditional

05:31ML approaches compare in terms of performance and how people are using them.

05:35So one of the things we do every couple of weeks is we bring startups to Google to just

05:38ask how they're using AI.

05:40What are the difficulties they're having with Gemini, this and that?

05:43And so a couple of weeks ago, I learned that a lot of startups are doing this thing where

05:47they basically train a bunch of baby models, that's sort of a Gemma 2B model, on things

05:51like classification tasks.

05:52So they can go to market in something like six to eight weeks, whereas previously, even

05:56just to train a trivial model as a classifier would take six to 12 months.

06:01So we've been seeing this quick time to market using baby LLMs as classification engines,

06:06trained on maybe tens of examples, which is incredible.

06:09And actually, this is a great way to get us back from that slight digression, which

06:13is what is the shift in LLMs that makes them especially attractive to businesses?

06:18Yeah.

06:19That's one of the things my team and I are really excited about, which is all of a sudden,

06:23to train a classifier, to train a model of a fixed quality, the amount of time that it

06:27takes, the amount of data that it takes, the amount of expertise that it takes, the amount

06:31of compute that it takes, has fallen dramatically.

06:34So that's one of the things we're exploring on our team.

06:36Absolutely.

06:37And I think there's actually one important one we missed, which is customizability, right?

06:41The ability to tune and align models to a specific task or domain.

06:46Businesses have vertical use cases, specific customer problems they're trying to solve.

06:49So this is incredibly important.

06:52And I actually want to drill down on that a bit further and get your insights.

06:56Data shows that customizability is one of the top two selection criteria for enterprising,

07:01selecting a model provider.

07:04But the process for going about customization is very complex.

07:08There's many different tuning techniques.

07:11There's many quality and cost trade-offs.

07:13It's very difficult to get to the output that you want.

07:16So perhaps, Peter, starting with Peter Gee, give businesses a starting point to navigate

07:22this complexity.

07:23Yeah.

07:24100%.

07:25So I saw this in my own work.

07:26I worked on the Google Assistant for a number of years.

07:28One of the things that we were focused on is building a sentence simplification engine

07:31for kids.

07:32And so if you ask, why is the sky blue?

07:34And you're an adult, you might get an answer like refraction in the ionosphere.

07:38But if you're a kid, that's not a satisfying answer.

07:40You want something like it bounces off a water drop that's in the sky.

07:43And so I spent about, like you were saying, six to eight months trying to build a model

07:47with Google Research and launch it into production.

07:50We were able to build something that works, but it wasn't high enough quality to ship.

07:54Fast forward to a year ago, with the few-shot prompting techniques that we were talking

07:58about, I was able to build something that blew the model we had built five years prior

08:02out of the water.

08:03And so to segue, that's the advice that I would give to businesses.

08:06Think about the problem that you want to focus on and solve using a large model, and then

08:10just get started.

08:11You can start by asking a model a question, just like we were talking about a moment ago.

08:16So let's say they set up a sandbox.

08:18They run an internal pilot.

08:20They've got their metrics set, their gathering data.

08:24It performs really well on general tasks, but it's not quite ready to specialize in

08:29their domain.

08:30So they're trying to replace some of their domain-specific workflows.

08:34What are some more advanced approaches they can now take?

08:37And you touched on one thing that I think is really important to emphasize, which is

08:40make sure that you've got metrics in place so that you can measure when you're improving.

08:44And so in this case, maybe we can talk about a hypothetical example of using a legal startup.

08:50Maybe you want your chap out or your agent to talk and sound like a lawyer.

08:54The first thing you might do is try what's known as role prompting, which is just telling

08:57the model to talk like a lawyer.

09:00From there, I would, again, evaluate, measure, see how you're doing.

09:04And if it's not where you want it to be, there's a couple other techniques you can try.

09:09Definitely dive into those, especially with regards to domain knowledge and domain specificity.

09:14Sure.

09:15So the next thing I would think about trying is what's known as a family of techniques

09:18known as domain adaptation.

09:20So the first thing you might try is continued pre-training.

09:23So what you're doing is you're taking that language modeling task that you started with,

09:26predict the next word.

09:27You're using backpropagation to update your weights.

09:29But you're focusing it on a corpus of data that's relevant to your model.

09:33And so for instance, in this legal example, you might do continued pre-training on a corpus

09:39of law textbooks.

09:40To give a human analogy, that's like telling a first year law student to go read 50 legal

09:45textbooks and come back and talk to me more like a lawyer.

09:49Awesome.

09:50And what about things like classification, given we mentioned it earlier, or chat, where

09:55we're talking about really adapting the task that the LLM is doing rather than the domain

09:59background?

10:00That's a great question.

10:01So if you're asking the model to make, I don't know, a decision about some sort of case law

10:05or something like that, continued pre-training can absolutely help.

10:09You might decide to focus it on specific examples of the task you want it to do.

10:13And so if it's a classification problem, you would train it using backpropagation, using

10:17the next word prediction task to do that specific focus on that task.

10:23And so in that context, it's usually known as supervised fine tuning.

10:27Awesome.

10:28So to summarize, for domain knowledge, domain specificity, continued pre-training is a good

10:33place for people just to start.

10:35If you're looking to improve a very specific problem framing, very specific task, SFT is

10:41a good place to start.

10:42But there are, as we know and you know more than most, there's many other techniques to

10:47explore within that.

10:48But we can talk about that maybe outside later.

10:51Peter D., I feel like we've ignored you a bit here, but the examples that Peter G. gave

11:00were with use cases where basically quality and accuracy can be defined pretty concretely.

11:05A legal chatbot, you can evaluate against an LSAT, a classifier, you can measure accuracy

11:10and precision.

11:11What about use cases where quality is defined more ambiguously?

11:16Yeah, it's interesting.

11:17So I don't know if this is a West Coast thing, but we had a bunch of startups come to Google

11:21a few weeks ago who, they're trying to solve this personal companion problem.

11:24And there seems to be a lot of VC money in that.

11:27Is that a West Coast thing, by the way, you guys on that East Coast?

11:30So anyway, we had this little experiment, right?

11:32So let's see if we can fine tune a Gemini model to be like Sherlock Holmes or Elizabeth

11:36Bennett.

11:38And so we ran this experiment where we tuned a Sherlock Holmes with about 10,000 examples,

11:42right?

11:43And there was this really bizarre phenomenon where this fine tuned Sherlock Holmes didn't

11:48seem to know that he was in a book, right?

11:51And so he would answer in the first person, which was a little bit strange.

11:54If you do the same thing with vanilla Gemini, you'll notice that Gemini speaks as Gemini

11:58with a little bit of Sherlock Holmes lipstick, essentially.

12:02But one of the funny things was, though, is that how do you evaluate this fine tuned Sherlock

12:06Holmes versus a vanilla one?

12:09And is it enough just to say, hey, Holmes, by the way, do you live at 211B Baker Street?

12:14And it turns out it's not.

12:16And especially when you're talking about these AI as companion sort of domain, I think there

12:21are a lot of these subtle issues like, is this character personable?

12:25Does this character scratch my itch for some definition of itch?

12:30And for that sort of thing, maybe it turns out you need a human in the loop sort of evaluating

12:37fine tuned Holmes in this case versus vanilla Holmes.

12:39So are you telling me that fine tuning is a little bit like method acting for LLMs?

12:44I think so.

12:45I think so.

12:46But, you know, and the funny thing, by the way, you know, it turns out that you can fine

12:49tune a model with like four to 500 examples.

12:52And in this Holmes case, I think we took about 10,000 examples.

12:55And it could have been that some of those examples were actually lower quality.

12:58And so just to give you an example, I just said, hey, Holmes, you know, can you tell

13:01me a little bit about rugby?

13:03And he said, you know what?

13:04I can't tell you.

13:05I've never played rugby.

13:06Basically he said, no, you know, I can't tell you.

13:08And I was just thinking, you know, if we had trained this Holmes on fewer high quality

13:11examples, we might have got a better result.

13:14And so there's this funny thing where more data is not necessarily better.

13:19Right?

13:20Right.

13:21I think that's a really great point.

13:22And before we move on from customizability as a topic, because there's so much more to

13:27cover, I do want to step outside this bubble of fine tuning an LLM for a single task and

13:32maybe talk about how LLMs are being extended for more complex workflows where they're operating

13:38asynchronously, even autonomously.

13:40Peter, I know you have a lot of experience with this.

13:43If you could share your insights.

13:46So, yeah, we did this interesting experiment a few weeks ago where we trained an LLM to

13:54be an asynchronous day trading bot.

13:56And just to show that I had some skin in the game, I threw a thousand bucks at it.

14:00And the funny thing is I made about three bucks.

14:03And so I don't know if I can tell you, but I have a 0.3% return, which is nice.

14:08But the funny thing is that even out of the box, using this thing called function calling,

14:14the LLM will actually learn how to act as an autonomous agent.

14:19And in order to pull that off, we had to do some classification tasks, like, you know,

14:23given these tweets, given these news headlines, are these bullish, are these bearish?

14:27And the funny thing was every single tweet was bearish.

14:30I'm not sure why.

14:32And it always wanted to spend at least half of my money.

14:36But the but, you know, and one of the things I was thinking is if you did something like

14:40a backtesting algorithm and maybe train the model in the last year of market data, you'd

14:44get an even better result.

14:46But I was kind of amazed what you could do out of the box.

14:48Awesome.

14:49Well, I'm going to responsibly pivot this conversation to factuality, because I think

14:53it's very relevant.

14:56Factuality is very important to enterprises.

14:59Recently an airline's chatbot famously hallucinated a false refund policy.

15:05They're now being sued as a result of that.

15:08Could Peter G. perhaps describe to us what's happening under the hood when a chatbot hallucinates?

15:13And then we can discuss some approaches to dealing with that.

15:17That's a great question.

15:18So I think what's going on is exactly what we were talking about at the start of this

15:21talk was the model is just trying to do next word prediction.

15:25In some cases, the model is very certain about the next word.

15:28If you imagine a probability distribution over all the possible tokens, you get a really

15:32spiky distribution.

15:33In other cases, it might be much less certain.

15:35So I think that's one dynamic at play.

15:37The second dynamic that I think is really interesting is in a lot of cases, these models

15:41are trained to be helpful.

15:42And so after the pre-training stage, there's often a phase known as instruction tuning,

15:47where that's exactly what you're doing.

15:48You're coaching the model, you're instructing the model, you're teaching the model how to

15:51be helpful, how to follow results, or how to give results, how to follow direction.

15:55And so in a case where the model is unsure, especially if you've instruction tuned, instead

16:00of just simply saying, I don't know, or I'm not sure, the model might try to hallucinate

16:05something or make something up to try to be helpful and answer your question.

16:10And what are some more advanced approaches that people can deal to deal with hallucination?

16:15There's a couple things that we recommend.

16:17And so one is using a technique that you all might be familiar with, known as retrieval

16:20augmented generation.

16:21And the idea there is that you want to use language models and a database together to

16:26solve the problem.

16:27And so you let language models do what language models are good at, which is generate natural

16:31language.

16:32And you let databases do what databases are good at, which is store, update, delete, tackle

16:37facts.

16:38Then you train the language model to be able to retrieve the relevant information from

16:41the database and give an answer based on that context.

16:43And so in the airline example, hopefully in that case, it would have retrieved the actual

16:48refund policy and then maybe massaged it or summarized it or used that to answer the question.

16:54Another hot topic is guardrails for LLMs, if you could just briefly touch on that as

16:57well.

16:58Yeah, 100%.

16:59And so this is a topic that's super important.

17:01It's not important just for generative applications, since it's important any time you're building

17:05a machine learning system.

17:06But the idea is that frequently you take a stochastic machine learning model that's always

17:10going to have a little bit of randomness in it, and then apply a policy layer or a set

17:13of guardrails on top.

17:15And so in the large language model day trading case, you might do something like, no matter

17:23how good the market looks, don't spend more than 10% of my money.

17:26Or no matter how good the market looks, don't put all of my money on GameStop.

17:30And hopefully that might limit the output space that you're thinking about and control

17:35the behavior a little bit.

17:36There's a really interesting case where you can also use LLMs to help evaluate that policy.

17:41And so you might have a layer on top that says, is this in the voice of the company?

17:46And give some examples of the voice of the company.

17:48Or is this a helpful statement?

17:50Is this a short and accurate response?

17:52And that could be another way to use LLMs as a policy layer as well.

17:57It's worth noting that factuality, though, and these approaches, especially guardrails,

18:02can sometimes come at the cost of the user experience.

18:06Peter D., given your experience working with startups, especially these highly creative

18:10AI personas, perhaps you can share some insight on how this balance between factuality and

18:16creativity is met.

18:19This is a really interesting phenomenon.

18:20So I noticed when startups come in now, one of the first things they do with the LLM is

18:24they turn off all of the safety features.

18:27And that's because there's this bizarre sort of optimization problem between safety and

18:33utility.

18:34And there are these cases where, just to give you an example, somebody wanted to do some

18:37multimodal analysis on monuments.

18:40And they couldn't about 75% of the time because there was like a human face in the picture.

18:46And so that's one of these really sort of subtle dances.

18:49Because I know it's possible, for instance, that you could inadvertently maybe let's say

18:54fine-tune a toxic model, you turn off the safety filters, and all of a sudden maybe

18:58there's sort of an embarrassing moment with your customers.

19:02And so anyway, there's this really subtle dance between safety and utility.

19:07And I think as a startup, maybe that's just some of the things that you have to be aware

19:09of when you go to market, right?

19:12Awesome.

19:13For the purposes of time, I do want to shift to data privacy, because this is absolutely

19:17key for enterprises.

19:18I mentioned those two top selection criteria for model providers.

19:22Data privacy is actually number one.

19:25Peter G., businesses are very concerned about training models on sensitive customer data

19:35about proprietary data being divulged through LLM prompting.

19:40Perhaps touch on what's the basis for this concern, and what are some of the approaches

19:44that enterprises can take?

19:46Absolutely.

19:47So there's a long history of data privacy and machine learning going hand-in-hand.

19:50For as long as there have been machine learning models, people have concerns about data privacy.

19:56These can be well-founded.

19:57Many of you might be familiar with the Netflix challenge from about 10 or 15 years or so

20:01ago at this point.

20:02And even with a relatively constrained output space, either a ranking problem or a classification

20:06problem should people watch this, they were able to reveal a whole bunch of sensitive

20:11information about the people in the data set.

20:13And so the first piece of advice I would give is don't ever train your model on sensitive

20:18data, whether it is a very simple classification model or whether it's a much more complicated

20:24generative model.

20:25Now, I think the reason that people are thinking about it so much in the generative case is

20:29the output space that you can produce in is much, much larger.

20:33Instead of true or false, yes or no, the model can generate free text.

20:39And so to motivate this concern a little bit, if I prompted a model, you know, Peter Grabowski's

20:44social security number is, hopefully it wouldn't be able to produce a valid response.

20:51Now let's say a company here has a product workflow that is really centered on the exchange

20:58of sensitive data.

21:00What are some approaches to enable that exchange of sensitive data without having to actually

21:03train on it?

21:04Good question.

21:06So one thing I would recommend is that retrieval augmented generation framework that we were

21:09talking about a moment ago.

21:11And that lets you store the sensitive data in a database where it can be appropriately

21:14ackled and then at inference time, at prompt time, you can inject that into the model and

21:18allow it to use it in its response.

21:20Awesome.

21:21Now, just a tidbit I would add here is, you know, for folks that are concerned about data

21:26privacy regulations like GDPR and HIPAA, RAG is very complementary in the sense that with

21:32a database you can easily permanently delete data, of course.

21:36It's just a question of deleting tables and rows and tables.

21:40Additionally, you can localize that database so you can ensure that data is not transferred

21:44outside of a specified geographic region.

21:46Both of those are very important to things like GDPR.

21:49There's another lens to this though, Peter G, that I want to get your insight into, which

21:53is mistrust of businesses, I think especially startups, of closed source model providers

22:00and using their model because they're concerned that the logs from that model, sensitive data

22:06will be used to actually train the closed source providers model.

22:10No, you're absolutely right.

22:11And I think to that extent, you know, startups tend to go with something like a Lama 2 stack,

22:15maybe a Gemma stack, a Mistral stack, because they can run a couple of GPUs and they control

22:19the entire thing from beginning to end.

22:21But what I've noticed though is that some startups tend to be using something like something

22:25called a long context window as an ad hoc form of RAG.

22:28And what that means is there's this promiscuous intermingling of kind of inference and possibly

22:33training data.

22:34And that gets dangerous when you're talking about things like law and, you know, insurance

22:38type of matters.

22:40So I think just having RAG is a form of data discipline, right?

22:43And so even if you're running your own open source models, you can still run into privacy

22:47issues if you're not careful.

22:49But I think also that's something that we're trying to do.

22:51You know, I know Vertex AI is trying to basically be the, you know, one of the pitches that

22:55they're making is that your data is safe with Google, right?

22:58And, you know, I think that's at least how we're trying to differentiate ourselves, right?

23:03Awesome.

23:04So, look, both of you, my brilliant colleagues, there is absolutely no way that we can cover

23:12all the information needed to know about how to productionize an LLM in 25 minutes.

23:16But you've done a really good job.

23:19And I just want to thank everyone for listening.

23:21I want to thank Peter and Peter.

23:24And if you do have any follow-up questions, if you didn't understand something, we're

23:27going to be in the Media Lab.

23:29So please feel free to come up to us and ask questions.

23:32And I hope you enjoy the rest of the program.

23:34Thanks.

23:35Thanks.

23:36Thanks.

23:37Thanks.

Category

Transcript

Recommended