Three Google engineers, James Rubin, Peter Danenberg and Peter Grabowski, discuss what they’re learned so far working on Google’s Gemini ai and what’s to come next.
Subscribe to FORBES: https://www.youtube.com/user/Forbes?sub_confirmation=1
Fuel your success with Forbes. Gain unlimited access to premium journalism, including breaking news, groundbreaking in-depth reported stories, daily digests and more. Plus, members get a front-row seat at members-only events with leading thinkers and doers, access to premium video that can help you get ahead, an ad-light experience, early access to select products including NFT drops and more:
https://account.forbes.com/membership/?utm_source=youtube&utm_medium=display&utm_campaign=growth_non-sub_paid_subscribe_ytdescript
Stay Connected
Forbes newsletters: https://newsletters.editorial.forbes.com
Forbes on Facebook: http://fb.com/forbes
Forbes Video on Twitter: http://www.twitter.com/forbes
Forbes Video on Instagram: http://instagram.com/forbes
More From Forbes: http://forbes.com
Forbes covers the intersection of entrepreneurship, wealth, technology, business and lifestyle with a focus on people and success.
Subscribe to FORBES: https://www.youtube.com/user/Forbes?sub_confirmation=1
Fuel your success with Forbes. Gain unlimited access to premium journalism, including breaking news, groundbreaking in-depth reported stories, daily digests and more. Plus, members get a front-row seat at members-only events with leading thinkers and doers, access to premium video that can help you get ahead, an ad-light experience, early access to select products including NFT drops and more:
https://account.forbes.com/membership/?utm_source=youtube&utm_medium=display&utm_campaign=growth_non-sub_paid_subscribe_ytdescript
Stay Connected
Forbes newsletters: https://newsletters.editorial.forbes.com
Forbes on Facebook: http://fb.com/forbes
Forbes Video on Twitter: http://www.twitter.com/forbes
Forbes Video on Instagram: http://instagram.com/forbes
More From Forbes: http://forbes.com
Forbes covers the intersection of entrepreneurship, wealth, technology, business and lifestyle with a focus on people and success.
Category
🤖
TechTranscript
00:00Well, hello everyone.
00:02I'm James Rubin.
00:03I'm going to be the moderator for this fireside chat with my esteemed Google colleagues, Peter
00:09Grabowski and Peter Dannenberg.
00:11We're going to be touching on the key blockers and solutions to productionizing enterprise
00:17ready LLMs and talk about some of the state-of-the-art approaches to really solving those challenges.
00:24We want to really focus on the practical applications today and we hope this is really a jumping
00:28off point for folks.
00:30It's a conversation starter for folks and we'll be in the media lab later if we have
00:33any more deep dive questions.
00:36But I think before we dive in, maybe we'll start with some introductions.
00:40I can go first.
00:41I'm James Rubin.
00:42I'm a product manager at Google for a Gemini applied research team.
00:48I work with Peter Grabowski.
00:50I'm his product counterpart.
00:53Prior to Google, I was at Amazon as a PM for the better part of three years.
00:58Worked across the AI stack, beginning with Zooks, which is their self-driving subsidiary,
01:03but also on custom AI chips and machine learning services at AWS.
01:08With that, I guess I'll pass it off to Peter Dannenberg.
01:11Hi.
01:12I'm Peter.
01:13I work on Gemini, which was formerly BARD and before that was Assistant.
01:17Peter Dannenberg on Assistant, formerly BARD, formerly, sorry, Gemini, formerly BARD, formerly
01:22Assistant.
01:23And I've been a senior SWE for a while, currently working on Gemini extensions.
01:29My name is Peter Grabowski.
01:30I am lucky enough to work with James on the Gemini applied research team.
01:34I've been at Google for about 10 years.
01:36Came in through the Nest acquisition, then spent some time working on the Google Assistant.
01:40And in the evenings, I moonlight as a teacher.
01:42So I'm on the faculty of UC Berkeley's master's in data science program and teach deep learning
01:46with natural language processing.
01:49And one thing I'll add there, Peter is too humble to say it, but during really the height
01:54of the AI craze in early 2023, he created and taught a LLM bootcamp that has really
02:00become the foundational course for Google.
02:02It's been taken by tens of thousands of Googlers, including myself.
02:05So we definitely have two incredible experts on the stage today.
02:11Before we dive in, though, I want to discuss kind of what the motivation for this talk
02:15was.
02:16A recent survey from Andreessen showed that the overwhelming majority of enterprises adopting
02:21AI are choosing to build AI apps in-house on top of common foundation models, as opposed
02:29to leveraging B2B AI software off the shelf.
02:33There may be a founder in the audience today that disrupts that trend.
02:36But regardless, the thing that we're most interested in is this disconnect between that
02:41enthusiasm to build and this hesitance and slowness that we're seeing with enterprises
02:47to deploy LLMs externally in production.
02:51External apps we're seeing, testing we're seeing, use cases for employee productivity,
02:55but when it comes to external apps, comparatively, they lag behind.
02:59So we really want to keep that in focus today, and we want to focus on ways that perhaps
03:04we can make the process of productionizing these LLMs a bit more clear.
03:09So with that, I thought we'd just level set with everyone and maybe start with the basics.
03:13Peter Gee, how would you describe an LLM, and why should businesses care?
03:19Happy to.
03:20I'll give a really simple example, but hopefully it's got some strong pedagogic value.
03:25LLMs are really nothing more than fancy autocomplete.
03:27You might have heard that metaphor before.
03:29And so if I said, or if I gave the audience a prompt, I went to go see a baseball game
03:33last night, I got to see the Boston Red Sox.
03:37Hopefully everybody's thinking about Sox.
03:39That's at their core what LLMs are doing.
03:42Now with short context windows for just a little bit of data, that's not all that interesting.
03:46But when you start showing these examples or giving them billions of parameters to learn
03:50with and start showing them hundreds of thousands or millions or billions of examples with longer
03:54and longer context windows, you start to see some really interesting behavior emerge.
04:01And I think one thing that we may want to expand on a bit is the fact that LLMs aren't
04:07just good for next word prediction and chat, but actually can be used for a wide range
04:10of traditional ML approaches as well.
04:12So perhaps just unpack that for us and also talk about the ways in which tasks can be
04:17reframed to work with LLMs.
04:19And so that's one of the things that's super interesting.
04:21Once you get into the couple billion parameter size, you start to see these fascinating properties
04:26emerge.
04:27And so if you've heard the term zero shot or few shot learning, that's what folks are
04:30talking about.
04:31If any folks in the audience either have a two-year-old or are ML engineers, this is
04:35the answer to the question you might have had, which is why can I show my two-year-old
04:38a picture of a zebra and show her maybe two or three?
04:41And then on the fourth, she's able to correctly identify what a zebra is.
04:45But with more traditional machine learning techniques, you need it to show 10 or 20 or
04:4930,000 images of a zebra.
04:51So I think that's what James is alluding to.
04:53The other thing that's really interesting is you can start to recast many of these traditional
04:58ML problem framings as next word prediction problems.
05:01So if I gave you a sentence and then I gave it the prompt, this sentiment of this sentence
05:05is blank, it would happily fill in positive, negative, neutral, happy, sad.
05:10And all of a sudden, you've reframed a classification problem into a next word prediction problem.
05:15Awesome.
05:16So basically, no excuse for businesses not to think creatively about how to apply LLMs
05:20to their use cases.
05:22Plenty of solution space exploration there.
05:25Peter D., I want to get your insights on this, especially how LLMs with these traditional
05:31ML approaches compare in terms of performance and how people are using them.
05:35So one of the things we do every couple of weeks is we bring startups to Google to just
05:38ask how they're using AI.
05:40What are the difficulties they're having with Gemini, this and that?
05:43And so a couple of weeks ago, I learned that a lot of startups are doing this thing where
05:47they basically train a bunch of baby models, that's sort of a Gemma 2B model, on things
05:51like classification tasks.
05:52So they can go to market in something like six to eight weeks, whereas previously, even
05:56just to train a trivial model as a classifier would take six to 12 months.
06:01So we've been seeing this quick time to market using baby LLMs as classification engines,
06:06trained on maybe tens of examples, which is incredible.
06:09And actually, this is a great way to get us back from that slight digression, which
06:13is what is the shift in LLMs that makes them especially attractive to businesses?
06:18Yeah.
06:19That's one of the things my team and I are really excited about, which is all of a sudden,
06:23to train a classifier, to train a model of a fixed quality, the amount of time that it
06:27takes, the amount of data that it takes, the amount of expertise that it takes, the amount
06:31of compute that it takes, has fallen dramatically.
06:34So that's one of the things we're exploring on our team.
06:36Absolutely.
06:37And I think there's actually one important one we missed, which is customizability, right?
06:41The ability to tune and align models to a specific task or domain.
06:46Businesses have vertical use cases, specific customer problems they're trying to solve.
06:49So this is incredibly important.
06:52And I actually want to drill down on that a bit further and get your insights.
06:56Data shows that customizability is one of the top two selection criteria for enterprising,
07:01selecting a model provider.
07:04But the process for going about customization is very complex.
07:08There's many different tuning techniques.
07:11There's many quality and cost trade-offs.
07:13It's very difficult to get to the output that you want.
07:16So perhaps, Peter, starting with Peter Gee, give businesses a starting point to navigate
07:22this complexity.
07:23Yeah.
07:24100%.
07:25So I saw this in my own work.
07:26I worked on the Google Assistant for a number of years.
07:28One of the things that we were focused on is building a sentence simplification engine
07:31for kids.
07:32And so if you ask, why is the sky blue?
07:34And you're an adult, you might get an answer like refraction in the ionosphere.
07:38But if you're a kid, that's not a satisfying answer.
07:40You want something like it bounces off a water drop that's in the sky.
07:43And so I spent about, like you were saying, six to eight months trying to build a model
07:47with Google Research and launch it into production.
07:50We were able to build something that works, but it wasn't high enough quality to ship.
07:54Fast forward to a year ago, with the few-shot prompting techniques that we were talking
07:58about, I was able to build something that blew the model we had built five years prior
08:02out of the water.
08:03And so to segue, that's the advice that I would give to businesses.
08:06Think about the problem that you want to focus on and solve using a large model, and then
08:10just get started.
08:11You can start by asking a model a question, just like we were talking about a moment ago.
08:16So let's say they set up a sandbox.
08:18They run an internal pilot.
08:20They've got their metrics set, their gathering data.
08:24It performs really well on general tasks, but it's not quite ready to specialize in
08:29their domain.
08:30So they're trying to replace some of their domain-specific workflows.
08:34What are some more advanced approaches they can now take?
08:37And you touched on one thing that I think is really important to emphasize, which is
08:40make sure that you've got metrics in place so that you can measure when you're improving.
08:44And so in this case, maybe we can talk about a hypothetical example of using a legal startup.
08:50Maybe you want your chap out or your agent to talk and sound like a lawyer.
08:54The first thing you might do is try what's known as role prompting, which is just telling
08:57the model to talk like a lawyer.
09:00From there, I would, again, evaluate, measure, see how you're doing.
09:04And if it's not where you want it to be, there's a couple other techniques you can try.
09:09Definitely dive into those, especially with regards to domain knowledge and domain specificity.
09:14Sure.
09:15So the next thing I would think about trying is what's known as a family of techniques
09:18known as domain adaptation.
09:20So the first thing you might try is continued pre-training.
09:23So what you're doing is you're taking that language modeling task that you started with,
09:26predict the next word.
09:27You're using backpropagation to update your weights.
09:29But you're focusing it on a corpus of data that's relevant to your model.
09:33And so for instance, in this legal example, you might do continued pre-training on a corpus
09:39of law textbooks.
09:40To give a human analogy, that's like telling a first year law student to go read 50 legal
09:45textbooks and come back and talk to me more like a lawyer.
09:49Awesome.
09:50And what about things like classification, given we mentioned it earlier, or chat, where
09:55we're talking about really adapting the task that the LLM is doing rather than the domain
09:59background?
10:00That's a great question.
10:01So if you're asking the model to make, I don't know, a decision about some sort of case law
10:05or something like that, continued pre-training can absolutely help.
10:09You might decide to focus it on specific examples of the task you want it to do.
10:13And so if it's a classification problem, you would train it using backpropagation, using
10:17the next word prediction task to do that specific focus on that task.
10:23And so in that context, it's usually known as supervised fine tuning.
10:27Awesome.
10:28So to summarize, for domain knowledge, domain specificity, continued pre-training is a good
10:33place for people just to start.
10:35If you're looking to improve a very specific problem framing, very specific task, SFT is
10:41a good place to start.
10:42But there are, as we know and you know more than most, there's many other techniques to
10:47explore within that.
10:48But we can talk about that maybe outside later.
10:51Peter D., I feel like we've ignored you a bit here, but the examples that Peter G. gave
11:00were with use cases where basically quality and accuracy can be defined pretty concretely.
11:05A legal chatbot, you can evaluate against an LSAT, a classifier, you can measure accuracy
11:10and precision.
11:11What about use cases where quality is defined more ambiguously?
11:16Yeah, it's interesting.
11:17So I don't know if this is a West Coast thing, but we had a bunch of startups come to Google
11:21a few weeks ago who, they're trying to solve this personal companion problem.
11:24And there seems to be a lot of VC money in that.
11:27Is that a West Coast thing, by the way, you guys on that East Coast?
11:30So anyway, we had this little experiment, right?
11:32So let's see if we can fine tune a Gemini model to be like Sherlock Holmes or Elizabeth
11:36Bennett.
11:38And so we ran this experiment where we tuned a Sherlock Holmes with about 10,000 examples,
11:42right?
11:43And there was this really bizarre phenomenon where this fine tuned Sherlock Holmes didn't
11:48seem to know that he was in a book, right?
11:51And so he would answer in the first person, which was a little bit strange.
11:54If you do the same thing with vanilla Gemini, you'll notice that Gemini speaks as Gemini
11:58with a little bit of Sherlock Holmes lipstick, essentially.
12:02But one of the funny things was, though, is that how do you evaluate this fine tuned Sherlock
12:06Holmes versus a vanilla one?
12:09And is it enough just to say, hey, Holmes, by the way, do you live at 211B Baker Street?
12:14And it turns out it's not.
12:16And especially when you're talking about these AI as companion sort of domain, I think there
12:21are a lot of these subtle issues like, is this character personable?
12:25Does this character scratch my itch for some definition of itch?
12:30And for that sort of thing, maybe it turns out you need a human in the loop sort of evaluating
12:37fine tuned Holmes in this case versus vanilla Holmes.
12:39So are you telling me that fine tuning is a little bit like method acting for LLMs?
12:44I think so.
12:45I think so.
12:46But, you know, and the funny thing, by the way, you know, it turns out that you can fine
12:49tune a model with like four to 500 examples.
12:52And in this Holmes case, I think we took about 10,000 examples.
12:55And it could have been that some of those examples were actually lower quality.
12:58And so just to give you an example, I just said, hey, Holmes, you know, can you tell
13:01me a little bit about rugby?
13:03And he said, you know what?
13:04I can't tell you.
13:05I've never played rugby.
13:06Basically he said, no, you know, I can't tell you.
13:08And I was just thinking, you know, if we had trained this Holmes on fewer high quality
13:11examples, we might have got a better result.
13:14And so there's this funny thing where more data is not necessarily better.
13:19Right?
13:20Right.
13:21I think that's a really great point.
13:22And before we move on from customizability as a topic, because there's so much more to
13:27cover, I do want to step outside this bubble of fine tuning an LLM for a single task and
13:32maybe talk about how LLMs are being extended for more complex workflows where they're operating
13:38asynchronously, even autonomously.
13:40Peter, I know you have a lot of experience with this.
13:43If you could share your insights.
13:46So, yeah, we did this interesting experiment a few weeks ago where we trained an LLM to
13:54be an asynchronous day trading bot.
13:56And just to show that I had some skin in the game, I threw a thousand bucks at it.
14:00And the funny thing is I made about three bucks.
14:03And so I don't know if I can tell you, but I have a 0.3% return, which is nice.
14:08But the funny thing is that even out of the box, using this thing called function calling,
14:14the LLM will actually learn how to act as an autonomous agent.
14:19And in order to pull that off, we had to do some classification tasks, like, you know,
14:23given these tweets, given these news headlines, are these bullish, are these bearish?
14:27And the funny thing was every single tweet was bearish.
14:30I'm not sure why.
14:32And it always wanted to spend at least half of my money.
14:36But the but, you know, and one of the things I was thinking is if you did something like
14:40a backtesting algorithm and maybe train the model in the last year of market data, you'd
14:44get an even better result.
14:46But I was kind of amazed what you could do out of the box.
14:48Awesome.
14:49Well, I'm going to responsibly pivot this conversation to factuality, because I think
14:53it's very relevant.
14:56Factuality is very important to enterprises.
14:59Recently an airline's chatbot famously hallucinated a false refund policy.
15:05They're now being sued as a result of that.
15:08Could Peter G. perhaps describe to us what's happening under the hood when a chatbot hallucinates?
15:13And then we can discuss some approaches to dealing with that.
15:17That's a great question.
15:18So I think what's going on is exactly what we were talking about at the start of this
15:21talk was the model is just trying to do next word prediction.
15:25In some cases, the model is very certain about the next word.
15:28If you imagine a probability distribution over all the possible tokens, you get a really
15:32spiky distribution.
15:33In other cases, it might be much less certain.
15:35So I think that's one dynamic at play.
15:37The second dynamic that I think is really interesting is in a lot of cases, these models
15:41are trained to be helpful.
15:42And so after the pre-training stage, there's often a phase known as instruction tuning,
15:47where that's exactly what you're doing.
15:48You're coaching the model, you're instructing the model, you're teaching the model how to
15:51be helpful, how to follow results, or how to give results, how to follow direction.
15:55And so in a case where the model is unsure, especially if you've instruction tuned, instead
16:00of just simply saying, I don't know, or I'm not sure, the model might try to hallucinate
16:05something or make something up to try to be helpful and answer your question.
16:10And what are some more advanced approaches that people can deal to deal with hallucination?
16:15There's a couple things that we recommend.
16:17And so one is using a technique that you all might be familiar with, known as retrieval
16:20augmented generation.
16:21And the idea there is that you want to use language models and a database together to
16:26solve the problem.
16:27And so you let language models do what language models are good at, which is generate natural
16:31language.
16:32And you let databases do what databases are good at, which is store, update, delete, tackle
16:37facts.
16:38Then you train the language model to be able to retrieve the relevant information from
16:41the database and give an answer based on that context.
16:43And so in the airline example, hopefully in that case, it would have retrieved the actual
16:48refund policy and then maybe massaged it or summarized it or used that to answer the question.
16:54Another hot topic is guardrails for LLMs, if you could just briefly touch on that as
16:57well.
16:58Yeah, 100%.
16:59And so this is a topic that's super important.
17:01It's not important just for generative applications, since it's important any time you're building
17:05a machine learning system.
17:06But the idea is that frequently you take a stochastic machine learning model that's always
17:10going to have a little bit of randomness in it, and then apply a policy layer or a set
17:13of guardrails on top.
17:15And so in the large language model day trading case, you might do something like, no matter
17:23how good the market looks, don't spend more than 10% of my money.
17:26Or no matter how good the market looks, don't put all of my money on GameStop.
17:30And hopefully that might limit the output space that you're thinking about and control
17:35the behavior a little bit.
17:36There's a really interesting case where you can also use LLMs to help evaluate that policy.
17:41And so you might have a layer on top that says, is this in the voice of the company?
17:46And give some examples of the voice of the company.
17:48Or is this a helpful statement?
17:50Is this a short and accurate response?
17:52And that could be another way to use LLMs as a policy layer as well.
17:57It's worth noting that factuality, though, and these approaches, especially guardrails,
18:02can sometimes come at the cost of the user experience.
18:06Peter D., given your experience working with startups, especially these highly creative
18:10AI personas, perhaps you can share some insight on how this balance between factuality and
18:16creativity is met.
18:19This is a really interesting phenomenon.
18:20So I noticed when startups come in now, one of the first things they do with the LLM is
18:24they turn off all of the safety features.
18:27And that's because there's this bizarre sort of optimization problem between safety and
18:33utility.
18:34And there are these cases where, just to give you an example, somebody wanted to do some
18:37multimodal analysis on monuments.
18:40And they couldn't about 75% of the time because there was like a human face in the picture.
18:46And so that's one of these really sort of subtle dances.
18:49Because I know it's possible, for instance, that you could inadvertently maybe let's say
18:54fine-tune a toxic model, you turn off the safety filters, and all of a sudden maybe
18:58there's sort of an embarrassing moment with your customers.
19:02And so anyway, there's this really subtle dance between safety and utility.
19:07And I think as a startup, maybe that's just some of the things that you have to be aware
19:09of when you go to market, right?
19:12Awesome.
19:13For the purposes of time, I do want to shift to data privacy, because this is absolutely
19:17key for enterprises.
19:18I mentioned those two top selection criteria for model providers.
19:22Data privacy is actually number one.
19:25Peter G., businesses are very concerned about training models on sensitive customer data
19:35about proprietary data being divulged through LLM prompting.
19:40Perhaps touch on what's the basis for this concern, and what are some of the approaches
19:44that enterprises can take?
19:46Absolutely.
19:47So there's a long history of data privacy and machine learning going hand-in-hand.
19:50For as long as there have been machine learning models, people have concerns about data privacy.
19:56These can be well-founded.
19:57Many of you might be familiar with the Netflix challenge from about 10 or 15 years or so
20:01ago at this point.
20:02And even with a relatively constrained output space, either a ranking problem or a classification
20:06problem should people watch this, they were able to reveal a whole bunch of sensitive
20:11information about the people in the data set.
20:13And so the first piece of advice I would give is don't ever train your model on sensitive
20:18data, whether it is a very simple classification model or whether it's a much more complicated
20:24generative model.
20:25Now, I think the reason that people are thinking about it so much in the generative case is
20:29the output space that you can produce in is much, much larger.
20:33Instead of true or false, yes or no, the model can generate free text.
20:39And so to motivate this concern a little bit, if I prompted a model, you know, Peter Grabowski's
20:44social security number is, hopefully it wouldn't be able to produce a valid response.
20:51Now let's say a company here has a product workflow that is really centered on the exchange
20:58of sensitive data.
21:00What are some approaches to enable that exchange of sensitive data without having to actually
21:03train on it?
21:04Good question.
21:06So one thing I would recommend is that retrieval augmented generation framework that we were
21:09talking about a moment ago.
21:11And that lets you store the sensitive data in a database where it can be appropriately
21:14ackled and then at inference time, at prompt time, you can inject that into the model and
21:18allow it to use it in its response.
21:20Awesome.
21:21Now, just a tidbit I would add here is, you know, for folks that are concerned about data
21:26privacy regulations like GDPR and HIPAA, RAG is very complementary in the sense that with
21:32a database you can easily permanently delete data, of course.
21:36It's just a question of deleting tables and rows and tables.
21:40Additionally, you can localize that database so you can ensure that data is not transferred
21:44outside of a specified geographic region.
21:46Both of those are very important to things like GDPR.
21:49There's another lens to this though, Peter G, that I want to get your insight into, which
21:53is mistrust of businesses, I think especially startups, of closed source model providers
22:00and using their model because they're concerned that the logs from that model, sensitive data
22:06will be used to actually train the closed source providers model.
22:10No, you're absolutely right.
22:11And I think to that extent, you know, startups tend to go with something like a Lama 2 stack,
22:15maybe a Gemma stack, a Mistral stack, because they can run a couple of GPUs and they control
22:19the entire thing from beginning to end.
22:21But what I've noticed though is that some startups tend to be using something like something
22:25called a long context window as an ad hoc form of RAG.
22:28And what that means is there's this promiscuous intermingling of kind of inference and possibly
22:33training data.
22:34And that gets dangerous when you're talking about things like law and, you know, insurance
22:38type of matters.
22:40So I think just having RAG is a form of data discipline, right?
22:43And so even if you're running your own open source models, you can still run into privacy
22:47issues if you're not careful.
22:49But I think also that's something that we're trying to do.
22:51You know, I know Vertex AI is trying to basically be the, you know, one of the pitches that
22:55they're making is that your data is safe with Google, right?
22:58And, you know, I think that's at least how we're trying to differentiate ourselves, right?
23:03Awesome.
23:04So, look, both of you, my brilliant colleagues, there is absolutely no way that we can cover
23:12all the information needed to know about how to productionize an LLM in 25 minutes.
23:16But you've done a really good job.
23:19And I just want to thank everyone for listening.
23:21I want to thank Peter and Peter.
23:24And if you do have any follow-up questions, if you didn't understand something, we're
23:27going to be in the Media Lab.
23:29So please feel free to come up to us and ask questions.
23:32And I hope you enjoy the rest of the program.
23:34Thanks.
23:35Thanks.
23:36Thanks.
23:37Thanks.