Brainstorm AI London 2024: The Future Of Synthetic Media

  • 5 months ago
Victor Riparbelli, Co-founder and CEO, Synthesia Mati Staniszewski, Co-founder and CEO, ElevenLabs Moderator: Jeremy Kahn, FORTUNE
Transcript
00:00 - Hi, everybody. Welcome back. I hope you had a good break. I'm very excited to be joined
00:04 with some of the CEOs of two of the hottest sort of startups in the generative AI space
00:10 at the moment. And as you heard from Ellie, both working in synthetic media, which is
00:16 maybe using text to generate something, but maybe also using existing video, still images,
00:22 or voice to generate media and content. To show you how this works, first, we've got
00:27 two very exciting demos. First, we're going to go to Victor, who's going to show you a
00:30 little bit about how Synthesia's product works. Victor, go ahead.
00:34 - Thank you so much. I think we should get a visual in just a second here, but maybe
00:38 before we jump into that. So at Synthesia, we're on a mission to make video easy for
00:42 everyone. That's something a lot of people have attempted before us, but we take kind
00:46 of a new approach to this, right? We don't think of this as building smaller, more affordable
00:50 cameras. We don't think of this as like slightly better editing apps that run on your phone.
00:54 We're actually building technology that eventually is going to be able to replace the entire
00:58 physical production process of using studios, cameras, actors, microphones, and all those
01:02 stuff. We want to take that entire workflow and make it into something you can do entirely
01:06 from behind your desk. Today, we're a SaaS platform, and we help predominantly enterprise
01:12 create more videos to communicate better with their stakeholders, which could be employees,
01:16 could be customers. And I think we'll get a visual up here now of how it actually works
01:21 and how easy it actually is to do. So this is the platform, and in some ways, it's sort
01:26 of modeled a little bit on PowerPoint. That's how easy we want it to be able to do. What
01:31 we're showcasing here is our AI video assistant, which uses LLMs to essentially help you get
01:35 to a draft of your video in just a few moments. So what you're seeing here is someone putting
01:42 in the URL to an article, of course, a Fortune article, because we're here today, giving
01:47 the system a little bit more context of what it is that we're trying to do with this video.
01:52 And in just a second, the system will then go in, it will pass the content, and it will
01:57 give us a draft, an editable copy that we can then kind of tune to what exactly we want
02:03 it to be like. Now, in the context of an enterprise, this may not be an article like this. This
02:09 would probably be a knowledge-based article. It could be a case study that you want to
02:13 share with your customers or anything else really you want to turn into a video. So as
02:16 you can see here now, we begin to actually write the script. And it is, of course, using
02:22 all the data that it's kind of taking from this URL. The idea here is not that this is
02:27 a final video that you can just publish, but it is to take you 70, 80% of the way. Instead
02:32 of having what we internally call the blank screen of death, we give you something to
02:37 actually begin kind of working from. And we don't just write the script. We actually also
02:40 give you the visuals. So as you'll see here, there's some text on the screen. There'll
02:44 be some bullet points. There'll be a bunch of other things that you can sort of edit
02:48 yourself. Of course, everything can be edited. We have these avatars, which is the sort of
02:52 core of our product. And I think after this video is done, we'll of course see what the
02:56 video looks like. But it's just to give you a quick visual of how easy it actually is
03:00 to make videos with these systems. And it's one of the things that excites us the most
03:04 about this new wave of generative tools. It's awesome if we can make it easier for Hollywood
03:10 people and video production professionals to make video. But I think what's much, much
03:13 more interesting is taking everyone in the world, including everyone here, and making
03:17 you into video creators. In some ways, it sounds a bit science fiction. In other ways,
03:23 I think it's just natural evolution. If you go back 40, 50 years in time, it wasn't part
03:28 of most people's job to write text, right? You had secretaries. You had people in companies
03:33 who would write things on typewriters and send them to the right people. Then we all
03:36 got computers and keyboards. And now I'm pretty sure everyone in this room, you write as part
03:41 of your daily tasks, right? Then PowerPoint came along. We all became designers without
03:46 really knowing. Probably all of you here know how to operate a PowerPoint. And obviously,
03:51 we think that the next iteration of this is going to be video, which is just a much, much
03:55 better way of compressing information than text, at least for most people out there.
04:01 I think with that, let's see the demo of what the final video looks like. This uses some
04:04 of our latest technology, our latest model called Express One, which essentially teaches
04:09 the avatars how they should perform and behave. But let's see the video.
04:13 Imagine stepping into a room where the air crackles with the energy of innovation, a
04:17 place where the future of AI and its impact on our world is not just discussed, but shaped.
04:24 Welcome to Fortune Brainstorm AI in London, a gathering that promises to be the nexus
04:28 of AI's brightest minds from leading technology companies.
04:34 At Fortune Brainstorm AI in London, expect to dive deep into conversations that matter.
04:40 Hear from Google DeepMind's Vice President and Faculty AI CEO as they unveil the future
04:46 of AI and its transformative potential on society. Microsoft's Chief Scientist alongside
04:52 Accenture's Chief AI Officer will explore the profound changes generative AI is set to
04:57 bring to the workplace. This event is not just about listening, but engaging with roundtable
05:03 sessions and ample opportunities for networking. I hope you enjoy the event.
05:09 Great. It's very impressive given how quickly it all generates. Yeah, absolutely.
05:16 Thank you. And now we'll hear from Madi about 11 Labs,
05:20 which is, you may know about voice cloning. Let's see how this works.
05:23 Thank you, Jeremy. I will attempt doing it live and maybe that's a very quick introduction.
05:29 11 Labs is an audio AI research and deployment company with the goal of making content universally
05:35 accessible across voices and languages. And what I'll show you now is three demos, quick
05:40 demos of how some of those technologies come to life. You should get a visual any second
05:45 now to the screen. And the first of those building blocks is one that most of you might
05:51 be familiar with, with foundational text to speech. And that's one of the breakthroughs
05:55 that we've figured out of how you can take existing text and by understanding the nuance
06:01 of the context, turn it into emotions and into the right intonation with a specific
06:06 voice. So queuing up, I will use 11 Labs to help me with the introduction to all of you
06:12 and we'll see how that comes across.
06:14 Ladies and gentlemen, welcome to the Fortune AI Brainstorm Summit. We are thrilled to have
06:21 you join us from around the globe for this dynamic gathering of minds and ideas. Today,
06:29 we stand on the brink of new discoveries and innovations in the field of artificial intelligence
06:35 that promise to redefine what's possible in business, society and beyond.
06:44 You could hear the pauses, you could hear the intonation, you could hear the excitement
06:47 and also a little bit of ponder with the hmm as the AI was speaking. And this is one of
06:52 the voices from one of voice actors we work with. And actually there's plenty of those
06:58 voice actors as part of our platform. We now have over a thousand such voices, which created
07:03 clones of their voices and then shared it while every time it's being generated, they
07:07 earn compensation and return as well. And whether you want this voice, a voice with
07:11 an Australian accent, with a different style or gender, this is all possible within the
07:16 platform. But this is not the end. And one of the things that's super exciting about
07:20 the technology, that it not only allows you to create the content in the language of the
07:24 speaker, but actually take the content and turn it into other languages while preserving
07:29 the same voice and the same characteristics. I'll swap over to a demo that some of you
07:36 might be familiar. It's a John F. Kennedy speech from moon landing. And we'll see how
07:42 initially he speaks in English. And then I will flip it to Spanish and to Hindi live.
07:48 And you'll see how some of those characteristics come across while we play the demo.
07:53 >> But why some say the moon? Why choose this as our goal? And they may well ask why climb
08:04 the mountain? Why do we choose to go to the moon? Why do we choose to go to the moon?
08:15 >> Three, two, one, zero. Take off.
08:21 >> We chose to go to the moon this decade to do other things, not because they are easy,
08:27 but because they are difficult. Because the goal is to provide our energy and skill and
08:34 to measure it. Because it is a challenge that we are ready to accept. One that we don't
08:41 want to avoid. And we intend to win.
08:46 >> That's one small step for man, one giant leap for mankind.
08:53 >> You can see the emotions come across while the moon and the liftoff happened. And that's
08:58 something we're excited about. How we can take that content and enable those global
09:01 audiences to watch it while still enjoying that original experience. But of course in
09:06 the audio world, there's so much more than speech and the voices. And as we think about
09:10 the future, about our work, is how we can enable some of that work across some of the
09:15 tangential domains. The sound effects, the music. How we can bring it all together with
09:19 the video and truly make it an immersive experience. And to close it off, I'll show you
09:25 something that you might be familiar from text to video work with Sora, from OpenAI,
09:29 with Synthesia as well, where we'll take some of the stitches from the videos and we'll
09:34 supplement it with AI-generated sound effects. What we're trying to do is hit that unmute
09:39 button with those videos that you've probably seen online out there.
09:42 [VIDEO PLAYBACK]
10:06 - In a place beyond imagination, where the horizon kisses the heavens, one man dares
10:12 to journey where few have ventured. Armed with nothing but his wit and an unyielding
10:17 spirit, he seeks the answers to mysteries that lie beyond the stars.
10:22 [END PLAYBACK]
10:25 And that's the goal. To make all the content out there accessible across voices,
10:30 across languages, across sounds. And thank you.
10:34 [APPLAUSE]
10:40 - Well, I think you all agree those are pretty impressive demos. And amazing
10:45 technology, also a little bit scary. And I think looking at this, a lot of people
10:49 immediately think about some of the negative use cases around deepfakes, around
10:54 fraud. I know there's already been some concern around 11 Labs, whether people have
10:59 used voice clones to perpetuate frauds. How are you guys trying to prevent that
11:05 from happening and to make sure these technologies are not used for harmful
11:10 uses? Victor, maybe you go first.
11:13 - Yeah, sure. I think as with most new technologies, we immediately go to all
11:17 things and go wrong, right? And I think in this case, that's definitely right.
11:20 These technologies will be used by bad actors to do bad things, for sure. I think
11:24 we should not be staring about that at all. And so for us, the sort of safeguarding
11:28 the technology has always been part of the company from day one. So we found the
11:32 company in what we call our ethical framework, which is the three Cs, consent,
11:36 control, and collaboration. It's a long topic, but I think the kind of
11:40 fundamental keystone for us is around consent. So never, ever recreate anyone's
11:46 voice or video avatar without the explicit consent.
11:49 - And how do you guarantee consent?
11:51 - So we have a KYC style check when you go through, right? You submit your avatar
11:54 footage, it's reviewed by a human being. You have to say out some specific
11:58 sentences to make sure you can make your clone. So essentially, it is impossible
12:01 today to go in and take some YouTube videos or something and make a clone of
12:04 someone. That is just not possible. And we intend to keep it that way.
12:08 Control is about content moderation, so we employ pretty heavy content moderation
12:12 to quite strong stance on what you're allowed to create, what you're not
12:14 allowed to create. I think for most of us, the sort of what we call the red
12:18 content is easy to agree on, hate speech, violence, swearing, things like that.
12:22 But what gets harder is, well, someone's making a video about cryptocurrencies,
12:25 for example, right? Are you talking about what a great technological invention that
12:28 is, or you're trying to lead me into some fraudulent scheme that promises I'm going
12:32 to get rich in 10 days, right? And we do our best to catch that stuff.
12:36 It's a huge effort. It's, of course, a lot of computers and AI, but it's also a lot
12:40 of humans sort of figuring out who are the bad faith actors on the platform.
12:45 So that's a big, that's essentially an internal product for us. That said,
12:49 it is an incredibly small amount of users we catch trying to do this,
12:52 but we do think it is our responsibility to take some stance on that.
12:55 And the last one is...
12:56 - Okay, good.
12:57 - No, that's fine.
12:58 - No, Matti, I want to get you in there. How are you at 11 Labs are you trying to
13:00 prevent this sort of misuse?
13:01 - Yeah, I think, first of all, Jeremy, you're right. The deepfakes is a serious
13:04 concern as we think about this year, AI-generated content across audio,
13:07 images, videos coming across. And it's something that's a concern because
13:12 open source, other commercial models are becoming more active, and there will be
13:16 bad actors that will use and abuse them to the nefarious cases. At 11 Labs,
13:21 it's education, making everyone aware of the technology. Second is traceability.
13:27 All the content that's generated by 11 Labs can be traced back to a specific user,
13:30 specific account, and as they use the technology, they will need to verify
13:34 themselves through different steps depending on the technology they use.
13:37 And the last one is traceability, but also how you can embed the traceability
13:42 as part of detection. So effectively, all the content out there should be known
13:47 as AI-generated, and there should be tools that allow you to quickly get that
13:51 information as a viewer. And we released our tool publicly that allows everybody
13:55 to use it and find whether it was 11 Labs or not and report it back to us.
13:59 - So right now that that exists, if I run across some content, an audio clip on the
14:04 internet on X or on LinkedIn or whatever, I can check and see if it was 11 Labs
14:09 generated?
14:10 - Yes, you can. You can directly go on our website, upload that clip,
14:14 and get probability of how likely it is. And now we are working with some partners
14:19 to extend it across other technologies. So some of the open-source model,
14:22 other commercial models, that's part one. And then part two, because as you said,
14:26 you can run into that in social media, but you might not be aware that it's AI-generated
14:30 in the first place. So how we can collaborate with social media,
14:33 with telecommunication companies to check it on the fly while that content is being
14:37 shared.
14:38 - I wanted to ask a question about sort of the ethics of some of this.
14:41 There's an interesting use case on the JFK speech. There have already been some
14:45 cases of politicians in various countries wanting to appeal to a certain audience
14:49 in a language that they don't speak. And they have translated their speeches in
14:53 real time, I think maybe even using some of this technology, to appear to speak
14:57 those languages. And they've said, "Oh, that's legitimate. I'm just trying to
15:00 reach a different audience with my content, with my political content."
15:03 But other people have said, "No, that's a complete red flag.
15:06 You don't actually speak that language. You're presenting yourself as something
15:08 you're not." Do you have a policy around this? And also at Synthesia,
15:12 do you have a policy around the creation of political content?
15:15 - Yeah. So we don't allow political content today. I think that will change
15:18 over time. But we take generally a quite permissive stance on what you can use
15:23 these technologies for. We're also mainly serving the enterprise.
15:27 I do think it's a very interesting philosophical question. I think in five
15:29 years' time, something like that, no one will be talking about this anymore,
15:32 and everyone will be doing it. But I think as with any new technology,
15:35 you will have these sort of years, right, where people are trying to figure out
15:38 what's right, what's wrong. You could say, "Is it okay for a politician to have a
15:42 ghostwriter to write a piece in Fortune?" Right? They didn't actually write it
15:45 themselves. Is that worse? Is it the same? Is it okay? So I think, you know,
15:49 all new technologies come with these sort of interesting questions and dilemmas.
15:52 And some of them will ring out to be very important and very true.
15:55 And I think some of them we'll care less about in the future.
15:59 - And Mati, do you have a policy on that?
16:01 - I would certainly echo some of the victory pieces as well, where it's going
16:06 to happen at one stage, and now as a society, we need to come together and
16:09 figure out what's the right way of approaching that. Currently, we don't
16:14 allow people around to impersonate any of that work or pretend that they are part
16:20 of the candidates and do the political speech or political content.
16:25 - Great. I want to take one question from the audience because I think I only have
16:28 time for one. But if there's a question, please raise your hand, and I'll try to
16:31 come to you. Right here, I see there's a gentleman here at this table. If you
16:35 raise your hand again. Oh, that's great. If you could stand up and identify
16:38 yourself.
16:39 - Hi. It's Mark Salmon from the Times. When Sadiq Khan had a deep fake of him
16:44 made, the Met investigated, and they concluded that they couldn't take any
16:48 action. Do you think it's possible to write a law in this country to combat
16:56 against that kind of deep fake, and do you think that's desirable?
17:00 - Should there be laws against deep fakes?
17:03 - I'm obviously not a lawmaker, so I can't give you a definitive answer to that.
17:07 I can give you my personal opinion. I think, ultimately, deep fakes is an
17:11 advanced version of impersonation, and impersonation is, in most cases, illegal
17:16 if you do it for deceptive reasons like fraud, for example. I definitely think we
17:19 should include deep fakes of that, maybe even make the punishment harder if you're
17:23 using a deep fake for malicious impersonation. But where to draw the lines
17:27 legally, I think, is very difficult to say. That's not my question.
17:31 - And to add one quick note to that, and as you rightly asked, this content is out
17:36 there. There's a plethora of tools that can generate that content, and what's the
17:39 true solution? There's definitely the legislature that the countries can pass,
17:43 but then also on the technical level, what we can do, and beyond traceability and
17:46 being able to detect it, as you think a few years out, how this could work as part
17:50 of the society. And one thing we are advocating for is beyond watermarking
17:54 what's AI and approved AI, maybe there's a version of watermarking what's real
17:58 content and for Sadek Han or any other candidate to be able to transmit a
18:03 message, and that message is decoded, "This is really Sadek Han that's speaking."
18:07 So we don't only detect what's fake, but actually verify what's true.
18:12 - Excellent. We're out of time. Thank you very much for the demos, and thank you
18:15 very much for the conversation. Put your hands together, please, for Maddie and
18:18 Victor.
18:20 ♪ [music] ♪
18:23 [BLANK_AUDIO]

Recommended