• 2 days ago
Jessica Powell, Co-founder and CEO, AudioShake
Transcript
00:00Hi, thanks all for coming.
00:01I'm Jessica Powell, the CEO and co-founder of AudioShake.
00:06And I am waiting for the slides to start.
00:10Oh, there we go.
00:11All right.
00:12So today, I'm going to talk to you about sound.
00:14But before we do that, we're going to do a sound experiment.
00:17So in just a moment, I'm going to ask you all
00:19to clap and to cheer as if you were at a sports
00:22stadium or a concert.
00:24And I'm going to talk.
00:25And can we get some music?
00:28Great.
00:29OK, so louder, louder.
00:32Can you hear me?
00:33Can you hear me?
00:35Great.
00:37I really just wanted applause.
00:40Great, so now let's see.
00:41If I had asked you what the person two feet from you
00:44had said during that time, you would have had no idea.
00:48Or if it had been a concert, and you had actually
00:50wanted to hear what was happening,
00:52maybe more you wanted to hear the music than the crowd,
00:55you'd be stuck.
00:56And that's generally how we experience sound
00:58in the real world, right?
00:59Think about being in a noisy bar.
01:01Or you're on the street, and you're recording a video.
01:05And all of a sudden, an ambulance
01:06comes by or a police siren.
01:09Or since we're in Silicon Valley,
01:10let's say you're developing an application that
01:12requires voice input.
01:14Well, if your user is in a call center,
01:16on a noisy street corner, or simply
01:18has toddlers running around in the background,
01:21good luck getting the input that you need.
01:23But what if that didn't have to be the case?
01:25What if we could actually extract what we needed to hear?
01:29So let's listen to something that's very, very noisy,
01:31like this.
01:32This is the mission.
01:33And liftoff of the Space Shuttle Discovery,
01:36returning to the Space Station.
01:39And now let's isolate the voice.
01:42And liftoff of the Space Shuttle Discovery,
01:45returning to the Space Station.
01:47So this is what we do at Audioshake.
01:49We make sound work better for everyone.
01:51We split sound into its different components
01:54in order to make audio editable, accessible, and useful.
01:58We're a core audio infrastructure company,
02:00a bit like a Dolby.
02:02Now this is actually pretty hard to do,
02:04and let me show you why.
02:06This is an image of an audio recording.
02:08And you probably don't know,
02:10is this a whole bunch of people speaking in a crowded room?
02:12Is this a music recording?
02:15Is it a bunch of sound effects in a movie?
02:17And all I'm gonna tell you is that the y-axis
02:19is the frequency, and the x-axis is time.
02:23And that's about all you're gonna get.
02:25Now what I'm gonna do is I'm actually gonna color
02:26this in for you so you can see what's actually going on.
02:29So this is a song recording.
02:31And you can see we've got some vocals, some drums,
02:34some bass, a collection of other instruments.
02:37And again, this is not actually how sound looks.
02:41You have no way of actually knowing
02:42what's in these different parts.
02:44Now recording engineers have a hack for this,
02:47developed back in the 60s, around the time of the Beatles,
02:50which is multi-track recording.
02:52You send the vocalist into the studio
02:54to lay down their track, then the drummer,
02:56then the bass player, and you now have
02:58all these granular elements that you can do
03:01granular audio editing with.
03:02You can remix the track, you could put those different
03:05sound objects in different perceptual fields,
03:08the bass here, the drum here, to make a surround sound mix.
03:11You could correct imbalances in the audio,
03:13all because you have those parts.
03:16But real-world sound doesn't come to us like this.
03:19It's more like what we just experienced
03:21in our little experiment.
03:23And so we have to turn to AI and deep learning
03:26to help us fix this problem,
03:27and be able to get at these individual parts.
03:30So let me show you how we do that.
03:31And in fact, before I move slides,
03:33if you look here, if you look at the yellow and the orange,
03:36the drums and the bass, this is a really great example.
03:39So you can see that these are not in one tidy band.
03:42They're all over the place, right?
03:44They're overlapping with all the other frequencies as well.
03:46And remember, we don't see any color in this image.
03:49So all we can see is essentially sound patterns.
03:54So what we do is we train on hundreds of thousands
03:57of these parts, which are called stems,
03:59and let's say drums, in order to learn
04:02what these sound patterns are.
04:04So let's say we can then take an audio recording
04:07that we've never seen before,
04:08and recognize that there are drums in there.
04:11This matches to the sound pattern of a drum.
04:14So we figured out what something is,
04:16but now we need to disentangle it from everything else.
04:19So let's say we want to do that with the bass.
04:20We want to isolate the bass from the recording.
04:24Think of an image editor.
04:25So what we're going to do is we're going to leave
04:27everything that we've, all the pixels that we've identified
04:30as corresponding to the bass,
04:31we're going to leave those there.
04:33And everything that's not the bass,
04:34we're going to black out.
04:36And essentially we're going to silence that audio.
04:38And that's going to leave us with the bass.
04:40Now this might sound, as my parents said to me,
04:43like a very weird hobby,
04:46but actually it's very, very practical.
04:48And AudioShake works across a ton of entertainment
04:50and speech workflows today.
04:52So in film and TV, we can isolate dialogue,
04:55music, and effects to take old film
04:57and open it up to localization in other languages.
05:00Or remove on set wind and noise.
05:04In music, we can separate the different instrument stems
05:06for remixing or immersive sound.
05:09In sports, we can boost what a player is saying
05:12on the field, or remove music from the environment
05:15so that teams and leagues can avoid millions of dollars
05:17in copyright fines.
05:19In transcription and captioning,
05:21our tech is used to isolate the voice
05:23before it goes through transcription
05:25so that you get much higher accuracy.
05:28And in generative AI, we're used by some of the world's
05:30largest foundation models to learn human conversation.
05:34So for example, disentangling overlapping speakers
05:37or extracting the uh-huhs and yeahs
05:40that make human conversation human.
05:43So let me show you a quick demo
05:45of what this all looks like in practice.
05:48♪ Oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh
05:52♪ I've a feeling coming down low
05:55♪ Gonna stare at the ceiling but there's n-
06:00The lock of Dream Come True, innit really.
06:03Did you uh, we talked about nerfs
06:04and coming into a game like this.
06:05It's the magnitude of this game, massive game.
06:08Did, was there a hangover at all from last season?
06:10The result and the coming to...
06:12Well, when you came to Earth,
06:13you couldn't be more mistaken.
06:14We're here to help you by deactivating them.
06:18The loot, yeah.
06:19Dort hat sich was eingenistet, Noah.
06:22I think that you should get to the airport.
06:24No, I think that you should get to the airport
06:26like 30 minutes before boarding, maximum.
06:29I want to walk right through security.
06:31Enjoy it and lounge.
06:33That's what I call a lounge.
06:35In fact, there's going to be so much more that you're going
06:43to be able to power with sound separation in the future
06:45and in the near future.
06:47So we've already heard a little bit today about music.
06:49In music, you're going to be able to remix and reimagine
06:52everything.
06:53And all content, including UGC, is
06:56going to be able to be made immersive,
06:58both through sound separation and things
07:00that are happening on the vision side as well.
07:04We're also going to be able to do all kinds of measurement
07:06and analysis thanks to sound separation.
07:09So imagine being able to measure the degradation
07:12or the health of an ecological environment
07:15around the presence or the absence
07:17and the change over time in different kinds of animal
07:19sounds.
07:21And in generative AI, we're going to be able to,
07:24the foundation models, in fact, are going to go deeper
07:26and deeper into audio, to the point where they're
07:28going to be able to understand audio the way we do.
07:31So all the complexity, the way any human here
07:33could have said, that was crowd noise
07:36and there was a dog barking over there.
07:38Computers will be able to do that too.
07:40And they're going to be able to learn to talk like us
07:42as well, which is going to change
07:44so many workflows from the factory floor
07:46through to kiosks and amusement park.
07:49Finally, because of advances not just in AI,
07:52but also chips and hardware, we're
07:54going to be able to solve the noisy bar problem.
07:58Now one big technical challenge in all this is speed.
08:01So in sound separation, we are taking large audio files,
08:05running them through large deep learning models,
08:07and then outputting large high resolution audio files.
08:11And that's why for the past year,
08:12AudioShake's been working on building faster and faster
08:15models.
08:16And today, we're launching here at Fortune,
08:18the AudioShake voice SDK, which will isolate voice
08:21in real time for noisy backgrounds,
08:24streaming capable on edge devices.
08:27And I think, while we were all talking and doing
08:29our experiment, in the back, they were running exactly that,
08:33or we're about to find out if they did.
08:35Can you play the full mix of what
08:36that room sounded like when everyone was cheering?
08:39Louder.
08:39Can you hear me?
08:41Can you hear me?
08:43All right, and what did you isolate?
08:45Louder.
08:46Can you hear me?
08:47Can you hear me?
08:49Great.
08:50I should have done something like you all want a free car,
08:53or something like that.
08:54So as you go out today into the noisy hall,
08:56or in a street corner later, or at a bar tonight watching
09:00a DJ remix a track, know that all these experiences are going
09:03to be enhanced by sound separation,
09:05and AudioShake's going to help make it possible.

Recommended