Brainstorm AI 2024: A New Tool to Master The Mix

Fortune

Jessica Powell, Co-founder and CEO, AudioShake

Transcript

00:00Hi, thanks all for coming.

00:01I'm Jessica Powell, the CEO and co-founder of AudioShake.

00:06And I am waiting for the slides to start.

00:10Oh, there we go.

00:11All right.

00:12So today, I'm going to talk to you about sound.

00:14But before we do that, we're going to do a sound experiment.

00:17So in just a moment, I'm going to ask you all

00:19to clap and to cheer as if you were at a sports

00:22stadium or a concert.

00:24And I'm going to talk.

00:25And can we get some music?

00:28Great.

00:29OK, so louder, louder.

00:32Can you hear me?

00:33Can you hear me?

00:35Great.

00:37I really just wanted applause.

00:40Great, so now let's see.

00:41If I had asked you what the person two feet from you

00:44had said during that time, you would have had no idea.

00:48Or if it had been a concert, and you had actually

00:50wanted to hear what was happening,

00:52maybe more you wanted to hear the music than the crowd,

00:55you'd be stuck.

00:56And that's generally how we experience sound

00:58in the real world, right?

00:59Think about being in a noisy bar.

01:01Or you're on the street, and you're recording a video.

01:05And all of a sudden, an ambulance

01:06comes by or a police siren.

01:09Or since we're in Silicon Valley,

01:10let's say you're developing an application that

01:12requires voice input.

01:14Well, if your user is in a call center,

01:16on a noisy street corner, or simply

01:18has toddlers running around in the background,

01:21good luck getting the input that you need.

01:23But what if that didn't have to be the case?

01:25What if we could actually extract what we needed to hear?

01:29So let's listen to something that's very, very noisy,

01:31like this.

01:32This is the mission.

01:33And liftoff of the Space Shuttle Discovery,

01:36returning to the Space Station.

01:39And now let's isolate the voice.

01:42And liftoff of the Space Shuttle Discovery,

01:45returning to the Space Station.

01:47So this is what we do at Audioshake.

01:49We make sound work better for everyone.

01:51We split sound into its different components

01:54in order to make audio editable, accessible, and useful.

01:58We're a core audio infrastructure company,

02:00a bit like a Dolby.

02:02Now this is actually pretty hard to do,

02:04and let me show you why.

02:06This is an image of an audio recording.

02:08And you probably don't know,

02:10is this a whole bunch of people speaking in a crowded room?

02:12Is this a music recording?

02:15Is it a bunch of sound effects in a movie?

02:17And all I'm gonna tell you is that the y-axis

02:19is the frequency, and the x-axis is time.

02:23And that's about all you're gonna get.

02:25Now what I'm gonna do is I'm actually gonna color

02:26this in for you so you can see what's actually going on.

02:29So this is a song recording.

02:31And you can see we've got some vocals, some drums,

02:34some bass, a collection of other instruments.

02:37And again, this is not actually how sound looks.

02:41You have no way of actually knowing

02:42what's in these different parts.

02:44Now recording engineers have a hack for this,

02:47developed back in the 60s, around the time of the Beatles,

02:50which is multi-track recording.

02:52You send the vocalist into the studio

02:54to lay down their track, then the drummer,

02:56then the bass player, and you now have

02:58all these granular elements that you can do

03:01granular audio editing with.

03:02You can remix the track, you could put those different

03:05sound objects in different perceptual fields,

03:08the bass here, the drum here, to make a surround sound mix.

03:11You could correct imbalances in the audio,

03:13all because you have those parts.

03:16But real-world sound doesn't come to us like this.

03:19It's more like what we just experienced

03:21in our little experiment.

03:23And so we have to turn to AI and deep learning

03:26to help us fix this problem,

03:27and be able to get at these individual parts.

03:30So let me show you how we do that.

03:31And in fact, before I move slides,

03:33if you look here, if you look at the yellow and the orange,

03:36the drums and the bass, this is a really great example.

03:39So you can see that these are not in one tidy band.

03:42They're all over the place, right?

03:44They're overlapping with all the other frequencies as well.

03:46And remember, we don't see any color in this image.

03:49So all we can see is essentially sound patterns.

03:54So what we do is we train on hundreds of thousands

03:57of these parts, which are called stems,

03:59and let's say drums, in order to learn

04:02what these sound patterns are.

04:04So let's say we can then take an audio recording

04:07that we've never seen before,

04:08and recognize that there are drums in there.

04:11This matches to the sound pattern of a drum.

04:14So we figured out what something is,

04:16but now we need to disentangle it from everything else.

04:19So let's say we want to do that with the bass.

04:20We want to isolate the bass from the recording.

04:24Think of an image editor.

04:25So what we're going to do is we're going to leave

04:27everything that we've, all the pixels that we've identified

04:30as corresponding to the bass,

04:31we're going to leave those there.

04:33And everything that's not the bass,

04:34we're going to black out.

04:36And essentially we're going to silence that audio.

04:38And that's going to leave us with the bass.

04:40Now this might sound, as my parents said to me,

04:43like a very weird hobby,

04:46but actually it's very, very practical.

04:48And AudioShake works across a ton of entertainment

04:50and speech workflows today.

04:52So in film and TV, we can isolate dialogue,

04:55music, and effects to take old film

04:57and open it up to localization in other languages.

05:00Or remove on set wind and noise.

05:04In music, we can separate the different instrument stems

05:06for remixing or immersive sound.

05:09In sports, we can boost what a player is saying

05:12on the field, or remove music from the environment

05:15so that teams and leagues can avoid millions of dollars

05:17in copyright fines.

05:19In transcription and captioning,

05:21our tech is used to isolate the voice

05:23before it goes through transcription

05:25so that you get much higher accuracy.

05:28And in generative AI, we're used by some of the world's

05:30largest foundation models to learn human conversation.

05:34So for example, disentangling overlapping speakers

05:37or extracting the uh-huhs and yeahs

05:40that make human conversation human.

05:43So let me show you a quick demo

05:45of what this all looks like in practice.

05:48♪ Oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh oh

05:52♪ I've a feeling coming down low

05:55♪ Gonna stare at the ceiling but there's n-

06:00The lock of Dream Come True, innit really.

06:03Did you uh, we talked about nerfs

06:04and coming into a game like this.

06:05It's the magnitude of this game, massive game.

06:08Did, was there a hangover at all from last season?

06:10The result and the coming to...

06:12Well, when you came to Earth,

06:13you couldn't be more mistaken.

06:14We're here to help you by deactivating them.

06:18The loot, yeah.

06:19Dort hat sich was eingenistet, Noah.

06:22I think that you should get to the airport.

06:24No, I think that you should get to the airport

06:26like 30 minutes before boarding, maximum.

06:29I want to walk right through security.

06:31Enjoy it and lounge.

06:33That's what I call a lounge.

06:35In fact, there's going to be so much more that you're going

06:43to be able to power with sound separation in the future

06:45and in the near future.

06:47So we've already heard a little bit today about music.

06:49In music, you're going to be able to remix and reimagine

06:52everything.

06:53And all content, including UGC, is

06:56going to be able to be made immersive,

06:58both through sound separation and things

07:00that are happening on the vision side as well.

07:04We're also going to be able to do all kinds of measurement

07:06and analysis thanks to sound separation.

07:09So imagine being able to measure the degradation

07:12or the health of an ecological environment

07:15around the presence or the absence

07:17and the change over time in different kinds of animal

07:19sounds.

07:21And in generative AI, we're going to be able to,

07:24the foundation models, in fact, are going to go deeper

07:26and deeper into audio, to the point where they're

07:28going to be able to understand audio the way we do.

07:31So all the complexity, the way any human here

07:33could have said, that was crowd noise

07:36and there was a dog barking over there.

07:38Computers will be able to do that too.

07:40And they're going to be able to learn to talk like us

07:42as well, which is going to change

07:44so many workflows from the factory floor

07:46through to kiosks and amusement park.

07:49Finally, because of advances not just in AI,

07:52but also chips and hardware, we're

07:54going to be able to solve the noisy bar problem.

07:58Now one big technical challenge in all this is speed.

08:01So in sound separation, we are taking large audio files,

08:05running them through large deep learning models,

08:07and then outputting large high resolution audio files.

08:11And that's why for the past year,

08:12AudioShake's been working on building faster and faster

08:15models.

08:16And today, we're launching here at Fortune,

08:18the AudioShake voice SDK, which will isolate voice

08:21in real time for noisy backgrounds,

08:24streaming capable on edge devices.

08:27And I think, while we were all talking and doing

08:29our experiment, in the back, they were running exactly that,

08:33or we're about to find out if they did.

08:35Can you play the full mix of what

08:36that room sounded like when everyone was cheering?

08:39Louder.

08:39Can you hear me?

08:41Can you hear me?

08:43All right, and what did you isolate?

08:45Louder.

08:46Can you hear me?

08:47Can you hear me?

08:49Great.

08:50I should have done something like you all want a free car,

08:53or something like that.

08:54So as you go out today into the noisy hall,

08:56or in a street corner later, or at a bar tonight watching

09:00a DJ remix a track, know that all these experiences are going

09:03to be enhanced by sound separation,

09:05and AudioShake's going to help make it possible.

Category

Transcript

Recommended