Computer Vision

innovator

Watch Computer Vision - innovator on Dailymotion

Transcript

00:00Human vision is amazingly beautiful and complex.

00:03It all started billions of years ago, where small organisms developed a mutation that

00:08made them sensitive to light.

00:10Fast forward to today, and there's an abundance of life on the planet which all have very

00:14similar visual systems.

00:16They include eyes for capturing light, receptors in the brain for accessing it, and a visual

00:21cortex for processing it.

00:23Genetically engineered and balanced pieces of a system, which help us do things as simple

00:27as appreciating a sunrise.

00:30But this is really just the beginning.

00:32In the past 30 years, we've made even more strides to extending this amazing visual ability

00:36not just to ourselves, but to machines as well.

00:40The first type of photographic camera was invented around 1816, where a small box held a piece

00:45of paper coated with silver chloride.

00:47When the shutter was open, the silver chloride would darken where it was exposed to light.

00:51Now, 200 years later, we have much more advanced versions of the system that can capture photos

00:57right into digital form, so we've been able to closely mimic how the human eye can capture

01:02light and color.

01:03But it's turning out that that was the easy part.

01:06Understanding what's in the photo is much more difficult.

01:09Consider this picture.

01:11My human brain can look at it and immediately know that it's a flower.

01:15Our brains are cheating, since we've got a couple million years' worth of evolutionary

01:18context to help immediately understand what this is.

01:21But a computer doesn't have that same advantage.

01:24To an algorithm, the image looks like this.

01:27Just a massive array of integer values, which represent intensities across the color spectrum.

01:32There's no context here.

01:34Just a massive pile of data.

01:37It turns out that the context is the crux of getting algorithms to understand image content

01:41in the same way that the human brain does.

01:44And to make this work, we use an algorithm very similar to how the human brain operates,

01:49using machine learning.

01:51Machine learning allows us to effectively train the context for a dataset, so that an

01:55algorithm can understand what all those numbers in a specific organization actually represent.

02:00And what if we have images that are difficult for a human to classify?

02:04Can machine learning achieve better accuracy?

02:06For example, let's take a look at these images of sheepdogs and mops, where it's pretty hard

02:11even for us to differentiate between the two.

02:15With the machine learning model, we can take a bunch of images of sheepdogs and mops, and

02:19as long as we feed it enough data, it will eventually be able to properly tell the difference

02:23between the two.

02:25Computer vision is taking on increasingly complex challenges and is seeing accuracy that rivals

02:30humans performing the same image recognition task.

02:33But like humans, these models aren't perfect.

02:35They do sometimes make mistakes.

02:38The specific type of neural network that accomplishes this is called a convolutional neural network,

02:43or CNN.

02:44CNNs work by breaking an image down into smaller groups of pixels called a filter.

02:50Each filter is a matrix of pixels, and the network does a series of calculations on these

02:54pixels, comparing them against pixels in the specific patterns the network is looking for.

02:59In the first layer of a CNN, it is able to detect high-level patterns like rough edges

03:03and curves.

03:04As the network performs more convolutions, it can begin to identify specific objects

03:08like faces and animals.

03:11How does a CNN know what to look for, and if its prediction is accurate?

03:15This is done through a large amount of labeled training data.

03:18When the CNN starts, all of the filter values are randomized.

03:22As a result, its initial predictions make little sense.

03:26Each time the CNN makes a prediction against labeled data, it uses an error function to compare

03:30how close its prediction was to the image's actual label.

03:34Based on this error or loss function, the CNN updates its filter values and starts the

03:38process again.

03:40Ideally, each iteration performs with slightly more accuracy.

03:44What if instead of analyzing a single image, we want to analyze a video using machine learning?

03:48At its core, a video is just a series of image frames.

03:52To analyze a video, we can build on our CNN for image analysis.

03:56In still images, we can use CNNs to identify features.

03:59But when we move to video, things get more difficult, since the items we're identifying

04:04might change over time.

04:06Or, more likely, there's context between the video frames that's highly important to labeling.

04:11For example, if there's a picture of a half-full cardboard box, we might want to label it

04:15packing a box or unpacking a box, depending on the frames before and after it.

04:21This is where CNNs come up lacking.

04:23They can only take into account spatial features, the visual data in an image, but can't handle

04:28temporal or time features, how a frame is related to the one before it.

04:33To address this issue, we have to take the output of our CNN and feed it into another model,

04:38which can handle the temporal nature of our videos.

04:41This type of model is called a Recurrent Neural Network, or RNN.

04:45While a CNN treats groups of pixels independently, an RNN can retain information about what it's

04:51already processed, and use that in its decision making.

04:55RNNs can handle many types of input and output data.

04:58In this example of classifying videos, we train the RNN by passing it a sequence of frame descriptions,

05:04empty box, open box, closing box, and finally a label, packing.

05:09As the RNN processes each sequence, it uses a loss or error function to compare its predicted

05:14output with the correct label.

05:16Then it adjusts the weights and processes the sequence again until it achieves a higher

05:20accuracy.

05:22The challenge with these approaches to image and video models, however, is that the amount

05:25of data we need to truly mimic human vision is incredibly large.

05:30If we train our model to recognize this picture of a duck, as long as we're given this one picture,

05:35with this lighting, color, angle, and shape, we can see that it's a duck, but if you change

05:39any of that, or even just rotate the duck, the algorithm might not understand what it

05:44is anymore.

05:45Now, this is the big picture problem.

05:47To get an algorithm to truly understand and recognize image content the way the human brain

05:51does, you need to feed it incredibly large amounts of data of millions of objects across

05:56thousands of angles, all annotated and properly defined.

06:00The problem is so big that if you're a small startup or a company lean on funding, there's

06:04just no resources available for you to do that.

06:08This is why technologies like Google Cloud Vision and Video can help.

06:12Google digests and filters millions of images and videos to train these APIs.

06:17We've trained a network to extract all kinds of data from images and video so that your

06:21application doesn't have to.

06:23With just one REST API request, we're able to access a powerful pre-trained model that

06:27gives us all sorts of metadata.

06:30Here's how easy it is to call the Cloud Vision API with CURL.

06:34I'll send this image to the API, and here's the response we get back.

06:39Billions of years since the evolution of our sense of sight, we've found that computers

06:43are on their way to matching human vision, and it's all available as an API.

06:47If you'd like to know more about the Cloud Vision and Video APIs, check out their product

06:51pages at the links here to see how you can easily add machine learning to your application.

06:56Thanks.

06:57Thanks for watching.

Category

Transcript

Recommended