Category
🤖
TechTranscript
00:00Human vision is amazingly beautiful and complex.
00:03It all started billions of years ago, where small organisms developed a mutation that
00:08made them sensitive to light.
00:10Fast forward to today, and there's an abundance of life on the planet which all have very
00:14similar visual systems.
00:16They include eyes for capturing light, receptors in the brain for accessing it, and a visual
00:21cortex for processing it.
00:23Genetically engineered and balanced pieces of a system, which help us do things as simple
00:27as appreciating a sunrise.
00:30But this is really just the beginning.
00:32In the past 30 years, we've made even more strides to extending this amazing visual ability
00:36not just to ourselves, but to machines as well.
00:40The first type of photographic camera was invented around 1816, where a small box held a piece
00:45of paper coated with silver chloride.
00:47When the shutter was open, the silver chloride would darken where it was exposed to light.
00:51Now, 200 years later, we have much more advanced versions of the system that can capture photos
00:57right into digital form, so we've been able to closely mimic how the human eye can capture
01:02light and color.
01:03But it's turning out that that was the easy part.
01:06Understanding what's in the photo is much more difficult.
01:09Consider this picture.
01:11My human brain can look at it and immediately know that it's a flower.
01:15Our brains are cheating, since we've got a couple million years' worth of evolutionary
01:18context to help immediately understand what this is.
01:21But a computer doesn't have that same advantage.
01:24To an algorithm, the image looks like this.
01:27Just a massive array of integer values, which represent intensities across the color spectrum.
01:32There's no context here.
01:34Just a massive pile of data.
01:37It turns out that the context is the crux of getting algorithms to understand image content
01:41in the same way that the human brain does.
01:44And to make this work, we use an algorithm very similar to how the human brain operates,
01:49using machine learning.
01:51Machine learning allows us to effectively train the context for a dataset, so that an
01:55algorithm can understand what all those numbers in a specific organization actually represent.
02:00And what if we have images that are difficult for a human to classify?
02:04Can machine learning achieve better accuracy?
02:06For example, let's take a look at these images of sheepdogs and mops, where it's pretty hard
02:11even for us to differentiate between the two.
02:15With the machine learning model, we can take a bunch of images of sheepdogs and mops, and
02:19as long as we feed it enough data, it will eventually be able to properly tell the difference
02:23between the two.
02:25Computer vision is taking on increasingly complex challenges and is seeing accuracy that rivals
02:30humans performing the same image recognition task.
02:33But like humans, these models aren't perfect.
02:35They do sometimes make mistakes.
02:38The specific type of neural network that accomplishes this is called a convolutional neural network,
02:43or CNN.
02:44CNNs work by breaking an image down into smaller groups of pixels called a filter.
02:50Each filter is a matrix of pixels, and the network does a series of calculations on these
02:54pixels, comparing them against pixels in the specific patterns the network is looking for.
02:59In the first layer of a CNN, it is able to detect high-level patterns like rough edges
03:03and curves.
03:04As the network performs more convolutions, it can begin to identify specific objects
03:08like faces and animals.
03:11How does a CNN know what to look for, and if its prediction is accurate?
03:15This is done through a large amount of labeled training data.
03:18When the CNN starts, all of the filter values are randomized.
03:22As a result, its initial predictions make little sense.
03:26Each time the CNN makes a prediction against labeled data, it uses an error function to compare
03:30how close its prediction was to the image's actual label.
03:34Based on this error or loss function, the CNN updates its filter values and starts the
03:38process again.
03:40Ideally, each iteration performs with slightly more accuracy.
03:44What if instead of analyzing a single image, we want to analyze a video using machine learning?
03:48At its core, a video is just a series of image frames.
03:52To analyze a video, we can build on our CNN for image analysis.
03:56In still images, we can use CNNs to identify features.
03:59But when we move to video, things get more difficult, since the items we're identifying
04:04might change over time.
04:06Or, more likely, there's context between the video frames that's highly important to labeling.
04:11For example, if there's a picture of a half-full cardboard box, we might want to label it
04:15packing a box or unpacking a box, depending on the frames before and after it.
04:21This is where CNNs come up lacking.
04:23They can only take into account spatial features, the visual data in an image, but can't handle
04:28temporal or time features, how a frame is related to the one before it.
04:33To address this issue, we have to take the output of our CNN and feed it into another model,
04:38which can handle the temporal nature of our videos.
04:41This type of model is called a Recurrent Neural Network, or RNN.
04:45While a CNN treats groups of pixels independently, an RNN can retain information about what it's
04:51already processed, and use that in its decision making.
04:55RNNs can handle many types of input and output data.
04:58In this example of classifying videos, we train the RNN by passing it a sequence of frame descriptions,
05:04empty box, open box, closing box, and finally a label, packing.
05:09As the RNN processes each sequence, it uses a loss or error function to compare its predicted
05:14output with the correct label.
05:16Then it adjusts the weights and processes the sequence again until it achieves a higher
05:20accuracy.
05:22The challenge with these approaches to image and video models, however, is that the amount
05:25of data we need to truly mimic human vision is incredibly large.
05:30If we train our model to recognize this picture of a duck, as long as we're given this one picture,
05:35with this lighting, color, angle, and shape, we can see that it's a duck, but if you change
05:39any of that, or even just rotate the duck, the algorithm might not understand what it
05:44is anymore.
05:45Now, this is the big picture problem.
05:47To get an algorithm to truly understand and recognize image content the way the human brain
05:51does, you need to feed it incredibly large amounts of data of millions of objects across
05:56thousands of angles, all annotated and properly defined.
06:00The problem is so big that if you're a small startup or a company lean on funding, there's
06:04just no resources available for you to do that.
06:08This is why technologies like Google Cloud Vision and Video can help.
06:12Google digests and filters millions of images and videos to train these APIs.
06:17We've trained a network to extract all kinds of data from images and video so that your
06:21application doesn't have to.
06:23With just one REST API request, we're able to access a powerful pre-trained model that
06:27gives us all sorts of metadata.
06:30Here's how easy it is to call the Cloud Vision API with CURL.
06:34I'll send this image to the API, and here's the response we get back.
06:39Billions of years since the evolution of our sense of sight, we've found that computers
06:43are on their way to matching human vision, and it's all available as an API.
06:47If you'd like to know more about the Cloud Vision and Video APIs, check out their product
06:51pages at the links here to see how you can easily add machine learning to your application.
06:56Thanks.
06:57Thanks for watching.