Articles, Blog

Can Self-Driving Cars Learn Depth Perception? 🚘

Can Self-Driving Cars Learn Depth Perception? 🚘


Dear Fellow Scholars, this is Two Minute Papers
with Dr. Károly Zsolnai-Fehér. When we, humans look at an image, or a piece
of video footage, such as this one, we all understand that this is just a 2D projection
of the world around us. So much so, that if we have the time and patience,
we could draw a depth map that describes the distance of each object from the camera. This information is highly useful, because
we can use it to create real-time defocus effects for virtual reality and computer games,
or even perform this Ken Burns effect in 3D, or in other words, zoom and pan around in
a photograph, but, with a beautiful twist, because in the meantime, we can reveal the
depth of the image. However, when we show the same images to a
machine, all it sees is a bunch of numbers. Fortunately, with the ascendancy of neural
network-based learning algorithms, we now have a chance to do this reasonably well. For instance, we discussed this depth perception
neural network in an earlier episode, which was trained using large number input-output
pairs, where the inputs are a bunch of images, and the outputs are their corresponding depth
maps for the neural network to learn from. The authors implemented this with a random
scene generator, which creates a bunch of these crazy configurations with a lot of occlusions
and computes via simulation the appropriate depth map for them. This is what we call supervised learning,
because we have all these input-output pairs. The solutions are given in the training set
to guide the training of the neural network. This is supervised learning, machine learning
with crutches. We can also use this depth information to
enhance the perception of self-driving cars, but this application is not like previous
two I just mentioned. It is much, much harder, because in the earlier,
supervised learning example, we have trained the network in a simulation, and then, we
also use it later in a computer game, which is, of course, another simulation. We control all the variables and the environment
here. However, self-driving cars need to be deployed
in the real world. These cars also generate a lot of video footage
with their sensors, which could be fed back to the neural networks as additional training
data…if we had the depth maps for them, which, of course, unfortunately, we don’t. And now, with this, we have arrived to the
concept of unsupervised learning. Unsupervised learning is proper machine learning,
where no crutches are allowed. We just unleash the algorithm on a bunch of
data, with no labels, and if we do it well, the neural network will learn something useful
from it. It is very convenient, because any video we
have may be used as training data. That would be great. But we have a tiny problem, and that tiny
problem is that that this sounds impossible. Or it may have sounded impossible, until this
paper appeared. This work promises us no less than unsupervised
depth learning from videos. Since this is unsupervised, it means that
during training, all it sees is unlabeled videos from different viewpoints, and somehow,
figures out a way to create these depth maps from it. So how is this even possible? Well, it is possible by adding just one ingenious
idea. The idea is that since we don’t have the
labels, we can’t teach the algorithm how to be right, but instead, we can teach it
to be consistent. That doesn’t sound like much, does it? Well, it makes all the difference, because
if we ask the algorithm to be consistent, it will find out that a good way to be consistent
is to be right! While we are looking at some results, to make
this clearer, let me add one more real-world example that demonstrates how cool this idea
is. Imagine that you are a university professor
overseeing an exam in mathematics, and someone tells you that for one of the problems, most
of the students gave the same answer. If this is the case, there is good chance
that this was the right answer. It is not a 100% chance that this is the case,
but if most of the students have the same answer, it is much more unlikely that they
all failed the same way. There are many different ways to fail, but
there is only one way to succeed. Therefore, if there is consistency, often
there is success. And this simple, but powerful thought leads
to far-reaching conclusions. Let’s have a look at some more results! Wo-hoo! Now this is something. Let me explain why I am so excited for this. This is the input image, and this is the perfect
depth map that is concealed from our beloved algorithm and is there for us to be able to
evaluate its performance. These are two previous works, both use crutches,
the first was trained via supervised learning by showing it input-output image pairs with
depth maps, and does reasonably well, while the other one gets even less supervision,
a worse crutch if you will, and it came up with this. Now, the unsupervised new technique was not
given any crutches and came up with this. Holy mother of papers. It looks like a somewhat coarser, but still,
very accurate version of the true depth maps. So what do you know! This neural network-based method just looks
at unlabeled videos, and finds a way to create depth maps by not trying to be right, but
trying to be consistent. This is one of those amazing papers where
one simple, brilliant idea can change everything and make the impossible possible. What a time to be alive! What you see here is an instrumentation of
this depth learning paper we have talked about, which was made by Weights and Biases. I think organizing these experiments really
showcases the usability of their system. Also, Weights & Biases provides tools to track
your experiments in your deep learning projects. Their system is designed to save you a ton
of time and money, and it is actively used in projects at prestigious labs, such as OpenAI,
Toyota Research, GitHub, and more. And, the best part is that if you are an academic
or have an open source project, you can use their tools for free. It really is as good as it gets. Make sure to visit them through wandb.com/papers
or just click the link in the video description and you can get a free demo today. Our thanks to Weights & Biases for their long-standing
support and for helping us make better videos for you. Thanks for watching and for your generous
support, and I’ll see you next time!

Tagged , , , , , , , , , , , ,

60 thoughts on “Can Self-Driving Cars Learn Depth Perception? 🚘

  1. HOLY MOTHER OF PAPERS INDEED! Such an incredible idea to use consistency ans measurement for unsupervised learning! Will be interesting how people try to exploit it 😀

  2. Thank you Károly for getting me through the lockdown without losing focus on ML, that is the most generous support of all!

  3. Given the source image and a depth map, I wish they had reprojected the image from a different point of view (slightly above or to the side of the original camera). This is my favorite way to evaluate the performance of depth maps.

  4. I'll consider this for my VR projects! Great paper as usual keep up the amazing work. Support from the Philippines

  5. I think the main reason that the AI is not as good as humans to perceive depth, is because we have a previous knowledge of how big the objects are compared to people. Unlike the AI that has to guess how big the objects are to be right.

  6. What a time to be alive! I'm afraid I didn't hold on to my papers… they're flying through the air from excitement.

  7. I don't really understand how this "consistency" thing works. Aren't there many different ways in which such a network could be consistent? Say for instance that the network consistently extracts the blue color channel from the image. Isn't it pretty unlikely for the consistent outputs to be perfect heightmap representations of the input images? I haven't read the paper, though, so I don't know if they address this.

  8. from my uneducated perspective, it seems this may be a monumental leap in machine learning. meaningful outcomes with messy data might be in our reach, and that's one more step towards AGI.

  9. im not really good in this topic but wouldnt it be a solution to give the ai two videos with perspective, like the human eyes, and then let it figure out on a 2d video? we humans didnt develope our sense of depth on 2d videos either

  10. Karoly, this seems like an unusually old paper to be featured here. Is it possible that your sponsor Weights & Biases is the reason? They conveniently have an instrumentation ready…

  11. I'm just curious as to how it didn't produce some other kind of consistent result like edge detection or something simpler.

  12. This paper is relatively old, there are much better works (for those that are planning to use depth estimator)
    https://youtu.be/b62iDkLgGSI

  13. Wouldn't it make sense to use two (or more) cameras/views? Then you could extract 3d information from each pair of input images in addition to the sequence of images. A useful application would be for new cars that have multiple cameras without having to use LIDAR systems like a few cars have. Very good paper regardless.

  14. I mean, the idea of consistency itself is not remarkable, but it is heavily underrated at the moment because people aren't courageous/smart enough to move away from supervised learning……

  15. I remember one exam in college where a majority of the students answered "?" on one question, so I agree that consistency isn't always right!

  16. I will tell all my class peers to write 1 on next math exam question, if the answer is consistent, it must be right? haha

  17. It actually looks like some parts of the new method are more accurate where the ground truth gets it wrong. The depth map misinterprets the reflections on the car sixth row down, while the ML method although blurrier doesn't have that same problem.

  18. i can't wait for an ai that will take a cropped, square photo, learn from it, then pick one pixel and resize it to like 1000×1000 Example: A photo of trees. The AI learns it and then you pick a single pixel of it and it will resize it to 1000×1000..

  19. To be fair: He makes it seem like one simple idea that changed everything, but the really amazingly genius part of this paper (at least for me) is that they made the reprojection of the images according to the depth maps *differentiable*! I would have liked Károly to at least mention this.

  20. I'm not in STEM or computer-science professionally, but this sounds like a method that could be super-useful in general AI, especially combined with those "curious" systems and systems that can emulate "imagination" (generate their own learning samples). Fantastic stuff.

  21. Did you automate the process of this video making? training supervised learning algorithm with your video as a label and a corresponding paper as an input. just wondering if this can be a deepfake video as well. 😛

  22. I'd be really freaked out if there was a self-driving car going thru the area at 5:45 and I was one of the pedestrians. Sure, it might never hit anyone, but that's a complicated situation, even for a human driver.

  23. The "summer" footage at 2:00–2:14 — looks like the AI really wanted there to be lots of birds on lots of wires, which is cute, if true. 🙂

  24. feel like they're going about it the wrong way why not just give the thing binocular vision like every other real world organism and train it to stitch the two images together, again like real organisms do…

  25. I don't feel like I understand this one as much as the other videos you have done. It seems like the consistency based network is still being supervised, even if it doesn't know the exact "rightness". You gave the metaphor of a student guessing the correct answer by picking the most consistent answer, but the neural network has a whole host of options to get consistent. For example, making all the pixels equal zero would be a lot more consistent than depth mapping. In addition, there are millions of extrapolations you can make from a 2d image. You can highlight edges, corners, pictures that look like frogs, anything. A neural network can be consistent in all those other cases, so how does the network know that it is supposed to track depth rather than pictures of frogs?

  26. An interesting trick but I don't see people taking this very far. This is like saying I'm right because I'm right. More importantly for the application of robotics it isn't safe enough. Interesting to see the results tho.

  27. As a maths teacher, I can say that all the student having the same answer is not evidence that they're correct. It's evidence that they either cheated or all made the same mistake for the same reason.
    It's obvious with multiple choice questions when the student can perform statistically significantly worse than random. For example fewer than 10% of them being correct when asked to choose between answer a, b, c and d.

  28. Tesla is already using this in their Autopilot system. Karpathy explains the same paper starting at 2:21:20: https://www.youtube.com/watch?v=Ucp0TTmvqOE&feature=youtu.be&t=8480. Also, current state-of-the-art is already much better than this. See for example https://arxiv.org/pdf/1904.04998.pdf

  29. They shouldn't need to learn depth perception because you should have multiple cameras for stereo and something like lidar on top of that.

  30. I guessed that "consistency" meant throughout the video frames, but the example seemingly being a single frame would contradict that.
    What a time to be confused.

  31. So i skipped this video through without the audio (too long 😉 ) but i reckon you don't need ML for the task of extrapolating depth from a video. So what I mean I think you can do it algorithmic. But I guess at some part of the video you said it is much faster with ML now …

Leave a Reply

Your email address will not be published. Required fields are marked *