He ran Teslas ML division, but still doesnt know what a simple kalman filter is (in the sense where he claimed that lidar would be hard to integrate with cameras).
The Kalman filter examples I've seen always involve estimating a very simple quantity, like the location of a single 3D point, from noisy sensors. It's clear how multiple estimates can be combined into a new estimate.
I'd guess that cameras on a self-driving car are trying to estimate something much more complex, something like 3D surfaces labeled with categories ("person", "traffic light", etc.). It's not obvious to me how estimates of such things from multiple sensors and predictions can be sensibly and efficiently combined to produce a better estimate. For example, what if there is a near red object in front of a distant red background, so that the camera estimates just a single object, but the lidar sees two?
1. make prediction on the next state change of some measurable n dimentional quantity, and estimate the covariance matrix across those n dimentions, which describe essentially a probability that the i-th dimention is going to increase (or decrease) with j-th dimention, where i and j are between 0 and n (indices of the vector)
2. Gather sensor data (that can be noisy), and reconcile the predicted measurement with the measured to get the best guess. The covariance matrix acts as a kind of weight for each of the elements
3. Update the covariance matrix based on the measurements in previous step.
You can do this for any vector of numbers. For example, instead of tracking individual objects, you can have a grid where each element represents a physical object that the car should not drive into, with a value representing certainty of that object being there. Then when you combine sensor reading, you still can use your vision model but that model would be enhanced by what lidar detects, both in terms of seeing things that camera doesn't pick up and rejecting things that aren't there.
And the concept is generic enough to where you can set up a system to be able to plug in any additional sensor with its own noise, and it all works out in the end. This is used all the You can even extend the concept past Gaussian noise and linearity, there are a number of other filters that deal with that, broadly under the umbrella of sensor fusion.
The problem is that Karpathy is more of a computer scientist, so he is on his Code 2.0 train of having ML models do everything. I dunno if he is like that himself or Musks "im smarter than everyone else that came before me" rubbed off.
And of course when you think like that, its going to be difficult to integrate lidar into the model. But the problem with that thinking is that forward inference LLM is not AI, and it will never ever be able to drive a car well compared to a true "reasoning" AI with feedback loops.
He ran Teslas ML division, but still doesnt know what a simple kalman filter is (in the sense where he claimed that lidar would be hard to integrate with cameras).