CoTracker3: Revolutionizing Point Tracking with Simplicity

Download MP3

Dalton Anderson (00:01.39)
Welcome to the Venture Step podcast where we discuss entrepreneurship, industry trends, and the occasional book review. Imagine a world where we could seamlessly track objects. Why do you think that's important? Think robotics. Think about your favorite dog emoji filter on Snapchat and Instagram. In this episode, we're gonna be discussing Meta's CoTracker 3 research paper.

Two episodes ago, I talked about Meta's shipping spree. Within that episode, I had a live demo of CoTracker 3 with their online model. If you're interested, please check out that episode. In this episode, we're gonna be diving in to their research paper and discussing some of the things that they're solving and why it's pretty cool.

But before you dive in, I'm your host Dalton Erson. My background is a bit of a mix of programming, data science and insurance offline. could find me running, building my side business or lost in a good book. You can listen to this podcast and video and audio format on YouTube. And if audio is more your thing, you could find this podcast on Apple podcasts, Spotify and YouTube or wherever else you get your podcasts.

The first thing I'd like to touch on is what is point tracking? I talked about it two episodes ago. We're going to touch on it again. Point tracking involves what it sounds like tracking an object within a video or an image and understanding its interactions frame by frame. What does that mean? Well, in the live demo, what CoTracker did was

attract a man walking down a street. And the way that attract this person was it put down a grid of points pixel by pixel frame by frame on the video. Within that video, when the object moved around, there were tracks. so within this, there was

Dalton Anderson (02:21.632)
identified points on the subject that we wanted to follow and link to the subject. So there would be a couple of points on the body, the legs, the knees, the elbows, the hands, the head. And so as that person walked in the video, the tracker, the point tracking would track that object.

So why is that important?

Well, think about what is called 3D reconstruction.

That is a basic and crucial application for autonomous driving, virtual reality, and medical imaging. Think about video editing and special effects, like tracking a subject and providing that subject with super armor or whatever powers they have and linking those powers with their hand. Motion analysis, understanding players on a field or.

different things that they're doing. All that stuff involves tracking.

Dalton Anderson (03:37.954)
Now that we understand the crucial applications of tracking, let's talk about how this has been evolving. So there is three models that are going to be discussed. The first one is the model called PIFS that

Pips, pips, that led to the evolution of tracking. And I'm gonna read this off the screen because it's quite long and I can't remember all of it. So PIP is partial video revisited tracking through occlusions using point trajectories. And this was the first model that used deep learning for tracking. And then someone built on top of that.

So another person came out with, or group of people came out with TAPIR, which is T-A-P-I-R, which stands for Tracking Any Point With Per Frame Initiation and.

Initialization, And temporal refinement. Okay, so that's a lot going on. That's why I read it off. And this tracking any point with per frame, what is that? So it's a bit different than the initial approach, but builds off of the same concept. Instead of doing basic tracking, it does global tracking.

but one of the key issues with global tracking or global matching is what they're calling it, but it's global matching to improve tracking accuracy. One of the key issues is that it is quite resource intensive. And now there is what is called CoTracker, which is what this episode is about. So CoTracker 3 incorporates correlation between different tracks to improve performance, particularly when

Dalton Anderson (05:45.164)
an object is occluded, which is when it's moved out of the frame.

Okay, so what are some of the challenges with tracking? We have occlusions, which is when an object becomes temporarily hidden. Complex motions, think about rapid or erratic movements. And then we have changing appearances. If it changed or there's some kind of lighting.

or object deformation, like maybe the object was in a crash and had a lot of kinetic energy exerted onto the object and it's deformed or say that I turned up the lights or turned down the lights, the exposure and the way that I look and how I appear on the video will drastically change depending on the lighting.

So those are the kind of things that trackers have trouble with. How do they deal with it with CoTracker 3? They use joint tracking. So joint tracking uses multiple points together and the thought process on this is pretty simple and I think that if you think about it in the sense that

CoTracker is trying to digitalize vision as I discussed in the last episode with Metis Sparsh, which is the ecosystem for robotics to digitalize touch. Digitalizing touch, if you think about how touch works in your body, think about how vision works in your body. Joint tracking, what does that really mean? So if you were looking at a car going down the road,

Dalton Anderson (07:45.11)
you could identify separate points of the car that are critical to the structure. So you can think about the hood, you can think about the roof, you can think about the headlights, the wheels, all of those things are dependent on the structure and they have to move together because they're part of the structure. So what joint tracking is, is okay, I have the car.

and the car's hood and the roof, those are kind of together. So I'm going to track those things together and they're grouped. With joint tracking, it has some logic baked in saying, hey, this is one structure, let's track them together. And then there's this 4D correlation features, which I found pretty confusing when I was reading the paper and I had to reread the paragraph a couple of times.

because it kind of turned my head a bit sideways because I was thinking about it and the real world is 3D. Videos are 2D. So how does a 2D video that films a 3D world become 4D? And what they're saying is it's the height and the width and time, which is 3D. And then 4D is correlation between the different tracks. And so

It creates this, it compares the features that are identified of an object and then extracts them at different frames and creates this, you know.

motion of patterns and then those patterns are part of this four dimension. And then there's iterative updates. So the way that this architecture is created is that one, they have the joint tracking and then they have the 4D correlation features and then they also have these iterative updates. So the iterative updates change the correlation over time, why the video is being processed, the transformer

Dalton Anderson (09:54.746)
is constantly updating and it uses things like, okay, object texture, speed, motion, how visible is the object? When is the object not visible or does it change, do you think it changes directions? And it improves, that by improving the estimates because it's constantly updating upon itself in real time.

Okay, so now that we're gonna move over to the CoTracker and how does that work? What are some of the key differences between CoTracker 3 and these other models?

Dalton Anderson (10:41.678)
The first one I think is crucial to identify is with these tracking videos, similar to the last episode, if you listen to it, is they're not very generalized and it relies on static data because it's quite expensive to label these data sets of videos that allow for training. And one of the issues in the last episode was

It was so expensive that companies chose to build in-house models and in-house datasets and they weren't being shared. Not as much of an issue here as the previous one because I think there's less direct commercial application, but a similar issue with the synthetic data piece where there's not enough

Well, there's not enough training data, because there's videos everywhere, right? There's not enough training videos that are labeled, and they need the labeling to be able to be trained. So in the last episode, I talked about how they use a self-supervised learning approach that had the model figure out what

and where and how that they should label these images and videos. Similar thing here where they use knowledge distillation, which I talked about in the last episode, but knowledge distillation is when you have a teacher, multiple teachers in this case, with the CoTracker 3, they use multiple teachers. These teachers already pre-trained models on train data. And what the...

teachers are doing their one, they're, they are labeling the videos. So they're using the teachers to label real world videos instead of using synthetic data. And the issue is synthetic data. One is it's not generalized and doesn't model off the real world, which makes it not generalized. And then in addition to that, they don't scale and they're quite expensive to create. That being said, Co-Tracker 3 doesn't use

Dalton Anderson (13:04.036)
synthetic videos, uses real videos. It's such a weird phrase by the way, real videos. Okay, what do you mean? Real, what is real? But anyways, so.

that point, you have the teachers providing the labeled videos, they're providing the output to the student model, which is what becomes of CoTracker 3.

And I would think about it as an analogy where a teacher has different strengths and weaknesses. Like think about your math teacher, think about your history teacher, think about your science teacher. They all have different personality characteristics given the subject matter that they're interested in, different quirks and different strengths. What the student's job is to do is one, take the results from the teacher.

study, figure out how to get similar results. And then two, take the strengths of each teacher and then embody that. And eventually the student is more competent than the teachers.

So that's a great approach to strengthen the model and then not only does it strengthen the model, it allows the model to use real world data. And then another cool thing is CoTracker 3 with this simplified architecture and approach no longer is as resource intensive, but then the other pieces.

Dalton Anderson (14:45.654)
Now it's able to get similar results with the competitor models with training on a thousand X less data. And the competitor model trained on 15 million videos. Co-Tracker 3 trained on 15,000. Big difference, better performance, more efficient, and a real world data, which makes it more generalized. So they're solving

quite a few things there. They simplified the architecture. They allowed the processing of real world data, real world data. said data. They use knowledge installation to allow the model to pick up on.

things a bit faster than you would normally. And I talked about like inheriting the strengths and the assembly effect and understanding, okay, it's a group of individuals that all have strengths and weaknesses and these models using these multiple teachers have all of these pros and cons and the students taking all the pros, none of the cons.

Dalton Anderson (16:02.254)
There is some other pieces that I found quite interesting. And the first one I maybe wanna talk about is feature mapping. So they use CNNs, which is Convotional Neural Networks. And a CNN is a spatial model that understands and vectorizes images. And basically,

It comes with, I would think about it as maybe like a magnifying glass. And so it has multiple layers. so think about the magnifying glass having different zoom points. And the first magnifying glass might be very close and fine grained. So it's only checking for different pixel colors. And these colors might be okay.

the background is white and there's a blue, something blue is there. So it identifies the pixels in that video or image that have different colors and there's different edges and textures involved identifying those things. And then the next layer will go over and maybe the magnifying glass is zoomed out a bit and you can see a bigger picture. And from there, you...

can see, okay, this is an object. This is a blue hat or this is a blue car. This is the sky or whatever it is. And then the third one might be a little bit more detailed. Okay. Within this object, there's, there's different things going on. That's feature mapping. The next thing I was going to talk about was the 4d correlation features, but I remember that we already talked about that iterative updates we've, we've touched on. So the next thing is to talk about

technical aspects. So the technical aspects of the video, one that I thought was interesting was what's called down sampling. And down sampling is the approach of having an image, right? Say that your image on its own is a thousand and eighty pixels.

Dalton Anderson (18:18.446)
Those pixels, I don't really know what HD and HD image looks like or not looks like, but I don't know how many pixels are in HD image. That's what I'll say off top my head. I know you could look it up pretty easily, but for this, for this instance, let's just go with a hundred. I think that's easier numbers. So we go with a hundred. This original image has a hundred pixels. Downsampling decreases the pixels of the image. Why would you want to do that?

because then it would become blurry and things are less clear and it gives you less understanding of what's going on in this frame, which is an image within a video. You do that because it's faster to process. So down sampling allows the model to understand roughly what's going on and then from there, identify, okay, this is an object moving this way, blah, Then,

Dalton Anderson (19:19.618)
it's able to process more information than it normally would be able to. It just basically reduces the resolution of a frame within a video. It's refined enough to understand some of the details, high level, but not necessarily everything. And the next thing would be multi-scale features.

The multi-scale features is, I would say, like different zoom points of looking at a frame. Zooming out and zooming in to understand the different nuances. It's, I would say, maybe similar to seeing in, but it just understands, okay, here are the features identified at this zoom level, and here are the features identified here. And then,

from that they can understand the movement of the object or a different camera perspective. If you look at a photo or a piece of art and you look at it very close up, you can see the every minute detail and understand each. You can feel the emotion of each breaststroke, right? Or brush stroke as a breast. I'm using swimming. I got the swimming terminology brush stroke. And then when you step

five steps back, you'll see something that you couldn't see before because you have a different perspective. And then you step 15 steps back and you might see something that people have never seen before. It's kind of similar thing.

So where does that leave applications? Well, as I mentioned earlier, 3D tracking and 3D reconstruction for.

Dalton Anderson (21:15.362)
robotics, autonomous driving, all those really cool applications that are gonna be interesting and revolutionize society, they're important. You need to be able to track objects that are around you. You need to be able to predict where objects might go in an instant. For you to be able to have autonomous driving, there needs to be events tracking.

and not only tracking of moving objects, but stationary ones like a stop sign or pedestrian or a bird or whatever it is. You need to be able to track these things.

when you are operating a vehicle that can cause physical harm to other people and or things.

And if you can't track, you can't drive is what I say to the autonomous vehicle. If you can't track, you can't drive.

Video generation and editing. So I mentioned earlier with the special effects, like generating these special effects and say that I have some hand movement and I shoot out lightning from my hand. How do you track my arm movement in my hand? And maybe with my muscle flexing, that's when the lightning's supposed to shoot out. Like this different level of detail.

Dalton Anderson (22:50.03)
you can be able to get if you had more advanced tracking or when you have these situations where maybe instead of shooting with multiple costumes, maybe you could just put the costume on virtually and you're on a green screen, you have your original costume and maybe you turn into the whole can now you're the whole or, or whatever. That's a bad example because the Hulk is definitely special effects.

Dalton Anderson (23:23.726)
Well, you know what I mean? You could potentially maybe get in some kind of mood or something going on in your movie and you could change outfits, not necessarily change outfits, but change demeanor and outfits altogether. maybe your white outfit and now that you're all mad, it's like this rustic red or something like that without having actually to change. And then we can go back to your other mood and boom, you're back to normal.

Instead of having to change outfits, that's probably not the best example, but I don't do movies. So it's the best I could come up with on the fly. And as I mentioned before, robotics, like understanding and digitalizing touch, you were also digitalizing vision and for a robot to understand how things are interacting with them and their reality, they need to also understand reality itself.

That means mimicking a lot of these human functions that we have that is related to thought, vision, sight, touch, hearing. If you can see and understand touch, you're pretty far along. If you could hear, that might help as well, but I think that is gonna be later on.

But if you could figure out the very complex nuances between vision, tracking objects within your vision, and then how to interact with objects via touch.

Dalton Anderson (25:05.238)
you're pretty far along on just being able to do whatever and assigned tasks to robots and robots working independently. And maybe they have little robot teams and there's a robot team leader and all sorts of crazy stuff that's going to be going on in the future. And hopefully it happens in my lifetime. So I'd love to see how everything plays out.

Dalton Anderson (25:31.958)
Last thing would be sports analysis. So sports analysis, understanding where or what is interacting with the ball or the objects that you're tracking, which were humans playing on a field or tracking, I guess, horses or dogs or other things of that nature. Maybe you could use it to track

animals, like if you had a, if they had a three minute video of an animal, a four minute video, or if you had a real time video, like where, where would an object be going if you could only see it for a little bit in certain areas of the tree line for hunting.

But I was mostly thinking about coaching and being able to like a professional coach, like, or a professional trainer, being able to provide in-depth details of what a player is doing and how they could do things better. And I talked about physical therapy before, but they'd be quite advanced. it'd be physical therapy for professional athletes or very wealthy individuals.

But the sports analysis thing is quite interesting. I would love to have real-time tracking of an individual and maybe you could track stuff on your phone. I don't know. I mean, you could run this online. They've got the online and offline model. The offline model works a little bit better at tracking objects, but is rooted in your memory. So if you don't have enough memory, then it won't be able to track indefinitely. Online model, if you...

run that online, it doesn't have as well tracking or as good as tracking, but the tracking can be done indefinitely in real time. And it's not limited to your memory, it's limited to Meta's memory, I guess, and their memory is sufficient for you, for the most part.

Dalton Anderson (27:38.244)
But the future of tracking, this point tracking feature, it's gonna be great for robotics.

and potentially help shape some of the future research that comes out. One of the key concerns of this technology might be with tracking of people. We have seen some articles about that throughout the world. What about tracking of people and or.

tracking them in a military aspect.

Sorry, I had a cough. Tracking them in military aspect where you're using it for military applications of tracking vehicles and then launching ammunitions, ammunitions like missiles towards the object that you're tracking. I mean, it happens anyways, but I don't know if that might be a competing interest on the funding there. I'm not sure. know Metta isn't.

funding militaries, but they're funded by the military. But that is something that you could think about is, okay, tracking of people or using this device to assassinate individuals with drone strikes, like autonomous drone strikes. Not only are you...

Dalton Anderson (29:09.86)
having someone drive the drone anymore, you're just identifying, okay, here's a picture of this person that I wanna assassinate. Find them, track them, and then find the most optimal place to perform this deed. And kind of this kind of sleeper cell drone thing from the Skyverse or Skynet that just takes over. I don't know, I don't know. I hope that we don't go that route, but I'm sure.

I'm sure humans left up to their devices typically choose destruction before peace. So we'll see how it plays out. But anyways, I'm super excited about this technology and how it's going to move the world forward. And I hope that you are too. And if you are share a little bit in the comments, what you think. And if you found this episode interesting, of course, let me know. And next week I'll be discussing.

MovieGen, MovieGen's another model. It's not publicly released, but it's in beta testing with creators. And I don't think they're gonna publicly release this model as it has a little bit more commercial application than these other things. But they did release the research paper. So I do plan on reading that research paper and talking about it, the Meta MovieGen. From the stuff that they were doing, it seemed like black magic. So I'm interested in.

how much information they share in the research paper. Hopefully it's interesting and I learn a lot. I hope so. This paper was quite dense. It's only 16 pages, but man, some of the stuff that they're talking about, the whole paragraph is just math. It's like, okay, V Delta this blah, blah, blah. And it's like, all right, let me read this again. Hopefully if I read it again, I'll understand a little bit more because

some of the stuff that, or the way that they explained it in this paper, some of it was quite complicated, but it is a complicated topic, so that makes sense, but yeah, sometimes I was having trouble with this paper. But I did have trouble with the Hurting the Lamas paper. The Iowa paper was super complex and it was 90 pages, so.

Dalton Anderson (31:32.696)
I just thought it was interesting on how complex some of the topics are within the paper and it's only 16 pages, but it was quite concise and digestible. If you're willing to read something a couple of times and potentially look up some side information to get a better grounding of the subject matter. But of course, wherever you are in this world, I hope that you have a good day, a good morning, good afternoon or good evening.

I hope that you'll tune in next week. Really appreciate you and everyone's continued interaction with the show. I recently got to 60 subscribers on YouTube, which is crazy that people actually watch these videos and find them interesting. Overall, really appreciate everyone that participates with the show or consistently listens.

Thank you. See you next week. Bye.

Creators and Guests

Dalton Anderson
Host
Dalton Anderson
I like to explore and build stuff.
CoTracker3: Revolutionizing Point Tracking with Simplicity
Broadcast by