From Pixels to Perception: How Sparsh is Changing Touch

Download MP3

Dalton Anderson (00:01.346)
Welcome to Venture Step Podcasts where we discuss entrepreneurship, industry trends, and the occasional look of you. There's been quite a few technologies that have changed and completely altered the way that we live and communicate with each other. The easiest of these technologies to talk about would be computers. Computers originally were the size of a car, and then we optimized the cost and the size of the technology required to

execute tasks, then that computer that was the size of a car is now within your home. And then the next step was phones. Similar things are happening with robotics and pretty interesting. This paper that Meta came out with from FAIR, where they discuss digitizing touch and the paper is Sparse Self-Supervised Touch Representations for Vision-Based

tactile sensing. And that's what we're going to be discussing today. Those robots, have vision, they're integrating these LLMs for voice interactions slash audio inputs that can be

encoded to understand that audio, turn it into text and then process it, then respond back with a voice. But one thing that we're missing is the generalization of touch. And so there has been quite a few proprietary or custom built models for

touch, but they have been tasked base. So what that means is we would train a robot to do one thing very well. But if you try to do anything else, then it wouldn't work. And Meta saw a couple issues with how we're going about touch and this application. One of the, one of the crucial things is

Dalton Anderson (02:14.434)
changing the way that the data is processed and what kind of data can we use. So instead of using supervised learning, they use self-supervised learning. I'll explain the differences later in the episode. Then they have another thing that they're talking about. They talked about how everyone was building custom-made sensors and there was no standardization of benchmarks. What benchmarks are we collecting? How do we collect them? How do we store them? And then another piece was, okay, well, it's kind of difficult to...

integrate all these sensors to these benchmarks. So try to create an ecosystem, which is what's sparse is. And today we're going to be discussing the research paper, some of the findings. Once again, for these type of episodes, I am not a robotics researcher, AI researcher. If you're very interested in the stuff I talk about, I encourage you to go to the paper and

read it. It's free. It's online. The paper will be linked in the show notes and you can read it on your own time. It's 32 pages. About 12 of them or 15 of them are of Sparsh, like the Sparsh paper. And then the other 15 or so pages is references and support to their research or stuff that they've been building on or built on top of or other people's work that was used in their findings.

Okay. So that being said, we're going to talk about the limitations of the current tech touch technology. What spark tries to solve, sparse tries to solve and what's the difference between self-supervised and, and supervised learning, some benchmarks that they created and the, you know, implications of implications of sparse in general. Okay. But before we dive in, of course, I'm your host Dalton Anderson.

My background's a bit of a mix of programming, data science, insurance, offline. You could find me running, my side business or lost in a good book. You can listen to his podcast in both video and audio format on YouTube. If audio is more your thing, you'd find the podcast on Apple podcasts, Spotify, or wherever else you get your podcasts.

Dalton Anderson (04:33.248)
Okay. So the first section is the challenges of robot touch, I would say. I talked about it briefly in the intro of this episode, but in a general sense is there was an issue of being able to share data with others. There's an issue with how the data was being trained and the cost of the data and the limitation of

of one, access to the data and the expense of obtaining classified data. And then the other thing was like understanding these

Dalton Anderson (05:17.25)
these metrics that are being applied, right? So it was a little confusing at first, because I am once again, not a robotics researcher, so I had to look up these things, but some of the metrics they use is like poise, the shape and size of an object, the force applied to alter the poise of the object, the texture of the object, and things called slip. And all of those things need to be understood digitally before

you can actually apply them in the real world. So all this stuff is being done in the virtual world. And that's what I was finding confusing was you have an image or a video and that video or image would have this data overlaid on top of it to help predict the slippage or, different class, like the way that you're, you're clasping the object virtually, it might

result in some slippage and do you have the reaction time or are your hands placed in the right way digitally to react to slippage? And if you're running at 60 frames a second and you are going off the normal reaction time of a human, you have about 80 milliseconds to react. And so there's all these weird things that are going on and I find it quite intriguing because it's similar to

how you would interact in the real world or as a human, how you would interact in your day to day, but you don't necessarily realize all the nuances when it comes to touch and how your fingers can sense out with the texture of the object. And you kind of just know that, when I pick this up and I grip it this way with this certain texture and the angle of which the object is shaped and it's proposed to wait given the

the texture and what the material looks like. I know that I can pick this up with two fingers, lightly grasp it, or if I really press on it, it's gonna break, or these other things that are second nature. It's difficult to make that happen, one, virtually with this robot, and then also consistently perform these tasks.

Dalton Anderson (07:44.364)
without error, which is something else. So one of the key issues, that being said, all this stuff is related to how humans interact in their day-to-day world. One of the things that these programs were doing before was they were making their own custom models and then they were training data, supervised training data. And so supervised training data is something that you would use in a regression model or classification.

And basically that means that you have a data set. And I think the easiest thing is, is okay, is this a dog or a cat? You're not specifying that the species of the, not the species, but the type of dog, we're not doing English massive and poodle. We're just saying, is this a dog or is it a cat? Right? So you would just give your model, here's, here's my images, my training set.

And then you could partition it in certain different ways. Like you might do five partitions and split it X, Y, Z, and then, you know, have your tests, your validation, your training, blah, blah, blah. But for this, for this instance, we're just going to keep it really simple. And we're just going to talk about the training, actual training data and how, how you go about training. So in supervised learning, you just have the data set. And then the data set has

this metadata associated with like an image. in this case, you have an image of a cat, a black cat, and then it'd be.

associated with it would be this additional data with like, this is a black cat. And so then the model would know, okay, this is a black cat. That's what a black cat looks like. And then you just say that many times, like you might have 10,000 images of black cats. And then after that, then you have the same thing for white cats. And then you have some for spotted and then you have these dogs and you've got white dog, a black dog, a brown dog. And so then when you train your data, then it will know that, this is a dog.

Dalton Anderson (09:49.794)
this is a cat. Okay. But then what if you try to give the model a purple dog or some kind of cat or dog that it wasn't trained on? Well, then it doesn't work. Or what if you didn't want to train just for dogs, but you wanted to do dogs, cats, and maybe birds? Well, we wouldn't know what a bird was because it never seen a bird before. And so the,

One, you have to have everything labeled and classified, which is very time consuming and expensive. Thus there isn't that much data out there. And then if there is data, it's so cost prohibitive to, prohibitive to allow other people to use it. And so it's, it's normally closed source. So what Meta is doing is they made their own data set of like 475,000

textiled images that will allow other people to train their models on. And then they changed the way that they went about it. instead of using supervised learning, they use self supervised learning. So the model has to understand the relationship between what the actual image is, what the relationship of the image

is to how it interacts with the robot and touch like with the different textures, the shape, the force applied to the, to these virtual images and videos and, and how does that affect all the slippage and, what about the different poises of, of the image? And so understanding all those little nuances, like how you would pick up an object in real life, they're trying to do digitally.

Dalton Anderson (11:45.74)
which one is difficult and two, pretty cool. But it's way different than how typical approaches are being done. And then we have this data set that allows people to train their models on. then that also came out with these models. And the general sense of this is one, make it easier for people to play around with.

Digitalizing touch. Like get it into as many people's hands as possible. You never know what's gonna happen with a curious mind.

and then release a data set for the group. Like for the cohort of robotics experts, engineers, researchers, all those people. And then the last thing is generalize these tasks. Instead of being tasks like task training, transform this into a generalized training set. So there is no, I'm training to know if it's blue or black, or if this is a blueberry or.

whatever it would just understand. And the idea is just to understand your world and interacting with objects in your world instead of just training for one object or a couple of objects in this, in this specified task, train for everything. And then once you have a ground truth understanding of, of your interactions and where you are in this 3d space, then you should be able to understand how to do your task, which I find pretty interesting.

But yeah, general sense, we had task specific models. We're moving to a general purpose model. And that's what a lot of people are going towards. Like these different researchers, like with Meta or with OpenAI or Anthropic or Google, there's a lot of papers being pushed out for general.

Dalton Anderson (13:48.494)
AI knowledge versus it being a task or something else.

Dalton Anderson (13:57.55)
But yeah, that's what they're trying to do. Are they getting it done? I think that the research looked promising, like from the results of what they have and then general perception of the AI researchers or robotic researchers on social media, they seem very excited about it. In a general sense, it seems to be solving some crucial problems to moving robotics forward.

Dalton Anderson (14:29.158)
This would just make robots more versatile, adaptable, and efficient. So there was a blatant inefficiency of private companies continuously training and gathering data and then creating their own custom models. Now we have this massive open public data set and it's open for researchers to

publish whatever things that they need to. And then it also would create a more versatile robot. Like what if the situation changes or, or what if the color of these objects that you're trained on or changed with different manufacturers or what if the environment that you're placed in as a robot is slightly different than the virtual world? You got to be able to adapt and you got to be on, you got to be able to think on your feet, a robot or not. Everyone's got to think on their feet. So.

I think it's, it's pretty interesting and I, I enjoyed this paper, but one thing it didn't explain was the, the visualization of tech touch, which is the vision based tactile sensors is quite interesting. So they have with markers and without markers and with markers have these little little dots on

The gel, it's like the squishy gel thing. And there's little dots on the, I would call them nubs, which call them gel nubs. Cause it's like a little thing about his little finger nub, like their first nub of your finger. Little gel nubs, little robot gel nubs. And there's little markers on them, like little dots. And those dots are pushed. Like when you touch something, think about the tip of your finger.

when you touch like the little skin on the tip of your finger, it kind of changes a bit. you could see how it slides or it pushes presses against the object that you're holding. Same thing with these little gels. And so the gels with the markers understand, okay, one of my points on my gel now moved from this position to another position and there's some light reflection and

Dalton Anderson (16:52.674)
These other things going on. So, okay, this was the amount of touch. This was the amount of slippage. This was the amount of force applied before slippage happened. And then when I released my grip, is when slippage started. And so it's able to mathematically understand the interactions between the robot gel nubs and the robot's grip and the object that's touching. It's pretty cool. I think you would explain it as like a little gel nub that's squishy, that

you touch it, the gel deforms. And then when the gel deforms, there's measurements taken. And from there, that information is processed, encoded, and then stored. And then later on, there's analysis done on how it worked. What worked and what didn't work. Okay. So that was that. We talked about the new approach of the taxile sensing, the self-supervised learning.

and why it's important. It's an important approach. And it allows

It allows you to not have all the answers and it's described as a teacher, teacher, student situation where the student, which is the model, like there's two parallel models. One, one has some of the answers and then one doesn't. And then the teacher is teaching the student and then the student is, has to use the teachings from the teacher or the professor to get a result similar to what the teacher would have gotten.

Dalton Anderson (18:34.198)
And it's good at like things like predicting, like missing parts of the image, understanding images that aren't labeled, grouping similar images, like classifying, okay, like, all right, so I don't know what this thing is, right? These are, these are all animals with four legs and they're similar sizes and the hair texture is similar. I think that

these all go together and then you put those. Okay. And then I've got, I've got this, this metal looking object with four rounds circles at the bottom. They're all different colors and different shapes, they, they'll have similar texture and look the same ish like the outline of, of, of it looks the same. And then you'd be like, that eventually you find out that's a dog and those are cars.

And so it's able to understand and better than supervised learning because self-supervised learning is kind of how we would go about learning as humans. And that's why one of the reasons why I find this stuff intriguing is these papers aren't a...

Dalton Anderson (19:59.608)
These papers are a reflection of oneself. These things are trying to humans and then it makes you think about yourself and your interactions as a human and how everything could be mathematically measured like emotions, touch.

you know, knowledge. I don't know. I it just, I just find it interesting. So I don't want to go on a tangent. I want to stay focused because this might get long, but.

Dalton Anderson (20:34.542)
They have

they have the self-supervised learning instead of supervised learning. And then from there, they were able to experiment with different approaches. So they had these, they had self-supervised learning, then they had self-supervised learning with mask image modeling. And what is that? So mask image is what it sounds like where you hide parts of the image.

And so maybe you hide three quarters of an image and it's got to figure out what the image is. If you have three fourths of it covered and it's like some kind of berry looking thing and then you then you predict, this is a blueberry and I only can see a quarter of it. You have to figure out depending on the texture and the shape of it and how it looks and blah, blah. yeah, this is a blueberry.

And then there's a combination of approaches, but they came out with three models and they all have different strengths and weaknesses, but for the most part, they are...

They're pretty close in certain areas and then in other edge cases, one is better than the other, but they have sparse A E sparse Dino sparse I J E P a they, yeah, they have different, they have different strengths and weaknesses and

Dalton Anderson (22:13.39)
The general sense is that

Dalton Anderson (22:18.272)
each of which are a tool and they're not necessarily one is better than the other for everything. It's just a kind of thing like, what program language is the best? Like people will get really caught up with, is Python or R or Java or go or colon or whatever the language is Julia, which one's the best? And people get really caught up on it, especially in college. They're like, yeah, like, Python's the best. You could do everything, but

You're not the best at anything, but you're probably second or third. So the Python's the best general purpose language and the other people like, well C++ or C sharp or whatever it is, they all have their own little nuances of what they're really good at and what they're not so good at. And you could potentially do anything in any language, but it doesn't make sense. Kind of the same thing here. Not as extreme, but close enough.

And then they made this other thing called TacBench and they standardized a benchmark system for evaluation of touch representations. the things that they did was they standardized slip detection, poise, grasp, stability, texture recognition, force estimation. So when you see an object like how much do I have to press on it for slip? You know, how do you detect, how can the model properly detect?

Objects slipping digitally. Poise estimation, understanding where this object is in a 3D space and getting that right and understanding how far or what the distance is between you and the object. Grass stabilization or grass stability. Texture recognition, understanding the texture of the object and how that might affect, one, how

much force is required to how will that affect your grip, which would affect your slip. And then,

Dalton Anderson (24:25.102)
the bead maze problem, which I had no idea. I was looking it up on the internet and I was like, what, what is this? I've never heard about this before. And a bead maze is some kind of toy, children's toy that's used for hand eye coordination. And basically it's like a wired toy that has beads on it and you try to get the bead through the maze and they do the same thing with the robot, which sounds like torture because

they don't train it on like what's the right answer for the bead maze. They kind of just train it on different, they trained it on different images of the bead maze with the bead being in different places on the maze, but they never trained it for the answer. And so it just sounds brutal where the robots is having to figure out one, cause it can't really, it can't really see, right? Like,

You got to think about it. Like it's like a blind man trying to go through a maze with only its hand. And so it's in the dark, can't see, only has his hand. And so it could feel that one, I am pressing on the bead, I'm holding it and I'm moving with little resistance. Okay. So this might be the right answer. And then you keep moving and then, then there's resistance that comes up.

that's the wrong answer. Okay. Let me go back to where I started. Okay. Let me try this way. that one doesn't have resistance. All right. And then it keeps doing that over and over and over and over and over. It just sounds like torture to me. And then there was this other thing that they were talking about was, which was model video tubing or tubing. Tubing video modeling is a technique of hiding an image.

like 30, not an image, video, Hiding like 30 to 40 seconds of a video randomly and cutting it all out of the video and having the model predict like what happened, that sounds brutal as well. Some of this stuff just sounds straight up like torture. This bead maze thing just sounds excruciating. I would never wanna do that. I couldn't imagine being blindfolded and having to.

Dalton Anderson (26:47.416)
to navigate a complicated maze with only my hand. Not cool, Meta, not cool.

Dalton Anderson (26:58.702)
So then, yeah, how does Sparsh understand these tactile images? Well, it standardized, remember, it standardized, it has a standardization victim arc. It understands and was trained on data, textile data. And then so once it's in action, it's basically just understanding and visualizing the force fields, understanding the, and that includes this kind of weird-ish.

They had two comparisons. One, one was the previous method of how other companies or methodology was go about it. And when you touched an object, it would come back with a whole bunch of, of metrics and numbers. The way that Meta did it was they kind of have a pixelized, not image, but like a pixelized grid map. And then

to understand the force and the force manipulation that was required, it would have a heat map of where the object was touched and then how hard

How much force was...

put on the object with the grip and where was it touched and how was it touched. And so then you can kind of measure, okay, you grab the object here, you slipped or you didn't grab it hard enough and then you can understand like what was going on and what was going wrong. Their method makes a lot more sense than the other method. The other method was like overly complicated and you couldn't see like where it was being touched and.

Dalton Anderson (28:40.3)
and how much force, I mean you could, but you couldn't visualize it. So visualization is always easier to understand when you're trying to translate localized data, like localized data meaning what the robot was potentially be touching and then putting it back onto a computer and understanding what happened previously when you weren't even there. So.

general sense, you're utilizing one, the generalization of touch. are standardizing the benchmarks required to understand what touch is. And then you're visualizing how the touch interact or how your grip interacted with the object. So understanding what's the right amount of force with preventing damage, but also not enabling slippage.

How do you manipulate objects without crushing them? Performing tasks that require precise control like assembly or surgery, which it comes in this bead surgery problem or the bead maze, which sounded not so fun, but I guess it's important for the robot. I don't want to be doing it. So has this stuff worked? Yes, it has. And if you listened to the last episode,

I talked about how it was like 93 % better than the other methods. And so it's been outperforming traditional.

force estimation models.

Dalton Anderson (30:19.798)
And that is a factor of one, standardizing the data, two, having generalize, using generalized, a good generalized approach with self-supervised learning, which is different than supervised learning, and then understanding and visualizing these touches and how this works.

Dalton Anderson (30:49.72)
think that that would give you a general sense of how this podcast worked. I could go into more detail about poi slippage or these other things, but I think that you guys listeners would have a good grasp of what's going on. And I think I provided enough information for this podcast episode to be useful. And I don't want to necessarily ramble. want to make sure that it is concise to the point and gives you a good understanding. And if you want to go a little deeper,

then you can of course with the paper. But I encourage everyone, if you're interested to read the paper, I think it's a, it's a well done paper. I still think that the best paper of the year that I've read is hurting the llamas, hurting the llamas. It always sounds like hurting, like I'm hurting the llamas, but I'm hurting the llamas. That's, that's the best paper I've read this year by far by a long shot. So if you haven't read a paper and you're, and you have,

a yearning to read one. I would read that one. It is quite dense. It is almost a hundred pages. think it's 95 pages. Give it a go. So fun. Loved it. Loved every minute of it. This paper also is quite good. Very energized. became last week, I had COVID and it was struggling this today. Didn't have that much energy. I'm still behind on the podcast where I'm recording a podcast just before it's supposed to come out.

And I was like, I just worked late and didn't necessarily want to do this podcast episode, but I read the paper and I was just so energized with the knowledge and intrigued and just overall fascinated with the information and how they went about solving the problems that they're researching and in trying to provide the free market overall thought the paper was great and they explained everything.

And then I didn't go through it on this side of this episode, but maybe I could in a future episode. I didn't deem it worth talking about, but at the back end of the paper, they talked about the different splits, the explanations of using different type of modeling techniques and which models they use and how they went about it, the splits of their data, reducing how the model performed on different types of data, if they reduced the amount of data they had,

Dalton Anderson (33:17.612)
I thought it was quite interesting the difference of the performance when you trained on.

one third of the data, half the data or a tenth of the data. And it seems like in certain scenarios, you could get away with only training with a tenth of the data. It's only like a half a percent or 0.5 % a difference. And I was like, yeah, that's decent. That's grand. So there's some cool snippets that I necessarily get to talk about in this episode. I didn't want it to be a two hour episode of me just talking very technical. But once again, I encourage you to read the episode.

And if you thought this podcast was interesting, give me, give me a comment and let me know what you think, what you found interesting, or if you have any additional insights, if you've worked in this space, once again, I'm not a robotics researcher. I do not work in computer vision. I work in insurance, but I did find this paper quite interesting. And just overall, I encourage you to interact with the podcast episode and, or look into

the research paper provided in the show notes. Next week we are gonna be discussing, we're gonna be discussing the next research paper which is going to be I think the code tracker or code tracker three or maybe another one. I don't know. I have been taking your guys' input very seriously and so one of the things that I've gotten, I wouldn't say roasted about but that people have talked about was

make sure, and other people like my mom or others, they're like, make sure that you look into the camera and make it seem like we're having a conversation, which I've been working on. It's more difficult. I've had a bad habit since I work remote and I'm not always on camera. I typically try to for big meetings, but I'm not always on camera. so...

Dalton Anderson (35:18.368)
I just had a habit of like looking around when I was talking and just looking at various things. I'm not necessarily looking like I'm not paying attention. It was just a habit of just moving around. So I've been very focused on talking to the camera and I hope that one, you appreciate these changes and two, I hope that you feel that I'm listening to you. Although I don't, I am not some large YouTuber or content creator. I do read the comments and I respond to everyone's comment.

unless you're a bot and if you are a bot, then I just will respond some nonsense back as you're sending me nonsense. But otherwise I will respond and engage and I'll listen to you. Some comments aren't that nice. Some comments are nice. It's fine. You put yourself out there on the internet. You're going to attract all sorts of people. It's okay. So that being said, I wish you a good day, a good afternoon, good evening, good morning, wherever you are in this world.

Thank you for listening and I hope that you tune in next week. See ya. Bye.

From Pixels to Perception: How Sparsh is Changing Touch
Broadcast by