The Imperfect Echo: AI Voice Cloning and Its Current Limits
Download MP3Dalton Anderson (00:01.966)
Welcome to Venture Step podcast, where we discuss entrepreneurship, entry trends, and the occasional book review. In this episode, we'll be talking about a phenomenon where you can't trust what you see and you hear and what you listen to on the internet. And that might be these artificial personalities or things that you wanted to look up. It could be just falsified AI slop.
And I hope that you listen to some AI slop earlier today. And so I'm going to pretend like the recordings that I made with the AI voice did not work. And I'm going to run through the agenda of things that we wanted to discuss today. So the agenda, AI personas, right? Real time photo editing, which is something that is really cool that came out from OpenAI and Gemini as of like last month.
And then there is an increasing trend where GPUs are becoming more advanced. There is gonna be more accessibility to power. Then there is also going to be a efficiency gain through optimizing the resources that you have. All of which would make more advanced systems.
Dalton Anderson (01:26.85)
This creates ethnical questions and entrepreneurial opportunities. But I am concerned about these issues bubbling up and rupturing the fabric of trust of society where you can't trust things. You can't trust things online. What if you can't trust things that you hear? You can't trust things that you see? And we're always careful of what we read online.
who would you verify that with? I watched the video from so-and-so who's the director or the CEO of the company and this is what he said or she said. And if you can't trust anything that you see or hear, then how do you know whether or not what you're listening to or looking at is real? And that's a head scratcher. That's a weird one. And
Hopefully this AI audio recording went well and we will just start our episode.
is a bit odd because the AI audio is going to be the first part of the episode and I'm coming in second. So let's do it.
Dalton Anderson (02:52.088)
regenerated, yeah, personas.
Hey everybody, welcome to the episode. I hope you appreciated and thought, or, hey everybody, welcome to the episode. This is me actually now. Okay. Hey everybody, welcome to the episode. This is me, Dalton. I'm your host and it's actually me this time. It's not some AI generated voice or video. It's me, the actual host of this podcast episode. I hope that you thought,
the AI generated me was thought provoking. And I know that there'd be some instances where the audio would mess up, but you got to think about it. In my recordings of my episodes, sometimes the audio messes up or things happen or I mispronounce words or I scramble a little bit and you can call those out and be like, Oh, well that's, that's when I noticed it was AI or that's when I noticed there was something up. The audio got jumbled a couple of times. Some of my tone was a little monotone.
Think about it, when I'm talking in the podcast episode, my tone is monotone. And sometimes I like to change my frequency, try to vary my tone and my pace of my voice differently throughout the episode, just so it adds a little bit of spontane-ness, spontaneous-ness to, see, I just did it, I just messed up a word. Mid episode. AI does it all the time.
Dalton Anderson (04:25.922)
And I want to emphasize that this is an episode that I have been pondering and thinking about for a while, and I just never had the time to sit down, create these AI prompts, then run them through an AI voice generator to match my voice, and then from there, add it into my episode and edit my audio and put that stuff in front. And then I always questioned what I was gonna do with the video.
Cause the video piece is what I deem is important. And one of the reasons why I do video podcasts is because one, they're more interactive. Two, it gives people the ability to verify that I am me and I've been doing this for a long time. And I guess a long time now, the grand scheme of things, podcasts, think podcasts, people have short bursts of inspiration.
and they do a couple episodes and then they stop. But I think I'm getting close to 70 episodes. But regardless of that, I do the video podcasts because once again, they're more interactive. And the second is the most important part, is I felt that the way that the technology was going, you would easily be able to create some AI podcasts, some AI persona, online persona.
have a personality traits, you can have a personality, your different traits, hairstyle, skin tone, create a nice AI voice that does well with other podcasts listeners. you know, in this area here are the top five people that are the best and they could be male or female, you could do two hosts, you could do co-hosts. What podcasts perform well with certain demographics?
and interests and then create and curate an AI persona and person and tone and visuals that would resonate with that audience.
Dalton Anderson (06:39.054)
that would be wrong. And if that does happen, then how do you know, how do you know it's just not people telling you what you wanna know? Or how do you know that what they're talking about is any real relevancy? And so there's all those questions that are gonna become increasingly more prominent.
as we go on in this world with this technology. And I'm not some kind of doomer, by the way. I'm I'm like, wow, AI is going to take over the world and, we're going to be their slaves and, and all sorts of stuff like that. This AI piece, we're in a very early stage of the development of this technology and the technology has been developing at a very rapid rate and
Sooner or later we'll get to a recursion where the AI is improving itself and the AI is optimizing itself.
And once we get there, then AI should improve a lot faster and then we've got to build infrastructure for AI to communicate with other AIs and then there's got to be some kind of currency for them to trade jobs or hire humans or vice versa. Then you would need some kind of decentralized current electronic currency that would support these transactions. All that being said is that things are rapidly changing.
I don't think I've ever seen a technology change this fast. Like from one year ago to two years ago to three years ago to four, the difference is night and day. We always had trouble getting consistent, consistent character creation in your images. You always had trouble printing out any text. You can now stylize your text. You can create consistent characters.
Dalton Anderson (08:42.23)
You can do all sorts of things without any effort. I mean, there's still some effort. Like, I'll show you some stuff that I did to prepare for this episode, and some of it took time. And I actually put a timer together, and it took me 42 minutes to generate some Gibby-style images, because it kept messing up, and it was one, because I was being lazy with my prompts, two, it just takes forever to generate.
and it might take you like eight minutes to generate an image. So I generated three images and that's 32 minutes right there. So keep that in mind that if you're going to do these things or generate these images, it does take time, but I don't want to get distracted about the core premise of the episode. The core premise of the episode is to be discussing OpenAI's image generator, native image generation.
upgrade and I was going to make a anime gibby style podcast studio set up and then I'm going to turn that podcast studio set up the image. I'm going to turn it into hopefully a video if everything works out well turn it into a video using the image lip sync it to my audio and create a animated animated gibby style
podcast episode using a screenshot or not a screenshot because I took the photo, a photo from my room using my webcam, turning that into this generated episode where the first 10 or so minutes are AI generated. The video is AI generated. Everything is just AI generated except this segment of the episode. And I am having AI generate some of the
parts of the audio and the scripts, AI generated all that script and the audio, and then I'm coming over with some human interaction at the end, but the video overlay is all gonna be animated, or it is animated, but it's gonna be all AI. And that's the plan at least. Why am I doing that? Well, I wanna emphasize the point. Two, I think it's interesting. Three,
Dalton Anderson (11:08.716)
Why not? I saw somebody else do it a couple of weeks ago and it looked pretty cool. And so it's like, okay, this is a perfect example of be careful of what you watch and listen to on the internet. It's not only what you read anymore. You have to verify these people are legit. And there's quite a few AI generated videos on Tik TOK with AI generated voices and they just say whatever there's
AI models that are created and trained to make AI social media videos.
and steal people's videos and then overlay text and stories over the top of them, like some kind of subway runner game or some Minecraft jumping things or whatever it may be, a racing thing. And then there's some kind of AI story over the top of it to get views and to get engagement to generate revenue. And there's this whole shops and like that's all they do is they just make these.
AI slot videos that people watch.
So that being said, let's talk about OpenAI's recent release of their Chat GPT 4.0 Native Generation update. And I want to start out with some examples, and then start out with my example. So I'm going to start out with some examples, and then I will show you my example. That's what I meant to say.
Dalton Anderson (12:47.96)
So let me share my screen and I'm first going to start off with, Hey, if you don't know what Studio Gibby is, then that's fine. Not everybody does, but Studio Gibby is a very famous studio and they're known for only making just works of art. They don't make anything that isn't serious. They only work on building the best
animated movies ever. And I think probably one of the most notable ones is Howl's Moving Castle. So let me see.
That came out in 2004 and I'll show you in a second, but let's do this. Let me share my screen. So share screen, tab, Gibby Studio. Okay, so this is what Gibby Studio looks like. They have a very unique art style and it's this.
It's dreamlike, it's whimsical, it's light, it's fun. A lot of the movies that they create are centered around kids and teenagers, I would say. Not kids, but teenagers. They make incredible stuff and the stories are captivating and intriguing and the world building is great. But if you zoom in, if I zoom in, you look at this image.
Like there's a...
Dalton Anderson (14:33.986)
There is a very clear art direction in these images.
Dalton Anderson (14:41.678)
And if you move over to Howl's Moving Castle, I'll share this screen, share this tab instead.
you can see that it's a consistent art direction with the studio. Like it's very unique to Gibby. And legendary, legendary movie.
Dalton Anderson (15:04.27)
I think it's about this woman who is a young woman and gets cursed by a witch and then this person who is living in the castle. I don't wanna talk about it too much, but living in the castle.
helps cure the woman who was cursed and they fall in love and all sorts of stuff. So that's Gibby Studio. That's kind of what Gibby looks like in the images. then, so now I have some Gibby stuff. So this is one thing that was pretty big on X a couple weeks ago was making memes, traditional memes into Gibby-like memes or Gibby-styled memes.
And so this meme is the child outside of her house that's burning down. And it's normally like a text like, this is fine or whatever. And it's a house is burning down, the firefighters are putting out the fire. There's onlookers. And then there is this child who is smirking kind of. God, it's fine. Share this tab. Maybe I didn't share the tab.
But you see the image, see the styling, it's very gibby. And this is a weird one. So video.
Dalton Anderson (16:36.556)
Also the video is, let me restart it, sorry. Start sharing, stab instead. The video is also odd, by the way, and the sounds are odd, like disturbing.
Dalton Anderson (17:35.339)
Okay. As I said, the video is unpleasant and unsettling. I don't like that video, but it's fine because it was a good example. Just having said, these are some easy going lighthearted videos. This person made a pig temple where there's all these pigs in a pool, giant pigs, maybe like pig gods. I don't know what's going on. And then here's this cat shrine, big cat in a massive tub.
very Gibby-like art direction. There's another big cat.
Dalton Anderson (18:12.536)
But yeah. And then the last one was, it turned Resident Evil characters into Gibby-like.
characters, which I think was interesting. Share this tab.
Dalton Anderson (18:38.816)
Okay, so that's what was created and then I'll show you what I made.
So here are my two images. I'll do the first one that was bad. So I did this three times. Share this tab instead.
Okay, so I originally asked for a room to feature high quality headphones, turn it into like this cozy atmosphere and give it like Gibby-like vibes. And it gave me this, which I didn't like because it's not like my room. And then I was like, okay, I need to be more direct. So I was like, okay, like let's make a Gibby studio style image. And then it gave me this, which is fine, but it...
I think it lacked the character of my room and where I'm at. It didn't feel like my room. It felt like I'd been transported somewhere else. And I didn't want that vibe for the viewers. I wanted an image that...
was similar to how my room was. So I got this one. This one's actually decent, right? The only thing is the mic, I forgot when I took the original image. sorry. Sorry, this one's decent. The only thing is when I took the original image, I didn't have the mic out like I was gonna start talking, because I wasn't ready to record the episode. I just wanted to see and trial how this would go, because I hadn't done it before.
Dalton Anderson (20:17.422)
And this turned out great. It's just the only thing is the mic. The mic isn't what I want the mic to be. I want the mic to be like my mic. But this might work. I think the room and everything going on is perfect. Like the stuff in the back looks great. Everything looks great except the mic and the way that the light is kind of trickling in.
It's really nice. So let me share this one instead. So this one is really good, I think. The only problem is it's not.
horizontal like for a video. So I'm wondering if I could ask AI to change the
the image from portrait to landscape. Same thing, just landscape. And I hope that would not result in some weird generated stuff, but I would love, I would love for this image to be horizontal for my videos. So we'll see. We'll see how it goes. Okay, so that was OpenAI and the
generation of, stop sharing. That was OpenAI and the image generation piece of what you can do natively now.
Dalton Anderson (21:50.899)
I feel that.
Dalton Anderson (21:56.079)
Since now you can natively generate images, you'll easily be able to create videos using consistent characters, because you can create a couple images and then say like, turn this into a video in the future. And as I'd mentioned before, it calls into question the authenticity of the story of what you're trying to listen to. And having a system in place, either your own system or
a system in place for these social media apps to verify that, you're a real person will be increasingly more important. And the AI generated audio that I talked in the beginning of the episode spoke about how the pendulum might have to switch from what is now the anonymous social media where people previously, like when I was growing up, I'm still growing, hopefully, right?
I'm not done growing. But when I was a kid, everyone wanted the social media to be their first and last name or their last name, first name, something like that. And after that point, there was a certain shift at one point where people didn't want to be monitored. People didn't want to know what other people were doing. And so they started making Finstas and Finstas were where you'd post your real stuff. Like you'd have your professional and then you'd have your Finsta.
I always thought it was too much work to have two different social medias, but whatever. So you'd have your Finsta where you'd publish your not so good photos, you out drinking with your friends or you doing whatever, raunchy photos, yada yada.
After that point, there was a shift with the next generation where they're Finsta only, like where their Instagram is anonymous by default, where it's not their name, it's just like some random like sky, skyfaller, green hat baller, is their user. I was actually pretty good on the fly, by the way.
Dalton Anderson (24:11.384)
they have this anonymous username and they just publish whatever memes yacht. They don't have to worry about what they're commenting or anything like that. And also with Tik Tok, it makes it a little bit odd or it makes it where people are monitoring in a higher frequency. The things you're commenting, because if you comment on a post, it gets suggested to somebody else.
So if you're commenting on like a social issue or any weird things that you see on social media, whatever you're doing in your darkest days on your phone, I don't know. But whatever comments you make, they're typically.
Dalton Anderson (24:56.096)
up ranked to your friends and if your friends are your friends and family and your coworkers or something like that and you're commenting on weird stuff, it doesn't look good. And so I'm not saying that's by default what people are doing. What I'm saying is there's an increased visibility on your activity on social media, simply because the algorithm integrates everything that you do with other folks. And so this
Connectivity to everyone makes it so people are cautious of what they're saying and what they're doing. And if that's the case, then it's easier instead of just making sure that you don't make any mistakes, just to make yourself anonymous. But now we're in the realm of AI, and AI is becoming more and more advanced. And that means that if everyone's anonymous, then you can't tell who's wearing the mask.
of the human and who is the human. And so I think the pendulum swims, swings backwards to verification of your identity and X has done that in some fashion. Like if you wanted to do premium and be able to do these things and run ads and make certain posts and have increased visibility, you need to send in your ID and they'll take a photo of it and verify it with some kind of third party. I'm not sure how the whole process works, but they verify your identity.
Facebook has something like that as like a beta for their premium program, if you want the blue check or not. But I don't know how it would work. Maybe you have two feeds, like one is your verified feed with only verified people. And when I say verified, it doesn't mean that you have to pay premium or something like that. It's more as maybe you pay a $2 fee to verify your identity and then once you're verified,
then you then are on this verified feed. So the algorithm would only suggest you verified posts and things. I think that's an easier problem to figure out. think the biggest problem is how do you know what you're reading and what you, or sorry, let me backtrack. What you're reading you always questioned before.
Dalton Anderson (27:18.668)
You'd always get the question, okay, did you verify that with those three sources or a couple other sources? Like, how do know what you're saying is true? There's that piece. And increasingly in the world, people have just been blatantly lying. They are just straight up lying. And like it's easily verifiable and they'll go on video and they'll just lie or they'll publish something on social media. is straight up lie and false. And so I think that there has been
in a weird acceptance of people who just been lying. And there's a difference from miss speaking or maybe wrongly intruding or explaining your point, but not in a malicious way. And then the difference is like if you're maliciously stating falsehoods to
sway people's opinions or their outlook on things, that's a problem. And it's been increasingly prevalent on social media where people just straight up lie and they'll be famous or they'll be political people or whatever the issue is or who the person is or the public figure. They'll just say whatever and
and then people are just cool with it. No one really is like, it's whatever. And so that's what you read on the internet, but what if that...
transfers to video and audio. And so then you're questioning everything that you watch, everything you hear. And then how do you know, like how would you then know like what is real and what is not? there, like I would question everything and I published something, where's my phone? I put something on X today and I said, let's see what I said, the exact words.
Dalton Anderson (29:26.904)
said something like, why this loads. The statement, you can't trust the internet is becoming even more true today. You can't trust things you hear or see to be real. Maybe we were all schizophrenic from the start.
Because you don't know what is going on. You have no idea what is going on when things are gonna escalate in a couple years where AI gets more advanced.
how would you know what is real and what is not? And maybe we're just listening all from the start. I don't know. But to get out of the doomer phase, the promising piece is for videos and for audio and images, since those are created by, audio's pretty easy, because an audio file has
like these frequency waves within the file. So you could encode a synthetic key, an AI synthetic ID to the audio. So you could easily verify whether or not that key was either disrupted or if that key was altered or if it has a key.
With images, you can do the same thing. can, because a pixel is just numbers, so you can encode your own little synthetic key, and Google's been doing this for a while for the AI images that they create. They have a digital synthetic key embedded within the photo, so wherever you put the photo, Google's able to verify that it's an AI-generated photo.
Dalton Anderson (31:19.502)
Where it gets a little wonky is text. And I talked to a friend named Fabriz and he had completed some research on this, but overall it's a very complicated problem because text is not an image. It's not built by these numbers. mean, an individual character has numbers associated with it, but if you made any edits,
to the text or you pre-fruited it and you had some grammar or you made any changes, then the key doesn't work. So the next step is figuring out, okay, so now that we're able to detect images and videos and audio, and these things need to get built. Audio and video, I'm not sure that there is a key in the works yet.
But I do know for images there is kind of an industry standard of, this is how we're gonna do it. Matt has adopted something similar, so has OpenAI. So that's squared away. But the text piece, I don't know if there's an easy solution for that. But people are already careful what they read online. All that's being said is this episode was really to talk about this.
Potential authenticity crisis that is coming amongst us sooner than later. And this is an episode I've been thinking about doing for a while and I thought it'd be cool and interesting to create this. And the fact that now you can generate an image and do so natively and then turn that image into a video and then lip sync your video that you're displaying from this podcast episode.
to the...
Dalton Anderson (33:18.494)
image and it creates a video from the image that you gave it, it's pretty cool that you could take one image, a video, overlay both of them together, create this AI generated audio and video.
Pretty cool stuff, I would say.
Dalton Anderson (33:38.799)
I will encourage you to be cautious on what you're viewing online. If you're picking up a new podcast, make sure that they have a video or make sure that they've been doing it a while. And if they don't have a video and they haven't been doing a while, try to verify or look up the person and verify that person's real. you, I'm really, I'm just like.
Being cautious about the stuff that you're listening to and watching. There's a lot of AI videos on YouTube now. I go on Facebook. There's so much interaction towards AI videos and AI images that people presumably have no idea is AI. And so I don't want you guys to be that. And I don't think you are, but it's not the things that are blatantly AI. It's the things that you can't tell are AI that.
is the worst. Things are becoming more advanced and there is increasingly
opportunities that are being discussed on doing unethical things for these type of
or using these type of technologies to do it on unethical things. Like, for example, someone had sent me a video about using women on social media, like say, TikTok, using their face and then finding a woman who has a good physique, combining the two together and make an AI model that
Dalton Anderson (35:24.82)
you would create videos with and stitch on the person's face and their body into a video and then use that to create a OnlyFans sales funnel where you use social media to create this AI model that's using people's real face.
Then using the funnel to get people to subscribe to their OnlyFans, which is all AI generated. And that stuff is horrible. They use somebody's real face and just a complete disregard of other people. It's horrific. And I don't know how all this stuff's gonna play out. We're always gonna be in this whack-a-mole situation.
But I get concerned when people are talking about, yeah, I'm making so much money doing this. Like you just got to do it. And I'm like looking around and people are nodding their head. Like I wasn't, you know, I'm not there. I just watched the video just because somebody sent it to me. It's like, wow, this could be a topic for your podcast.
And I didn't want to talk about it for a while because it's like, why even put that energy out in the world? But then when it's easy to do these things or becoming increasingly easier to do so, then I was like, okay, this is probably going to have to be discussed. And so when people are bragging about how much money they're making online by doing these unethical things that are just blatantly wrong,
Dalton Anderson (37:13.434)
It's going to be tough. We're to be in this whack-a-mole situation, as I mentioned, for a while. You're always going to have the technology, you're going to be outpacing the detection. But I haven't seen anything from TikTok or from Meta or even YouTube about trying to create a system to make sure that people's personas or their identity isn't being used in nefarious ways either.
by someone who's trying to make a buck or someone who's trying to discredit you. And so I hope that doesn't happen. My mom and I have a...
secret password and my Nana secret password that we never sent via text. We didn't discuss in a room with electronics. And so this password will be used to verify whether or not you're really in trouble if someone calls or if something happened while I was traveling and I was like, Hey, I really need help. send me, me 5,000 right now.
You'd be asked to verify.
Dalton Anderson (38:27.014)
And for myself, I was telling them like, hey, someone could easily create my voice. I have all these videos of myself on the internet. It wouldn't be that hard. I don't think we're a target right now, but it couldn't hurt to put this in place. And it was originally my mom's idea, but I had validated that idea with these AI-generated voices that I created like 10 months ago and some other stuff and then also this episode. So that's probably my takeaway for you.
is be cautious about what you're listening to and watching on the internet and then create a secret password with your family and don't send it via text. And if they don't know the password, then they will have to figure it out on their own or they'll have to meet you in person. Once again, I really appreciate everything that you guys are doing and supporting me and watching these episodes or listening to them.
I really do appreciate it. I'm always surprised that people listen to these episodes and I'm just thankful that you guys are still around and consistently checking in every couple of weeks or every week. But of course, wherever you are in this world, have a great day, a good afternoon, a good evening. Thanks for listening and I can't wait for you to listen in next week. Goodbye.
Creators and Guests
