Hey everyone, welcome back to the Elon Musk Podcast. I'm thrilled to share some exciting news with you over the next two weeks. We're evolving. We'll be broadening our focus to cover all the tech Titans shaping our world. And with that, our show will become Stage 0. You'll still get the latest insights on Elon Musk plus so much more. So stay tuned for our official relaunch at Stage 0 coming soon. I'd like to thank the sponsor of this show, Phish dot Audio, a
great service. So if you're a podcaster, a voice over artist, or a content creator like myself and you're looking for hyper realistic AI generated voices, you need to check out Phish Audio. Ranked among the top 2 most realistic voice generators in 13 languages, Phish Audio lets you create high quality voice overs with just a few clicks.
Now I'm going to be honest with you, sometimes I make mistakes during the show and I need to replace some words because I misspoke or something was in the background, there was some noise or something. I just can't get rid of it. So sometimes I use AI generated me in order to fill in those gaps. And I tried Phish Audio for this and it worked so good, so much better than the others that I've tried. And it's open source friendly. It's super fast, it's even faster than 11 labs if you're
familiar with that. And best of all, it's affordable. It has the lowest prices on the market. So whether you need multilingual voice cloning, real time speech to speech conversion, or professional quality narration, Fish Audio has you completely covered with whatever you need done. Try it today for free at Fish dot Audio and see why it's the absolute best AI vice tool out there today. Go check them out at Fish dot Audio. That's Fish dot Audio fish dot audio.
Head to Fish dot Audio and start creating with AI powered voices right now. Welcome to this week's AI update. So how many times can artificial intelligence reinvent itself in a single week? And what happens when it starts to reshape the very definition of creativity, design, and human likeness? This week in AI has introduced more than just flashy demos or iterative upgrades. It has opened a new chapter in how we visualize the future of interaction, media and autonomy.
From image generation that understands multiple references to hyper realistic avatars that lip sync in multiple languages, to humanoid robots executing front flips and punches, all within a few days, the question is no longer what AI can do, but what it will leave for us to do. Bytedance, Google, Meta, Alibaba, Vivago AI and a constellation of open source contributors have each launched new products, updates or jaw-dropping showcases, many of which feel like they came from
different futures. This isn't about incremental progress, it's about a cascade of systems each pushing boundaries once thought unscalable. But what's buried beneath the spectacle is the unsettling realization. How fast are we letting machines to what once required human hands, eyes, and voices, or evened our own emotions? So let's start with Bytedance's latest creation, Uno. This new AI model redefines what
it means to generate an image. Uno responds to text prompts, but it also fuses multiple image references into a single coherent output, something that previous models often struggle with. Most impressively, it retains character fidelity and object accuracy, regardless of how surreal or stylistically diverse the prompt might be. Now, say you upload a logo and a photo of AT shirt.
Uno doesn't just overlay it, it visualizes the logo printed on the shirt with convincing fabric, shadows and appropriate scale for the logo. Or consider uploading a doll and a plush toy. Uno will place them together in a new realistic scene. It even let's users generate A stylized portrait of a person using a single input photo. Anime, Pixar, or even video game style. Uno handles all of that stuff while maintaining identity trades. Now for creators, Uno is a
silent assassin. If you're running a fashion brand, Uno is basically your model and your whole photo shoot team. Upload 2 clothing items and prompt the model to generate a person wearing them. Set in a cityscape or flower field of sunset. You can even try different models, body types, or urban settings. And for influencers, Uno becomes a face swapping studio. Give it one video of yourself and it'll generate your likeness in any scene, doing anything,
wearing anything. This kind of tool doesn't just help with photorealistic image creation, it enables entirely new workflows for people who don't have a studio full of people. Now, what used to require hours and possibly days in Photoshop and multiple production rounds is now available with just a few clicks and a smart prompt. Now, Uno's value isn't just in flexibility, it's in fidelity.
In benchmark comparisons with tools like Omni IP Adapter and OEM control, Uno consistently produce more accurate generations. For instance, when fed a reference image of a uniquely designed clock, Uno was the only tool able to correctly recreate it on green grass surrounded by sunflowers with even the yellow 3 intact. The same held true for altering the color of toys or preserving intricate shapes. Accuracy at the level means less trial and error and more trust in AI generated visuals.
That's a quiet revolution for designers, developers and content creators who work under very tight deadlines. Now Byte Dance has made Uno freely available through a Hugging Face demo with an optional GitHub repository for local installs. You'll need a machine with at least 16 gigs of RAM and some familiarity with launching
Python scripts. Quantized model variants are also available for those with limited GPU resources and this dual offering browser based ease and local control makes Uno accessible without compromising its power. Gives professionals and hobbyists alike the ability to integrate next Gen. visual tooling into their own workflows. Meanwhile, a whole different team has pushed video generation to the next level.
Their latest tool, called One Minute Video Generation with Test Time Training, allows AI to produce full length, minute long videos that maintain consistent characters, environments, and art styles. Instead of relying on a single long prompt, you just breakdown a video into storyboard style scenes. Each scene contains a short description, like Jerry the brown mouse holds cheese or Tom the cat snatches the cheese and the model stitches these together to form a coherent, evolving story.
Yes, the generated videos have rough edges, the lip sync is glitchy, backgrounds jitter, and text in scenes is often illegible. But the consistency of character, motion and design throughout a full minute is a major leap forward. Prior tools could barely sustain coherence over 3 to 5 seconds. This was made possible by layering test time training modules over Cog Video's base model. These modules act like memory units, absorbing and maintaining the visual state across scene
transitions. It points to a near future where AI can generate a 10 minute short film with nothing but a script and a few references. Now then there's Hollow Point, another new AI system focused on 3D modeling. The tool doesn't just segment object, it reconstructs what's hidden from view. If you upload a ring, the model identifies the band in the settings, then completes the invisible side of each. That completion isn't guesswork.
It uses a two stage system. First it segments visible parts. Next, it sends each piece through a 3D diffusion model to fill in occluded areas. The end result is a fully editable mesh with complete logical subparts. For 3D artists, this means they can now expand or customize individual parts of a model, resize the diamond, change a rim texture, or separate a chair's legs from its seat, all without needing to manually remodel or make speculative guesses.
This kind of automation saves hours of work in production design, animation, and game development. Like many other tools this week, Hollow Part is available on Hugging Face for interactive testing and through GitHub for full installation. Open source models just got
better than commercial ones. One major development this week came from a lesser known but quickly rising company called Vivigo AI. It launched Hydram, a new open source image generation model that has overtaken the competition, and that's at least in benchmark ratings. Unlike other high performing systems that remain closed or partially commercialized, High Dream is fully uncensored and
openly available. According to Artificial Analysis Independent Model Ranking Organization, High Dream now holds the 3rd place position among all text to image models. It is the highest ranked open source model on the list, surpassing not only stable to Fusion, but also Flux 1D. Flux Pro, which ranks slightly higher, remains closed source, making Hydream the best available option for developers and researchers working without licensing restrictions.
Early tests show that Hydream is more than a benchmark leader, and it performs well under practical creative conditions. Text prompts that previously produced vague or distorted visuals now return precise, detailed, and highly stylized results. This has opened up new use cases for illustrators, concept designers, and product marketers who want control over the visual fidelity of AI generated art. Importantly, I Dream allows full creative expression without the
usual guardrails or sensors. This is a rare stance in the AI space, where most companies apply aggressive filtering or content restrictions. That raises ethical questions for many users, particularly in animation satire and adult design solves a problem they've long encountered with other models. Now Omni SVG is also rising it's text to vector generation. Another is stand out Omni SVG, an AI tool designed to generate scalable vector graphics, or SVGS, from text prompts or images.
Unlike pixel based images, SVGS are resolution independent means that designers can scale these visuals from handheld to billboard sizes without losing fidelity, and developers can drop them into apps without worrying about file size or weight on the SVG stands out for its ability to produce complex, detailed vector illustrations that actually match user
prompts. Previous models SVG generation often failed to capture layout or character details, but this one completely nails it. Examples include cartoon characters with mushroom hats, stylized buttons with web UI concepts, and even photo based SVG conversations. The retain line fidelity, so vectors still matter. This matters because the web still runs on vectors. Icons, logos, app UI elements, and infographics all rely on
SVGS. And for developers who need sharp design across multiple screen sizes, or for companies localizing assets into dozens of languages, tools like Omni SVG reduce the time between idea and development. And what's more, the model also accept raster images and converts them into precise SVG renderings. This brings it into direct competition with some commercial vectorization tools and according to testing it outperforms most of them in detail.
Preservation and line geometry code and data sets for Omni SVG are expected to be released soon. Meanwhile, Alibaba has debuted a new version of the talking head generator dubbed Omni Talker. The tool takes a video of a person talking just a few seconds and allows users to change what that person says using a custom transcript. And the results is a realistic looking video where the person says anything you want them to say.
The output is not only visually smooth, but supports different languages, maintains mouth movements In Sync with the audio, and preserves facial expressions. This means you can take an input of Jackie Chan speaking in Mandarin and generate a video of him delivering a speech in English, accent intact, expressions unbroken. Of course, it's not perfect. The output can become uncanny if the person moves their head rapidly or if there are long
monologues without breaks. Some cases, the AI renders the face smoothly or loses subtle emotional transitions if the lip sync and frame continuity far surpass most tools currently available. And the more fascinating part is emotional modulation. If the reference video shows a sad expression, the generated face will appear somber even if the transcript is upbeat. A happy reference video injects cheerful expressions into whatever script is applied.
This emotional anchoring makes the system feel less robotic and more like a tool for actors or presenters who want to alter delivery without rerecording. Now most avatar generators top out at about 10 to 15 seconds. Omni Talker handles multi minute videos, maintaining voice sync and facial movements throughout. In one test, a 2 minute political speech was generated using just a headshot and a transcript, no voice acting
needed. You can even interact with the avatars live in real time, giving them queries and getting real time spoken answers. This makes the system useful for educators, sales, customer service or political messaging. Any scenario where one person must deliver dynamic scripts at scale without reshooting or editing the video, removes AI from behind the scenes to center stage.
And another related tool launched this week, though not from Alibaba, can take a static photo and animate the face using either an audio clip or a full reference video of facial movement feeded a Steve Jobs speech and a photo of Einstein. It'll make Einstein deliver Jobs words, lip syncing each syllable with Uri accuracy. And what's remarkable here isn't just the lip sync, it's the ability to mimic non verbal cues, blinking, nodding, pausing or even glancing away.
The AI tracks motion and emotional shifts, mapping them to any face, real or fictional. Whether it's Anne Hathaway or a digital clone, the face mimics both speech and sentiment, and compared to rivals like X portrait or live portrait, this tool performs better on both fidelity and emotional nuance. The head doesn't stay locked, the eyes shift naturally, and when combined with expressive audio, it creates A convincing
illustration of live speech. The GitHub repository is listed, but the code hasn't been released yet. It's dual mode functionality, supporting both speech only input and full motion transfer, means that users can choose between simple lip sync generation or full body animation. The results are strong enough to enter entertainment production, especially for animated interviewers, explainer videos or social content.
Now, robots. Aside of software, the most viral moments this week came from humanoid robots. Unitri's robot has performed flips and complex dances before, but Engine AI's robot raises eyebrows with boxing humans reacting to punches and landing counter blows. The demo, though, was captured during a live stream with a well known streamer visiting the company's facility in China.
Critics have long claimed these humanoid demos were faked CGI stitched into real environments, but the live stream confirmed otherwise. The robot fell, got back up, aimed some punches and adjusted its stance all autonomously with AI. Not pre recorded, not choreographed. Unlike dancing routines or rehearsed acrobatics, boxing requires real time adaptation. The robot had to interpret opponent movements, adjust its balance, and select appropriate actions, all within
milliseconds. Like human. It's about machines competing with humans. Physically a human versus a robot. We've seen this movie before, and the humans don't fare too well in it. Now what sets this apart is the autonomy system wasn't remote controlled. It executed decisions based on internal algorithms tracking visual input and calculating body position. If it fell, it knew how to stand up. If it gets too close, it backed off.
For the first time, the idea of reacting humanoid robotics feels actually real. Then there's Kawasaki. They introduced a concept that sounds absurd until you watch it, an AI powered mechanical horse. Unlike traditional scooters or two Wheelers, this machine walks on 4 robotic legs. Its body mimics the movement of an actual horse and it's designed to be powered by hydrogen, releasing only water vapor. At this stage the prototype is non functional.
The riding demo is CGI, and the model shown at Osaka's Expo is a static mock up. But the idea is absolutely serious. Kawasaki envisions a transport method that's clean, adaptive, and capable of navigating uneven terrain, but early feedback has been skeptical. Most people agree that wheels still outperform legs when it comes to energy efficiency and also stability. Now, all of these tools and models and demos are very
flashy. They're tangible shifts in how humans can work, communicate, and create. If you're a designer, AI will now generate not only your visuals, but your style. If you're a performer, AI will now speak with your face and your voice. If you're a robotics engineer, the next leap may come not from code, but from carbon fiber and real time planning algorithms. The acceleration isn't evenly distributed, but the opportunities and risks are everywhere.
These tools aren't just for labs or corporations anymore. They're for us. They're available on GitHub. Hugging Face or even browser based apps anyone can build with them. Hey, thank you so much for listening today. I really do appreciate your support. If you could take a second and hit this subscribe or the follow button on whatever podcast platform that you're listening on right now, I greatly appreciate it. It helps out the show tremendously and you'll never miss an episode.
And each episode is about 10 minutes or less to get you caught up quickly. And please, if you want to support the show even more, go to Atreoncom Stage Zero. And please take care of yourselves and each other, and I'll see you tomorrow.