OpenAI image generation and voice models breakthrough. Google releases Gemini 2.5 - podcast episode cover

OpenAI image generation and voice models breakthrough. Google releases Gemini 2.5

Mar 27, 202518 minEp. 60
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

KBLaM - A New Approach to Knowledge Integration in AI
Apple Wants to Turn its Watches Into Wearable AI
OpenAI image generation breakthrough
OpenAI Shakes Up Voice AI with New Speech Models
DeepSeek V3-0324: Enhanced Coding AI with Improved Performance and Security Concerns
Gemini 2.5: Advanced AI Model with Superior Reasoning and Coding Capabilities
A new, challenging AGI test stumps most AI models

Transcript

Welcome to Innovation Pulse, your quick, no-nonsense update on the latest in AI. First, we will cover the latest news. ARC-AGI 2 tests, AI's general intelligence, new models' excels in reasoning and coding, and advanced AI improves voice tech. After this, we'll dive deep into Apple's AI enhancements for the Apple Watch. The ARC Prize Foundation, co-founded by AI researcher Francois Chollet, has unveiled ARC-AGI 2, a test to measure AI model's general intelligence.

Currently, most models, including OpenAI's O1 Pro and DeepSeq's R1, score between 1% and 1.3%. Human participants average 60% accuracy. The test challenges models with visual pattern puzzles, aiming to assess their ability to adapt to new problems without relying on brute computing power. ARC-AGI 2 improves upon its predecessor by emphasizing efficiency and interpretation, not just problem-solving ability. It prevents models from using extensive computing power as a shortcut.

This comes amid a call for more comprehensive AI benchmarks. The Foundation has also announced a contest for developers to achieve 85% accuracy on ARC-AGI 2, while maintaining low costs per task. Join us as we explore the advancements in AI reasoning. Gemini 2.5 is the latest AI model showcasing advanced reasoning and coding capabilities. This model, especially the experimental Gemini 2.5 Pro, excels in benchmarks like LM Arena, indicating its superior performance in complex tasks.

It has been designed to think and reason, analyzing information and drawing logical conclusions effectively. This makes it adept at coding, math and science tasks, scoring high in benchmarks like GPQA and AIME 2025. With its ability to handle a wide range of inputs, including text, audio, images and video, Gemini 2.5 offers a robust context window, soon expanding to 2 million tokens. Available now in Google AI Studio, it offers developers a powerful tool for creating sophisticated applications.

Users can expect further improvements and expanded availability in the coming weeks. Chinese AI startup DeepSeq has released an updated version of its V3 model, named V30324. This model, available since December, is open source under an MIT license, with public weights. DeepSeq claims the update offers improved coding skills for web development and enhanced reasoning performance, though they advise using it for simpler reasoning tasks.

V30324 surpasses its predecessor on benchmarks, notably scoring higher on the challenging AME math test. However, the ease of these benchmarks has led to concerns about benchmark saturation. The model also boasts an improved writing style and quality, especially for longer content. Speculation on Reddit suggests this upgrade might precede the anticipated release of R2.

Users can access V30324 on Hugging Face and DeepSeq's platforms, but should be cautious about security and privacy issues as previous models were easily compromised. OpenAI has unveiled new models for automatic speech recognition and text-to-speech, advancing AI voice technology. These models offer greater accuracy and affordability, making them appealing for businesses deploying AI voice agents.

The new ASR models, GPT-40 Transcribe and GPT-40 Mini Transcribe, surpass OpenAI's whisper in handling diverse languages and noise, with the mini version being cost-effective for scalable use. Most TTS models generate lifelike voices and allow customization of tone and emotion.

OpenAI highlights two voice agent architectures, the fast but less controllable speech-to-speech model and the more robust chained approach, which splits processes for better control, aimed at enterprises needing compliance and accuracy. OpenAI's focus on enterprise solutions positions it as a key provider for AI voice interactions, investing no-code platforms and CCAS vendors by requiring differentiation in usability and features like analytics and compliance.

Up next, we're exploring GPT-40's impact on creativity. Creation for chat GPT, the AI startup says it spent a year using human workers to train its GPT-40 model to generate more realistic images. With comprehensible text, OpenAI says it has over 400 million weekly users of chat GPT. Photo? Gabby Jones, Jack Bloomberg News. OpenAI unveiled an updated version of its AI system GPT-40 that can generate more realistic images, the result of a year-long effort with human trainers.

GPT-40 replaces Dahl E3 as the default image generation model behind OpenAI's chat GPT chatbot and the ability to use it is now available to chat GPT-free plus team and pro users, the company said. Build as a less expensive version of its most advanced AI model at the time, GPT-40 debuted last year as a multimodal model capable of creating an understanding text, video, audio and images.

Today's refined GPT-40 model makes it easier for consumers and businesses to create more lifelike images and paragraphs of comprehensible text and even company logos and slide decks OpenAI said. Behind the improvement to GPT-40 is a group of human trainers who label training data for the model, pointing out where typos, errant hands and faces had been made in AI generated images, said Gabriel Goh, the lead researcher on the project.

Through that technique, the AI model was trained to follow human directions more closely, thereby generating more accurately rendered and useful images, he said. Today's refined GPT-40 model makes it easier to create more lifelike images and paragraphs of comprehensible text OpenAI said. Photo OpenAI The process, usually referred to as reinforcement learning from human feedback, or RLHF, is a common technique used by AI companies to improve their models after they are initially trained.

Given the sheer reach of OpenAI's AI systems, it says it has over 400 million weekly users of chat GPT. The impact these human trainers can have is significant. OpenAI said it worked with a little more than 100 human workers for the reinforcement learning process. The base model is already intelligent in its own way, Goh said. And then, the reinforcement learning from human feedback process brings out the intelligence and refines it.

With the improvements in research made to GPT-40, chat GPT's image generation is now a lot more useful for consumers and businesses. OpenAI said. Whereas prior iterations of its AI systems weren't able to generate paragraphs of readable text with images, for instance, GPT-40 is capable of doing so, it said.

The model is also able to create transparent backgrounds, making it possible for businesses to create logos or other iconography, said Jackie Shannon, an OpenAI product lead for chat GPT multimodal. Other users the company suggested include asking chat GPT to generate images based on a user uploaded brand style guide. Goh Daddy, chief data and analytics officer Travis Moulistine said the technology and web hosting companies use of GPT-40 is helping us embrace AI driven content creation.

That includes things like using AI to create stock images and logos, the company said. Still, the image generation in GPT-40 isn't perfect, Goh said. In one example the company showed, a user uploaded a photo of their living room with two windows to chat GPT. The AI system was only able to reproduce one window when recreating the image of the living room with new furniture. The use of AI image generators remains controversial.

Some artists have said AI image generators plagiarize their work and threatens their livelihoods. OpenAI said GPT-40 was trained on publicly available data, as well as proprietary data from its partnerships with companies like Shutterstock. We're respecting of the artists' rights in terms of how we do the output, and we have policies in place that prevent us from generating images that directly mimic any living artists' work, said Brad Lightcap, OpenAI's chief operating officer.

Apple is reportedly planning to enhance its Apple Watch with AI capabilities, despite challenges in consumer adoption of wearable AI devices. According to Bloomberg, the company aims to integrate cameras into the watch, expanding its visual intelligence features currently found on the latest iPhones. The series model may have a camera in the display, while the Ultra might feature one on the side.

This could enable the watch to identify objects or translate text in real time, adding practical value beyond novelty. However, Apple faces hurdles with AI, having delayed a smarter series and receiving mixed reviews on its AI features. The company has also struggled to introduce promised health features like blood pressure tracking. As Apple explores this AI expansion, success remains uncertain, with potential for both innovation and added complexity.

And now, pivot our discussion towards the main AI topic. Today, we're exploring a fascinating new technology called KBLAM. It's the knowledge-base augmented language model, which could revolutionize how AI systems access and use information. Let's start with a problem we've all noticed with large language models like GPT-4 and others. These AI systems are incredible at many tasks. They can write, reason, and even get creative. However, they have a significant limitation.

They struggle to stay up to date with new information. Think about it this way. These models are like brilliant students who memorize a massive set of textbooks before taking an exam. They know what they were trained on, but anything that happened after their training, new scientific discoveries, current events, or changing facts, is completely unknown to them. And that's a serious problem if we want AI to work with the latest information. So how have researchers tried to solve this problem?

There are three main approaches, each with its own drawbacks. First, there's fine-tuning. This is like sending our AI students back to school for additional courses on new material. It works, but it's incredibly expensive and time-consuming. Imagine retraining a multi-billion parameter model every time a fact changes. It's not realistic for regular updates. Second, there's retrieval, augmented generation, or RAG. This approach is like giving our AI students access to a library and a librarian.

When asked a question, the AI first asks the librarian to fetch relevant books, reads them, and then answers based on its training and this new information. This works better but adds complexity with separate retrieval systems and can't be trained seamlessly end-to-end. The third approach is in-context learning, where we provide the new information directly in the prompt. This is like handing our AI student a stack of notes right before the exam. It's simple but extremely inefficient.

As the amount of information grows, the computational demands grow exponentially. Try to feed too much information this way, and the system slows to a crawl or crashes entirely. This is where KBLAM offers a clever new solution. Instead of these approaches, KBLAM finds a way to efficiently plug in knowledge to an existing language model. Here's how it works in simple terms.

KBLAM first organizes knowledge into a structured format, specifically triples that contain an entity, a property, and a value. For example, KBLAM Creator Microsoft Research. These knowledge triples get encoded into special vector pairs. Compressed information packets, the AI can easily process. What makes KBLAM special is how it integrates these knowledge packets. It uses something called rectangular attention, a modification of the standard attention mechanism that powers modern AI.

In normal language models, every word pays attention to all previous words, which is why adding more context gets exponentially more expensive. But KBLAM changes this dynamic. In KBLAM, when a user asks a question, the words in that question can be used to pay attention to all the knowledge packets. Importantly, the knowledge packets don't need to pay attention to each other or back to the question. This might sound like a small change, but it has profound implications.

It means KBLAM's computational requirements grow linearly rather than exponentially as we add more knowledge. The results are remarkable. While traditional approaches might struggle with a few dozen facts in context, KBLAM can handle over 10,000 knowledge triples, equivalent to about 200,000 text tokens, on a single GPU. It achieves this while extending a base model that originally had a context length of only 8,000 tokens. But the benefits go beyond efficiency. KBLAM is also more interpretable.

Researchers can see exactly which knowledge packets the model is paying attention to when answering a question. It's also more reliable, learning when to refrain from answering if the necessary information is missing from its knowledge base. This helps reduce hallucinations. Those confident but incorrect answers AI sometimes provides. Perhaps most importantly, KBLAM enables dynamic updates. If a fact changes, you only need to update that specific knowledge triple.

No need to retrain the entire model or recompute the whole knowledge base. Think about what this means for the future of AI. We could have systems that stay current with the latest research findings, news developments, or changing business data, all without the enormous costs and delays of retraining. In fields where accuracy is critical, medicine, finance, science, this approach could transform how AI systems interact with real-world information.

The Microsoft research team behind KBLAM has released their code and datasets to the research community, hoping to inspire further advances in this promising direction. While there's still work to be done before it can be deployed at scale, KBLAM represents an important step toward AI systems that can efficiently access, update, and reason with external knowledge. The future of AI isn't just about generating text.

It's about generating knowledge that's accurate, adaptable, and deeply integrated with our evolving world. KBLAM is helping to build that future. That's a wrap for today's podcast. We explored how new AI models like Gemini 2.5 and KBLAM are pushing the boundaries in reasoning, coding, and knowledge integration. While Open AI and Apple continue to innovate in voice technology and wearable AI, stay tuned for more updates.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android