🎙️ EP 211: The Chip That Hardwires AI (17,000 Tokens/sec?!)

00:00

When you think about the word hardware, what image pops into your head? Probably a blank slate, right? Right. A piece of silicon just waiting for code. But what if the software wasn't just running on the chip? What if it was physically baked into it? It sounds like science fiction, but it's actually happening. We are talking about a chip that runs 17 ,000 tokens per second. Because the AI isn't on the chip. It is the chip. Exactly. That is... That is really hard to wrap your head

00:26

around. It is a total paradigm shift from software to hardwired intelligence. Welcome to the Deep Give. I am really glad you are here with us today. We have a fascinating stack of reports to get through. And I want to approach this with a measured curiosity. There is just so much hype in this space right now. Our job is to slow down and really look at the mechanics of what is changing. I love that approach. And we have a pretty wild roadmap ahead. We are going to start. with this

00:54

silicon llama. The chip that breaks the speed limit. Right. By cementing the model directly into the hardware. Then we're going to pivot to a Google research paper. The one that says simply repeating yourself makes an AI smarter. Which sounds too simple to be true, but the data is there. Then we will zoom out to the massive geopolitical power struggles. Anthropic chasing open AI. And the U .S. rejecting global AI governance

01:19

entirely. And finally, we will look at new tools and the software hiding in your spreadsheets. Let us unpack this first big story, the Silicon Llama. This comes from a company called Taalis. They launched something called the HC1. Now, usually when we talk about AI chips, we talk about GPUs. Right. General processing units, they are flexible. But this HC1 is an ASIC. An ASIC. That is a chip hardwired for one single task. To understand why this matters, you have

01:46

to understand the memory wall. Right, the memory wall. In a standard setup, the compute core is incredibly fast. It does the math instantly. But the data, the actual weights of the AI model, they live in memory chips nearby. So every time you ask a question, the GPU has to fetch those weights. Move them over, process them, send them back. It is a constant traffic jam. The chip spends a lot of time just waiting for data. Exactly. It is highly inefficient. Enter TALIS. They built

02:13

an ASIC instead. Think of a GPU like a Swiss army knife. Right. It could do graphics or crypto mining or run different AI models. But the Kalalus HC1 is not a Swiss army knife. It is a scalpel. It is built to do exactly one thing, run Meta's Lama 3 .18b model. And because it is so speculized, the performance numbers are just staggering. We are seeing reports of up to 17 ,000 tokens per second. Tokens are basically pieces of words the AI processes. Yeah. Just pause on that number

02:44

for a second. 17 ,000. When I use a standard chatbot, I am thrilled with 50 tokens a second. It feels like someone typing fast, but 17 ,000 isn't typing. Whoa, imagine scaling that to a billion queries. You are not reading a book at that speed. You are downloading the entire library instantly. Forbes reports this is 10 times faster than Cerebras. Which was already the speed king. And potentially 100 times faster than standard GPUs. The cost is the other part that jumped

03:10

out at me. roughly 75 cents per 1 million tokens. Which is practically free. But the real killer stat is the power usage. A rack of these things pulls 12 to 15 kilowatts. Compared to a GPU rack, Pulling up to 600 kilowatts. So it's 10 times faster and uses a tenth of the power. But there's no free lunch in engineering. You don't get that performance without giving something up. What is the catch here? The catch is the absolute rigidity. They literally hardwire the Lama model

03:40

weights onto the silicon die. So if Meta releases Lama 4 next week. You cannot upgrade the software. Yeah. Because the software is the hardware. That is a massive gamble. There is another trade -off too. Quantization. Which means compressing the AI's math to save space. Right. To get everything to fit, they use mixed 3 -bit and 6 -bit weights instead of high precision numbers. So you lose some subtle accuracy to gain all that speed. Exactly. Though they are aiming for 4 -bit floating

04:07

point in future chips to close that gap. So is the speed worth the risk of hardware obsolescence if the model updates? So is a Ferrari engine welded shut fast but unchangeable? I really like that image. Let us pivot from locked -in hardware to a software hack that seems almost too easy. This Google research paper is fascinating. The stutter trick. Yeah, the stutter trick. Yeah. Google researchers found that for non -reasoning models, if you simply repeat the prompt twice...

04:35

Just paste it a second time. ...performance absolutely skyrockets. We are not talking about a 5 % bump.

04:42

No. On search -style tasks, accuracy jumped from 21 %... to 97 percent 21 to 97 just by asking twice just by saying it again why does that work it comes down to how these models process information they read left to right they interpret early words before seeing later clarifications right they are predicting the next word based on what they have seen so far So if I give a complex instruction at the end of a sentence, it is already committed to a trajectory before it gets there.

05:11

But repeating the prompt gives it a second pass. The first iteration puts the full context into its working memory. By the time it generates an answer after the second prompt, it has future knowledge of the entire request. It creates a perfect buffer for context awareness. I have to admit something here. I still wrestle with prompt drift myself. We all do. Sometimes I get lazy with instructions. It is incredibly comforting to know the fix is just copy -pasting. And the

05:38

data backs it up. Repetition beat the normal prompt in 47 out of 70 cases. And crucially, it never performed worse in a statistically meaningful way. So there's really no downside. But does this prove models aren't actually thinking but just predicting linearly? Right. They aren't reasoning. They are just auto -completing with better hindsight. It is a great reminder of what is actually under the hood. Now let us look at

06:03

the engine room of the industry itself. The business side of this deep dive is moving so fast right now. Anthropic is on an absolute tear. Their revenue scaled 10 times recently. Compared to OpenAI at 3 .4 times. OpenAI is still massive, but Anthropic is accelerating much faster. Some projections say they could overtake OpenAI by mid -2026. And then you have NVIDIA making a huge move. NVIDIA is nearing a $30 billion equity stake in OpenAI. Right. This replaces a previous

06:34

chip supply pact. This helps value open AI at $830 billion. We are creeping into trillion dollar territory for a private company. But the map of who uses and regulates this tech is fracturing. We have to talk about the Delhi Declaration. Over 70 countries signed this declaration in India, focusing on AI safety. It is a massive move by the global south to have a voice here. But we need to be clear about the U .S. response. The White House completely rejected it. The exact

07:02

phrase was they totally reject global. AI governance. We are just reporting what the sources state here, but that is a very definitive stance. It shows a clear prioritization of speed and domestic control. And when you look at the demographics, India's push makes sense. Young Indians are powering chat GPT usage. Nearly 50 % of their users are 18 to 24. India has over 100 million weekly users. So you have the users in India, the hardware in Taiwan and the U .S. The capital in Silicon

07:30

Valley. It is highly volatile mix. If the hardware maker owns the software maker who actually controls the industry. The arms dealer is essentially buying the army. Sponsor peak. We are back. Let us bring this down from geopolitics to something a bit more grounded. Literally down to your desktop spreadsheets. The tool that runs the world. The source highlights this concept of software hiding in your spreadsheets. They call it the big seed

07:56

or the blueprint. Your messy Excel sheet with client data and notes is actually a blueprint for a custom app. You just need the right tool to translate it. This is where platforms like Glide come in. Wrapping a user interface around your raw data. It is total democratization of software. Yeah. But we are also seeing highly specialized micro tools. Like Cloud and PowerPoint. Right. It reads your layouts and fonts. So when it generates a slide, it stays perfectly on brand.

08:21

No more generic corporate clip art. And then there is Wordy. Wordy is fun. You watch movie clips and it gives you quizzes. Gamified learning powered by AI to check comprehension. Then on the totally opposite end of the spectrum, we have. Ineffable intelligence. They just raised a $1 billion seed round. A $1 billion seed round led by ex -DeepMind star David Silver. Their explicit goal is building superhuman intelligence. The capital intensity required right now is just

08:52

wild. With $1 billion seed rounds, are we in a bubble or just starting the curve? High -stakes poker. But the chips are worth billions. Let us pull all these threads together. We covered a lot of ground today. If we look at the big picture, we are seeing a massive move towards specialization. Starting with the Silicon Llama. Chips hardwired for specific thoughts. Moving away from general purpose to extreme focus. At the same time, we are learning the weird psychology

09:17

of the machines. The Google stutter trick proves we are still just figuring out how to talk to them. And globally, the map is fracturing. The U .S. goes it alone, while the global south drives massive usage. It makes you wonder, if we are baking models into silicon, are we stabilizing or just building faster obsolescence? What happens when you bake llama into a chip and it gets outdated next Tuesday? You get a very expensive doorstop. Before we go, I want to encourage you to try

09:44

that double prompt trick on your next task. Just paste your complex instruction twice and see what happens. Thank you for joining us on this deep dive. Stay curious. Out to your own music.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript