Welcome to the deep dive. Today. We're diving into well, a really fundamental challenge in computing these days, handling that absolute flood of data, you know, the high speed stuff.
Yeah, it's relentless. I think social media, IoT sensors, everywhere, web clicks, it just keeps coming.
And it's not just the sheer volume, right, it's how bursty it is. You get these sudden peaks exactly.
That borestiness is the real killer. Our sources. They show that building systems that don't just fall over when you get a sudden rush, like say everyone trying to check out a bike in a smart city right at five pm, that needs special thinking. Traditional approaches just won't.
Cut it right. So our mission today is basically to give you a solid shortcut to understanding these kinds of architectures. Will use the Amazon Kinesis family, you know, data streams, fire hose, data analytics, video streams, kind of as our case study.
Yeah. The goal is by the end you'll really get the difference between these tools, and more importantly, the core idea is behind making these systems scalable and crucially fast.
Okay, let's kick off with the big why why is speed so important in data analysis. Now why the big rush?
Well, it's become a strategic must have. Really, we've moved way beyond just looking backwards. Yeah, like those old batch jobs running overnight giving you a report.
On yesterday, right, a snapshot of the past exactly.
Now, it's all about immediate insights, the freshest possible data. Let's you make the best decisions right now, perishable insights.
Basically, I've heard about this thing, the O day loop observe, orient, decide, act. It comes from military strategy. I think, how does real time data fit in there?
It fits perfectly. Real time analytics fundamentally strengths that observe window. Something happens, you know about.
It instantly, which means you can orient, decide, and act much faster.
Precisely gives you a huge edge. Think about frog detection or optimizing ad placements on the fly, or in that smart city example, rerouting maintenance crews instantly based on real time bike usage patterns.
Okay, so that's the why. Now the how? A single server obviously can't cope, so we use distributed systems. Lots of servers network together. But does not just explode the complexity failure modes everywhere.
Oh absolutely, It gets complex fast, but engineers figured it out ways to manage it. Two main patterns really help tame that initial complexity.
Beast Okay, what were they.
First standardized interfaces? Things like APIs This really paved the way for micro service architectures.
Ah, so breaking the big system down into smaller independent services exactly.
It lets different teams own their piece and you know, iterate faster without stepping on each other's toes.
It sounds a bit like Conway's law in action, the system design mirroring the ORG structure.
That's a great way to put it. Yeah, small independent teams tend to build small independent services. Those standardized interfaces are what make it work smoothly.
Okay, interfaces help services talk. But what about failures? If one micro service crashes while it's processing data, how do you stop it from dragging down everything else that depends on it?
Right, that's the castgating failure problem. And that's where the second pattern comes in. Decoupling. You use an asynchronous message broker.
A message broker like a middleman for data sort of.
Yeah. It acts as the stable buffer, an invariant in the system. If a downstream service, a consumer fails, the broker just holds onto the incoming messages. It builds up a backlog.
So the producer can keep sending data unaware of the downstream problem.
Exactly, the failure is contained. Once the consumer service comes back online, it just starts processing the backlog from the broker. No data lost, no castkating crash.
That makes sense. The broker isolates the fault. But if that backlog keeps growing, that sounds like trouble brewing. What metrics do we watch?
The key capacity metric is transactions per second tps. That's usually limited by either the number of records per second or the total data size like megabytes per second.
Okay, tps, but what's the danger signal the real data.
Your signal is back pressure. That's when your producers are sending data faster than your consumers can process it. Input tps is consistently higher than output tps.
So back to the bike sharing rush hour hits every bike starts pinging its location like crazy. How do you handle that surge, that back pressure without the system just drowning.
You've got a few strategies ranging from let's say gentle to pretty aggressive. Okay, you can throttle the producers tell the bikes, hey, maybe only report your location every minute instead of every second during peak times.
Reduce the input rate makes.
Sense, Or you can scale the consumers. If you're in the cloud, you might automatically spin up more instances of your processing application to handle the load.
Add more workers, right.
You can also use bigger buffers in the message broker itself to just absorb temporary spikes. But the most drastic option is to just drop messages drop data.
That sounds risky.
It is. You'd only ever do that for non critical data. Like maybe dropping some routine sensor readings is okay if the systems overloaded, but you'd never drop something like a customer order completion message. Never.
Okay, got it? Critical versus non critical. Let's dig into the stream itself. What are the absolute basic parts of any streaming system for main components?
Yeah, you can break it down pretty simply. First you have the producers. These are the apps sending the data, our bike sensors, the user's mobile app checking out a bike, that kind of thing.
Okay, producers send data.
Then you have the messages or records. That's the actual data payload, usually small, maybe I a few kill bytes often capped around a megabyte. Each record has the data itself and a header, usually with a unique ID the broker assigns.
Got it, producers messages.
Third is the stream or the broker. That's the component doing the buffering, like canisis itself.
Or Kafka the fault isolator we talked about, right, And finally you have the consumers, the applications that pull data from the stream and actually do something with it, analysis, storage, triggering, alerts.
Whatever, producers, messages, stream consumers. Okay, now latency you mentioned speed is critical the source of stress. We need to be really precise here. It's not just latency. They're two specific measures.
Yes, absolutely critical distinction. There's propagation delay that's simply the time it takes from the moment of producer writes a message to the moment a consumer reads it raw transmission speed.
Basically okay, right to read time. What's the other one?
The other and often more telling metric is the age of the message. This measures how long a message has actually been sitting in the stream before a consumer picked it up.
Ah, So propagation delay could be low, but the message might still be old. If the consumers are lagging exactly.
If that average message age starts creeping up, that's your big warning sign. It means your consumers can't keep up. The backlog is growing, and performance problems are right around the corner, even if the network itself is fast.
That's a really important distinction. Now, failure modes distributed systems retries network glitches. It means data streams all and have this at least once delivery guarantee right, a message might show up more than once.
That's the standard reality. Yes, because ensuring exactly once delivery across a distributed system is incredibly hard. Most brokers guarantee at least once. The producer might retry sending if it doesn't get confirmation the consumer by process and in crash before confirming. Duplicates happen, and that.
Sounds potentially disastrous. If you're, say, processing payments or updating inventory accounts, double counting.
It would be disastrous, which is why the responsibility for handling duplicates shifts downstream to the consumer or the final destination system. Those systems must be designed to be idempatant idempatent, meaning meaning processing the same message multiple times has the exact same effect as processing it just once.
How do you achieve that two main ways.
Either the operation itself is naturally idempatant like setting a value as adempatent adding to a count, or is not, or you use that unique ideed. It's typically in the message payload to explicitly check if you've already processed this specific message before de duplication.
Basically okay, So the consuming application needs to be smart about duplicates. What about errors within a message? If a consumer gets a record it just can't process.
That leads to a particularly nasty problem called the poison pill.
Poisman pill sounds bad, It is.
Because message streams usually guarantee the order of messages, at least within a certain context. Like all messages for a specific bike, you need to process the unlock event before the return event takes sense.
Order matters.
But if one message in that sequence causes an error in the consumer, maybe it's malformed bad data, and the consumer keeps retrying and failing on that specific message, it gets stuck. It gets completely stuck. That single poison pill record blocks the processing of all the perfectly valid records sitting behind it in the stream for that same bike or whatever the ordering context is. The whole sequence grinds to a halt because of one bad message.
Wow, so a single bite of bad data could effectively block a whole partition of the stream, causing a latency for everything behind it to skyrocket exactly.
It really highlights why robust error handling and potentially dead letter cues for problematic messages are so crucial in consumer design.
Okay, that sets the stage really well. Let's move on to the specific tools Amazon offers with Kinesis to handle all this at scale. Starting with the foundation canesis data streams or KDS. When do you reach for this?
KTS is your go to when you need maximum control, high durability, and really low latency. We're talking subsecond processing for your own custom applications that read directly from the stream.
And the core unit of KDS is the shard, right, what does that actually do?
The shard is the fundamental unit of capacity and parallelism in KDS. Each chard has fixed limits on how much data it can ingest typically one megabyte per second or one thousand records per second, whichever comes first, and also how much data can be read out.
If I need more capacity, I just add more shards pretty much.
Yeah, you scale the stream by scaling the number of shards.
Okay, and if we're tracking millions of bike journeys, how does KDS decide which shard a specific bike's data goes to and keep it in order.
That's determined by the partition key. When your producer application sends a record to KDS, it includes a partition key. This could be the bike ID for instance.
And KTS uses that key to route.
The data exactly. KTS hashes the partition key and uses the result to assign the record to a specific shard. All records with the same partition key always go to the same.
Chard ah, and that's how it guarantees order. Within that key, all data for bike one twenty three lands on, say, shard five, in the correct sequence precisely.
But this also highlights a potential pitfall, the hot partition key or hot shard. If one partition ke, like a super popular bike or a single central sensor, sends way more data than others, its shark can get overwhelmed while others are idle.
So choosing a good part two key that distributes data evenly is critical for performance, maybe not just the bike idea. If usage is very uneven.
Right, sometimes using a more random key or combining fields is necessary to spread the load effectively.
Now reading the data, Let's say we have multiple apps wanting that bike data, one for maintenance alarts, one for route analysis. How do they read efficiently? You mentioned Enhanced fan out EFO.
Yes, so, standard KDS consumers work on a pull model. They pull the shard for data, but each chard has a total read capacity limit about two megabytes per second that all standard consumers share.
So if you have lots of consumers pulling the same shard, they start slowing each other down exactly.
They compete for that shared bandwidth. Enhanced fan out EFO solves this with EFO. Each registered consumer gets its own dedicated throughput limit per shard, delivered via push model from Kinesis.
So EFO consumers don't interfere with each other.
Correct. It allows multiple real time applications to consume from the same stream at high speed independently. If you need scale on the consumer side, EFO is often essential.
Okay, KTS for flexible, low latency custom apps. What about Kinesis Data fire Hose KDF. People often call it the easy loader.
Why because it is KDF is fully managed, totally serverless and Its whole purpose is to make it super simple to capture streaming data and load it into specific destinations. Think S three data laks, redshift data warehouses, elastic search for search, that kind of thing.
How easy is easy?
Often zero code is needed for the delivery part. You can figure fire Hose in the console, point it at your stream source which could even be KDS, tell it where to send the data, and it handles the scaling, buffering, delivery, and retries automatically.
Does it do anything else like transform the data?
Yes, you can. It has built in capabilities for data format conversion like Jayson to park, and it can also invoke an AWS lambda function to perform custom inline transformations basic epl before the data lands and the destination.
Okay, KTS gives us speed and flexibility. KDF gives us ease of use for loading data, sinks. What's the catch with KDF? What's the tradeoff?
The main trade off is latency because fire Hose buffers data internally, often for several seconds or even minutes to batch it efficiently for delivery to the destination.
Ah, It's not truly real time like KDS.
Can be exactly. It optimizes for delivery throughput and cost effectiveness by batching, not for subsecond end to end latency, So if you need that immediate subsecond processing, KDS is the choice. If your goal is reliable, easy loading into F three or redshift and near real time is good enough, KDF is often way simpler.
Makes sense speed versus simplicity. Now, analysis kinesis data analytics kDa. How does this fit in analyzing the stream as it flows?
Precisely? kDa is also serverless, and it lets you run continuous queries or applications against your streaming data, either from KDS or KDF. It offers two different.
Engines for this, Okay, what are the engine choices?
First is a sqel engine. If you're familiar with standard SQL, you can use ans icql to query the stream. kDa presents the stream data within windows, like tumbling windows of the last five minutes or sliding windows, so.
You could write a query like select count from bike stream where status in US, group by location district over a rolling time.
Window exactly that kind of thing. It treats the incoming stream like a continuously updating table. You can query. Very powerful for many common analytics tasks and accessible if you know SQL.
And the second engine you mentioned too.
The second is based on a patche flink. This gives you much more power and flexibility. You typically write your applications in Javar Scala.
What can flink do that the SQL engine can't.
Well, flink apps can handle much more complex logic, maintain sophisticated state across events, and importantly can connect the data sources and sinks outside of just kinesis and aws like maybe an on premises KOFKA cluster or a custom database.
And I remember reading flink is key for achieving exactly once processing semantics. How does it manage that?
Yes, that's a major advantage. Flank achieves exactly once by carefully managing the application's internal state and periodically saving checkpoints of that state to durable storage typically S three or something like rock sysdb. If there's a failure, it can restore from the last successful checkpoint and resume processing without missing data or processing duplicates.
So SQL for simpler windowed queries, think for complex stateful potentially exactly one's processing and connecting anywhere.
That's a good summary.
Yeah, okay, one more service kinesis video streams KBS. This seems more specialized. Video and audio.
Yeah, KBS is specifically designed for handling time encoded data streams, primarily video and audio, but potentially other time series data too, like radar or liar feeds.
How does play out in our smart city bike scenario?
KBS actually addresses two pretty distinct use cases. There first is real time low latency interaction. Imagine a user wanting to see a live video feed of a bike stand on their phone to check if bikes.
Are actually available okay, live streaming.
KVS provides capabilities, often using Web RTC standards for that kind of low latency peer to peer or small group video streaming. It manages the signaling channels and things like stunt and turn servers needed to establish the connections. So KVS WebRTC.
For live use got it? And the second use case.
The second is more about storage, playback and analysis. This is KVS storage. You stream video from say security cameras that the bike stands into KVS for durable storage.
And then what just store it?
You can store it for compliance or later review, but the real power comes from integrating it with AI and mL services. For instance, you could feed that stored video stream into Amazon Recognition.
AH for analysis.
What like running facial recognition to detect known vandals near the bike stands in real time, or maybe object detection to count bikes automatically or spot obstructions. Recognition analyzes the frames ingested via KVS and can trigger alerts or other actions.
So KVS handles both the live interaction piece and the jest for later analysis piece for video.
Data correct two sides of the video coin.
That's a really helpful tour of the whole Kinesis family. Let's try to quickly recap the core job of each one for the listener.
Okay, KTS, that's your foundational low latency, high throughput stream for custom apps needing maximum flexibility, thinks subsecond speed right KTF KDF the easy serverlust loader, great for getting streaming data into S three redshift elastic search without much code,
but sacrifices that subsecond latency for batching efficiency. kDa the real time Analytics engine, use it SQL interface for simpler windowed queries, or the Apache frink engine for complex stateful processing exactly one semantics and broader connectivity.
And finally, KVS KDS.
The specialist for video and time encoded data, handles both low latency web RTC streaming for live interaction and gerbil ingestion for storage playback, and AML analysis like recognition.
Perfect summary. Now, before we wrap up, there's a crucial security point that underpins all of this distributed complexity. How to secure all these producers and consumers talking to kinesis, Yeah, this.
Is super important. The absolute standard best practice is use IMA roles. Do not embed long term access keys or secrets directly into your producer or consumer applications. Why roles specifically, because IM roles provide temporary, short lived security credentials that are automatically generated and rotated by AWS. Your application running on EC two or Lambda or wherever assumes a role gets temporary permissions, does its work, and those credentials expire.
So even if an application instance gets compromised somehow, the attacker doesn't get permanent keys.
Exactly, the potential blast radius is dramatically reduced compared to static, long lived keys being compromised. It's all about implementing least privileged access using temporary role based credentials. It's fundamental to secure cloud architecture.
That's a critical takeaway. Okay, let's finish with that provocative thought we talked about at least once delivery and the need for dempotency. How should listeners think about that?
Right, we established that GETT duplicates is just a fact of life in most high performance, resilient streaming systems like those built on Kinesis data streams. You have to design for it.
So the provocative thought.
Is think about how profoundly that changes how you have to design your applications. Yeah, you can never assume an input an event, A message will arrive only once, might arrive twice or even more time. So that's rare. So how does that constraint The absolute certainty that you might get duplicates ripple through your entire design? How does it affect your database schemas, your transaction logic, your state management.
Realizing that you must build identity consumers isn't just a detail, it's a fundamental shift in mindset needed to truly master streaming architectures.
A great point to ponder designing for repeats, not just requests. Thanks for breaking all that down.
My pleasure. It's complex, but hopefully a bit clearer now.
