Building Real-Time Analytics Systems: From Events to Insights with Apache Kafka and Apache Pinot

Speaker 1

00:00

Okay, so you're trying to get a real grip on something complex, right, and you want it fast, but without getting totally buried in jargon and detail.

Speaker 2

00:08

Yeah, information overload is real.

Speaker 1

00:10

Exactly, So think of this as your shortcut. We're diving deep into real time analytics today, trying to give you that core understanding, you know, without all the noise.

Speaker 2

00:20

And we're basing this on the book Building real Time Analytics Systems. It just came out September twenty twenty three, first.

Speaker 1

00:27

Edition, right, and the book's goal seems pretty practical, just helping you, the listener, get your job done if you're working in this space.

Speaker 2

00:34

Pretty much, it cuts through the theory to the how to now.

Speaker 1

00:38

Think back maybe early two thousands data analytics. It's often felt like something you did after everything else happened, you know, batch process.

Speaker 2

00:47

Oh definitely, reports would tell you what happened yesterday or last week hindsight basically.

Speaker 1

00:51

But things have really shifted, haven't they. There's this like massive appetite now for knowing things the moment they happen real time.

Speaker 2

00:59

Absolutely. The book uses fraud detection as a great example. Yeah, finding out about fraud hours later?

Speaker 1

01:04

Uh uh, too late? Right, the money's probably done exactly.

Speaker 2

01:07

The real wind is spotting it now, flagging it, maybe blocking it instantly. That immediacy is key. It's not just nice to have anymore often, it's well essential.

Speaker 1

01:18

And that brings us to this idea of streaming. It's not about waiting for a whole file to finish downloading or collecting.

Speaker 2

01:25

No, not at all. Think of it more like a continuous flow, a river of data that just keeps coming. It never really ends.

Speaker 1

01:33

And the crucial part is you can dip into that river and act on what you see right then and there.

Speaker 2

01:38

Precisely. A data stream fundamentally is just a series of data points ordered by time. Each one represents some kind of event or a change.

Speaker 1

01:46

Like what give us an example, well, like.

Speaker 2

01:48

Every single purchase on an e commerce site, or every reading from an IoT sensor, maybe temperature of pressure. It's like a constant pulsive information.

Speaker 1

01:56

Okay, okay, And here's a point the book really stresses, which I found fascinating. Events have a shelf.

Speaker 2

02:01

Life, a very short one.

Speaker 1

02:04

Sometimes their value can just like plummet super fast. Think about an online shopping cart someone just abandon right if.

Speaker 2

02:10

You can ping them with an SMS or an email. Maybe with a little discount voucher, Like.

Speaker 1

02:15

Immediately you might get that sale back.

Speaker 2

02:17

You've got a decent shot. Yeah, but wait, even just a couple of hours, they've moved.

Speaker 1

02:20

On, bought somewhere else, or just changed their mind exactly.

Speaker 2

02:24

The timing that immediate reaction makes all the difference, and that is the heart of real time analytics or RTA. It's all about squeezing value from those events basically as soon as they happen.

Speaker 1

02:36

The book mentions soft real time. What's that about. Does that mean it's not quite real time?

Speaker 2

02:41

Well, yeah, kind of. It just acknowledges that, you know, perfection is hard. There might be tiny delays milliseconds maybe seconds because of network latency or system hiccups. It's not instantaneous, but it's very very close.

Speaker 1

02:53

Okay, so practical real time. The big difference is compared to batch processing right batches.

Speaker 2

02:59

Where you collect data over time maybe an hour, maybe a day, put it in a big chunk, and then analyze it.

Speaker 1

03:05

We used to set up these artificial deadlines. Didn't we run the report at midnight for yesterday's data.

Speaker 2

03:11

Yeah, those time boundaries. The problem is your analysis is always looking backwards, you're getting insights about what was happening.

Speaker 1

03:17

Which might be stale news by the time you get it totally.

Speaker 2

03:21

RTA aims to give you a view of the present, so your decisions are actually relevant to now.

Speaker 1

03:25

So our mission for this deep dive drawing from the book is to really get into those core concepts and importantly the benefits. What do you actually gain from this?

Speaker 2

03:35

Let's talk benefits. Then. Speed is obviously a big one. The book argues it's often a decisive factor. Market leaders tend to be faster.

Speaker 1

03:44

Faster at understanding, faster at reacting.

Speaker 2

03:47

Exactly, and RTA helps achieve that. For one thing, it can actually open up totally new revenue streams now. So well, think about turning your real time data itself into a product offering your end users. Maybe customers the ability to query data with analytical capabilities almost live. They'd likely pay for that kind of access.

Speaker 1

04:06

Ah, interesting, So the insight itself becomes a premium service. That makes sense. It's not just about making more money, though, is it. The book talks infrastructure costs too.

Speaker 2

04:15

Yes, that's a really important one. Traditional BATGE systems often tie storage and compute together very tightly.

Speaker 1

04:21

Meaning if your data grows.

Speaker 2

04:23

Your costs for both storage and the processing power needed can just explode, often exponentially ouch. But with RTA, you're processing data more incrementally as it arrives. It sort of breaks that tight coupling. You don't necessarily need to store everything forever just to process it later in huge.

Speaker 1

04:40

Batches, so you avoid building those massive, expensive legacy systems just for batch jobs.

Speaker 2

04:45

Potentially, yes, significant cost savings are possible there. You're handling smaller streams continuously.

Speaker 1

04:51

Like managing a steady creek instead of building dams for unpredictable floods. And what about us, the customers? How does this improve customer experience?

Speaker 2

05:00

Well, think about customer support. Traditionally it's reactive, Right, you have a problem, you call their email, They investigate.

Speaker 1

05:07

And maybe fix it eventually.

Speaker 2

05:09

Maybe. With RTA, companies can constantly monitor streams of data usage patterns. ERA logus sensor data looking for anomalies or signs of trouble.

Speaker 1

05:18

Ah, so they can spot problems before I even notice them.

Speaker 2

05:21

That's the goal. They can potentially identify and even resolve issues proactively automatically, maybe reroute traffic, restart a service, or even reach out to you before it becomes a major headache.

Speaker 1

05:32

That sounds much better, moving from reactive firefighting to proactive.

Speaker 2

05:36

Care exactly, it leads to much higher customer satisfaction. It feels like the company is actually looking out for you.

Speaker 1

05:43

Okay, So RTA sounds powerful but also complex. The book introduces this term the real time analytics ecosystem or stack. What is that? In simple terms?

Speaker 2

05:56

Yeah, you'll hear ecosystem, stack, streaming stack. The basic mean the same thing. It's the whole collection of tools, technologies, and the processes you use to get from those raw, unending streams.

Speaker 1

06:07

Of data to actual insights you can use.

Speaker 2

06:09

Precisely, it's the entire pipeline, all the components working together.

Speaker 1

06:13

And why is understanding that whole picture important?

Speaker 2

06:15

Well, if you're an architect designing these systems, or developer building the apps, or even an operator keeping it all running, you need to understand how the pieces fit together.

Speaker 1

06:24

To make the right choices about tools and how they connect.

Speaker 2

06:27

Absolutely, it helps you build systems that are robust, scalable, and actually deliver those real time insights effectively.

Speaker 1

06:33

Okay. Now, before diving into the modern stack, the book briefly mentions something called the Lambda architecture. Sounds a bit I don't know dated.

Speaker 2

06:43

It is a bit older. Yeah.

Speaker 1

06:44

Yeah.

Speaker 2

06:44

It was kind of an early attempt to deal with having both real time needs and needing accurate historical analysis on huge data.

Speaker 1

06:53

Sets, trying to do both at once sort of.

Speaker 2

06:56

It had three layers, a big, slow batch layer for processing all the historical data accurately, a fast speed layer for handling the incoming real time streams providing quick, maybe slightly less perfect answers. And the third layer a serving layer that would try to merge the results from both the batch and speed layers when you actually queried the system.

Speaker 1

07:15

Okay, so it tried to give you fast answers and eventually correct complete answers. What was the upside?

Speaker 2

07:20

The main benefit was that your original raw data was kept safe and sound in the batch layer, so if you messed up your processing logic or wanted to try a new analysis.

Speaker 1

07:29

You could always go back and rerun it on the original data exactly. Data I mutability was a plus, But I sense a butt coming. The book implies it wasn't the perfect solution. What were the drawbacks?

Speaker 2

07:42

There were quite a few. Actually, First, it was complex. You essentially had to build and maintain two separate data pipelines, Batch and speed. That's a lot of engineering.

Speaker 1

07:52

Effort, double the work, potentially double the problem pretty much.

Speaker 2

07:55

Also, many early stream processors relied heavily on the JVM, the Java Virtual Mass, which was fine if you were a Java shop, but maybe less ideal otherwise.

Speaker 1

08:04

Fender lock in or skill set mismatch.

Speaker 2

08:07

Yeah, and maybe the biggest headache was often having to write and maintain the same or very similar processing logic in both the batch and the speed layers.

Speaker 1

08:16

Duplication. That sounds like a nightmare for consistency and updates, it really was.

Speaker 2

08:20

Keeping them perfectly in sync was hard, leading to potential inconsistencies in the final results. So yeah, lots of overhead and complexity.

Speaker 1

08:27

Okay, so LAMB deserved a purpose, But we've move on. What does a more modern real time analytics stack look like? What are the essential pieces?

Speaker 2

08:36

Right? The contemporary approach is generally more streamlined. It typically starts with event producers.

Speaker 1

08:40

The things generating the data in the first place.

Speaker 2

08:42

Exactly, systems that detect something happen to state change and fire off an event. Like an order management system sees a new order and generates an order received event, and that.

Speaker 1

08:52

Event contains the details like order ID, customer info items.

Speaker 2

08:57

All the relevant data and A key thing here mentioned in the book is you really need to benchmark your producers make sure they can actually handle the volume and speed of events you expect without becoming a bottleneck. Scalability and latency are critical right from the start.

Speaker 1

09:14

Okay, makes sense, The source needs to keep up. Where do those events go next?

Speaker 2

09:18

They flow into the event streaming platform. This is like the central highway or message bus for all your events. The backbone, yeah, exactly. Its job is to ingest potentially huge volumes of events, store them reliably, usually for some configurable period, and deliver them to whatever needs to consume them. A patch Kafka is probably the most well known example here.

Speaker 1

09:36

Right, Kofka comes up a lot. What makes a good streaming platform?

Speaker 2

09:41

Key things are scalability, Can it handle growth? Fault tolerance? Does it lose data if a server fails? High throughput? Can it handle a massive continuous flow and low latency? How quickly does data get through? Got it?

Speaker 1

09:56

So? Data producers feed events onto this Sofka like highway?

Speaker 2

10:02

Then what then? You typically have a stream processing platform This is where the real time analysis starts happening.

Speaker 1

10:07

This is where the magic happens, well.

Speaker 2

10:08

Some of it. This is where you take those raw event streams and transform them, maybe enrich them by joining them with other data streams or static data filter them, aggregate them, run calculations. Basically turn raw data into intermediate insights.

Speaker 1

10:22

Can you give examples of tools here?

Speaker 2

10:24

Sure. Popular ones include a Patche flink, which is a powerful stream processing framework. There's also a Patche Spark streaming, which extends the Spark batch engine for streaming, and Kaffka streams, which is a library that lets you build stream processing apps directly on top of Kafka.

Speaker 1

10:38

What are the important features for these stream processors?

Speaker 2

10:41

You need things like good state management because your analysis often depends on past events when doing capabilities for doing calculations over specific time periods like the last five minutes. Fault tolerance obviously so processing doesn't stop if something breaks, and support for different data formats okay.

Speaker 1

11:00

Using happens insights are generated, how do we actually use them or see them?

Speaker 2

11:04

That's the final piece, usually the serving layer. This is the system that stores the results of your real time processing and makes them available for querying fast.

Speaker 1

11:13

This is what applications or dashboards actually talk to exactly.

Speaker 2

11:17

It's the primary access point. Now, this serving layer could be a few different types of systems like what It could be a fast key value store think Mango dB, maybe elastic search or rettis. These are great if you primarily need to look up results based on a specific key, like getting the current status for a particular user.

Speaker 1

11:35

ID quick lookups. What's the alternative?

Speaker 2

11:37

The alternative, especially for more complex analytics, is a real time ol APP database. Ol app stands for online analytical processing.

Speaker 1

11:45

Ah okay designed for analysis right.

Speaker 2

11:48

Tools like a Pacupine, Apache, Druid, rock Set, or ClickHouse fall into this category. They are built for slicing and dicing data, running aggregations, filtering across lots of dimensions, much more complex queries than just a key lookup.

Speaker 1

12:02

So if I want to see, say, sales trends by region and product category for the last hour, I'd want an ol APP database.

Speaker 2

12:09

Generally, Yes, that's where they shine. The crucial thing for any serving layer in this context is speed. You need really fast data ingestion. The results from the stream processor need to show up almost instantly, and query latency needs to be low often in the.

Speaker 1

12:24

Millisecond well, well seconds again.

Speaker 2

12:25

Wow yeah, and it also needs to handle high concurrency, potentially thousands or even hundreds of thousands of queries per second depending on the application.

Speaker 1

12:32

That's incredible scale. How do you choose between key value and real time ol app beyond just the query type?

Speaker 2

12:38

Well, query type is the main driver, but you also look at how data gets in. Does it support direct streaming ingestion from COFKA or flink or do you need an extra step? How fast is that ingestion? Really? Can it handle your expected data volume and rate? Does it need complex indexing or pre aggregation to meet your query speed goals? Lots to consider, definitely, and the book makes a very sensible point. Don't just trust the marketing hype.

13:04

Do your own benchmarking with your own data and query patterns. See what actually works best for your specific needs.

Speaker 1

13:10

Test it yourself. Always good advice. Okay, so we have producers, the streaming platform, the process, or the serving layer. How do people like actual users see this stuff?

Speaker 2

13:19

Ah, the front end? Good point. If your users are internal like data analysts or engineers, maybe they querry the serving layer directly using SQL or an.

Speaker 1

13:27

API okay, But for less technical users or external customer.

Speaker 2

13:31

Then you'll likely need a user interface a front end, and you've got a few options here.

Speaker 1

13:36

What are they?

Speaker 2

13:36

You could go fully custom build your own web application using standard tools like react as Angular viewjas gives you total control over the look, feel, and functionality.

Speaker 1

13:47

The Highffert high control option. What else?

Speaker 2

13:49

Then there are low code frameworks, things like Streamlet or plotly, dash or popular especially in the Python world. They let you build interactive dashboards and web apps with much less front end coding effort.

Speaker 1

14:00

Faster development, maybe, less customization generally yes.

Speaker 2

14:04

And the third category is data visualization tools they ca Apache, Superset, redash, Grfauna. These often provide drag and drop interfaces to build dashboards directly on top of your data sources, often with no coding required at all.

Speaker 1

14:17

The quickest way to get a dashboard up.

Speaker 2

14:19

Often yes, So how you choose depends on a few things. What's the front end coding skill level of your team, how much time do you realistically have, and who are the user's internal experts or external customers needing a polished experience.

Speaker 1

14:32

A spectrum of choices matching needs and resources makes sense now the book also notes that sometimes the lines between these components get fuzzy.

Speaker 2

14:40

Yeah, technology evolves and tools sometimes wear multiple hats. Apache Pulsar is a good examples, mainly an event streaming platform like Kafka, but it also has built in capabilities called Pulsar functions that let you do some lightweight stream processing directly within Pulsar itself.

Speaker 1

14:57

Ah, so the streaming platform is doing some processing tasks exactly.

Speaker 2

15:01

It blurs the line a bit between the streaming platform and the stream processing platform. It just shows that these categories aren't always rigid silos. The landscape is pretty dynamic.

Speaker 1

15:11

Okay, that's a great tour through the stack. So wrapping up this main section, what's the big takeaway from the book about building these systems?

Speaker 2

15:18

I think the fundamental message is that embracing real time analytics isn't just a technical upgrade. It's a strategic move that can give you a serious competitive.

Speaker 1

15:28

Edge by making you faster, more informed.

Speaker 2

15:31

And ultimately making more accurate, relevant decisions because you're acting on what's happening now, not what happened yesterday.

Speaker 1

15:38

Fantastic, and this deep dive has been really an introduction. We've touched on the core ideas, the benefits those key building blocks of the RTA stack all pulled from the insights in building real time analytics systems.

Speaker 2

15:51

Absolutely, it just scratches the surface, but hopefully gives you a solid foundation.

Speaker 1

15:55

So a final thought for you, the listener to chew on, think about your own work, your own organization. Where could real time analytics unlock something new, maybe a new data product or a way to significantly improve a process you already have.

Speaker 2

16:10

Yeah, ask yourself, what's the current shelf life of your data? Is its value decaying rapidly? What can you gain by acting on it immediately? And maybe think about that stack. We discussed producers, streaming, processing, serving front end. Which piece might offer the biggest immediate win for your specific situation?

Speaker 1

16:29

Something to definitely consider. Where could that immediate incite make the biggest difference

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript