Hands-on Data Virtualization with Polybase: Administer Big Data, SQL Queries and Data Accessibility

Speaker 1

00:00

Okay, let's unpack this. You've given us well, quite a stack of sources here, all focused on data virtualization, specifically digging into Microsoft's poll based tool.

Speaker 2

00:09

That's right, And our mission today really is to cut through all that complexity for you. We want to show you how this kind of technology tackles what's probably the biggest challenge for modern businesses yea analyzing these huge, scattered data sets.

Speaker 1

00:25

We're talking big data, IoT streams, data mining stuff. Yeah, massive amount.

Speaker 2

00:29

Exactly petabytes of it, and doing it quickly, affordably, and crucially without making your team learn every single complex system your company happens to use.

Speaker 1

00:38

Yeah, it starts with just grappling with the sheer size, doesn't it. That gravity of big data. You imagine your data is like a small notebook manageable, right, but then it explodes. Suddenly it's not a notebook. It's like a million books scattered everywhere. When you hit petabyte scale, your normal ways of working just they can't keep up. They get swamped.

Speaker 2

00:57

Absolutely, the systems slow down, costs spiral, it's a mess. The old problem was if you wanted to analyze that giant library, you literally had to move every single book onto one enormous table your data warehouse before you could even start reading, a.

Speaker 1

01:12

Huge costly effort upfront, totally.

Speaker 2

01:14

So the whole point of virtualization is to find a way to analyze these massive, sprawling data sets fast using sensible resources, without that gigantic data moving phase first.

Speaker 1

01:26

Okay, so that gets us right to the heart of data virtualization. It's about solving that movement problem. But maybe we should quickly touch on why the data is so scattered to begin with. Yeah, why not just one big system for everything.

Speaker 2

01:38

Well, it really boils down to different tools for different jobs. It's this conflict between say, speed and data integrity. Relational databases, things like SQL server. They're fantastic for transactional stuff, data to day business ops.

Speaker 1

01:50

Updates, deletes, making sure the customer record is accurate precisely.

Speaker 2

01:54

They guarantee that integrity, but that comes with overhead, especially when you're just dumping massive amounts of sequential data like logs.

Speaker 1

02:02

Right for just raw logging or archiving old stuff, you'd look at file systems, maybe hdfs or cloud storage like Azure blobs.

Speaker 2

02:10

Yeah, they give you super fast reads and writes because they kind of sacrifice that strict integrity and transaction control.

Speaker 1

02:16

Great for archiving, yeah, terrible for updates. Right.

Speaker 2

02:19

I think one source mentioned updating a single customer record and a file system could mean touching hundreds of different pieces, slow, costly exactly.

Speaker 1

02:29

So now you've got this split. Your really valuable transactional data lives in the relational system and this huge bulk of sequential, maybe less frequently updated data is out in the file system or the cloud, and you.

Speaker 2

02:43

Need to join them together for analysis. That's where the pain starts.

Speaker 1

02:46

That's where the movement problem really bites. Think about that analogy. Your source is used computer A and computer B.

Speaker 2

02:52

Ah. Yes, computer A has the million entries the big archive, and.

Speaker 1

02:56

Computer B has just one thousand entries, maybe the current customers from the relational database. Okay, so historically the default way to join these you'd try to move all one million entries from A over to B.

Speaker 2

03:08

Yeah, instantly your network gets hammered. You're trying to ship potentially petabytes across the wire.

Speaker 1

03:14

And then poor computer B is struggling to even store it all, let alone process it.

Speaker 2

03:18

Exactly, it's CPU memory disc Everything is strained trying to sift through all this data. You mostly don't even need just to find maybe ten relevant records. It's just incredibly inefficient.

Speaker 1

03:30

Slow, expensive duplicates data.

Speaker 2

03:32

Yeah, great so data visualization and specifically the thinking behind poly base. It just flips that whole idea on its head. Also, the smart way, the efficient way, is to move the small data, those thousand entries from B over to the large environment on A.

Speaker 1

03:48

Ah okay, send the query to where the data lives.

Speaker 2

03:51

Precisely, you push the query logic, the filtering, the joining down to the system that has the bulk of the data and the resources to handle it, do the work there, and then you only bring back the final small relevant results set to B.

Speaker 1

04:03

Got it. So no network saguration, no data duplication on B and you let the heavy duty system do the heavy lifting.

Speaker 2

04:09

You got it. It's a fundamental shift in where the computation happens.

Speaker 1

04:13

That efficiency game is huge, and that leads us neatly into polybased itself, because this is where that technical integration gets really clever. What's the core promise of polybase for someone using say SQL server.

Speaker 2

04:28

The promise is really about seamless power through familiarity. That's the key. Polybase lets you query pretty much any external data source hdfs, azure blobs, even other databases using the tool you already know, SQL server and the language you already know, t sql.

Speaker 1

04:43

Okay, so I write my standard t SQL query, yep.

Speaker 2

04:46

But here's the clever bit.

Speaker 1

04:47

Yep.

Speaker 2

04:47

While you're writing familiar t sql, Polybase is working behind the scenes translating that query and leveraging the native capabilities of that external system, especially things like its parallel processing power or its optimized story access.

Speaker 1

05:00

So it's like a universal translator for data queries. You speak t sql and Polybase figures out how to ask the question and Hadoop speak or whatever.

Speaker 2

05:08

That's a great way to put it. Yeah, it handles that translation and execution.

Speaker 1

05:11

Now, this wasn't an overnight thing you mentioned. It has roots in Microsoft's earlier efforts, particularly with Parallel Data Warehouse.

Speaker 2

05:17

Absolutely essential context. Polybase was officially announced I think it was November twenty twelve at the Sequel Pass summit, But the underlying tech, the architecture, it really relied on the groundwork laid by Parallel Data Warehouse or PDW, which came out back in twenty ten.

Speaker 1

05:32

And the sources really emphasize how quickly that PDW tech evolved PDW version two in twenty thirteen. The performance jump was apparently staggered.

Speaker 2

05:42

It was revolutionary. We're talking like one hundred times faster query performance compared to view one. That's not incremental, that's a different class of machine.

Speaker 1

05:50

Wow.

Speaker 2

05:51

And at the same time they slashed the price per petabite. It proved that this massively parallel processing or MPP architecture, which is the foundation for PAUL two, was really the way forward for handling big data within a relational database context.

Speaker 1

06:05

And that investment paid off by sql server twenty sixteen. Polybase wasn't just a PDW thing, It was generally available in standard sql server editions.

Speaker 2

06:13

Yeah, that move really cemented it. Microsoft was clearly using the same core codebase, bringing that big data query power to its mainstream database product.

Speaker 1

06:20

Which brings us nicely to the technical secret sauce this idea of push down computation.

Speaker 2

06:25

Yes, this is critical to understanding why polybase is so much better than the older ways.

Speaker 1

06:30

We touch on the older ways failing, specifically SQL server's link servers. Can you elaborate on how they fell short with big data?

Speaker 2

06:38

Sure, they were let's just say not very optimized for remote filtering. If you wrote a standard query like select from my link server ducts give me not table where filter column ten, the link server wouldn't push that wear filter column ten part down to the remote system. It would actually read the entire table from the remote serce the whole.

Speaker 1

06:57

Thing, even if it was billions of rows, the whole.

Speaker 2

06:59

Thing, pulled it all across the network, then applied the filter locally on your SQL server instace.

Speaker 1

07:04

Oh my goodness. So if I just wanted ten records, I might still be pulling gigabytes or terabytes across the network.

Speaker 2

07:11

First, precisely a guaranteed performance killer for large data sets. To actually force the filter to run remotely, you had to jump through hoops using things like open query embedding your remote query as a string. It was awkward, error prone, and didn't scale well for complex logic. Right.

Speaker 1

07:29

That sounds painful. So the magic, as the sources call it, of predicate push down in polybase. It fixes that.

Speaker 2

07:37

It fixes exactly that. Polybase enables intelligent pushdown. The query optimizer looks at your t SQL query and figures out which parts the filtering predicate is. Maybe some joins, maybe aggregations can actually be executed on the external data source.

Speaker 1

07:50

Itself, so it does the work remotely, and then.

Speaker 2

07:52

It only brings back the much smaller pre filtered, maybe even pre aggregated results set. You only get the ten customers you ask for the billion stay put.

Speaker 1

08:01

That must completely change the game for data warehousing.

Speaker 2

08:03

Oh massively think about traditional ETL extract, transform load, huge complex processes, often running for hours overnight just to move and reshape data before you can even query it.

Speaker 1

08:15

Right the daily or nightly load window.

Speaker 2

08:17

Polybase lets you potentially bypass a lot of that heavy lifting. You don't necessarily need to physically load all the external data into the warehouse first, you can query it in place. You focus on the analysis, the queries, the calculations, not the complex data plumbing all through one connection point in SQL server and I guess this pushdown benefit is amplified by parallelism. Definitely polydase, especially when you set up scale out groups with multiple SQL server nodes, is designed for

08:45

parallel data transfer. It can read data from multiple nodes in a Hadoop cluster or multiple partitions in cloud storage simultaneously.

Speaker 1

08:52

So it's pulling data in parallel, professing parts of the query in parallel on the remote system exactly.

Speaker 2

08:58

That parallel operation capability is a hallmark of those high end MPP systems, and polybaseed brings that capability to SQL server interacting with external data.

Speaker 1

09:07

Fantastic. Okay, let's broaden the view of bit section four Polybase within the wider modern data ecosystem. Interoperability seems key here, it.

Speaker 2

09:15

Really is its main strength. It has those native, highly optimized connectors for the big ones Hadoop, hdfs and Azure blob storage using the WASB protocol WSB.

Speaker 1

09:25

That's Windows Azure Storage Blob yep.

Speaker 2

09:27

The standard way to talk to Azure blobs for a while.

Speaker 1

09:30

But what about everything else the world isn't just Microsoft and hadoob. What if you need data from say, Cassandra or Mango dB, or even other relational systems like mysuquel or postgres School.

Speaker 2

09:42

Good question. For many of those other systems, Polybase relies on ODDC drivers open database connectivity. It's like a standard adapter.

Speaker 1

09:51

Okay, ODBC. So you can still use.

Speaker 2

09:53

T sql, Yes, and that's the big win. You still get to query those diverse sources using familiar t SQL from within sql server. Huge for developer productivity. And ease of adoption.

Speaker 1

10:02

But there's always a butt, isn't there. What's the trade off with ODBC.

Speaker 2

10:05

Well, using a generic bridge like ODBC can sometimes introduce a bit of overhead, and more importantly, it can sometimes limit those powerful pushdown capabilities we just talked about.

Speaker 1

10:15

Uh, so the intelligence might not always translate perfectly through the ODBC layer.

Speaker 2

10:20

Exactly. One of your sources had a perfect, if slightly worrying example with mycequel. A simple count aggregation failed to push down because of some subtle difference in how white space was handled by the ODBC driver.

Speaker 1

10:33

Suriously, a count query failed to push down.

Speaker 2

10:35

Yeah, and the workaround they had to explicitly disable push down for that query, meaning it fell back to pulling more data than necessary.

Speaker 1

10:44

So while ODBC gives you broad connectivity, you might occasionally lose some of that peak performance or intelligent push down you get with the native connectors.

Speaker 2

10:52

That's the trade off essentially. It highlights why systems with truly native, deeply integrated connections often performed best.

Speaker 1

10:59

Speaking best performers, the sources bring up Terra data quite a bit as a kind of gold standard in this MPP world.

Speaker 2

11:06

Yeah, Terra data is often seen as the benchmark, especially for petabyte scale warehousing. Their architecture goes way back nineteen seventy nine. They really pioneered many of these MPP concepts like shared nothing architecture, so they've.

Speaker 1

11:19

Been doing native pushed down in parallel data movement for decades pretty much.

Speaker 2

11:24

Their maturity and optimization are their big strengths polybases in many ways, Bringing those proven MPP concepts refined over years by systems like Terra Data into the more mainstream SQL server ecosystem.

Speaker 1

11:37

Makes sense, and shifting to the cloud polybases vital there too. Right, Connecting SQL server.

Speaker 2

11:41

To cloud storage absolutely essential. We mentioned Azure blobs via WASB. It also supports reading from Azure Data lakes store both Gen one and Gen two.

Speaker 1

11:50

Does it use the newer native protocols for ady ls Gen two Like.

Speaker 2

11:53

ABFs, often it still relies on the WISP protocol compatibility layer even for Gen two. Depending on the specific sequel server or synapse version. The native ABFs support is getting better, but WASB is often the fallback.

Speaker 1

12:07

And another key point mentioned was read versus right right now.

Speaker 2

12:10

Yes, that's an important current limitation to be aware of, especially in cloud scenarios like Azure, synaps analytics. Polybase is primarily fantastic for reading data from these external sources hood ADLs, blobs, but writing data back out to them via Polybase is often not supported or more limited. It's mainly a consumption a virtualization tool, not necessarily a two way synchronization engine yet.

Speaker 1

12:32

Okay, good clarification. So let's tie this all together. Why should you, our listener, really care about this? What are the killer real world use cases?

Speaker 2

12:41

Well, there are two immediate ones that jump out, offering huge savings and capabilities. First, aging and archiving.

Speaker 1

12:47

Moving old data out of expensive databases exactly.

Speaker 2

12:50

Think about old log files, transaction history older than say five years, data you need to keep for compliance but don't query often. You can set up part titioning in sql server to automatically move those old partitions to cheaper storage hdfs as your data lak. And the beauty is Polybase makes that archive data still look like it's part of the original table. Your legacy applications can query it using the same tseql, no code changes needed, instant cost savings on primary storage.

Speaker 1

13:18

That's incredibly practical. Okay, what's the second big one?

Speaker 2

13:21

The second one is maybe more transformational creating those three hundred and sixty degree customer views, especially for things like AI and machine.

Speaker 1

13:27

Learning, combining different data types.

Speaker 2

13:29

Right, imagine joining your core customer data from your relational database names addresses purchase history with massive unstructured or semi structured data streams. Right, what like web clickstream data, social media interactions, maybe sensor data from devices, even anonymized location data, stuff that lives outside your traditional database. Polybase lets you

13:52

bring all that disparate data together virtually. You can then run mL models across that unified view to do really powerful things customers with incredible accuracy, predict churn, detect fraud, personalized offers, things you just couldn't do easily when the data was siloed.

Speaker 1

14:09

That opens up a lot of possibilities. Yeah, so wrapping things up, then, what's the big takeaway here?

Speaker 2

14:13

The big takeaway is that data virtualization, with tools like Polybase leading the charge in the Microsoft world, fundamentally changes the role of the relational database.

Speaker 1

14:21

It's not just a container anymore, exactly.

Speaker 2

14:24

It becomes more of a central hub and analytical control plane. By giving you familiar t SQL access to these vast varied external data sets and using clever tech like predicate pushdown to do it efficiently. It saves potentially huge amounts of time and money.

Speaker 1

14:39

Less complax etl less need for specialized skills for every single data source.

Speaker 2

14:44

Precisely, it makes leveraging diverse data much more accessible.

Speaker 1

14:47

Okay, a powerful shift. So we've seen how polybase makes reading and linking data from dozens of sources much easier. It makes dealing with massive static files almost trivial compared to the old ways.

Speaker 2

14:58

Yeah, the read side is pre well tackled, but you.

Speaker 1

15:01

Alluded to the difficulty of updating data in those distributed file systems earlier, which leaves us with a final thought few to chew on. If polybase has made reading and analyzing virtualized data so seamless, how long will it be until that other major big data headache, the complexity and cost of ensuring real time consistency and updates across all these different virtualized sources, is also virtualized away just as elegantly.

15:24

When can we update that archive record as easily as we can query it?

Speaker 2

15:28

That's the multi billion dollar question, isn't it. How do you handle distributed transactions and consistency at scale in a virtualized world. That's the next frontier

Transcript source: Provided by creator in RSS feed: download file

Hands-on Data Virtualization with Polybase: Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra

Episode description

Transcript