Episode 12 - Navigating Data Platform with Michael Tobin | Microsoft Community Insights podcast

00:44

Thank you , hello . Welcome to Microsoft Community Insights podcast , where we share insights from community expertise here today with Azure . My name is Nicholas and I'll be your host . Today In this podcast .

00:55

We will dive into data platform , but before we get started , I just want you to remind you to follow us on social media so you don't miss an episode to help us reach more amazing people like yourself . Today in this podcast , we have a special guest called Michael Tobin . Sorry if I pronounce it wrong . Can you start by introducing yourself , please ?

01:15

Yeah , hi everyone , my name is Michael Tobin . I'm an Azure consultant at ANS , so I'm primarily responsible for the delivery of infrastructure on Azure pretty much all things infrastructure , so that's landing zones , infrastructure , past services , uh , migrations and data platforms specifically as well .

01:32

So what we're going to be talking about today a little bit okay , brilliant .

01:35

So , uh , before we get started , uh , just those who don't know what data platform , can you briefly explain what it is ?

01:43

yeah , of course . So , um , a data platform is essentially a set of technologies . You know it's not specific to azure or anything like that . You could have a data platform in avs , you could have one in gcp , um , but it's a set of technologies that are designed to manage , process , store and analyze large volumes of data .

02:00

That data could be structured , it could be semi-structured , it could be unstructured , and it's to really get sort of analytics based on your data set that you're ingesting into a platform .

02:12

Okay , so how does that differ from traditional database ?

02:16

Yes . So the kind of difference with a platform end-to-end is a database might be a source for your data platform and then you want to basically take that data .

02:26

You want to do a process which is called ETL , so that stands for extract , transform and load , and sometimes your data if it's raw or unstructured it might not be useful to report on , so you might not be able to get the analytics you want out of them .

02:39

So with ETL , that sort of involves obviously the process extract , which is taking your data from source systems , putting them into a staging area for your data platform . Transformation , that's running sort of logic and business rules against the data . You know that could be like sql queries .

02:56

It could be things that ensure the data is in the correct structure because there might be issues with the data you're ingressed in . And then which is where you move that transformed data into a target repository . That could be . If you're looking at something like a data lake , that could be the next layer in your data lake , for example . Okay , great .

03:18

So in your experience , what are the key components you see that's required in data platforms ?

03:25

Yeah , absolutely so . Your key areas are an ingestion . So you need some kind of ingestion . Typically that's using what's called an Azure integration runtime or a self-host integration runtime , so that allows you to essentially pull your data either from Azure or an on-premise source .

03:43

It doesn't have to be on-premise , it could also be other cloud providers with a self-hosted integrated runtime . Another component is you need storage . So in Azure that's blob storage . It's something called Azure Data Lake , azure Data Lake Gen 2 .

03:57

And that's essentially large data storage that's based on blob storage and allows you to containerize data into sort of different layers . Now there's there's different concepts around that . Generally day-to-day I see the the medallion architecture , which is a Databricks model where you've got your .

04:17

You've got the kind of bronze , silver and gold free , free containers in a blob storage container typically . Which is bronze is your raw data ingestion . Silver is your kind of first level of ETL , so you've filtered , cleaned and augmented the data . And then gold is where you've got kind of business level reporting .

04:35

Okay . So before , when you introduced your service , you worked on some landing zones . So what's the difference between a typical landing zone and a data platform landing zone ? So what's the difference between a typical landing zone and a data platform landing zone ?

04:46

yeah , so generally , um , what what we do is that there are some some data landing zone architectures , but we , we recommend having a landing zone in place before you sort of look to do a data platform .

04:57

It's really important for , especially when you're ingesting data from on-premise , it's really important you've got your hybrid connectivity set up , your private dns is set up correctly . You know whether that's a private dns resolver or domain controllers .

05:11

Uh , we , you know we strongly recommend having a landing zone and some people because there's a big focus on data at the minute . Obviously , with ar coming up , it's really important to have that data processed and it's a business level we call it um . So it's really important to have that .

05:25

And some people come into kind of projects like this without a landing zone and say , oh , I just want a data platform , but it's really important to have that sort of that key landing zone deployment done in the first place , you know , not just for the hybrid connectivity element , but to have a good model of role-based access control and and your firewall and

05:42

things like that . So it is really important to have a landing zone in place when you sort of come to look at deploying a modern data platform and the elements of it okay , brilliant .

05:52

So how did you ? For example , when you create the landing zone , do you just create it with infrastructure it's called right , terraform and bicep and then you just put all those data resources , like data lakes and stuff , in within the landing zone yeah , so .

06:09

So we typically do it . Um , it's kind of there's . There's not one size fits all , but what we generally recommend is you , you have a dev , test and prod environment and we typically recommend , uh , separating them out into different subscriptions .

06:22

So , uh , they would probably sit under the court management group , uh , and then you'd have a subscription per environment for your data platform . So it fits into the landing zone . But in terms of the resource deployment , we usually keep that separate to landing zones . Okay brilliant .

06:38

So security is quite , very important for data platform as well . So what are the best practices for securing data as your data platform when you create those ?

06:50

Yeah , absolutely so it kind of splits up to two parts for me and obviously there's a lot more than just two parts , but a very high level networking security and URL based access control sort of strategy .

07:04

So when it comes to networking security , pretty much all the resources that I see in a typical data platform can be connected up with Fiverr Endpoint so a storage account can have Fiverr Endpoint . Generally we see key vaults used in data platforms to obviously store things like connection strings to databases . Obviously that can be scored with private endpoint .

07:23

Azure Synapse has four different private endpoints attached to it . That's three for the back end and then one for the front end UI .

07:29

So if you want you can have that only accessed from your internal network , which obviously we strongly recommend , because Synapse can be exposed to the internet and you know it is possible to do and if you have a good R back strategy and things like that , it's not too much of a worry . But generally we recommend locking that down .

07:48

Private endpoints are really the key here and just making sure that they're configured correctly and public access is turned off . Obviously it comes in with policy as well and your governance strategies . So generally we recommend having the policy in Azure on for denying public access to past resources .

08:04

That's a really important one here , because a lot of data platforms are based on these past resources . So , again , storage account , key Vault , synapse they're all past resources and they all have public enabled by default . So locking that down is really key . It's really important .

08:20

The second thing is thinking about your kind of audience and who's going to use your data platform . So , do you have data engineers ? Do you have data architects ? Kind of defining personas and then making sure you build out a role-based access control strategy that sort of fits those different personas .

08:37

So , does a data engineer need access to a database , for example ? Probably . Do they need access to Synapse , for example ? Probably . Do they need access to a key vault ? Probably not . They might not need to add credentials in , whereas an admin , an administrator , might need to do that in , whereas an admin and administrator might need to do that .

08:57

So it's just about making sure the roles are defined and what everyone's doing and building out our back groups in entra . And then you know , further on top of that , if you've got p2 licenses , integrating that with pim , integrating that with access packages and things like that .

09:09

So it's really important to have a good role-based access control strategy , as well as your networking strategy yeah , because I know that data platform involves like large amount of data and it's best .

09:20

I think it's very crucial that you secure your data , whether it's you need to have , yeah , scaling features to scale it up , scale it on demand also , according to organization yeah .

09:33

So in terms of scaling , depending on kind of what tools you're using you've got , you've got a lot of different options . So when it comes to synapse , you've got dedicated sql pools and you've got serverless ones . So the serverless ones will scale your dedicated ones . You generally need to define the skew yourself .

09:50

When it comes to other tools like data bricks , you've got clusters and you can set a minimum amount of minimum and maximum amount of workers , which are just virtual machines essentially , and you can define the SKU as well .

10:01

So things like ETL processes , like I mentioned at the start , sometimes they're , you know , some smaller ETL processes probably don't need too much compute and obviously with clusters and things like that it allows you to . The sort of serverless options are probably the best way to go in terms of cost saving .

10:21

Databricks have just brought in serverless compute into public preview in Azure . That's getting rolled out pretty soon , so that'll really help . It'll probably save a lot of costs when it comes to scaling ?

10:33

Yeah , so for those who don't know what ETL , do you want to explain what it is for the viewers ?

10:38

Yeah , of course . So I did touch on it briefly before .

10:41

But it's essentially a process which stands for extract , transform and load . An extract is pulling data from a source system , so that could be ingesting tables in a database into a storage account . The transformation process is usually kind of defined by data engineers and they'll typically work with different use cases to shape that data in a certain way .

11:06

So it could be doing things like validation , so something an example I've got could be that you can pull in a table which has loads of people's postcodes in , but you know , someone might have put two spaces in the postcode , which gives incorrect data .

11:22

So it's taking data like that and just making sure it's in kind of a fit shape , and then load is taking that data that's been cleaned up and then storing it somewhere else . Essentially , so that could be a data warehouse , it be a data mark . It could be a different type of storage system .

11:37

Again , usually what we see is is it comes into the storage account in a container , it then moves to a different container and then a third time , after the sort of last level of ETL , moves into another container .

11:50

So three containers in the storage account bronze , silver and gold and then as it get , as it goes through that etl process , each time it kind of progresses up into from bronze to silver , to gold yeah , because I know where I work .

12:03

We currently use purview , but could you use purview in data platform ?

12:07

yeah , so . So purview and data platforms go hand in hand . So there's um , there's there's connections from Purview to things like Synapse . There's connections to Purview , to data storage as well .

12:19

Obviously , it's a massive , massive thing these days , and what I see used the most when it comes to Purview is the data classification feature , so you're classifying the data that's been ingested into the platform . Obviously , purview is a massive product and it obviously encompasses a lot of other things . You've got the data loss protection and things like that .

12:36

But , yeah , we see a lot of Purview deployments go sort of hand-in-hand and you can also hook Purview up to your source systems that are on-premise , so it's not just limited to Azure . So that's really handy . So I just touched on it briefly earlier .

12:50

But you've got these virtual machines that are called self-hosted integrated runtimes and when it comes to data factory and signups , their sort of function is to pull data into your platform . Now Purview has self-hosted integrated runtimes as well , but they have a slightly different job .

13:05

Rather than pulling data in , they're used to scan on-premise data so you can categorize your on-premise data as well .

13:12

So we do kind of see those go hand in hand with data platforms and that's that's that's becoming really useful for organizations who need to categorize their on-premise databases and source systems , as well , yeah , because I know so , when you categorize your , your resources or database , always use labels in preview .

13:30

Yeah , exactly , yeah , very helpful , ok , so what's what's the best way to monitor a data platform for like performance wise ?

13:41

yes , so we've , we've , we've then been past services . They are pretty scalable , but , um , there's a lot of monitoring you can do , and especially with things like databricks .

13:50

Um , there's a lot of logs you can ingest into log analytics workspaces , which we generally recommend centrally managing through having a single log analytics workspace in your landing zone and then trying to ingest your data in there . Generally , we see logs like pipeline triggers who's running pipelines ? Have they been triggered automatically ?

14:12

That can be really useful to kind of monitor who's who's kicking off what pipelines . Is there anyone manually kicking off etl pipelines or processes ? Um , like I said , in terms of scalability , there isn't too much you need to do because a lot of these are based on past resources .

14:28

Um , but , but yeah , when it comes to monitoring , definitely recommend having a good sort of log analytics strategy , making sure you're pulling in the right audit logs , especially from a security perspective as well , making sure your users have the right access , and things like that . That's brilliant . How can someone new learn about Data Platform ?

15:01

Is it any other resource that you recommend ? There's some really good resources out there , and I think it's just worth noting that it's not limited to Azure . Obviously , there's loads of tools out there . There's Amazon , redshift is the AWS equivalent of Synapse on Azure and there's also there's a lot of tools out there .

15:20

I think that's probably the hardest thing about learning about data platforms is trying to sort of narrow the sort of tool set that you use , especially with so many products out there . There's a lot of SaaS products out there now as well , like Snowflake , which is a warehousing and database sort of rolled into one SaaS product .

15:40

So yeah , there's a lot of good resources out there . Definitely MS Learn , especially around Synapse and Azure Data Factory . They're sort of the two most common products we see used . So it's worth noting it's not a one-size-fits-all , but the sort of common architectures I see are Data Lake and Synapse that's kind of the and obviously Key Vault as well .

16:08

They're kind of the three core sort of uh parts of a data platform . But then you've also got data lake , um data factory and uh data bricks , for example . Um . Now the key difference between synapse and data factories is that um data factory is more of an orchestration tool , so it doesn't do um etl processes like synapses .

16:25

It doesn't have the analytics part but it does do orchestration , so essentially data movement , so you can use it to pull your data in from on-premise and you can use it to push your data into something like Databricks or push it into a different sort of system . So there's some sort of key differences there that are definitely worth looking into .

16:43

Okay , that's brilliant .

16:46

So , as this episode episode almost coming to an end , we would love to hear about the individual yourself . So are you going to any events like , whether it's technical events , tech events or like internal events ?

17:01

yes , so , yeah , so hoping to go to the next , uh , yorkshire zero user group . Um , I think that's coming up next month now I went to the last one in Sheffield and that was really good . I sort of like to try and get to them as much as possible . The Azure user groups are really good .

17:20

At ANS we have an internal event called TechCom every year , which is where we do talks from everyone sort of across the business , and that's really useful to learn about what everyone's doing across the business . But yeah , that's , that's pretty much it for me . What about you , nicholas ? Any any events coming ?

17:34

up for you , yeah , so I'll give myself a little plug . So , yeah , go on , there is . So there is . So I'm part of the organizer for Expert Live UK and it's coming to London . We're going to start a digital group , so it's coming to London next month , on the 20th . So if you're free , you can welcome to join .

17:57

Yeah , that'd be good . I'll hope to get down there .

18:01

Yeah , so it's just that one , and maybe I might

18:04

be going to . I'm to netherlands for the expo live now . It's quite a big one because yeah that's a big one employee yeah , yeah .

18:12

Yeah , I'm familiar with the netherlands one .

18:13

It's quite a big group yeah , so how can someone get in touch with you for any question regarding data platforms ?

18:21

yeah , absolutely so . Yeah , so I'm . I'm available primarily on linkedin . Just michael tobin , that's a m-I-C-H-A-E-L-T-O-B-I-N , and then I've got my blog as well . That's just hosted on mtobinuk . So , yeah , primarily LinkedIn is probably the best place to get in touch with me and feel free to connect or reach out if you don't have me already . Okay , brilliant .

18:43

Thank you for joining .

Episode 12 - Navigating Data Platform with Michael Tobin

Episode description

Transcript

⁠¶ Data Platform and Security Best Practices

⁠¶ Understanding Data Platform ETL Process

⁠¶ Networking and Contact Information Reach Out