Data Science and Security: Proceedings of IDSCS 2021

Speaker 1

00:00

You know, usually when we talk about a medical diagnosis, there's this expectation of like pure precision.

Speaker 2

00:06

Oh totally, it feels like engineering.

Speaker 1

00:07

Right exactly, like you break your arm, the X ray shows that jagged white line and the doctor just points and says, well, there it is.

Speaker 2

00:15

It's binary broken or not broken. It's clean, and honestly, it's very comforting for patients.

Speaker 1

00:21

We like things to be visible, easily categorized.

Speaker 2

00:23

Yeah, we really do.

Speaker 1

00:24

But then you step into the world of analyzing the the microscopic landscape of our own blood, or you try to extract a single meaningful diagnostic fact from thousands of pages of dense medical texts.

Speaker 2

00:39

Oh yeah, that's a nightmare.

Speaker 1

00:41

And suddenly that clean X ray machine is useless. We're looking at a data landscape that is murky. It's chaotic.

Speaker 2

00:47

It is the absolute definition of diagnostic muddy waters.

Speaker 1

00:51

Okay, let's unpack this, because if you were trying to make sense of this modern avalanche of information, you need to understand how the tools are evolving today. Our grounding material is a really fascinating collection of research from the proceedings of the International Conference on Data Science, Computation and.

Speaker 2

01:09

Security, which is a really long title, Yeah.

Speaker 1

01:13

It is. Specifically, we're looking at the IDSCS twenty twenty one.

Speaker 2

01:16

Papers, right, and this collection is really a treasure trove of how modern computational techniques are being applied to incredibly messy, real world problems exactly.

Speaker 1

01:28

And our mission today on this deep dive is to explore how cutting edge data science is solving three massive, interconnected challenges.

Speaker 2

01:35

Yeah, three really big ones.

Speaker 1

01:37

We are going to look at how algorithms are finally learning to digest human knowledge without losing the meaning crucial step, how they are identifying microscopic threats hiding in our bloodstreams, and crucially, how they are figuring out how to do all of this without demanding access to our private raw data.

Speaker 2

01:53

Because at the core of all these papers is a single, unifying mathematical challenge, which is extracting the vital signal from an over welming amount of noise.

Speaker 1

02:01

Well, let's start with the loudest noise of all, which is a problem I know you the listener deal with every single day.

Speaker 2

02:09

Oh, information overload.

Speaker 1

02:11

Yes, we are drowning in text emails, one hundred page reports, articles, research papers.

Speaker 2

02:18

It's endless.

Speaker 1

02:19

The traditional AI summary tools we've been using for the last few years often feel frankly kind of dumb.

Speaker 2

02:26

Why is that, Well, because traditional tech summarization algorithms often just skim the surface. Okay, The core problem is their methodology. They essentially look for frequently repeated words and just grab the sentences containing them.

Speaker 1

02:40

So they're just like scanning.

Speaker 2

02:41

For matches, right, But by doing that, they lose the crucial context. They miss the nuances I see, and they frequently drop highly useful specialized entities that maybe only appear once or twice, but they change the entire meaning of a paragraph. They don't actually understand what they are summarizing. They're literally discounting.

Speaker 1

02:59

So instead of just reading textbook and blindly highlighting repeated words, it's like a stressed college student highlighting the word mitochondria fifty times without knowing what it does.

Speaker 2

03:10

That is a very accurate, albeit depressing analogy.

Speaker 1

03:14

Yeah. So the first paper we are looking at today, by Sithan Seeing in Drundiepak proposes a knowledge centric semantic approach.

Speaker 2

03:21

Yeah they do.

Speaker 1

03:22

How does this actually fix the counting problem?

Speaker 2

03:24

By building what they call a term based ontology model.

Speaker 1

03:27

Okay, term based ontology.

Speaker 2

03:29

Yeah, let's break down how this algorithm actually reads. After it cleans up the.

Speaker 1

03:33

Text, like removing basic stop words and punctuations.

Speaker 2

03:36

Exactly after that, it applies something called TFIDF.

Speaker 1

03:40

Okay, what is tf IDF because that sounds like, I don't know, heavy military jargon.

Speaker 2

03:46

It does. It stands for term frequency inverse document frequency.

Speaker 1

03:50

It is a mouthful.

Speaker 2

03:51

It is. Think of it as a way to weigh the importance of a word. It looks at the most frequent words in a specific document. Okay, but it compares that to how rare those words are across a massive general corpus of text.

Speaker 1

04:04

Wait, can you give me an example.

Speaker 2

04:06

Sure, if the word blood appears fifty times in a medical paper but rarely in general English, TFIDF flags it as a highly unique identifier for that specific text.

Speaker 1

04:17

Okay, So it finds the unique keywords, but how does it know what they mean?

Speaker 2

04:22

Well, this is where it gets semantic. It takes these extracted features and cross references them with external knowledge sources like what specifically wikidata?

Speaker 1

04:32

Oh wow, Yeah, it.

Speaker 2

04:34

Is actively looking up the concepts it finds to build a domain based ontology.

Speaker 1

04:39

So like a literal mathematical map of how all these terms relate to each other in the real world exactly, So it's not just looking at the document in a vacuum. It's using wikidata to build a web of meaning, linking terms together before it even tries to summarize.

Speaker 2

04:52

You've got it. But then comes the hard part, which is deciding which sentences to actually keep and which to throw away. Right, this is where they bring in heavy duty statistical tools, starting with cross entropy.

Speaker 1

05:03

All right, slow down? What is cross entropy in plain English?

Speaker 2

05:07

Right?

Speaker 1

05:07

Sorry?

Speaker 2

05:08

In information theory, cross entropy essentially measures the difference between two probability distributions.

Speaker 1

05:14

Still a bit technical, okay.

Speaker 2

05:16

In the context of reading text, the algorithm is calculating the surprise factor of a new sentence.

Speaker 1

05:22

The surprise factor, Yeah.

Speaker 2

05:23

It's mathematically asking, based on the web of meaning I've already built from the previous sentences, how much genuinely new surprising information does this next sentence actually give me?

Speaker 1

05:34

Well, that is brilliant, it really is. So instead of just reading blindly, this algorithm acts like a genius friend who actually understands the meaning of the words. If the cross entropy is low, it means the algorithm isn't surprised at all.

Speaker 2

05:47

Precisely, it already knows this information. So the sentence is redundant.

Speaker 1

05:50

And it mathematically proves which sentences are redundant.

Speaker 2

05:53

Yes, And to further eliminate redundancy, it pairs this with NPMI or normalize point wise mutual information alongside ENOVA.

Speaker 1

06:02

Okay, NPMI, what does that do?

Speaker 2

06:04

NPMI looks at cooccurrence. If two concepts, say interest rates and inflation almost always show up together in the text, NPMI flags that strong.

Speaker 1

06:12

Relationship makes sense and ANOVA.

Speaker 2

06:15

The algorithm then uses an analysis of variance or ANOVA to generate statistical P values for these term relationships.

Speaker 1

06:24

So it's assigning a strict mathematical grade to every single word relationship.

Speaker 2

06:28

Yes, and the grading is ruthless.

Speaker 1

06:31

We really, Oh yeah.

Speaker 2

06:32

The system group's sentences based on these P values. It uses a strict threshold, like a cutoff point. Exactly, if the calculated value of cross entropy and the intersection of those NPMI scores is less than point five, that sentence is entirely eliminated.

Speaker 1

06:49

Just gone, gone.

Speaker 2

06:50

It is mathematically proving that the sentence adds no new value to the summary.

Speaker 1

06:55

But wait, if you mathematically chop up a fifty page document based on variance and entropy. The resulting summary might contain the right facts, but it's going to sound like a glitching robot trying to speak English.

Speaker 2

07:05

Right, it would be super disjointed.

Speaker 1

07:07

Sentences will just abruptly smash into each other, which is exactly why the authors included a final polishing step using two distinct agents. Oh they fixed the flow.

Speaker 2

07:15

Yes, First, a lexical agent using word net two point zero steps in what's word It acts like a massive conceptual dictionary to ensure the vocabulary transitions naturally and captures the right lexims.

Speaker 1

07:27

Okay, that helps the vocabulary.

Speaker 2

07:28

And then a grammatical agent restructures the phrasing to fix the grammatical errors that inevitably happen when you stitch disparate sentences together.

Speaker 1

07:37

So what were the actual results of this highly semantic, mathematically ruthless approach.

Speaker 2

07:42

Well, they tested this on the DUC two thousand and seven data.

Speaker 1

07:46

Set, which is what like a benchmark.

Speaker 2

07:47

Yeah, it's a standard academic benchmark containing hundreds of documents with manually created, human written summaries to test algorithms against. Okay, kind of, The sing and Deepak model achieved an F measure of eighty eight point twenty percent.

Speaker 1

08:01

Wow, that's high it is, And.

Speaker 2

08:03

Perhaps more importantly, a false negative rate of just zero point one four.

Speaker 1

08:07

Wait point one four, that's tiny. Let's translate that false negative rate into the real world. Okay, A false negative means the algorithm looked at a crucial piece of information and mistakenly decided to delete it. A rate of point one four means it is almost never deleting vital information exactly.

Speaker 2

08:24

For context, they compared it to baseline models like IKTWA, which only hit a seventy eight point three six percent F measure.

Speaker 1

08:31

If I'm relying on an AI to summarize a massive legal contract or a dense medical history, I need to know it didn't accidentally delete the hidden fee clause or the patient's drug allergy.

Speaker 2

08:42

It's vital.

Speaker 1

08:43

This math actually provides that confidence. It's so efficient. This directly benefits anyone trying to learn faster.

Speaker 2

08:50

It is a massive leap forward. It shows that by mapping the ontology of words, computers can finally move past just counting text to actually distilling human thought.

Speaker 1

09:00

If we can use statistical thresholds to filter out useless sentences, can we use that exact same logic to filter out useless noise in a medical scan.

Speaker 2

09:08

That is exactly what we're looking at.

Speaker 1

09:09

Next, we're moving from processing human language to processing human biology because data science isn't just about reading text faster, right, It's about seeing what the human eye misses.

Speaker 2

09:19

It is the transition from semantic analysis to geometric analysis using the exact same underlying.

Speaker 1

09:26

Principle extracting vital features from a sea of noise. This brings us to the second paper by Ali Siddam hasim geda Way and Gamela Judah. They're detecting abnormal red blood cells or RBCs using morphology and rotation. Now, why is this a problem that needs a data science solution?

Speaker 2

09:45

Because the steaks are incredibly high for conditions like hemolytic anemia, with sickle cell anemia being a prime example.

Speaker 1

09:52

Okay, hemoltic anemia.

Speaker 2

09:53

Yeah, in a healthy person, red blood cells are perfectly circular and flexible, but genetic abnormality can cause these cells to become deformed. They get all misshapen, right, They turn elliptical, rectangular, or sickle shaped like a crescent moon.

Speaker 1

10:07

And because of that elongated jagged shape, they become rigid. They get stuck in blood vessels, and they rupture easily as they pass through our capillaries.

Speaker 2

10:15

Right now, The traditional way to diagnose this involves a highly trained hematology technicians sitting at a.

Speaker 1

10:21

Microscope, staring through the lens all day.

Speaker 2

10:23

Manually examining a glass slide smeared with blood and looking for these deformed cells among hundreds or thousands of normal ones.

Speaker 1

10:31

That sounds exhausting.

Speaker 2

10:32

It is tedious, it is painstakingly slow, and it is highly prone to fatigue. You are relying entirely on a tired human eye.

Speaker 1

10:41

So Sudiam and his team built an automated solution. Yes they did, And what fascinated me is how they prep the image before the computer even looks for the cells.

Speaker 2

10:51

The preprocessing is key.

Speaker 1

10:52

They have to find the region of interest or ROI. They take the standard grayscale microscope image and essentially peel it up heart by converting it into pure black and white binary images.

Speaker 2

11:03

But they don't just do it once.

Speaker 1

11:05

Right, They process it at very specific intensity thresholds like sixty seventy eighty ninety one hundred correct. Why those specific numbers? Why not just make the dark stuff black and the light stuff white?

Speaker 2

11:18

Well, because a blood smear is messy. Lighting under a microscope isn't perfectly even some cells overlap, some are faded.

Speaker 1

11:24

Oh, so it's not a uniform image, not at all.

Speaker 2

11:27

By running multiple thresholds, the algorithm is essentially adjusting the exposure step by step.

Speaker 1

11:31

Oh like changing the settings on a camera exactly.

Speaker 2

11:34

It's finding the optimal contrast where the true edge of the cell separates from the background fluid.

Speaker 1

11:39

And as it creates these binary images, it runs a cleaning protocol. Yes, anything that shows up as a smooth region smaller than one hundred pixels is instantly deleted.

Speaker 2

11:49

It mathematically decides this is too small to be a red blood cell. It must be a speck of dust or an artifact on the glass.

Speaker 1

11:55

This leaves the algorithm with a clean map of distinct objects. Right, but an object, it's just a blob of pixels. Yeah, How does the computer know if it's a normal circle or an abnormal sickle cell.

Speaker 2

12:06

That's the real challenge.

Speaker 1

12:07

Here's where it gets really interesting. I was looking at the paper and it explains that once the algorithm isolates a cell, it actually rotates the image of that cell by ten twenty thirty and forty degrees counterclockwise.

Speaker 2

12:21

Yes, it spins the image.

Speaker 1

12:22

And my first thought was, wait, why does the algorithm then rotate the images? Why not just look at the cell as it is. If a sickle cell is shaped like a crescent moon, rotating it on a slide doesn't magically turn it into a circle. Why does the computer care what angle it's sitting at.

Speaker 2

12:37

It's a great question. It's because computers don't see shapes the way human eyes do. What do you mean They don't look at a cluster of pixels and instantly recognize a crescent. They understand geometry through bounding boxes.

Speaker 1

12:49

It's owning boxes.

Speaker 2

12:50

Yeah. Think of a bounding box as drawing a strict square or rectangle around the absolute furthest edges of the object.

Speaker 1

12:57

Okay, I'm visualizing drawing a tight box.

Speaker 2

12:59

Around a sus and that box is aligned perfectly with a horizontal x axis and a vertical y axis. Right now, think about how blood is smeared onto a slide. The cells land completely randomly.

Speaker 1

13:13

Just splattered everywhere right.

Speaker 2

13:14

They are oriented at all possible chaotic angles. A normal, healthy red blood cell is.

Speaker 1

13:20

Circular, so it's the same in every direction exactly.

Speaker 2

13:23

Its height and its width are roughly equaled no matter how you spin it inside that bounding box.

Speaker 1

13:28

But a sickle cell has a distinct long axis in a short axis. Ah So, if an elongated sickle cell happens to land diagonally on the slide and the computer draws a straight up and down bounding box around it, the box has to stretch out horizontally and vertically to capture the diagonal corners exactly.

Speaker 2

13:45

If it lands diagonally, the bounding box might actually look perfectly square.

Speaker 1

13:49

Oh well, I wouldn't have thought of that.

Speaker 2

13:52

The computer won't capture the cell's true maximum length versus its true minimum width. It'll just see a big square box.

Speaker 1

13:59

So by rote hitting the object by ten, twenty thirty and forty degrees, the algorithm forces the cell into alignment with the x and y axis.

Speaker 2

14:07

Yes, it tests different angles until it finds the orientation where the bounding box is stretched to its absolute maximum limit.

Speaker 1

14:15

That is so incredibly clever. It's essentially testing the geometry at every angle to find the cell's true stretched out shape, and the math.

Speaker 2

14:24

They use to flag it as diseased is so beautifully simple.

Speaker 1

14:27

It really is just basic subtraction. Right.

Speaker 2

14:29

Yeah, they calculate the difference between the height and width of that bounding box.

Speaker 1

14:33

They call it delta, right.

Speaker 2

14:35

Delta equals the absolute value of height minus width. If the minimum difference they find during all those rotations is greater than seven pixels, and.

Speaker 1

14:42

The cell's total area falls within the biological norm of four hundred and fifty to one thousand pixels.

Speaker 2

14:48

Then the algorithm officially flags that cell is abnormal.

Speaker 1

14:51

That's it. That's simple, yep. If the height and width remain relatively equal, so a delta of less than seven, it's classified as a normal, healthy cell.

Speaker 2

15:00

And the results from this geometric approach, they tested it on a data set of forty real blood smear images from the erythrocytes IDB.

Speaker 1

15:09

This is a very solid test set. Yeah.

Speaker 2

15:11

It achieved an eighty six percent detection rate with only a fourteen percent false alarm rate.

Speaker 1

15:16

That's incredibly promising.

Speaker 2

15:18

We are talking about taking a diagnostic process that usually requires a highly trained hematologist in a specialized lab and codifying it into an automated algorithm that could run on a basic computer in a remote clinic.

Speaker 1

15:31

It's the democratization of diagnostics. It really is.

Speaker 2

15:34

Okay, so we have an AI that can read and summarize complex documents like a genius, and another AI that can tirelessly diagnose our blood based on rotating bounding boxes.

Speaker 1

15:43

Two massive leaps forward.

Speaker 2

15:45

But both of these incredible tools an algorithm that digests our personal documents in a system that diagnoses our blood share a massive vulnerability.

Speaker 1

15:53

They absolutely do.

Speaker 2

15:54

They are completely reliant on digesting massive amounts of data, which brings up the elephant in the room data hunger.

Speaker 1

15:59

Who who owns this data and how do we keep it safe? Ginchus throw all of our most intimate data onto a giant, centralized public server so an algorithm can practice on it. No, we really can't, which brings us to the third paper, a systematic review by Kapelle Tawari, Semi Shashukla, and JOSSP. George on privacy preserving machine learning or PPML.

Speaker 2

16:21

What's fascinating here is that this addresses the central paradox of modern artificial intelligence.

Speaker 1

16:26

The paradox yeah.

Speaker 2

16:28

ML's effectiveness relies entirely on the amount, distribution, and variety of training data.

Speaker 1

16:33

Right, it needs to see a lot of examples to.

Speaker 2

16:35

Learn exactly, and AI trained only on data from one hospital in London won't be very accurate at diagnosing patients in a rural clinic in India. It needs diverse data to avoid bias.

Speaker 1

16:46

But getting that data from multiple diverse sources is a nightmare because of privacy concerns, security threats, data sovereignty laws.

Speaker 2

16:54

IPI in the US, GDPR in Europe. A hospital in London legally cannot and over its patient files to a tech company in Silicon Valley, and.

Speaker 1

17:03

Companies want to protect their competitive advantages too.

Speaker 2

17:06

Exactly, even if it were legal, they wouldn't want to share. So we face a roadblock. We have these algorithms, but the data is locked away in disconnected silos.

Speaker 1

17:16

So what does this all mean. We're basically stuck between wanting these life saving, time saving AI tools and not wanting to hand over our private medical records or personal notes to a giant central server.

Speaker 2

17:29

Yeah, we are stuck. But PPML is the crucial bridge here. Okay, it's an emerging suite of techniques designed to aggregate data, train models, and serve inferences without ever actually exposing the underlying raw data.

Speaker 1

17:41

So it's the silent security guard making the tech summarization and the medical imaging possible in the real world, But how does it actually work? A security guard just stops people at the door.

Speaker 2

17:51

Well, PPML isn't just one algorithm, it's an entire suite of cryptographic and statistical techniques. Likewise, one of the most powerful mechanisms they review in this paper is called federated learning.

Speaker 1

18:03

Okay, walk me through federated learning.

Speaker 2

18:05

In traditional AI, you take all the data from all over the world, move it to one giant central server, and train the model there. Federated learning completely reverses that architecture reverses it. Yeah, instead of moving the data to the model, we move the model to the data.

Speaker 1

18:19

Oh wow, It's like asking a thousand hospitals to bake a cake together. You want the ultimate perfect recipe, but you aren't allowed to know whose kitchen provided the eggs, or who sifted the flour, or what their kitchens even look like.

Speaker 2

18:32

That is a brilliant way to conceptualize it. In federated learning, a central server sends a blank, untrained copy of the AI model out to thousands of local hospitals or say smartphones. Okay, the model trains locally on that private data behind the hospital's own firewalls. The raw data never ever leaves the building.

Speaker 1

18:53

Wait, if the data never leaves, how does the central AI get any smarter?

Speaker 2

18:57

Because the local model doesn't send back the medical records, It only sends back the math. The math, Yeah, it sends back the updated weights and biases the microscopic adjustments it made to its own internal logic. Oh I say the central server collects these mathematical tweaks from thousands of hospitals, averages them out, and creates a master model that has learned the patterns of the disease without ever seeing a single patient's name or scan.

Speaker 1

19:21

That is wild. It learns the lesson, but immediately forgets the teacher exactly. But what if a hacker intercepts those mathematical tweaks. Couldn't they reverse engineer them to figure out the original patient data.

Speaker 2

19:34

That is a real risk, which is why federated learning is often paired with another PPML technique reviewed in the paper differential privacy.

Speaker 1

19:42

How does differential privacy work?

Speaker 2

19:44

It works by intentionally injecting mathematical noise into the data before it's even analyzed.

Speaker 1

19:49

Hold on, my brain is stuck again. If you inject noise into a medical diagnostic tool, don't you ruin the accuracy? Why would you purposefully make the data blurrier.

Speaker 2

19:58

Because it's a very specific, carefully calculated statistical.

Speaker 1

20:02

Noise, like static on a TV.

Speaker 2

20:04

Sort of. Think of it like taking a photograph of a massive crowd in a stadium. If you apply a specific blurring filter, you can completely obscure the individual faces of every single person.

Speaker 1

20:15

In the crowd.

Speaker 2

20:16

No one can be identified. Right, there's blobs, but you can still easily tell what color jerseys the crowd is wearing, or which section of the stadium is the most crowded.

Speaker 1

20:25

Oh okay, you destroy the micro details to protect the individual, but you preserve the macro trends so the AI can still learn the big picture precisely.

Speaker 2

20:35

Differential privacy ensures that the presence or absence of any single individual's data in the training set does not significantly affect the final output of the model.

Speaker 1

20:44

Wow. By combining federated learning and differential privacy, these researchers are building a world where an algorithm can master human language and diagnose human biology without causing a privacy apocalypse.

Speaker 2

20:56

It's the only way forward. Really.

Speaker 1

20:58

Okay, we've covered some serious intellectual ground today on this deep dive. Let's recap this journey for you, the learner. We started by exploring how data science is conquering information overload. Algorithms are moving past blindly counting words right.

Speaker 2

21:12

Using wiki data to build semantic maps exactly.

Speaker 1

21:16

And by applying stripped statistical thresholds like cross entropy and anova, they can mathematically prove which sentences carry the most meaning, summarizing dense text with over eighty eight percent accuracy.

Speaker 2

21:29

And then we saw those same principles of feature extraction applied to the microscopic.

Speaker 1

21:34

World, right the blood cells.

Speaker 2

21:36

We learned how algorithms use binary contrast thresholds and the rotating geometry of bounding boxes to catch the elongated, dangerous shapes of sickle cells that a tired human eye might easily miss.

Speaker 1

21:48

And finally, we address the necessary shield for all this innovation. We explored how privacy preserving techniques like federated learning and differential privacy allow these algorithms to travel to the data, they.

Speaker 2

21:59

Learn the mapp madical lessons and blur out the faces.

Speaker 1

22:02

Ensuring they can learn from our most intimate documents and our biology without actually spying on us.

Speaker 2

22:07

It all comes back to extracting the signal from the noise, whether that noise is redundant text, background, cellular artifacts, or the logistical nightmare of locked data silos.

Speaker 1

22:16

It really does. It's all incredibly connected.

Speaker 2

22:18

And if I can leave you with one final thought, think about where these three streams of technology inevitably converge. If data science can perfectly summarize the complexities of human thought and diagnose our diseases just by measuring the geometry of ourselves, what happens when privacy preserving machine learning allows these systems to securely cross reference both on a global scale.

22:41

Oh wow, Well, the ultimate a of the future know the intricacies of our minds and our bodies better than we know ourselves, all while never actually knowing our names.

Speaker 1

22:49

Wow. That is a massive thought to chew on. You came here today to connect the dots and get well informed, and I think we definitely hit the mark. Thank you so much for joining us on this deep dive. Until next time, keep questioning, keep learning, and we'll see you on the next one.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript