Navigating Data Quality and Scalability Challenges when Using Real-World Data from EHRs

Speaker 1

00:00

I would like to now invite up meritisosis from ut Health San Antonio. Thanks so much for having me. So, you know, on using clinical research as a care option, it really relies on taking data from routine care and using it in research, and or taking the data from research and getting it back into routine care where it's relevant to routine care. And so I'm going to talk a little bit about the goes In's part of that, which is using routine care data for research. So doing this using routine care.

Speaker 2

00:37

Data for research.

Speaker 1

00:39

Literally has been the holy grail. It's been a dream since the first uses of computers in medicine, and doing so would hugely facilitate our ability to integrate care and research because both care and research are incredibly information intensive endeavors, and so people can't make the decisions that they need

01:01

in either of those aspects without having the correct information. Unfortunately, we're not there yet, and I would say we're probably not nearly as close as we would like to be, even though there's a ton of standards and a ton of really good technology out there.

Speaker 2

01:23

So why is using.

Speaker 1

01:25

EHR data such a big deal in care and research? Well, first, is the cost of redundant data collections high. I'll never forget the day that I sat in a meeting at my old institution and one of my colleagues walked in and he said, oh my god, I just had a clinical trial visit and you're not going to believe this. And so all of us are clinical trials sitting around the table. We're like, doc, what happened. He looked around the table and he said, I got stuck twice. We said,

01:52

what do you mean, you got stuck twice? Well, the study needed a blood draw and my doctor needed my annual physical and they couldn't use one piece of information for the other. They took two blood draws from me thirty minutes apart. Thus, we really want to end that redundant data collection and the redundant participant burden, and the redundant burden on sites. Other things like it's hard to find investigators to conduct studies. We have many, many, many,

02:21

many questions that are still unanswered in clinical care. Evidence based medicine is not exactly evidence based, and I've seen estimates as high as about half of the cases, and I put some interesting examples on the bottom of the slide. The other is that just from a healthcare facility operations standpoint.

Speaker 2

02:39

I mean we care.

Speaker 1

02:39

About research at healthcare facility all across the spectrum. Sure, we want to make new therapeutics and discover new things, but dag on it, we also want our operations to perform better and to provide safer and better care. So we care about research in all of those aspects and have the same needs as others that conduct research in all of those aspects. So this is a big deal to the FDA using EHR data and research. They have released a slew of guidance recently to try and help

03:11

folks do this. It's a big deal to an organization known as the Patient Centered Outcomes Research Institute or for COORY. One of its fundamental goals is this pocornet network, which is a network of a network of networks. In each of those networks is multiple health care facilities that provide electronic health record data into a large national data set that can be used to support research. It's also a

03:38

big deal to National Institutes of Health. Not only do they have probably the largest single collection of EHR data for the COVID clinical coport, but they're also they have finished recently a pilot for broadening that up to all of EHR data, and making those data available in a research enclave for secondary use for research.

Speaker 2

03:57

So it's a big deal to a lot of people.

Speaker 1

03:59

Yet using EHR data in research has still very much eluded us from an automated, helpful standpoint. Now for years, we've gone over to the medical record and whether it's paper or electronic, looked at values, scanned through page by page, read the notes, get the data that we need, and enter it into an electronic data capture system. So everybody uses EHR data in that way, but dag on it.

04:25

Not only is that expensive, but it's also the most error prone process in collecting and managing clinical research data. Error rates are common up to ten or twenty percent of the data, and that's high enough to flip a P value in a study. So we've been eluded until now. And I'm going to talk about two different examples ways of using DHR data in research and some very recent data. One of the studies, the data literally came out of the field last week. The other one is an older study.

04:59

In the other two or three year in between. But I want to mention that in looking from prospective to retrospective studies. The gentleman that came up had a similar concern about asking if the policy was going to cover both clinical trials and observational research.

Speaker 2

05:14

So there's a spectrum.

Speaker 1

05:15

There that we need to cover, and there's different ways of using EHR data and thinking about use of EHR data when we move from observational studies to some interventional studies and maybe all the way forward to randomized control clinical studies. So the first example that I'm going to give this is the oldest of the studies. It was a seven thousand patient study.

Speaker 2

05:40

About fifty nine.

Speaker 1

05:42

Hundred of them were eligible for this data quality study, meaning that they had EHR data, and they also had participants self report data for thirty four medical conditions, a procedures, hospitalizations, and smoking status. And so we compared them. That was the goal of the study was to compare them and to measure the quality of patient reported data versus electronic health record data.

Speaker 2

06:05

And so when we.

Speaker 1

06:07

Did that, first of all, we found out that ninety four point five percent of the participants had one or more discrepancy. Okay, that's kind of high. We were hoping for a lower number, but you know, it's reasonable. It's a lot of data, right, lots of opportunities for disagreement in the data. What we also found was that ten out of the forty five assessed parameters, those thirty four conditions, A procedures, hospitalizations, and smoking status had less than eighty

06:32

percent overall agreement. And I will tell you that of the many and we can discuss later in the hall if you want a more detailed discussion, but overall agreement is the easiest bar to get over for data quality measurement. So we then took six hundred and eleven of the participants and we interviewed them about the discrepancies to figure out, well, your EHR says this, and you reported this, help us

06:57

understand the difference. And then after the inner with the participant, which was often quite conclusive in many of the cases, they're like, oh, yeah, no, that records from my healthcare record in Cannapolis, North Carolina, and I had them down in Florida five years ago. Oh okay, we get it. You showed up with a cannapolists er because you fell

07:17

and you broke your ankle. So we for those six hundred and eleven patients, we interviewed them and in for the Arkansas data, and this data was collected in regions across two states, one in North Carolina and the other in the state of Arkansas. The sensitivity of the EHR data was less than eighty percent for thirty items. So sensitivity basically means the ability of the EHR to detect that a diagnosis, that a patient has a diagnosis, given

07:50

that they have a diagnosis. So sensitivity of eighty percent means that I missed twenty percent of the diagnoses in the EHR, or that EHR itself miss twenty percent of the diagnoses. And we've so when we look at the data, the bars are the ninety five percent confidence intervals.

Speaker 2

08:09

This is the accuracy data.

Speaker 1

08:11

So it's only in that six hundred and eleven patient sample we've got the top limit of the ninety five percent confident center well meeting the area where we're in ninety five percent sure that that region covers the air

08:26

rate that we're seeing so many of them. The top region of that confidence interval is under eighty percent, which if you talk to a clinical trialist who is writing a clinical studies report submitting this data to FDA, they would fall out of their chair if you said the aer rate of the data might be twenty percent.

Speaker 2

08:49

I also want to just point out.

Speaker 1

08:50

For those of you that are using hospitalization as an outcome man, EHR data is terrible for that.

Speaker 2

08:58

You have to have.

Speaker 1

08:59

Another option for that. So can you find the culprit? Why so many errors in the discrepancies? What do you think it could have been? Or which parameter up there is the culprit? Since we're running out of time, I'll make it easy. It's the sensitivity of the EHR data.

Speaker 2

09:19

But it's the sensitivity of the EHR data.

Speaker 1

09:21

Note North Carolina and Arkansas look very different from an EHR sensitivity basis, and that.

Speaker 2

09:29

Really caused a lot of concern.

Speaker 1

09:31

That kind of caused us to scratch our heads a bit, and that was a little painful. So the first thing that we concluded from that, and we did sort of find the smoking gun with the sensitivity for the EHR data. The data in Arkansas came from an integrated health system with community sites around the four corners of the state, but all the data from those community sites were integrated centrally at the Academic Medical Center in Arkansas, so it was all integrated and warehoused in.

Speaker 2

10:03

In North Carolina.

Speaker 1

10:04

It was very different. The data came primarily from community sites to echo another theme of reaching people out in the communities, and their care tended to be much more fragmented. There were two large health systems in the area we had EHR data from one. The other refused to participate in the study, and the care was incredibly fragmented for those patients. And that really for us when we look

10:30

detailed into it. That's what we penned, given the limitations of an observational study, sort of what we think is the smoking gun is the care fragmentation and subsequent data fragmentation in the region.

Speaker 2

10:44

In general.

Speaker 1

10:45

The sensitivity of the self report data was higher than that for the EHR data, meaning maybe we should just ask the patients, or at a minimum, maybe we should ask the patients whenever we use EHR data, because asking thes can help increase the quality of the data. Together, huge benefit from us. And you know what, nobody we called, all those six hundred and eleven people in Arkansas and North Carolina, nobody we called was offended that we, you know,

11:13

said that there was a discrepancy in their data. They were engaged in helping us figure it out and nobody that we called even for consent to participate in the broad study of the five nine hundred people, Nobody said, wait, you have my EHR data. What do you mean you're going to use my EHR data. The North Carolina people were prospectively consented for use of EHR data, and the Arkansas people were consented for use of HR data at

11:39

the time of their self report collection. And so even in North Carolina, when that initial consent could have been seven years before, nobody came back and said, what I didn't consent to that? I don't remember that for hospitalization, neither source alone demonstrated good sensitivity. So for a hospitalization outcome, we need another option and the last sort of conclusion, and I put the reference for the PECORI technical report for the seven thousand patient study. It's our It's out

12:11

on the web, the full report. So there's a lot of detail out there. But the one of the recommendations from the report echoes that that which has been said elsewhere, that the sensitivity and specificity of data in a multi center study from EHRs really needs to be assessed in each center. We saw a huge difference in the two sites. You remember the red bar in the were's waldo huge difference, and when you pool data all together, combining that the data all together can just wash out any of that

12:43

and give you a very wrong answer. So then, what what does it mean for the use of e HR data to support integration of healthcare and research? And what about all those unanswered questions that we hoped that kind of answer using the e HR data was going to make it a little easier for us. Well incomes example

13:04

number two, completely different situation. The first was taking EHR data and that it's already been collected in routine care and looking back over it using it for some totally different purpose.

Speaker 2

13:16

This next is.

Speaker 1

13:17

Something that's been proposed and pursued in clinical research trials in particular, but broader than trials is taking the data in the context of.

Speaker 2

13:26

A structured protocol, where.

Speaker 1

13:28

A site happens to enter the data in the EHR as the source. So the original documentation of the data, given that the study is done in the context of a structured clinical protocol, maybe those data are higher quality because there is that structure there. So we did this in two oncology studies conducted in the United States. At one site, it happened to be our site in Texas thanks to a group software known as Incartes or in coup and also working with these two studies were SWAG

14:02

Studies used formerly known as the Southwest Oncology Group. It's an NIH funded cancer cooperative group.

Speaker 2

14:09

The next the.

Speaker 1

14:10

Single study was a study conducted at Shiba Medical Center in Israel, so outside the US. Didn't even use the healthcare the Health Level seven fire interoperability standards, so a very different way to even access the data and the electronic health records. So what we found is in all cases in the three separate studies, we measured a zero percent error rate. Literally found no errors in the data.

14:40

And it was mainly labs, meds, vital signs, this simple kind of common data that was mapped, not the full study CRF.

Speaker 2

14:48

And even in.

Speaker 1

14:49

Just the few patients that we used, really reasonable confidence intervals and competence intervals that don't overlap between the error rate real error rate as in all all calls errors, all errors counted in the data compared to an adjudicated gold standard, very different than the EDC error rates.

Speaker 2

15:08

So there's an indication there.

Speaker 1

15:10

Granted, these are observational data quality studies themselves, but there's an indication that actually using the data from the EHR that's collected in the context of a structured protocol can better the data quality that we get from from today's standards. So comparing the examples we just did that, I'll skip that and save more time for questions, but it's there in the slides if you need it.

Speaker 2

15:38

Some closing thoughts.

Speaker 1

15:39

First, the EHR to DC and what we've seen so far is associated with a better error rate. The program that we're doing this under is called ASRWD. It's Ancillary Studies to evaluate real world data quality, and we're working with sponsors and technology providers who are willing to have their.

Speaker 2

15:59

Data intopend evaluated.

Speaker 1

16:01

They send us all the discrepancies, we adjudicate them one by one, go through them, call a site all sorts of stuff, have the site look back in the EHR, so really an independent evaluation, and then we adjudicate a gold standard based on that full medical record review at the site by the site study coordinator.

Speaker 2

16:21

And so in doing.

Speaker 1

16:22

That, the ACE is just starting to generate this look at what we can achieve with integration of care and research at a site where we're reducing what's a huge burden for a site in terms of data collection. Hopefully we're reducing the I got stuck twice phenomenon from our

16:41

patients in clinical studies. So making sort of the final point that when all sizes, it's not all equal, right when you compare use of EHR data using it retrospectively where it was collected in the course of routine care outside the context of a clinical studies, completely different from when the site has the context of that clinical study protocol and may even have aspects of that protocol implemented

17:07

in their EHR. And as we as sites all get better and more fascial with our EHR systems, that aspect is going to get even better. So we have ways to go to catch up with Denise and what she's implemented at Duke. With respect to that last thing I'll mention is the ASRWD program is ongoing.

Speaker 2

17:25

We've got two years left in it.

Speaker 1

17:27

We're always happy to do these independent assessments or work with somebody that's got both to do an independent assessment

17:35

of the data. We're doing this so that technology providers, sponsors, regulators, everybody has a shared pool of information that's cross site, cross therapeutic area cross study, so that they can have much better information about what the actual quality of the data is moving forward, and that we can get best practices out of how to implement this and keep the error slow and uh, Thursday or I guess it's actually

18:04

Wednesday afternoon. Sorry, we'll talk more about how we actually measured it and measurement methods if folks are interested in doing that. Did I mention ACE is still recruiting. I'm just saying I do in the interest of fully disclosing there. This work is really hard to get funded, right because very few people care about data quality and at this little detail level and want to pay extra money to

18:31

get the studies done. So huge shout out to Cory for the funding for the older study, n CI National Institutes of Health for the cancer study, or our Cancer Institute at Texas, the Borough's Welcome Fund which which funds the Coordinating Center for the ASR w D program, UH in Coop and Karts Platform for e h R to e d C and Yonahlink for their platform and in kind contribution uh the data. Huge thank you to these guys. I'm happy for questions if we have time.

Speaker 2

19:10

We do have time.

Speaker 1

19:11

For one or two questions.

Speaker 3

19:13

Go ahead. I thank you for that wonderful study and presentation. One thing that I would add, if I might, to the general topic of HR data for research is the insecurity of vhrs, the most hacked and breached data sources

19:37

in the United States. And I understand that the Office of National Coordinator is promoting TEFKA, the Trusted Exchange Framework and Common Agreement Program as a way of making more EHR data using tokens presumably to link to put together the pieces of patient data, so you have supposedly single individual data for research purposes, for pandemic use, for major

20:08

healthcare uses. And yet what they're really doing is cascading the error that can come in, not just the kind of errors that you were talking about, but from hackle actual hacking. And I think there's another The White House Office of National Security does have a major concern about that as a national security risk, because TEFKA is really just expanding the contact surface and the opportunity for bad

20:39

actors to throw wrong data in. But what could be happening even now, as I understand it, is that bad actors are doing precisely that they can get in and without anyone knowing it, just alter data here and there, and so researchers have really no sense of what they're what. It might seem accurate, but anyway, that's just another concern the whole security question.

Speaker 1

21:08

So yet, yes and no, I mean definitely there are those of us who have been hacked who don't even know it yet, and there are hundreds of thousands of hacking tries, if the numbers that low across our country today in EHR systems. I will say though, that the hacker's primary interest is selling the data or getting getting funds for the data, and they don't get much for going in and playfully changing values in our EHRs. It's

21:34

when they sell the identifiers on the record. So there's a lack of a financial motivation so much from.

Speaker 3

21:43

Concern of state actors deliberately trying to create problems in the United States.

Speaker 1

21:50

Yep. Well, there's those of us that haven't been hacked unfortunately, will uh, we'll deal with it in the future and and have you know, be hacked at some point most of us.

Speaker 4

22:02

So thank you.

Speaker 2

22:03

That was great.

Speaker 4

22:04

So my question relates to practice based research networks, and I just got a big thirty million dollar fund we're all going after.

Speaker 2

22:14

My question really.

Speaker 4

22:16

Relates to the quality of data in clinical practice sites family medicine, internal medicine, primary care, general pediatrics. What we've found with our PBRN is that smaller sites that aren't necessarily part of a larger health system actually need help understanding how to make data driven decisions. And so if data is not all that accurate going into their electronic health record to start with, and we're trying to help them build capacity for data driven decisions as part of

22:58

building research capacity. Do you have recommendations for those of us who are coordinating administering large networks of primary care practice to help them develop capacity for data quality.

Speaker 1

23:16

Yeah, so that is a really big issue the pbr ns and pbr in like sites that I've worked with. Instead of saying sure, Meredith will send you that, let me get you in the queue, they say, Meredith, if you want that data, come down here yourself and get it. You're going to have to email it to yourself. Now, not that we would email it, but you get the picture. It's more of a self service kind of thing because they're usually their EHR is hosted and someone else does

23:47

all the work on it for them. They wouldn't know how to go in and extract the data many of those sites if they had to. The other thing I'll say is that there's sort of a fundamental principle in data quality that information that's used is of better quality. So your point about if we can just get them the kool aid and get them hooked on the power of being able to make data driven decisions, then this

24:13

will be one. The case of data quality will be one. Unfortunately, it takes an awful lot of work to get the skills in data analytics, not just framing the questions, not just programming, not just formatting the reports, not just formatting the data visualizations, or even getting it to tuning an AI algorithm.

Speaker 2

24:38

Heaven forbid.

Speaker 1

24:40

That's a lot of skill, and those skills are much much less common way out in the community and in today's world. It's a hand to hand combat thing. It's a one side at a time helping them figure out what data day themselves can get out of their EHR and what questions that they have have to better manage their facility or care quality they can make based on the data that they have. Thanks for the question. We're

25:08

at time, So I'm just gonna Nope. You can ask your question, and I'm just going to invite you to to condense it as much as possible.

Speaker 2

25:16

Then we'll move on to our next.

Speaker 5

25:18

Sure, this was fantastic and disturbing at the same time.

Speaker 2

25:23

Thank you.

Speaker 5

25:23

I think yours was great. I guess I'm just interested in hearing your thoughts on our industry's latest shiny penny, and that is the impact of AI on such a data set.

Speaker 1

25:36

Yeah, sure, so you know, because that's a short question. Yeah, what I'll say, and try to be short, is that the penny is really not so shiny, and AI, like any algorithm, takes work and takes proving and takes ongoing human monitoring. When we do the fire based data extraction, there's a study coordinator sitting there in the middle that looks at it and says, oh, yeah, okay, that was the data from that visit send it through the same thing.

26:03

When with the AI that we're working with for adverse event identification or fact extraction out of unstructured data, we make it a human in the loop and supervised learning where the human looks at the AI ongoing and once you're confident and the data source is stable, you can back off a little and go to just things that are at a lower confidence. But yeah, it's it's not the shiny bullet short as I could be. Sorry, perfect, thank you,

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript