Assessing the fitness for use of real-world electronic health records and log data with and without the application of privacy preserving technologies

Thomas, JasonAssessing the fitness for use of real-world electronic health records and log data with and without the application of privacy preserving technologiesMy University2021data qualitydata sharingdata utilityelectronic health recordsfitness for usesynthetic dataBioinformaticsComputer sciencePublic healthBiomedical and health informaticsMy UniversityMy UniversityWilcox, Adam2021-10-292021-10-292021-10-292021en-USThesisThomas_washington_0250E_23471.pdfhttp://hdl.handle.net/1773/47880application/pdfCC BYThesis (Ph.D.)--University of Washington, 2021Over the past decade, electronic health record (EHR) adoption has led to an explosion in the volume of electronic health record and log data, then efforts to effectively harness the potential of these data for knowledge discovery (KD) and quality improvement (QI). In parallel, recent gains in artificial intelligence have produced powerful methods to analyze, use, and even create synthetic data which are statistically or mathematically reflective of real data yet are generated by a computer algorithm. However, limitations in data utility (e.g. bias, data quality, comprehensiveness) and accessibility (e.g. privacy, interoperability, availability), as well as limited means to measure and manage tradeoffs between the two are significant barriers to using these data effectively. Determining whether data are suitable to be used in a specific analysis or context, known as “fitness for use” is not included in current frameworks for general health record data quality characterization nor evaluated by data quality assessment (DQA) tools. EHR log data use is particularly unrefined for QI and KD due to an absence of validated standards and methods. Thus, users of electronic health record and log data remain uninformed as to the fitness for use of their data at baseline and are unable to effectively assess subsequent tradeoffs between utility and privacy when applying privacy preserving technologies. To address these challenges, we sought to assess the fitness for use of electronic health record and log data - both synthetic and real - across three use cases. First, we 1) developed a framework for data utility assessment of electronic health records, then 2) adapted open-source tools to make use of this framework which we then applied to assess the utility of real and synthetic EHR data for observational research related to COVID-19 and future influenza pandemics. Second, we evaluated whether synthetic data derived from a national COVID-19 data set could be used for geospatial and temporal epidemic analyses. To do so we conducted replication studies and computed general summary statistics on original and synthetic data, then compared the similarity of results between the two datasets. Third, we conducted a retrospective, observational analysis - with and without privacy preserving technology - of clinical workstation authentication behaviors from the UW Medicine health system to inform customized solutions that balance usability and security. The three use cases studied advance our understanding of 1) the fitness for use of varied electronic health record and clinical workstation log data with and without privacy preserving technologies as well as 2) methods to conduct these assessments. As the use of synthetic data rises, so will the importance of fitness for use assessments on both original and synthetic data. Synthetic data that are broadly distributed will reach less expert users than those who have access to the original data. Thus, in addition to helping those creating synthetic data manage tradeoffs, fitness for use assessments will provide guidance to synthetic data end-users on 1) the approximate similarity between the synthetic data and the original data as well as 2) the overall limitations of the likely inaccessible (to the end-users, at least at the time of analyzing the synthetic data) original data which have a downstream effect on the synthetic data.