Presentation description
For this study, we are validating a clinical data mining algorithm for the extraction of patient clinicodemographic data related to cancer and metabolic diseases. 100 patient records were randomly selected for data extraction using the random sample function in R. Using these records, we extracted select data elements related to diabetes, prediabetes, hypertension, hypertriglyceridemia, and low high-density lipoprotein (HDL) diagnoses, which were collected using REDCap, a self-service web-based for creating and managing databases that is subsidized for University of Utah research needs. The data elements included ICD codes, biomarker tests, medication usage, and medical history via clinician notes for each diagnosis. Two student researchers separately extracted the data elements for all 100 patients, and statistics including percent agreement (p0), expected agreement (pe), and Cohen's Kappa (k) were computed. There was a 98.16% agreement between both abstractors for all variables extracted, and a Cohen's Kappa value of 0.962 (p0=98.16%, pe=0.51 k=0.962), suggesting a near perfect agreement. Overall, 92% of patients had at least one element related to diabetes, with the most common being an elevated fasting blood glucose or HbA1c (88%). 47% of patients had at least one element related to prediabetes, with the most common being a history recorded in provider clinical notes (43%). 98% of patients had at least one element related to hypertension, with the most common being elevated blood pressure measurements (94%). Lastly, 88% of patients had at least one element related to dyslipidemia, with the most common being a history of high triglycerides (80%). So far, we have observed high agreement in manual data extraction as well as high prevalence of diabetes, hypertension, and dyslipidemia in our dataset. The next stage in this investigation is to calculate the agreement between manual data extraction and clinical data mining methods.