VA, ORNL and Harvard develop new method to identify complex medical relationships

Newswise – A team of researchers from the Department of Veterans Affairs, Oak Ridge National Laboratory, Harvard’s TH Chan School of Public Health, Harvard Medical School and Brigham and Women’s Hospital have developed a novel, machine learning– technique based on exploring and identifying relationships between medical concepts using electronic health record data from multiple health care providers.
The method, called Knowledge Extraction via Sparse Embedding Regression, or KESER, was recently published in Nature Digital Medicine. The process integrates data from electronic health records from two leading institutions – the VA and Boston-based Partners Healthcare – and provides automated feature selection that leads to phenotype identification algorithms and knowledge discovery.
“KESER provides a high-level view of the relationships between clinical insights that we can’t always see when caring for patients at the individual or group level,” said KESER VA Principal Investigator Dr. Katherine Liao. Boston and associate professor. of Medicine at Harvard Medical School. “We look forward to translating the study’s methods and results from applications in clinical research to advances in clinical care.”
The project is part of the groundwork on phenomics led by Drs. Kelly Cho and Mike Gaziano of VA Boston and Harvard as part of VA’s Million Veteran Program, or MVP, a “national research program to learn how genes, lifestyle, and military exposures affect health and disease”, according to the VA Office of Research and Development MVP website.
In 2016, ORNL began collaborating with VA on MVP-CHAMPION, a big data initiative under the MVP program, to create a large precision medicine platform to house the VA’s vast medical records dataset — made up of records for some 24 million veterans. To strengthen cross-cutting innovation in support of many research projects under this joint VA-DOE program, ORNL worked closely with MVP Data Core from VA Boston and Harvard to identify research areas specific to pursue. Among these was an effort to answer the question: what elements do we need to find in electronic health records to correctly identify a given phenotype?
Working with what they believe to be the largest cohort of health data used for this type of research in the United States, the team set out to automate the identification of phenotypic relationships while providing visibility into the factors under -lyings. machine learning assumptions and decision process.
To do this, they designed and built the KESER methodology in four steps: converting the data into a structured format, constructing a low-dimensional vector representation of each medical code, selecting the features to assign importance to, and mapping the relationships. assigned as a network.
Data processing and representation learning
ORNL has played a key role in the tedious but essential work of processing and structuring a variety of medical data – procedures, diagnoses and patient measurements, as well as doctors’ notes, prescription information and more. – millions of patients across the VA and Partners Healthcare.
“There’s a lot of unstructured data processing that takes place before you end up with structured information that can be put into statistical methods,” said Edmon Begoli, ORNL. AI Head of Systems Section and principal researcher on the MVP-CHAMPION project. “The team spent years working on the data to get it into a state where we could start using it for research.”
With the processed data, the team built a co-occurrence matrix, made up of over 100,000 event types, or healthcare codes – essentially a massive, but sparse, data table with one row and one column for each possible health care code. Each co-occurrence in time between two events helps create a clearer and more detailed picture of a given phenotype.
Leveraging ORNL’s big data infrastructure and expertise in scientific computing – essential when working at this scale of data – the team worked to automate data pre-processing and make the process accessible to the audience.
“A researcher or institution can download the code, store their data in the correct format, and our process will perform all the necessary steps to integrate their data with everyone else’s,” said Everett Rush, ORNL researcher and engineer. master of project data.
The research team took great care to protect patient privacy throughout the project. The team processed all VA data inside ORNL’s secure Protected Health Data infrastructure. After turning it into an anonymous summary level, they shared it with Harvard and other collaborators. The resulting KESER matrix retains no connection to individual patients.
“There’s no way to trace the end results back to an individual patient because they’re aggregates,” said Dallas Sacca, ORNL’s senior solutions engineer. Sacca manages the protected health data enclave at ORNL and reviews each piece of data to ensure it meets HIPAA guidelines for anonymization before allowing it to leave the enclave.
Knowledge extraction
The matrix is full of anonymized information about this huge patient cohort that can be probed with different methods, such as KESER, to gain new insights into human health. Using a series of modern statistical methods, the team transformed summary data into vectors, tuned a model that codes the relationship of each vector, and extracted the most important features and feature weights for each. phenotype.
“These statistical methods, which include graphical Gaussian models for sparse modeling of covariance structures, are particularly capable of assigning importance that exposes potential causal relationships, a concept with which AI technology, like deep learning, tends to struggle,” said George Ostrouchov, senior researcher at ORNL and chief statistician of the MVP-CHAMPION project.
After running the KESER method, the team selected eight phenotypes – including depression, rheumatoid arthritis and ulcerative colitis – to explore. Using the traits selected by KESER, they trained models to identify phenotypes of interest.
Future research
The possibilities offered by KESER’s new ability to anonymize, integrate and analyze data from multiple healthcare facilities seem limitless.
Tianxi Cai, Professor of Biomedical Informatics at Harvard Medical School and Principal Investigator at KESER, said, “We are excited to have a highly scalable approach that can handle arrays an order of magnitude larger than what we are currently working.
The team is already integrating more clinical descriptors into knowledge graphs. Additionally, the team began to explore knowledge graphs to better understand emerging diseases.
“In a situation like COVID, for example, where everyone needs to share data and we need to start investigating all the different things that are related to this specific disease, you could potentially do that with this system,” said Chuan Hong. , assistant professor at Duke University, who led research on the KESER project as an instructor at Harvard last year. “It’s essentially plug-and-play; you access the data warehouse, follow the four-step process and integrate your results directly.
The potential for future collaboration and discovery may be the project’s greatest success. “This innovation will facilitate multi-center collaborations,” the team wrote in Nature“and bring the field closer to the promise of creating distributed networks for learning across institutions while maintaining patient privacy.”