Research

Modeling family disease risk in EHRs with graph neural networks

Family history is known to be an important risk factor for many diseases, yet machine learning methods for electronic health record (EHR) data have largely unexplored the complex relational structure of family health data beyond first-degree relatives. In research carried out at the Finnish Center for Artificial Intelligence (FCAI) and Institute for Molecular Medicine Finland (FIMM), and published in proceedings for the Machine Learning for Health Care (MLHC) conference, my colleagues and I examine the the prediction of disease risk from family history recasted as a geometric deep learning problem.

Finland has outstanding health data infrastructure, with national health and socio-demographic registries covering the Finnish population. The de-identified data is made available to researchers in a secure computing environment. For this project, we used a registry containing family relationships to construct a pedigree graph connecting over 7 million individuals and their 1st to 3rd degree relatives. This results in a large network structure with over 66 million unique family connections.

Having such detailed data on family health history allows us to go beyond 1st degree relatives and identify more complex genetic and environmental risk factors shared within families. To analyze this massive graph dataset we required a new scalable and interpretable method, and so we developed a new deep learning approach using graph neural networks and recurrent neural networks.

The key innovation is jointly learning both family-level and individual-level representations for predicting an individual's disease risk. We evaluated our model on 5 common diseases with known heritable impact and differing genetic architectures, using historical diagnosis code data to predict 10-year risk. We also apply graph explainability techniques to identify the most informative risk factors from an individual's relatives for a given disease.

We show that our method outperforms standard clinical approaches and other deep learning methods, across several evaluation metrics. We view this work as a starting point for future research on applications for a variety of use cases, including population health analyses of how family history affects disease risk, and tools to help clinicians understand the impact of a patient’s family history in a personalized manner.

Figure 1: Overview of modeling approach. a) Graph inputs contain the EHRs for (up to third-degree) family members as node features and descriptors of genetic relationships as edge features; b) The ML model jointly learns family (graph) and patient-specific representations, concatenated in the final layers to classify patient outcomes; c) Graph explainability analysis identifies important relatives (nodes) and risk factors/co-morbidities (node features) the model found important for predicting the patient’s risk.

Figure 2: Neural network architecture diagram, illustrating how representations for supervised classification tasks are learned from both the target patient's EHR, as well as the EHR's from their family members.

Table 1. Results for 5 disease phenotypes, showing that graph-based methods consistently outperform clinical baseline approaches for family history. See the paper for further details and experiments.

Publication details and other resources

See below for links to the full publication, code and video resources. The software developed for this work is available at https://github.com/dsgelab/family-EHR-graphs. More information about the dataset used in this work can be found at https://www.finregistry.fi/. Due to the sensitive nature of the data, we do not share the original dataset in the code repository, but we do share code for creating synthetic data that models the key properties of the real data.

Journal/conference publication

Journal/conference publication

Journal/conference publication

Sophie Wharrie, Zhiyu Yang, Andrea Ganna, Samuel Kaski. (2023). Characterizing personalized effects of family information on disease risk using graph representation learning. Proceedings of the 8th Machine Learning for Healthcare Conference, New York, USA, in Proceedings of Machine Learning Research (PMLR), 219:824-845. https://proceedings.mlr.press/v219/wharrie23a.html



More from my blog

View All Articles

© 2024 Sophie wharrie