research

My PhD research develops advanced probabilistic machine learning and deep learning methods for the health and biological domains

My PhD research develops advanced probabilistic machine learning and deep learning methods for the health and biological domains

Gradient
Gradient
Gradient
Gradient

Research background

Artificial intelligence (AI) has advanced in recent years due to innovations in machine learning and deep learning algorithms, computing hardware and software, and data availability. My academic research has a significant focus on machine learning for health, biology, and personalised medicine: these domains offer immense transformative potential for AI applications, yet even with larger datasets, the inherent characteristics of the data and complex needs of computational biology, bioinformatics and digital health applications require innovative approaches to achieve breakthrough results. Collaborating with international teams and utilizing large-scale, world-leading datasets such as the UK Biobank, Finnish national health registry data, and the FinnGen project, my work develops new probabilistic ML and deep learning strategies aimed at creating robust models that can effectively handle the complexities of real-world data and applications.

My recent projects include:

  • Synthetic Data for Personalised Medicine Research: Introduced an ML framework for generating synthetic but realistic datasets for genotypes and phenotypes, collaborating with the Europe-wide INTERVENE consortium to enable researchers to test new computational methods while protecting sensitive health information.

  • Geometric Deep Learning for Family Networks: Developed a novel deep learning approach using graph representation learning to predict health outcomes in families, working with the Institute of Molecular Medicine Finland on nationwide electronic health records (EHRs) and family networks for over 7 million patients.

  • Meta-learning for Health Record Tasks: Studied meta-learning approaches that "learn how to learn" from related machine learning tasks and how causal task similarity affects the generalizability and negative transfer properties of these algorithms for health-related prediction tasks. This work included a case study for stroke prediction using the UK Biobank and FinnGen datasets.

Working in collaboration with

Working in collaboration with

RESEARCH publication list

Preprint

Sophie Wharrie, Samuel Kaski, Meta-Learning With Hierarchical Models Based on Similarity of Causal Mechanisms, arXiv preprint, 2024, https://arxiv.org/abs/2310.12595

Journal/conference publication

Sophie Wharrie, Zhiyu Yang, Andrea Ganna, Samuel Kaski. (2023). Characterizing personalized effects of family information on disease risk using graph representation learning. Proceedings of the 8th Machine Learning for Healthcare Conference, New York, USA, in Proceedings of Machine Learning Research (PMLR), 219:824-845. https://proceedings.mlr.press/v219/wharrie23a.html

Journal/conference publication

Sophie Wharrie, Zhiyu Yang, Vishnu Raj, Remo Monti, Rahul Gupta, Ying Wang, Alicia Martin, Luke J O’Connor, Samuel Kaski, Pekka Marttinen, Pier Francesco Palamara, Christoph Lippert, Andrea Ganna, HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes, Bioinformatics, Volume 39, Issue 9, September 2023, https://doi.org/10.1093/bioinformatics/btad535

Journal/conference publication

Sophie Wharrie, Lamiae Azizi, Eduardo G. Altmann, Micro-, meso-, macroscales: The effect of triangles on communities in networks, Physical Review E, Volume 100, Issue 2, August 2019, https://link.aps.org/doi/10.1103/PhysRevE.100.022315

Workshop paper

Sophie Wharrie, Zhiyu Yang, Vishnu Raj, Remo Monti, Rahul Gupta, Ying Wang, Alicia Martin, Luke J O’Connor, Samuel Kaski, Pekka Marttinen, Pier Francesco Palamara, Christoph Lippert, Andrea Ganna, HAPNEST: An efficient tool for generating large-scale genetics datasets from limited training data, NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research, New Orleans, USA, 2022

Journal/conference publication

Remo Monti, Lisa Eick, Georgi Hudjashov, Kristi Läll, Stavroula Kanoni, Brooke N Wolford, Benjamin Wingfield, Oliver Pain, Sophie Wharrie, Bradley Jermy, Aoife McMahon, Tuomo Hartonen, Henrike O Heyne, Nina Mars, Genes & Health Research Team, Kristian Hveem, Michael Inouye, David A van Heel, Reedik Mägi, Pekka Marttinen, Samuli Ripatti, Andrea Ganna, Christoph Lippert, Evaluation of polygenic scoring methods in five biobanks reveals greater variability between biobanks than between methods and highlights benefits of ensemble learning. The American Journal of Human Genetics, 2024.

Academic
publication list

Preprint

Sophie Wharrie, Samuel Kaski, Meta-Learning With Hierarchical Models Based on Similarity of Causal Mechanisms, arXiv preprint, 2024, https://arxiv.org/abs/2310.12595

Journal/conference publication

Sophie Wharrie, Zhiyu Yang, Andrea Ganna, Samuel Kaski. (2023). Characterizing personalized effects of family information on disease risk using graph representation learning. Proceedings of the 8th Machine Learning for Healthcare Conference, New York, USA, in Proceedings of Machine Learning Research (PMLR), 219:824-845. https://proceedings.mlr.press/v219/wharrie23a.html

Journal/conference publication

Sophie Wharrie, Zhiyu Yang, Vishnu Raj, Remo Monti, Rahul Gupta, Ying Wang, Alicia Martin, Luke J O’Connor, Samuel Kaski, Pekka Marttinen, Pier Francesco Palamara, Christoph Lippert, Andrea Ganna, HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes, Bioinformatics, Volume 39, Issue 9, September 2023, https://doi.org/10.1093/bioinformatics/btad535

Journal/conference publication

Sophie Wharrie, Lamiae Azizi, Eduardo G. Altmann, Micro-, meso-, macroscales: The effect of triangles on communities in networks, Physical Review E, Volume 100, Issue 2, August 2019, https://link.aps.org/doi/10.1103/PhysRevE.100.022315

Workshop paper

Sophie Wharrie, Zhiyu Yang, Vishnu Raj, Remo Monti, Rahul Gupta, Ying Wang, Alicia Martin, Luke J O’Connor, Samuel Kaski, Pekka Marttinen, Pier Francesco Palamara, Christoph Lippert, Andrea Ganna, HAPNEST: An efficient tool for generating large-scale genetics datasets from limited training data, NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research, New Orleans, USA, 2022

Journal/conference publication

Remo Monti, Lisa Eick, Georgi Hudjashov, Kristi Läll, Stavroula Kanoni, Brooke N Wolford, Benjamin Wingfield, Oliver Pain, Sophie Wharrie, Bradley Jermy, Aoife McMahon, Tuomo Hartonen, Henrike O Heyne, Nina Mars, Genes & Health Research Team, Kristian Hveem, Michael Inouye, David A van Heel, Reedik Mägi, Pekka Marttinen, Samuli Ripatti, Andrea Ganna, Christoph Lippert, Evaluation of polygenic scoring methods in five biobanks reveals greater variability between biobanks than between methods and highlights benefits of ensemble learning. The American Journal of Human Genetics, 2024.

© 2024 Sophie wharrie