Apr - Yongjin Park¶
Thursday, April 20th, 2023 6:00pm - 9:00pm PT
Location: St. Paul's Hospital (1081 Burrard Street, Vancouver, BC V6Z 1Y6), Cullen Family Lecture Theatre (Providence Building, Level 1, Room 1477)
Featured Speaker: Dr. Yongjin Park
Affiliation: Assistant Professor, Pathology and Statistics, University of British Columbia
Talk Title: Learning deep biology with shallow statistical models
As deep learning approaches gained much popularity, typical statistical learning methods were replaced by deep generative models and black-box classification algorithms in many types of biological data analysis, including genetic variant calling, regulatory genomics, high-dimensional data embedding, missing value imputations, and risk predictions. Despite being readily accessible with general-purpose libraries, so-called deep methods demand long hours of training, specialized hardware resources, and a large amount of data; yet, we are often startled at seeing only marginally improved classification performance, model overfitting, or lack of generalizability. Not undermining important advancements made possible by deep models, this talk will seek to showcase that we can deepen our understanding of biological systems with shallow statistical models.
First, I will discuss our scalable algorithm for probabilistic topic modelling in single-cell genomics data. Based on probabilistic topic assignments in each cell, we identify the latent representation of cellular states and heterogeneity, and the latent topic vectors often yield much-improved clustering results than other types of dimensionality reduction methods. We were initially inspired by several empirical observations: (1) Data sets compressed by repeatedly applying random projection operations highlight cell type-specific signature genes. (2) We also noted that rare cell types are better characterized with lowly-expressed genes (in total data) that are typically removed in quality control steps, thus, not included in embedding model estimations. (3) Bulk sequencing data sets are generally less prone to zero inflation or measurement errors. Based on these key findings, we designed a statistical framework termed ASAP--short for Annotating Single-cell data matrix by Approximate Pseudo-bulk projection) to identify cell topics. ASAP seeks to reduce sample size by interactively collapsing cell-level expression vectors into pseudo-bulk vectors in order to accurately perform non-negative matrix factorization to learn topic-specific gene frequency patterns.
Another example will be a sparse regression model with a causal inference flavour that can effectively handle putative confounding issues in a genetic fine-mapping problem. Our idea is rooted in Rubin's causal inference framework (Rubin and Rosenbaum, 1983), with which non-genetic and indirect genetic effects can be cancelled out in implicit adjustment steps. In order to handle the non-binary nature of exposure variables (genetic dosage), our method first matches individuals with one another based on a covariate similarity matrix. If paired individuals share non-genetic factors, then any gene expression changes between them will be removed so that genetic dosage will become a causal factor in the expression divergence. We implemented the algorithm in the SuSiE framework (Wang et al. 2020).
Overall, this talk will reassure that it is okay to work on a shallow model. In fact, if our scientific goal is straightforward enough to be coded in a simple model armed with intuitive algorithms, such modelling approach can help uncover deeper layers of biological mechanisms underneath high-dimensional genomics data.
I have been an Assistant Professor in the Department of Pathology and Statistics since the late Fall of 2020 and a Scientist at BC Cancer Research. I am generally a Bayesian statistician who is interested in developing causal inference algorithms. I have working experience in statistical genetics, regulatory genomics, epigenomics, and systems biology (network science). I received PhD in Biomedical Engineering at Johns Hopkins (advisor: Joel Bader), MSc in Computational Biology at Carnegie Mellon University (advisor: Russell Schwartz), and BSc in Biology and CSE at Seoul National University.
Trainee Speaker: Hans Ghezzi
Affiliation: PhD Student, Dr. Carolina Tropini Lab, University of British Columbia
Talk Title: PUPpy: an automated pipeline to design taxon-specific primers in microbial communities