Biostatistics Seminar | March 07, 2022

UTHSC
Department of Preventive Medicine
Biostatistics Seminar

Systems Genetics of Diet Responsivity

DATE & TIME: July 15, 2024, 2:00 PM - 3:00 PM CT
Virtual: Zoom Meeting
LOCATION: 4th Floor Conference Room 400 in the Doctors Office Building at 66 N. Pauline Street, Memphis, TN 38105

Register Now

Speaker: Dr. Chris Emfinger, Department of Biochemistry, University of Wisconsin-Madison

Abstract: Diabetes, resulting from insufficient insulin secretion to match metabolic demand, affects hundreds of millions of people. Diabetes risk is highly heritable and the majority of single-nucleotide polymorphisms conferring diabetes risk are thought to influence insulin production. Conversely, environmental factors such as obesity and diet can elevate metabolic demand and drive resistance to insulin. Individuals display a wide variation in diet responsivity. Consequently, despite many years of research there is no consensus on a single diet most compatible with health and most useful in preventing or reversing metabolic dysfunction in susceptible individuals. Our lab focuses on understanding the genetic factors driving insulin production and diet responsivity. In contrast to many labs studying metabolism using only a handful of highly inbred mouse strains, we focus on mice with high genetic and phenotypic variation. This allows us to perform genetic screens in well controlled experiments with high power, identifying novel target genes regulating metabolism which we can then validate experimentally. This talk will be an overview of our genetics pipeline and some of the techniques we are currently applying to nominate and validate interesting targets.
Bio: https://attielab.biochem.wisc.edu/staff/emfinger-chris/

PAST SEMINARS

Modeling GeoSpatial Data: Turkish Health Studies

DATE & TIME: July 1, 2024, 2:00 PM - 3:00 PM CT
Virtual: Zoom Meeting
LOCATION: 4th Floor Conference Room 400 in the Doctors Office Building at 66 N. Pauline Street, Memphis, TN 38105

Speaker: Dr. Mehmet Koçak, Istanbul Medipol University, Istanbul, Turkey
Abstract: Spatial autocorrelation is a fundamental concept in spatial statistics, describing the degree to which a set of spatial data points are correlated with each other across a geographical space. This presentation provides a comprehensive overview of spatial autocorrelation, discussing its definitions, detection methods, and modeling techniques. Firstly, we delve into the definition of spatial autocorrelation, highlighting its significance in understanding geographical data patterns. We cover various aspects such as self-correlation due to geographical ordering, information content in geo-referenced data, and its role as a diagnostic tool for spatial model misspecification. Next, we explore methods for detecting spatial autocorrelation, focusing on Moran’s I and Geary’s C statistics, which provide formal tests for spatial dependency. The presentation explains how to compute these statistics and interpret their results, supported by visual examples and practical applications. We then shift to modeling spatial autocorrelation, demonstrating the use of SAS procedure PROC SPATIALREG. The SPATIALREG procedure enables the application of various spatial models, including Spatial Auto-regressive (SAR), Spatial Durbin Model (SDM), Spatial Error Model (SEM), etc. We also provide examples of geospatial modeling using sets of Turkish Health Studies, which are national surveys having more than 20,000 participants, considering spatial autocorrelation in both the response and predictor variables. We discuss the generation of spatial weight matrices and the fitting of spatial models to the data, highlighting key findings and model performance metrics.
Bio: https://sabita.medipol.edu.tr/index.php/portfolio-item/prof-dr-mehmet-kocak/

Joint Modeling of Multivariate Nonparametric Longitudinal Data and Survival Data by A Local Smoothing Approach

DATE & TIME: May 6, 2024, 2:00 PM - 3:00 PM CT
Virtual: Zoom Meeting

Speaker: Dr. Lou You, Health Informatics Institute, University of South Florida
Abstract: In many clinical studies, evaluating the association between longitudinal and survival outcomes is of primary concern. For analyzing data from such studies, joint modeling of longitudinal and survival data becomes an appealing approach. In some applications, there are multiple longitudinal outcomes whose longitudinal pattern is difficult to describe by a parametric form. For such applications, existing research on joint modeling is limited. In this paper, we develop a novel joint modeling method to fill the gap. In the new method, a local polynomial mixed-effects model is used for describing the nonparametric longitudinal pattern of the multiple longitudinal outcomes. Two model estimation procedures, i.e., the local EM algorithm and the local penalized quasi-likelihood estimation, are explored. Practical guidelines for choosing tuning parameters and for variable selection are provided. The new method is justified by some theoretical arguments and numerical studies.
Bio: https://usf.discovery.academicanalytics.com/scholar/599629/LU-YOU

Probabilistic methods to identify multi-scale enrichment in genomic sequencing studies

DATE & TIME: April 22, 2024, 2:00 PM - 3:00 PM CT
Virtual: Zoom Meeting

Speaker: Dr. Lorin Crawford, Brown University and Microsoft Research New England
Abstract: A consistent theme of the work done in my lab group is to take modern computational approaches and develop theory that enable their interpretations to be related back to classical genomic principles. The central aim of this talk is to address variable selection questions in nonlinear and nonparametric regression. Motivated by statistical genetics, where nonlinear interactions and non-additive variation are of particular interest, we introduce a novel, interpretable, and computationally efficient way to summarize the relative importance of predictor variables. Methodologically, we present flexible and scalable classes of Bayesian models which provide interpretable probabilistic summaries such as posterior inclusion probabilities and credible sets for association mapping tasks in high-dimensional studies. We illustrate the benefits of our methods over state-of-the-art linear approaches using extensive simulations. We also demonstrate the ability of these methods to recover both novel and previously discovered genomic associations using real human complex traits from the Wellcome Trust Case Control Consortium (WTCCC), the Framingham Heart Study, and the UK Biobank.
Bio: https://www.lorincrawford.com/

Error analysis of generative adversarial network

DATE & TIME: March 28, 2024, 2:00 PM - 3:00 PM CT
Virtual: Zoom Meeting
LOCATION: 4th Floor Conference Room 400 in the Doctors Office Building at 66 N. Pauline Street, Memphis, TN 38105.
Please park in the multi-story parking garage adjacent to the Doctors Office Building, and bring your parking ticket with you so we can give validate it.

Speaker: Dr. Hailin Sang, University of Mississippi
Abstract: The generative adversarial network (GAN) is an important model developed for high-dimensional distribution learning in recent years. However, there is a pressing need for a comprehensive method to understand its error convergence rate. In this research, we focus on studying the error convergence rate of the GAN model that is based on a class of functions encompassing the discriminator and generator neural networks. These functions are VC type with bounded envelope function under our assumptions, enabling the application of the Talagrand inequality. By employing the Talagrand inequality and Borel-Cantelli lemma, we establish a tight convergence rate for the error of GAN. This method can also be applied on existing error estimations of GAN and yields improved convergence rates. In particular, the error defined with the neural network distance is a special case error in our definition. This talk is based on the project jointly with Mahmud Hasan.
Bio: https://math.olemiss.edu/hailin-sang/

Quality control of GWAS summary statistics

DATE & TIME: February 12, 2024, 10:00 AM - 11:00 AM CT
Virtual: Zoom Meeting

Speaker: Dr. Florian Privé , Aarhus University, Denmark
Abstract: Results from genome-wide association studies (GWAS summary statistics) have been extensively used in different applications such as estimating the genetic architecture of complex traits and diseases, identifying causal variants with fine-mapping, and predicting complex traits with polygenic scores. One reason behind the popularity of GWAS summary statistics is that they are widely available and shared, e.g. in the GWAS Catalog. However, these GWAS summary statistics come with varying degrees of quality, and from many different tools and studies. In this presentation, I go over several different cases that could go wrong when using GWAS summary statistics and what we can do about it. Most of these are still unknown to many people using GWAS summary statistics on a regular basis, which can cause results they derive to be biased or suboptimal. I present and discuss past and current work on how to perform some quality control and ultimately improve the quality of GWAS summary statistics, in order to make best use of them.
Bio: https://privefl.github.io/

Big Data is Low Rank

DATE & TIME: January 22, 2024, 2:00 PM - 3:00 PM CT
Virtual: Zoom Meeting

Speaker: Dr. Madeleine Udell, Stanford University
Abstract: Data scientists are often faced with the challenge of understanding a high dimensional data set organized as a table. These tables may have columns of different (sometimes, non-numeric) types, and often have many missing entries. This talk surveys methods based on low rank models to analyze these big messy data sets. We show that low rank models perform well — indeed, suspiciously well — across a wide range of data science applications, including in social science, medicine, and machine learning. This good performance demands (and this talk provides) a simple mathematical explanation for their effectiveness, which identifies when low rank models perform well and when to look beyond low rank.
Bio: https://web.stanford.edu/~udell/bio.html

Generalized kernel machine regression

DATE & TIME: November 13, 2023, 2:00 PM - 3:00 PM CT
LOCATION: 4th Floor Conference Room 400 in the Doctors Office Building at 66 N. Pauline Street, Memphis, TN 38105.
Please park in the multi-story parking garage adjacent to the Doctors Office Building, and bring your parking ticket with you so we can give validate it.
Virtual: Zoom Meeting

Speaker: Dr. Xichen Mou, University of Memphis
Abstract: Kernel Machine Regression (KMR) serves as a nonparametric regression approach fundamental in numerous scientific domains. By utilizing a map determined by the kernel function, KMR transforms original predictors into a higher-dimensional feature space, simplifying the recognition of patterns between outcomes and independent variables. KMR is invaluable in studies within the biomedical and environmental health sectors, where it aids in identifying crucial exposure points and gauging their impact on results. In our study, we introduce the Generalized Bayesian Kernel Machine Regression (GBKMR) which integrates the KMR model within the Bayesian context. GBKMR not only complements the conventional KMR but also suits a range of outcome data, from continuous to binary and count data. Simulation studies confirm GBKMR's superior precision and robustness. We further employ this method on a real data set to pinpoint specific cytosine phosphate guanine (CpG) locations correlated with health-related outcomes or exposures.
Bio: https://www.memphis.edu/publichealth/contact/faculty_profiles/mou.php

A Clustering Approach to Non-Equal Length Joint Pattern Genetic and Epigenetics Factors Weighted by Covariates

DATE & TIME: October 30, 2023, 2:00 PM - 3:00 PM CT
LOCATION: 4th Floor Conference Room 400 in the Doctors Office Building at 66 N. Pauline Street, Memphis, TN 38105.
Please park in the multi-story parking garage adjacent to the Doctors Office Building, and bring your parking ticket with you so we can give validate it.
Virtual: Zoom Meeting

Speaker: Joseph (Kip) Handwerker, University of Tennessee Health Science Center and University of Memphis
Abstract: Clustering analysis is a popular approach to gaining insight into the structure of data, especially on a large scale. Some of the most popular approaches are the K-means and K-prototype algorithms which are partitioning methods that use distance measures to assign groups. While these methods are good, especially for large datasets, when it comes to genetics data they fail to consider potential joint effects and require the same dimensionality across variables. The Vector in Partition (VIP) algorithm fills this gap with a distance measure designed to partition genetic and epigenetic data with non-equal length dimensions; specifically, gene expression (GE), DNA methylation (CPG), and single nucleotide polymorphisms (SNP). The VIP extension method extends this framework by adding another layer of complex joint effects of genetic and epigenetic data with other potential health-related variables to dictate clustering. The extension algorithm performs well on simulated data when the clustering of the covariates follows the same clustering scheme of the genetics data. Like other distance measures, when the data does not follow a clear clustering scheme the algorithm tends to underperform, especially against numeric data. The results highlight many aspects of the algorithm’s performance capabilities, as well as multiple areas for future improvements.
Bio: https://kiphandwerker.github.io/

Real-time Linear Mixed Model Implementation for Association Mapping on Large Numbers of Quantitative Traits

DATE & TIME: September 18, 2023, 2:00 PM - 3:00 PM CT
Virtual: Zoom Meeting

Speaker: Zifan (Fred) Yu, University of Tennessee Knoxville
Abstract: Linear mixed models (LMMs) are used widely in genome-wide association studies (GWAS) to account for population structure and genetic relatedness among the study individuals. The advancements of high-throughput genotyping technologies have enabled GWAS to be conducted on large number of traits and created a demand for more efficient and scalable implementations of LMMs. We developed a new software package for fast LMM association scans, BulkLMM, that is designed for GWAS of large numbers of quantitative traits with modest sample sizes which are common for analyzing animal model data. We applied BulkLMM on BXD Individual Liver Proteome data, wherefor genome scans of the scale of over 35k traits and 7k markers and obtained real-time (in a few seconds) performance on high-end desktop hardware. BulkLMM provides additional features valuable for performing GWAS, such as permutation testing and the ability to incorporate prior knowledge on the residual variance. Our open-source implementation in the Julia programming language has the combined benefits of high-efficiency and easy prototyping which enable seamless downstream analysis and manipulation.
Bio: https://learningmalanya.github.io/

Inferring Within-Subject Variances From Intensive Longitudinal Data

DATE & TIME: May 22, 2023, 2:00 PM - 3:00 PM CT
Virtual: Zoom Meeting

Speaker: Dr. Hua Zhou, University of California Los Angeles
Abstract: The availability of vast amounts of longitudinal data from electronic health records (EHR) and personal wearable devices opens the door to numerous new research questions. In many studies, individual variability of a longitudinal outcome is as important as the mean. Blood pressure fluctuations, glycemic variations, and mood swings are prime examples where it is critical to identify factors that affect the within-individual variability. We propose a scalable method, within-subject variance estimator by robust regression (WiSER), for the estimation and inference of the effects of both time-varying and time-invariant predictors on within-subject variance. It is robust against the misspecification of the conditional distribution of responses or the distribution of random effects. It shows similar performance as the correctly specified likelihood methods but is 10³ ~10⁵ times faster. The estimation algorithm scales linearly in the total number of observations, making it applicable to massive longitudinal data sets. The effectiveness of WiSER is illustrated using the accelerometry data from the Women's Health Study and a clinical trial for longitudinal diabetes care.
Bio: https://ph.ucla.edu/about/faculty-staff-directory/hua-zhou

Statistical Frameworks for Longitudinal Metagenomic and Transcriptomic Data

DATE & TIME: April 24, 2023, 2:00 PM - 3:00 PM CT
LOCATION: 4th Floor Conference Room 400 in the Doctors Office Building at 66 N. Pauline Street, Memphis, TN 38105.
Please park in the multi-story parking garage adjacent to the Doctors Office Building, and bring your parking ticket with you so we can give validate it.
Virtual: Zoom Meeting

Speaker: Dr. Quian Li, St. Jude Children's Research Hospital
Abstract: Longitudinal sampling has become popular in the omics studies, such as microbiome and transcriptome. To identify operational taxonomy units (OTUs) signaling disease onset, a powerful and cost-efficient strategy was selecting participants by matched sets and profiling their temporal metagenomes, followed by trajectory analysis. We proposed a joint model with matching and regularization (JMR) to detect OTU-specific trajectory predictive of host disease status. The between- and within-matched-sets heterogeneity in OTU relative abundance and disease risk were linked by nested random effects. The inherent negative correlation in microbiota composition was adjusted by incorporating and regularizing the top-correlated taxa as longitudinal covariate, pre-selected by Bray-Curtis distance and elastic net regression. For the longitudinal bulk transcriptomes, we propose a statistical framework ISLET to infer individual-specific and cell-type-specific transcriptome reference panels. ISLET models the repeatedly measured bulk gene expression data to optimize the usage of shared information within each subject. ISLET is the first available method to achieve individual-specific reference estimation in repeated samples. In the simulation study and an application to a large-scale metagenomic study, JMR outperformed the competing methods and identified important taxa in infants’ fecal samples with dynamics preceding host autoimmune status. We also show outstanding performance of ISLET in the reference estimation and downstream cell-type-specific differentially expressed genes testing in simulation. An application of ISLET to the longitudinal PBMC transcriptomes in the same study confirms the cell-type-specific gene signatures for early-life autoimmunity.
Bio: https://www.stjude.org/directory/l/qian-li.html

False Discovery Rates for Penalized Regression Models

DATE & TIME: March 27, 2023, 2:00 PM - 3:00 PM CT
LOCATION: Zoom Meeting

Speaker: Dr. Patrik Breheny, University of Iowa
Abstract: Penalized regression is an attractive methodology for dealing with high-dimensional data where classical likelihood approaches to modeling break down. However, its widespread adoption has been hindered by a lack of inferential tools. In particular, penalized regression is very useful for variable selection, but how confident should one be about those selections? How many of those selections would likely have occurred by chance alone? In this talk, I will review recent developments in this area, with an emphasis on my work and that of my recent graduate students.
Bio: https://myweb.uiowa.edu/pbreheny/

Methods for Phylogenetic Networks

DATE & TIME: February 20, 2023, 2:00 PM - 3:00 PM CT
LOCATION: Zoom Meeting

Speaker: Dr. Cécile Ané, University of Wisconsin-Madison
Abstract: Phylogenetic networks and admixture graphs can represent the past history of a group of species or populations and how they diversified. Unlike trees, networks can represent events such as migration between populations, admixture, hybridization between species, or recombination between viral strains. I will give an overview of how phylogenetic networks are used and the difficulties of estimating phylogenetic networks from genome-wide data. Then I will focus on the characterization of what is (or is not) knowable about the network based on genetic distance data.
Bio: https://pages.stat.wisc.edu/~ane/index.html
https://botany.wisc.edu/staff/ane-cecile/

MatrixLM: a flexible, interpretable framework for high-throughput data

DATE & TIME: October 31, 2022, 2:00 PM - 3:00 PM CT
LOCATION: Zoom Meeting

Speaker: Chenhao Zhao, Geisel School of Medicine, Dartmouth College
Abstract: The Matrix Linear Model (MLM) is an efficient and computationally feasible solution to association analysis for biomedical high-throughput data. Sen and Liang (2018) developed the MatrixLM.jl package in Julia programming language, providing core functions to estimate matrix linear models.
The project's main goal was to collaborate on user-friendly documentation, increase testing features, and improve certain coding functionalities. We used simulated data to demonstrate how to use the package and used a case study example based on an actual disease metabolomics study to showcase MLM's benefits. Nonalcoholic fatty liver disease (NAFLD) is a progressive liver disease that is strongly associated with type II diabetes 2. Using Matrix Linear Models, our analysis investigated the association between metabolite characteristics (e.g., pathways) and patient characteristics such as type II diabetes.
Bio: https://geiselmed.dartmouth.edu/qbs/profile/chenhao-zhao/

A multi-omics approach to understanding the role of APOL1 in CKD amongst African Ancestry individuals

DATE & TIME: October 17, 2022, 2:00 PM - 3:00 PM CT
LOCATION: Zoom Meeting

Speaker: Dr. Heather M. Highland, The University of North Carolina at Chapel Hill
Abstract: APOL1 is an integral part of the complement system, a component of innate immunity that serves as the first line of defense against pathogens. The trypanosome that causes African sleeping sickness developed mechanisms to evade the innate immunity system after the migration out of Africa. This created a selective pressure on non-synonymous variants in the APOL1 gene. These variants are only observed in people with recent African ancestry. People carrying two variants in APOL1 are at increased risk of developing a variety of kidney disease. The mechanism by which APOL1 alters kidney function is currently unclear. To investigate potential mechanisms, we have looked at differences in DNA methylation across the genome and metabolomic profiles. Using 1740 African Americans (AA) in the ARIC study and 3886 Hispanic/Latinos in SOL, we did not observe any statistically significant differences in metabolism in preliminary analyses after adjusting for multiple comparisons. In 947 AAs in ARIC, 949 AAs in JHS and 332 AAs in MESA, we identified extensive differences in methylation near the APOL gene family region on chromosome 22 (p=2.7x10-80). Additional methylation differences were seen near FEZF2, FAM20A, and KIAA0556. The role of these loci in the development of kidney disease in people with two APOL1 risk alleles continues.
Bio: https://sph.unc.edu/adv_profile/heather-m-highland-phd/

Bidirectional Causal Modeling With Instrumental Variables and Data From Relatives

DATE & TIME: September 19, 2022, 2:00 PM - 3:00 PM CT
LOCATION: Zoom Meeting

Speaker: Dr. Luis FS Castro-de-Araujo, The University of Melbourne, Australia; Virginia Institute for Psychiatric and Behavioral Genetics
Abstract: Establishing (or falsifying) causal associations is an essential step towards developing effective interventions for psychiatric and substance use problems. While randomized controlled trials (RCTs) are considered the gold standard for causal inference in health research, they are impossible or unethical in many common scenarios. Mendelian randomization (MR) can be used where RCTs are not feasible, but it requires stringent assumptions that can be fundamentally flawed when applied to complex traits. Some assumptions of MR can be avoided with using structural equation modeling. In this paper we developed an extension of the Direction of Causation twin model (Neale 1994) that includes two polygenic risk scores in the specification, as an approach to avoid some inherent restrictions of both MR and RCT. We hypothesize that adding a second PRS will generate a more flexible model in terms of identification, whilst maintaining reasonable power and allowing for bidirectional causation. OpenMx software is used to explore the power of such a model and its identification. We arrive at an extension of the Direction of Causation model that can be used both in a twin design or in a extended family design, but at the same time relaxing some of MRs assumptions. We further report the model is well powered enough for current data set sizes (from around 13000 observations or less, depending on the variance of the instruments), and in a range of additive, shared and environmental variances found in common clinical scenarios.
Bio: https://findanexpert.unimelb.edu.au/profile/821977-luis-fernando-silva-castro-de-araujo

Using Ontologies for Knowledge Engineering and Management in Medicine and Healthcare

DATE & TIME: August 29, 2022, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Dr. Arash Shaban-Nejad, University of Tennessee Health Science Center
Abstract: Health intelligence relies on the systematic collection and integration of data from diverse distributed and heterogeneous sources at various levels of granularity. These sources include data from multiple disciplines represented in different formats, languages, and structures posing significant integration and analytics challenges. Using a series of clinical and population health applications and use cases, this seminar highlights the contribution made by emerging semantic technologies that offer enhanced interoperability, interpretability, and explainability through the adoption of ontologies (a computational artifact capturing domain knowledge using concepts, relations, and complex logical rules and axioms), and knowledge graphs.
Bio: https://www.uthsc.edu/faculty/profile/?netid=ashabann

Using Outbred HS Rats to Study the Genetic Basis of Almost Everything You Can Think Of

DATE & TIME: August 22, 2022, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Dr. Abraham Palmer, University of California San Diego
Abstract: Whereas there are well established methods for translating findings about single genes from humans to non-humans, there is an urgent need for methods to translate the polygenic signals obtained from GWAS across species. This is difficult because GWAS produces information about SNPs rather than genes, however SNPs are inherently species specific. My lab is helping to develop two complementary methods to address this problem. Both methods depend on translation of GWAS signals from SNPs to genes. In one method, this is done by choosing the gene that is nearest to an implicated SNP. The list of orthologous genes from two or more species are then projected into a previously defined gene network and a random walk is used to defuse the signal to neighboring genes. The overlap between the network defined by each species is then assessed for significance relative to permuted gene sets. In the second method, SNPs are used to predict gene expression and these predictions are used to estimate the effect of each gene’s expression on phenotype, creating what we term a polygenetic transcriptomic risk score (PTRS). A PTRS can then be used in conjunction with orthologous genes such that a PTRS defined in one species can be used to estimate an analogous trait in individuals from another species. In preliminary work we found that both methods identify highly statistically significant overlap in the signals associated with both BMI and body length. We are extending these methods to behavioral traits, including those relevant for substance use disorders.
Bio: https://profiles.ucsd.edu/abraham.palmer

A Single-Cell and Spatially Resolved Atlas of Human Breast Cancers

DATE & TIME: April 11, 2022, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Abstract: Breast cancers are complex cellular ecosystems where heterotypic interactions play central roles in disease progression and response to therapy. However, our knowledge of their cellular composition and organization remains limited. Recently we published an integrated cellular and spatial atlas of 26 primary human breast cancers spanning all major molecular subtypes1. This provided a systematic, high-resolution, characterization of the cellular diversity of the epithelial, immune and stromal cellular landscape. To investigate neoplastic cell heterogeneity, we developed a single cell classifier of intrinsic subtype (scSubtype) and revealed recurrent transcriptional gene modules that define the neoplastic cells. This detailed cellular taxonomy was then used to deconvolute large breast cancer cohorts, allowing their stratification into nine clusters, termed ‘ecotypes’, with unique cellular compositions and association with clinical outcome. Further, Visium spatial profiling provided an initial view of how stromal, immune and neoplastic cells are spatially organized in tumours, offering insights into tumour regulation. This work is now being expanded to the generation and integration of 100s of cellular profiles and a matching dataset of spatially resolved tumour transcriptomes. We propose that this will identify cellular niches that are spatially organized in breast tumours, offering insights into anti-tumour immune regulation and neoplastic heterogeneity. In particular, I’ll discuss recent work, using whole-transcriptome Nanostring spatial profiling, that analyses T-cell and cancer rich regions of the tumour micro-environment in triple-negative breast cancers, showing how integration with our cellular taxonomy allows for deconvolution of the cell-type abundances in these specific tissue regions. Our work highlights the potential of large-scale, integrated, cellular and spatial genomics to unravel the complex cellular heterogeneity within tumours and identify novel cell types, niches, and regulatory states that will inform treatment response.
Publication: https://www.nature.com/articles/s41588-021-00911-1

A Mixed-Effects Model for Powerful Association Tests in Integrative Functional Genomic Data

DATE & TIME: March 28, 2022, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Dr. Li Hsu, Fred Hutchinson Cancer Research Center
Abstract: Genome-wide association studies (GWAS) have successfully identified thousands of genetic variants for many complex diseases; however, these variants explain only a small fraction of the heritability. Recent developments in genotype-omics studies have shown promises for discovering novel loci by leveraging genetically regulated molecular phenotypes (e.g., gene expression, methylation, proteomics) into GWAS. However, there is a limitation in the existing approaches. Some variants can individually influence disease risk through alternative functional mechanisms. Existing approaches of testing only the association of imputed molecular phenotypes will potentially lose power. To tackle this challenge, we consider a unified mixed effects model that formulates the association of intermediate phenotypes such as imputed gene expression through fixed effects, while allowing for residual effects of individual variants as random effects. We consider a set-based score testing framework, MiST (Mixed effects Score Test), and propose data-driven combination approaches to jointly test for the fixed and random effects. We also provide p-values for fixed and random effects separately to enhance interpretability of the association signals. Recently, we extend the MiST to depend on only GWAS summary statistics instead of individual level data, allowing for a broad application of MiST to GWAS data. Extensive simulations demonstrate that MiST is more powerful than existing approaches and summary statistics-based MiST (sMiST) agrees well those obtained from individual level data with substantively improved computational speed. We apply sMiST to a large-scale GWAS of colorectal cancer using summary statistics from >120, 000 study participants and gene expression data from the Genotype-Tissue Expression (GTEx) project. We identify several novel and secondary independent genetic loci.
Bio: https://www.fredhutch.org/en/faculty-lab-directory/hsu-li.html
Publication: https://pubmed.ncbi.nlm.nih.gov/29727690/

A statistical framework for differential pseudotime analysis with multiple single-cell RNA-seq samples

DATE & TIME: March 21, 2022, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Dr. Hongkai Ji, Johns Hopkins Bloomberg School of Public Health
Abstract: Pseudotime analysis with single-cell RNA-sequencing (scRNA-seq) data has been widely used to study dynamic gene regulatory programs along continuous biological processes. While many computational methods have been developed to infer the pseudo-temporal trajectories of cells within a biological sample, methods that compare pseudo-temporal patterns with multiple samples (or replicates) across different experimental conditions are lacking. Lamian is a comprehensive and statistically-rigorous computational framework for differential multi-sample pseudotime analysis. It can be used to identify changes in a biological process associated with sample covariates, such as different biological conditions, and also to detect changes in gene expression, cell density, and topology of a pseudotemporal trajectory. Unlike existing methods that ignore sample variability, Lamian draws statistical inference after accounting for cross-sample variability and hence substantially reduces sample-specific false discoveries that are not generalizable to new samples. Using both simulations and real scRNA-seq data, including an analysis of differential immune response programs between COVID-19 patients with different disease severity levels, we demonstrate the advantages of Lamian in decoding cellular gene expression programs in continuous biological processes.
Bio: https://publichealth.jhu.edu/faculty/1943/hongkai-ji
Publication: https://www.biorxiv.org/content/10.1101/2021.07.10.451910v1

Inference of disease associated genomic segments in post-GWAS analysis

DATE & TIME: March 14, 2022, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Dr. Lin Hou, Tsinghua University
Abstract: Identification and interpretation of disease associated loci remain a paramount challenge in genome-wide association (GWAS) of complex disease. We develop post-GWAS analysis tools, which leverage pleiotropy and functional annotations to dissect the genetic architecture of complex traits. In this talk, I will first introduce LOGODetect, a powerful and efficient statistical method to identify small genome segments harboring local genetic correlation signals. LOGODetect automatically identifies genetic regions showing consistent association with multiple phenotypes through a scan statistic approach. Applied to seven neuropsychiatric traits, we identify hub regions showing concordant effect on five or more traits. Next, I will introduce Openness Weighted Association Studies (OWAS), a computational approach that leverage and aggregate predictions of chromatin accessibility in personal genomes for prioritizing GWAS signals. In extensive simulation and real data analysis, OWAS identifies genes/segments that explain more heritability than existing methods, and has a better replication rate in independent cohorts than GWAS. Moreover, the identified genes/segments show tissue-specific patterns and are enriched in disease relevant pathways.
Bio: http://www.stat.tsinghua.edu.cn/en/teambuilder/faculty/lin-hou/
Seminar Record

Durability of Covid-19 Vaccines

DATE & TIME: February 14, 2022, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Dr. Danyu Lin, University of North Carolina at Chapel Hill
Abstract: Evaluating the durability of protection afforded by Covid-19 vaccines is a public health priority, with the results needed to inform policies around booster vaccinations as well as those around non-pharmaceutical interventions. In this talk, I will present a general framework for estimating the effects of Covid-19 vaccines over time in phase 3 clinical trials and observational studies. I will show some results on the duration of vaccine protection from the Moderna pivotal trial and from the North Carolina statewide surveillance data. The latter data, which were published in the New England Journal of Medicine in January, provided rich information about the effectiveness of the Pfizer, Moderna, and Johnson & Johnson vaccines in reducing the risks of Covid-19, hospitalization, and death over time. I will discuss the implications of these results for booster vaccinations.
Bio: https://sph.unc.edu/adv_profile/danyu-lin-phd/

Democratizing data-driven biology: Tackling incomplete data, unstructured metadata, and hidden curricula

DATE & TIME: February 7, 2022, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Dr. Arjun Krishnan, Michigan State University
Abstract: There is much enthusiasm about using omics and biomedical data collections to fuel research on complex traits and diseases. However, there are still some well-known fundamental challenges in seamlessly and effectively using these data to drive research. For instance, there are >1.5 million human gene expression profiles that are publicly available, but, depending on the technology/platform used to record each profile, different subsets of genes in the genome are measured in these transcriptomes, leading to thousands of unmeasured genes in many of these profiles. These gaps in data are major hurdles for integrative analysis. Critical problems also exist with data descriptions: the majority of >2 million publicly available omics samples lack structured metadata, including information about tissue of origin, disease status, and environmental conditions. Thus, discovering samples and datasets of interest is not straightforward. In this seminar, I will present recent work from our group on developing machine learning approaches to address these fundamental challenges. In addition, I will discuss the need for improving advanced research training in biological data analysis by formalizing concepts in statistical procedures, study design, data/code management, critically consuming data-driven findings, and reproducible research.
Bio: https://cmse.msu.edu/directory/faculty/arjun-krishnan/

Meta-clustering of Genomic Data

DATE & TIME: January 31, 2022, 3:00 PM - 4:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Dr. Yingying Wei, Chinese University of Hong Kong
Abstract: Like traditional meta-analysis that pools effect sizes across studies to improve statistical power, it is of increasing interest to conduct clustering jointly across datasets to identify disease subtypes for bulk genomic data and discover cell types for single-cell RNA-sequencing (scRNA-seq) data. Unfortunately, due to the prevalence of technical batch effects among high-throughput experiments, directly clustering samples from multiple datasets can lead to wrong results. The recent emerging meta-clustering approaches require all datasets to contain all subtypes, which is not feasible for many experimental designs.
In this talk, I will present our Batch-effects-correction-with-Unknown-Subtypes (BUS) framework. BUS is capable of correcting batch effects explicitly, grouping samples that share similar characteristics into subtypes, identifying features that distinguish subtypes, and enjoying a linear-order computational complexity. We prove the identifiability of BUS for not only bulk data but also scRNA-seq data whose dropout events suffer from missing not at random. We mathematically show that under two very flexible and realistic experimental designs—the “reference panel” and the “chain-type” designs—true biological variability can also be separated from batch effects. Moreover, despite the active research on analysis methods for scRNA-seq data, rigorous statistical methods to estimate treatment effects for scRNA-seq data—how an intervention or exposure alters the cellular composition and gene expression levels—are still lacking. Building upon our BUS framework, we further develop statistical methods to quantify treatment effects for scRNA-seq data.
Bio: https://www.sta.cuhk.edu.hk/peoples/ywei/

Novel Genetic Association Test and Cardiomyopathy Risk Prediction in Cancer Survivors

DATE & TIME: January 24, 2022, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Dr. Xuexia Wang, University of North Texas
Abstract: This talk includes two projects. Project 1: Gene-based association tests are widely used in GWAS. The power of a test is often limited by the sample size, the effect size, and the number of causal variants or their directions in a gene. In addition, access to individual-level data is often limited. To resolve the existing limitations, we proposed an optimally weighted combination (OWC) test based on summary statistics from GWAS. We analytically proved that aggregating the variants in one gene is the same as using the weighted combination of Z-scores for each variant based on the proposed score test. Several existing methods are special cases. We also numerically illustrated that our proposed test outperforms several existing methods via simulation studies. Furthermore, we utilized schizophrenia GWAS data and fasting glucose GWAS meta-analysis data to demonstrate that our method outperforms the existing methods in real data analyses. Project 2: We used a carefully curated list of 87 previously published genetic variants to determine whether the incorporation of genetic variants with non-genetic variables could improve the identification of cancer survivors at risk for anthracycline-related cardiomyopathy. We used anthracycline-exposed childhood cancer survivors from a Children’s Oncology Group study (COG-ALTE03N1: 146 cases; 195 matched controls) as the discovery set. Replication was performed in two anthracycline-exposed survivor populations: i) childhood cancer survivors from the Childhood Cancer Survivor Study (CCSS: 126 cases; 250 controls); ii) autologous blood or marrow transplantation (BMT) survivors from the BMT Survivor Study (BMTSS: 80 cases; 78 controls). The Clinical+Genetic model performed better than the Clinical Model in COG-ALTE03N1 (AUC of Clinical+Genetic Model = 0.88 vs. AUC of Clinical Model = 0.81) and BMTSS (AUC of Clinical+Genetic Model = 0.72 vs. AUC of Clinical Model = 0.64), but not in CCSS (AUC of Clinical+Genetic Model = 0.88 vs. AUC of Clinical Model = 0.89). However, the Clinical+Genetic model performed marginally better in CCSS patients without CVRFs where cardiomyopathy developed within 30 years of anthracycline exposure (AUC of Clinical+Genetic Model = 0.90 vs. AUC of Clinical Model = 0.85). Conclusions: Adding a comprehensively assembled genetic profile to clinical characteristics improves the identification of cancer survivors at risk for anthracycline-related cardiomyopathy.
Bio: https://math.unt.edu/people/xuexia-wang
Publication: https://pubmed.ncbi.nlm.nih.gov/32413235/

Advances in Mendelian Randomization

DATE & TIME: November 22, 2021, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Dr. Zoltán Kutalik, Associate Professor, University of Lausanne
Abstract: First, I will motivate the need for causal inference in contrast to observational correlations. In particular, I’ll describe the principle of an instrumental variable approach heavily applied in genetic research, termed Mendelian Randomization (MR). Next, I will show four extensions of this method to different settings/assumptions: (i) In-depth quantification and correction of the bias of the most popular MR method (IVW); (ii) Modelling genetic architecture simultaneously with bidirectional causal effects in the presence of a heritable confounder; (iii) Causal inference for composite trait type exposures; (iv) estimation of non-linear causal effects. I’ll finish with two applications to omics data: first, we link differential gene expression analyses with bi-directional causal effects and finally, I’ll touch on how causal effects of different omics layers are mediated.
Bio: https://applicationspub.unil.ch/interpub/noauth/php/Un/UnPers.php?PerNum=1048220&LanCode=8&menu=coord
Publication: https://pubmed.ncbi.nlm.nih.gov/33737653/

Standardization and Penalty Parameter Selection in Penalized Linear Mixed Models

DATE & TIME: November 1, 2021, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Anna Reisetter, PhD, University of Iowa
Abstract: Penalized linear mixed models (LMMs) have been developed to accurately identify genotype-phenotype associations in the presence of dependent samples. In spite of this, the statistical properties of these models are not well understood. In addition, there is a lack of available software for their implementation. In this talk, we provide an overview of penalized LMMs for the analysis of structured genetic data, while examining their statistical properties in the genetic association setting. We then focus on the statistical properties of penalized LMMs in a general setting, and provide recommendations for key components of their implementation, including appropriate standardization and penalty parameter selection. We demonstrate the benefits of our recommendations using both a general setting, and one specific to genetic data. We conclude with a detailed analysis of a large, empirical GWAS data set which contains complex correlation among samples. We use this analysis to illustrate the benefits of penalized LMMs compared to traditional genome-wide association methods, and to demonstrate the utility of penalizedLMM, an R package we have developed for the flexible, and user-friendly implementation of penalized LMMs.
Publication: https://onlinelibrary.wiley.com/doi/full/10.1002/gepi.22384

PanelPRO: A General Framework for Multi-Gene, Multi-Cancer Mendelian Risk Prediction Models

DATE & TIME: October 18, 2021, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Jane Liang, Ph.D. in Biostatistics, Harvard University
Abstract: Risk evaluation to identify individuals who are at greater risk of cancer as a result of heritable pathogenic variants is a valuable component of individualized clinical management. Using principles of Mendelian genetics, Bayesian probability theory, and variant-specific knowledge, Mendelian models derive the probability of carrying a pathogenic variant and developing cancer in the future, based on family history. Existing Mendelian models are widely employed, but are generally limited to specific genes and syndromes. However, the upsurge of multi-gene panel germline testing has spurred the discovery of many new gene-cancer associations that are not presently accounted for in these models. We have developed PanelPRO, a flexible, efficient Mendelian risk prediction framework that can incorporate an arbitrary number of genes and cancers, overcoming the computational challenges that arise because of the increased model complexity. Using simulations and a clinical cohort with germline panel testing data, we evaluate model performance, validate the reverse-compatibility of our approach with existing Mendelian models, and illustrate its usage.
Publication: https://arxiv.org/abs/2108.12504
Seminar Recording Link

Principled Approaches to the Practical Challenges of Real-World Data

DATE & TIME: October 4, 2021, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Dr. Rebecca Hubbard, Professor of Biostatistics, University of Pennsylvania
Abstract: Interest in conducting research using real-world data, data generated as a by-product of digital transactions, has exploded over the past decade and has been further spurred by the 21st Century Cures Act. Real-world data facilitate understanding of treatment utilization and outcomes as they occur in routine practice, and studies using these data sources can potentially proceed rapidly compared to trials and observational studies that rely on primary data collection. However, using data sources that were not collected for research purposes comes at a cost, and naïve use of such data without considering their complexity and imperfect quality can lead to bias and inferential error. Real-world data frequently violate the assumptions of standard statistical methods, but it is not practicable to develop new methods to address every possible complication arising in their analysis. The statistician is faced with a quandary: how to effectively utilize real-world data to advance research without compromising best practices for principled data analysis. In this talk I will use examples from my research on methods for the analysis of electronic health records (EHR) derived-data to illustrate approaches to understanding the data generating mechanism for real-world data. Drawing on this understanding, I will then discuss approaches to identify, use, and develop principled methods for incorporating EHR into research. The overarching goal of this presentation is to raise awareness of challenges associated with the analysis of real-world data and demonstrate how a principled approach can be grounded in an understanding of the scientific context and data generating process.
Seminar Recording Link

Generalized V-Learning Framework for Estimating Dynamic Treatment Regimes

DATE & TIME: September 27, 2021, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Dr. Duyeol Lee, Quantitative Analytics Specialist, Model Innovations Team, Wells Fargo Bank
Abstract: Precision medicine is an approach that incorporates personalized information to efficiently determine which treatments are best for which types of patients. A key component of precision medicine is creating mathematical estimators for clinical decision-making. Dynamic treatment regimes formalize tailored treatment plans as sequences of decision rules. Recently, the V-learning method was introduced to estimate optimal dynamic treatment regimes. This method showed good performance compared to the existing reinforcement learning methods such as greedy gradient Q-learning. However, the complicated functional form of its loss function makes it difficult to apply modern machine learning methods for the estimation of value functions of treatment policies. We propose a generalized V-learning framework for estimating optimal treatment regimes. The proposed method adopts widely used loss functions and an iterative method to estimate value functions. Simulation studies show that the proposed method provides better performance compared with the original V-learning method.

OSGA Webinar - Organizing Data in Spreadsheets

DATE & TIME: September 24, 2021, 12:00 PM - 1:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Dr. Karl Broman, Professor, Department of Biostatistics & Medical Informatics, University of Wisconsin-Madison
Abstract: Spreadsheets are widely used software tools for data entry, storage, analysis, and visualization. Focusing on the data entry and storage aspects, this presentation will offer practical recommendations for organizing spreadsheet data to reduce errors and ease later analyses. The basic principles are: be consistent, write dates like YYYY-MM-DD, do not leave any cells empty, put just one thing in a cell, organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row), create a data dictionary, do not include calculations in the raw data files, do not use font color or highlighting as data, choose good names for things, make backups, use data validation to avoid data entry errors, and save the data in plaintext files.
Broman KW, Woo KH (2018) Data organization in spreadsheets. The American Statistician 78:2–10

Previous Webinar Series

Mendelian Randomization Analyses in a Multivariate Framework

DATE & TIME: September 13, 2021, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Dr. Phillip Hunter Allman (University of Alabama at Birmingham School of Public Health)
Summary: Mendelian randomization (MR) is an application of instrumental variable (IV) methods to observational data in which the IV is a genetic variant. An IV is a variable which functions similarly to the random treatment group assignment seen in clinical trials. Numerous statistical methods exist for subject-level MR or IV analysis; such as two-stage predictor substitution (2SPS) or the correlated errors model (CEM). These methods have some limitations depending on the distributions involved and the number of IVs; particularly when it comes to asymptotic variance estimation and hypothesis testing. Our research explores extensions of the CEM to scenarios with non-normal variates. This talk will provide a brief introduction to MR; describe the popular statistical methods for such analyses; describe our extensions to the correlated errors model; and compare these methods with simulations and real data analysis.

Summer Intern Presentation II

DATE & TIME: August 30, 2021, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Zifan(Fred) Yu (University of Washington)
Title: Fast Algorithms for Elastic-net Matrix Linear Models
Abstract: Our work builds upon the Matrix Linear Models, which have been established and well-studied by Dr. Sen and Jane W. Liang who was a previous intern. The models are suitable for modeling high-throughput data where the responses are multivariate with each column being, for example, a genetic strain, and each row being an experiment run, and they take full advantages of the column covariates (which represent usually the environmental conditions) and the row covariates (features that are associated with the of the type of the response). Extended from the univariate linear models and some standard statistical approaches such as the t-test, the Matrix Linear Models provide the flexibility in modeling and computational speed. The previous work developed solutions to estimate the models under Lasso penalization with the assumption that the effects will be rare (sparsity in the coefficients). In order to overcome some of the limitations while still preserve some of the strengths of the Lasso type of solutions, we developed solutions to the models with Elastic-net, which is a combined regularization approach of the Lasso and Ridge types of regression. Similar as the previous work, we extended the proximal algorithms (ISTA, FISTA and ADMM) to be suitable for solving the Elastic-net problems. The outcome of our work is the Julia (a relatively young but promising programming language) package called MatrixLMnet extended with Elastic-net solutions with full functionalities as the Lasso solutions, which initially developed by Jane W. Liang and Dr. Sen, Saunak.

Speaker: Phillip Winston Miller (University of Memphis)
Title: bigBERD: An R package to generate reproducible analysis
Abstract: bigBERD is an R package which automatically generates code for general analysis plans in a piecewise and robust manner. Nearly all standard analysis plans contain the same basic steps: visualization, summarization of variables, univariate testing, and some type of regression analysis. Much time is spent copying, modifying, or developing new code to analyze methodologically similar problems. My package aims to circumvent this issue by generating customized code for any dataset at hand. bigBERD operates in a stepwise manner. First, the user generates a Report directory tree to facilitate good data handling practices. Then the package reads the clean dataset and creates a short summary file, which contains useful information for the user and forms the basis of automated decision making. After these prerequisites are generated, the user can call one of four document generating functions: exploratory analysis, univariate analysis, multiple regression analysis, or time-to-event analysis. Each of these will generate a .Rmd document containing code customized to the task and dataset at hand. All code can be run immediately after generation or modified as the user sees fit. The exploratory document contains code for univariate plots of each variable, bivariate plots, and a summary table. The univariate analysis function writes code to perform univariate analysis on all variables, stratified by a user-defined condition of interest. It also generates code for simple linear/logistic/multinomial/ordinal regression on an outcome of interest, with accompanying tables and plots. The multiple regression analysis function generates code to build a main effects model and a stepwise AIC reduced model, with summary tables and plots. The time-to-event analysis function generates KM plots with log-rank tests for categorical variables and a main-effects cox-proportional hazard model, with summary table.

Summer Intern Presentation I

DATE & TIME: August 23, 2021, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Nadeesha Thewarapperuma (University of Kansas Medical Center)
Title: Analysis of DNA methylation data using a functional data approach
Abstract: We will use a data series that examines oral and pharyngeal carcinoma, and which includes 154 cases and 72 controls. A functional data approach will be used to analyze CpG island data for each gene to determine if there is a difference between case and control groups. The package fdANOVA will be used for the hypothesis tests. The null hypothesis is that the case and control groups have the same mean function, and the alternative is that that the mean functions differ.

Speaker: Ye Eun Bae (Florida State University)
Title: Building tools for understanding the genetic architecture of allopolyploidy
Abstract: Analysis of quantitative trait loci (QTL) is a useful approach to identify putatively causal genetic factors underlying quantitative trait variation in species. While many QTL analysis tools and methods have been proposed, it is challenging to extend them to allopolyploid species which require polyploidy-aware genetic mapping by using the fact that there is preferential paring between pairs of homologous chromosomes. In this project, we developed a QTL analysis tool for allopolyploids to provide insights into their genetic architecture and genotype-phenotype associations. As a starting point, we applied our tool to switchgrass (Panicum virgatum) datasets and visualized the application results to help understanding the allopolyploid nature of switchgrass.

Seminar Recording Link

We look forward to seeing you!

Processing Registration...