Quality control of GWAS summary statistics
DATE & TIME: February 12, 2024, 10:00 AM - 11:00 AM CDT
Virtual: Zoom Meeting

Speaker: Dr. Florian Privé , Aarhus University, Denmark
Abstract: Results from genome-wide association studies (GWAS summary statistics) have been extensively used in different applications such as estimating the genetic architecture of complex traits and diseases, identifying causal variants with fine-mapping, and predicting complex traits with polygenic scores. One reason behind the popularity of GWAS summary statistics is that they are widely available and shared, e.g. in the GWAS Catalog. However, these GWAS summary statistics come with varying degrees of quality, and from many different tools and studies. In this presentation, I go over several different cases that could go wrong when using GWAS summary statistics and what we can do about it. Most of these are still unknown to many people using GWAS summary statistics on a regular basis, which can cause results they derive to be biased or suboptimal. I present and discuss past and current work on how to perform some quality control and ultimately improve the quality of GWAS summary statistics, in order to make best use of them.
Bio: https://privefl.github.io/

PAST SEMINARS
Big Data is Low Rank
DATE & TIME: January 22, 2024, 2:00 PM - 3:00 PM CDT
Virtual: Zoom Meeting

Speaker: Dr. Madeleine Udell, Stanford University
Abstract: Data scientists are often faced with the challenge of understanding a high dimensional data set organized as a table. These tables may have columns of different (sometimes, non-numeric) types, and often have many missing entries. This talk surveys methods based on low rank models to analyze these big messy data sets. We show that low rank models perform well — indeed, suspiciously well — across a wide range of data science applications, including in social science, medicine, and machine learning. This good performance demands (and this talk provides) a simple mathematical explanation for their effectiveness, which identifies when low rank models perform well and when to look beyond low rank.
Bio: https://web.stanford.edu/~udell/bio.html

Generalized kernel machine regression 
DATE & TIME: November 13, 2023, 2:00 PM - 3:00 PM CDT
LOCATION: 4th Floor Conference Room 400 in the Doctors Office Building at 66 N. Pauline Street, Memphis, TN 38105.
Please park in the multi-story parking garage adjacent to the Doctors Office Building, and bring your parking ticket with you so we can give validate it.
Virtual: Zoom Meeting

Speaker: Dr. Xichen Mou, University of Memphis
Abstract: Kernel Machine Regression (KMR) serves as a nonparametric regression approach fundamental in numerous scientific domains. By utilizing a map determined by the kernel function, KMR transforms original predictors into a higher-dimensional feature space, simplifying the recognition of patterns between outcomes and independent variables. KMR is invaluable in studies within the biomedical and environmental health sectors, where it aids in identifying crucial exposure points and gauging their impact on results. In our study, we introduce the Generalized Bayesian Kernel Machine Regression (GBKMR) which integrates the KMR model within the Bayesian context. GBKMR not only complements the conventional KMR but also suits a range of outcome data, from continuous to binary and count data. Simulation studies confirm GBKMR's superior precision and robustness. We further employ this method on a real data set to pinpoint specific cytosine phosphate guanine (CpG) locations correlated with health-related outcomes or exposures.
Bio: https://www.memphis.edu/publichealth/contact/faculty_profiles/mou.php

A Clustering Approach to Non-Equal Length Joint Pattern Genetic and Epigenetics Factors Weighted by Covariates 
DATE & TIME: October 30, 2023, 2:00 PM - 3:00 PM CDT
LOCATION: 4th Floor Conference Room 400 in the Doctors Office Building at 66 N. Pauline Street, Memphis, TN 38105.
Please park in the multi-story parking garage adjacent to the Doctors Office Building, and bring your parking ticket with you so we can give validate it.
Virtual: Zoom Meeting

Speaker: Joseph (Kip) Handwerker, University of Tennessee Health Science Center and University of Memphis
Abstract: Clustering analysis is a popular approach to gaining insight into the structure of data, especially on a large scale. Some of the most popular approaches are the K-means and K-prototype algorithms which are partitioning methods that use distance measures to assign groups. While these methods are good, especially for large datasets, when it comes to genetics data they fail to consider potential joint effects and require the same dimensionality across variables. The Vector in Partition (VIP) algorithm fills this gap with a distance measure designed to partition genetic and epigenetic data with non-equal length dimensions; specifically, gene expression (GE), DNA methylation (CPG), and single nucleotide polymorphisms (SNP). The VIP extension method extends this framework by adding another layer of complex joint effects of genetic and epigenetic data with other potential health-related variables to dictate clustering. The extension algorithm performs well on simulated data when the clustering of the covariates follows the same clustering scheme of the genetics data. Like other distance measures, when the data does not follow a clear clustering scheme the algorithm tends to underperform, especially against numeric data. The results highlight many aspects of the algorithm’s performance capabilities, as well as multiple areas for future improvements.
Bio: https://kiphandwerker.github.io/
       
Real-time Linear Mixed Model Implementation for Association Mapping on Large Numbers of Quantitative Traits 
DATE & TIME: September 18, 2023, 2:00 PM - 3:00 PM CDT
Virtual: Zoom Meeting

Speaker: Zifan (Fred) Yu, University of Tennessee Knoxville
Abstract: Linear mixed models (LMMs) are used widely in genome-wide association studies (GWAS) to account for population structure and genetic relatedness among the study individuals. The advancements of high-throughput genotyping technologies have enabled GWAS to be conducted on large number of traits and created a demand for more efficient and scalable implementations of LMMs. We developed a new software package for fast LMM association scans, BulkLMM, that is designed for GWAS of large numbers of quantitative traits with modest sample sizes which are common for analyzing animal model data. We applied BulkLMM on BXD Individual Liver Proteome data, wherefor genome scans of the scale of over 35k traits and 7k markers and obtained real-time (in a few seconds) performance on high-end desktop hardware. BulkLMM provides additional features valuable for performing GWAS, such as permutation testing and the ability to incorporate prior knowledge on the residual variance. Our open-source implementation in the Julia programming language has the combined benefits of high-efficiency and easy prototyping which enable seamless downstream analysis and manipulation.
Bio: https://learningmalanya.github.io/
       
Inferring Within-Subject Variances From Intensive Longitudinal Data
DATE & TIME: May 22, 2023, 2:00 PM - 3:00 PM CDT
Virtual: Zoom Meeting

Speaker: Dr. Hua Zhou, University of California Los Angeles
Abstract: The availability of vast amounts of longitudinal data from electronic health records (EHR) and personal wearable devices opens the door to numerous new research questions. In many studies, individual variability of a longitudinal outcome is as important as the mean. Blood pressure fluctuations, glycemic variations, and mood swings are prime examples where it is critical to identify factors that affect the within-individual variability. We propose a scalable method, within-subject variance estimator by robust regression (WiSER), for the estimation and inference of the effects of both time-varying and time-invariant predictors on within-subject variance. It is robust against the misspecification of the conditional distribution of responses or the distribution of random effects. It shows similar performance as the correctly specified likelihood methods but is 10³ ~10⁵ times faster. The estimation algorithm scales linearly in the total number of observations, making it applicable to massive longitudinal data sets. The effectiveness of WiSER is illustrated using the accelerometry data from the Women's Health Study and a clinical trial for longitudinal diabetes care.
Bio: https://ph.ucla.edu/about/faculty-staff-directory/hua-zhou
       
Statistical Frameworks for Longitudinal Metagenomic and Transcriptomic Data
DATE & TIME: April 24, 2023, 2:00 PM - 3:00 PM CDT
LOCATION: 4th Floor Conference Room 400 in the Doctors Office Building at 66 N. Pauline Street, Memphis, TN 38105.
Please park in the multi-story parking garage adjacent to the Doctors Office Building, and bring your parking ticket with you so we can give validate it.
Virtual: Zoom Meeting

Speaker: Dr. Quian Li, St. Jude Children's Research Hospital
Abstract: Longitudinal sampling has become popular in the omics studies, such as microbiome and transcriptome. To identify operational taxonomy units (OTUs) signaling disease onset, a powerful and cost-efficient strategy was selecting participants by matched sets and profiling their temporal metagenomes, followed by trajectory analysis. We proposed a joint model with matching and regularization (JMR) to detect OTU-specific trajectory predictive of host disease status. The between- and within-matched-sets heterogeneity in OTU relative abundance and disease risk were linked by nested random effects. The inherent negative correlation in microbiota composition was adjusted by incorporating and regularizing the top-correlated taxa as longitudinal covariate, pre-selected by Bray-Curtis distance and elastic net regression. For the longitudinal bulk transcriptomes, we propose a statistical framework ISLET to infer individual-specific and cell-type-specific transcriptome reference panels. ISLET models the repeatedly measured bulk gene expression data to optimize the usage of shared information within each subject. ISLET is the first available method to achieve individual-specific reference estimation in repeated samples. In the simulation study and an application to a large-scale metagenomic study, JMR outperformed the competing methods and identified important taxa in infants’ fecal samples with dynamics preceding host autoimmune status. We also show outstanding performance of ISLET in the reference estimation and downstream cell-type-specific differentially expressed genes testing in simulation. An application of ISLET to the longitudinal PBMC transcriptomes in the same study confirms the cell-type-specific gene signatures for early-life autoimmunity.
Bio: https://www.stjude.org/directory/l/qian-li.html
       
False Discovery Rates for Penalized Regression Models 
DATE & TIME: March 27, 2023, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Dr. Patrik Breheny, University of Iowa
Abstract: Penalized regression is an attractive methodology for dealing with high-dimensional data where classical likelihood approaches to modeling break down. However, its widespread adoption has been hindered by a lack of inferential tools. In particular, penalized regression is very useful for variable selection, but how confident should one be about those selections? How many of those selections would likely have occurred by chance alone? In this talk, I will review recent developments in this area, with an emphasis on my work and that of my recent graduate students.
Bio: https://myweb.uiowa.edu/pbreheny/
       
Methods for Phylogenetic Networks
DATE & TIME: February 20, 2023, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Dr. Cécile Ané, University of Wisconsin-Madison
Abstract: Phylogenetic networks and admixture graphs can represent the past history of a group of species or populations and how they diversified. Unlike trees, networks can represent events such as migration between populations, admixture, hybridization between species, or recombination between viral strains. I will give an overview of how phylogenetic networks are used and the difficulties of estimating phylogenetic networks from genome-wide data. Then I will focus on the characterization of what is (or is not) knowable about the network based on genetic distance data.
Bio: https://pages.stat.wisc.edu/~ane/index.html
        https://botany.wisc.edu/staff/ane-cecile/
       
MatrixLM: a flexible, interpretable framework for high-throughput data
DATE & TIME: October 31, 2022, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Chenhao Zhao, Geisel School of Medicine, Dartmouth College
Abstract: The Matrix Linear Model (MLM) is an efficient and computationally feasible solution to association analysis for biomedical high-throughput data. Sen and Liang (2018) developed the MatrixLM.jl package in Julia programming language, providing core functions to estimate matrix linear models.
The project's main goal was to collaborate on user-friendly documentation, increase testing features, and improve certain coding functionalities. We used simulated data to demonstrate how to use the package and used a case study example based on an actual disease metabolomics study to showcase MLM's benefits. Nonalcoholic fatty liver disease (NAFLD) is a progressive liver disease that is strongly associated with type II diabetes 2. Using Matrix Linear Models, our analysis investigated the association between metabolite characteristics (e.g., pathways) and patient characteristics such as type II diabetes.
Bio: https://geiselmed.dartmouth.edu/qbs/profile/chenhao-zhao/
A multi-omics approach to understanding the role of APOL1 in CKD amongst African Ancestry individuals
DATE & TIME: October 17, 2022, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Dr. Heather M. Highland, The University of North Carolina at Chapel Hill
Abstract: APOL1 is an integral part of the complement system, a component of innate immunity that serves as the first line of defense against pathogens. The trypanosome that causes African sleeping sickness developed mechanisms to evade the innate immunity system after the migration out of Africa. This created a selective pressure on non-synonymous variants in the APOL1 gene. These variants are only observed in people with recent African ancestry. People carrying two variants in APOL1 are at increased risk of developing a variety of kidney disease. The mechanism by which APOL1 alters kidney function is currently unclear. To investigate potential mechanisms, we have looked at differences in DNA methylation across the genome and metabolomic profiles. Using 1740 African Americans (AA) in the ARIC study and 3886 Hispanic/Latinos in SOL, we did not observe any statistically significant differences in metabolism in preliminary analyses after adjusting for multiple comparisons. In 947 AAs in ARIC, 949 AAs in JHS and 332 AAs in MESA, we identified extensive differences in methylation near the APOL gene family region on chromosome 22 (p=2.7x10-80). Additional methylation differences were seen near FEZF2, FAM20A, and KIAA0556. The role of these loci in the development of kidney disease in people with two APOL1 risk alleles continues.
Bio: https://sph.unc.edu/adv_profile/heather-m-highland-phd/
Bidirectional Causal Modeling With Instrumental Variables and Data From Relatives
DATE & TIME: September 19, 2022, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Dr. Luis FS Castro-de-Araujo, The University of Melbourne, Australia; Virginia Institute for Psychiatric and Behavioral Genetics
Abstract: Establishing (or falsifying) causal associations is an essential step towards developing effective interventions for psychiatric and substance use problems. While randomized controlled trials (RCTs) are considered the gold standard for causal inference in health research, they are impossible or unethical in many common scenarios. Mendelian randomization (MR) can be used where RCTs are not feasible, but it requires stringent assumptions that can be fundamentally flawed when applied to complex traits. Some assumptions of MR can be avoided with using structural equation modeling. In this paper we developed an extension of the Direction of Causation twin model (Neale 1994) that includes two polygenic risk scores in the specification, as an approach to avoid some inherent restrictions of both MR and RCT. We hypothesize that adding a second PRS will generate a more flexible model in terms of identification, whilst maintaining reasonable power and allowing for bidirectional causation. OpenMx software is used to explore the power of such a model and its identification. We arrive at an extension of the Direction of Causation model that can be used both in a twin design or in a extended family design, but at the same time relaxing some of MRs assumptions. We further report the model is well powered enough for current data set sizes (from around 13000 observations or less, depending on the variance of the instruments), and in a range of additive, shared and environmental variances found in common clinical scenarios.
Bio: https://findanexpert.unimelb.edu.au/profile/821977-luis-fernando-silva-castro-de-araujo
Using Ontologies for Knowledge Engineering and Management in Medicine and Healthcare
DATE & TIME: August 29, 2022, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Dr. Arash Shaban-Nejad, University of Tennessee Health Science Center
Abstract: Health intelligence relies on the systematic collection and integration of data from diverse distributed and heterogeneous sources at various levels of granularity. These sources include data from multiple disciplines represented in different formats, languages, and structures posing significant integration and analytics challenges. Using a series of clinical and population health applications and use cases, this seminar highlights the contribution made by emerging semantic technologies that offer enhanced interoperability, interpretability, and explainability through the adoption of ontologies (a computational artifact capturing domain knowledge using concepts, relations, and complex logical rules and axioms), and knowledge graphs.
Bio: https://www.uthsc.edu/faculty/profile/?netid=ashabann
Using Outbred HS Rats to Study the Genetic Basis of Almost Everything You Can Think Of
DATE & TIME: August 22, 2022, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting

Speaker: Dr. Abraham Palmer, University of California San Diego
Abstract: Whereas there are well established methods for translating findings about single genes from humans to non-humans, there is an urgent need for methods to translate the polygenic signals obtained from GWAS across species. This is difficult because GWAS produces information about SNPs rather than genes, however SNPs are inherently species specific. My lab is helping to develop two complementary methods to address this problem. Both methods depend on translation of GWAS signals from SNPs to genes. In one method, this is done by choosing the gene that is nearest to an implicated SNP. The list of orthologous genes from two or more species are then projected into a previously defined gene network and a random walk is used to defuse the signal to neighboring genes. The overlap between the network defined by each species is then assessed for significance relative to permuted gene sets. In the second method, SNPs are used to predict gene expression and these predictions are used to estimate the effect of each gene’s expression on phenotype, creating what we term a polygenetic transcriptomic risk score (PTRS). A PTRS can then be used in conjunction with orthologous genes such that a PTRS defined in one species can be used to estimate an analogous trait in individuals from another species. In preliminary work we found that both methods identify highly statistically significant overlap in the signals associated with both BMI and body length. We are extending these methods to behavioral traits, including those relevant for substance use disorders.
Bio: https://profiles.ucsd.edu/abraham.palmer
A Single-Cell and Spatially Resolved Atlas of Human Breast Cancers
DATE & TIME: April 11, 2022, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting
Abstract: Breast cancers are complex cellular ecosystems where heterotypic interactions play central roles in disease progression and response to therapy. However, our knowledge of their cellular composition and organization remains limited. Recently we published an integrated cellular and spatial atlas of 26 primary human breast cancers spanning all major molecular subtypes1. This provided a systematic, high-resolution, characterization of the cellular diversity of the epithelial, immune and stromal cellular landscape. To investigate neoplastic cell heterogeneity, we developed a single cell classifier of intrinsic subtype (scSubtype) and revealed recurrent transcriptional gene modules that define the neoplastic cells. This detailed cellular taxonomy was then used to deconvolute large breast cancer cohorts, allowing their stratification into nine clusters, termed ‘ecotypes’, with unique cellular compositions and association with clinical outcome. Further, Visium spatial profiling provided an initial view of how stromal, immune and neoplastic cells are spatially organized in tumours, offering insights into tumour regulation. This work is now being expanded to the generation and integration of 100s of cellular profiles and a matching dataset of spatially resolved tumour transcriptomes. We propose that this will identify cellular niches that are spatially organized in breast tumours, offering insights into anti-tumour immune regulation and neoplastic heterogeneity. In particular, I’ll discuss recent work, using whole-transcriptome Nanostring spatial profiling, that analyses T-cell and cancer rich regions of the tumour micro-environment in triple-negative breast cancers, showing how integration with our cellular taxonomy allows for deconvolution of the cell-type abundances in these specific tissue regions. Our work highlights the potential of large-scale, integrated, cellular and spatial genomics to unravel the complex cellular heterogeneity within tumours and identify novel cell types, niches, and regulatory states that will inform treatment response.
Publication: https://www.nature.com/articles/s41588-021-00911-1
A Mixed-Effects Model for Powerful Association Tests in Integrative Functional Genomic Data
DATE & TIME: March 28, 2022, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting
Speaker: Dr. Li Hsu, Fred Hutchinson Cancer Research Center
Abstract: Genome-wide association studies (GWAS) have successfully identified thousands of genetic variants for many complex diseases; however, these variants explain only a small fraction of the heritability. Recent developments in genotype-omics studies have shown promises for discovering novel loci by leveraging genetically regulated molecular phenotypes (e.g., gene expression, methylation, proteomics) into GWAS. However, there is a limitation in the existing approaches. Some variants can individually influence disease risk through alternative functional mechanisms. Existing approaches of testing only the association of imputed molecular phenotypes will potentially lose power. To tackle this challenge, we consider a unified mixed effects model that formulates the association of intermediate phenotypes such as imputed gene expression through fixed effects, while allowing for residual effects of individual variants as random effects. We consider a set-based score testing framework, MiST (Mixed effects Score Test), and propose data-driven combination approaches to jointly test for the fixed and random effects. We also provide p-values for fixed and random effects separately to enhance interpretability of the association signals. Recently, we extend the MiST to depend on only GWAS summary statistics instead of individual level data, allowing for a broad application of MiST to GWAS data. Extensive simulations demonstrate that MiST is more powerful than existing approaches and summary statistics-based MiST (sMiST) agrees well those obtained from individual level data with substantively improved computational speed. We apply sMiST to a large-scale GWAS of colorectal cancer using summary statistics from >120, 000 study participants and gene expression data from the Genotype-Tissue Expression (GTEx) project. We identify several novel and secondary independent genetic loci.
Bio: https://www.fredhutch.org/en/faculty-lab-directory/hsu-li.html
Publication: https://pubmed.ncbi.nlm.nih.gov/29727690/
A statistical framework for differential pseudotime analysis with multiple single-cell RNA-seq samples
DATE & TIME: March 21, 2022, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting
Speaker: Dr. Hongkai Ji, Johns Hopkins Bloomberg School of Public Health
Abstract: Pseudotime analysis with single-cell RNA-sequencing (scRNA-seq) data has been widely used to study dynamic gene regulatory programs along continuous biological processes. While many computational methods have been developed to infer the pseudo-temporal trajectories of cells within a biological sample, methods that compare pseudo-temporal patterns with multiple samples (or replicates) across different experimental conditions are lacking. Lamian is a comprehensive and statistically-rigorous computational framework for differential multi-sample pseudotime analysis. It can be used to identify changes in a biological process associated with sample covariates, such as different biological conditions, and also to detect changes in gene expression, cell density, and topology of a pseudotemporal trajectory. Unlike existing methods that ignore sample variability, Lamian draws statistical inference after accounting for cross-sample variability and hence substantially reduces sample-specific false discoveries that are not generalizable to new samples. Using both simulations and real scRNA-seq data, including an analysis of differential immune response programs between COVID-19 patients with different disease severity levels, we demonstrate the advantages of Lamian in decoding cellular gene expression programs in continuous biological processes.
Bio: https://publichealth.jhu.edu/faculty/1943/hongkai-ji
Publication: https://www.biorxiv.org/content/10.1101/2021.07.10.451910v1
Inference of disease associated genomic segments in post-GWAS analysis
DATE & TIME: March 14, 2022, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting
Speaker: Dr. Lin Hou, Tsinghua University
Abstract: Identification and interpretation of disease associated loci remain a paramount challenge in genome-wide association (GWAS) of complex disease. We develop post-GWAS analysis tools, which leverage pleiotropy and functional annotations to dissect the genetic architecture of complex traits. In this talk, I will first introduce LOGODetect, a powerful and efficient statistical method to identify small genome segments harboring local genetic correlation signals. LOGODetect automatically identifies genetic regions showing consistent association with multiple phenotypes through a scan statistic approach. Applied to seven neuropsychiatric traits, we identify hub regions showing concordant effect on five or more traits. Next, I will introduce Openness Weighted Association Studies (OWAS), a computational approach that leverage and aggregate predictions of chromatin accessibility in personal genomes for prioritizing GWAS signals. In extensive simulation and real data analysis, OWAS identifies genes/segments that explain more heritability than existing methods, and has a better replication rate in independent cohorts than GWAS. Moreover, the identified genes/segments show tissue-specific patterns and are enriched in disease relevant pathways.
Bio: http://www.stat.tsinghua.edu.cn/en/teambuilder/faculty/lin-hou/
Seminar Record
Durability of Covid-19 Vaccines 
DATE & TIME: February 14, 2022, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting
Speaker: Dr. Danyu Lin, University of North Carolina at Chapel Hill
Abstract: Evaluating the durability of protection afforded by Covid-19 vaccines is a public health priority, with the results needed to inform policies around booster vaccinations as well as those around non-pharmaceutical interventions. In this talk, I will present a general framework for estimating the effects of Covid-19 vaccines over time in phase 3 clinical trials and observational studies. I will show some results on the duration of vaccine protection from the Moderna pivotal trial and from the North Carolina statewide surveillance data. The latter data, which were published in the New England Journal of Medicine in January, provided rich information about the effectiveness of the Pfizer, Moderna, and Johnson & Johnson vaccines in reducing the risks of Covid-19, hospitalization, and death over time. I will discuss the implications of these results for booster vaccinations.
Bio: https://sph.unc.edu/adv_profile/danyu-lin-phd/
Democratizing data-driven biology: Tackling incomplete data, unstructured metadata, and hidden curricula 
DATE & TIME: February 7, 2022, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting
Speaker: Dr. Arjun Krishnan, Michigan State University
Abstract: There is much enthusiasm about using omics and biomedical data collections to fuel research on complex traits and diseases. However, there are still some well-known fundamental challenges in seamlessly and effectively using these data to drive research. For instance, there are >1.5 million human gene expression profiles that are publicly available, but, depending on the technology/platform used to record each profile, different subsets of genes in the genome are measured in these transcriptomes, leading to thousands of unmeasured genes in many of these profiles. These gaps in data are major hurdles for integrative analysis. Critical problems also exist with data descriptions: the majority of >2 million publicly available omics samples lack structured metadata, including information about tissue of origin, disease status, and environmental conditions. Thus, discovering samples and datasets of interest is not straightforward. In this seminar, I will present recent work from our group on developing machine learning approaches to address these fundamental challenges. In addition, I will discuss the need for improving advanced research training in biological data analysis by formalizing concepts in statistical procedures, study design, data/code management, critically consuming data-driven findings, and reproducible research.
Bio: https://cmse.msu.edu/directory/faculty/arjun-krishnan/
Meta-clustering of Genomic Data 
DATE & TIME: January 31, 2022, 3:00 PM - 4:00 PM CDT
LOCATION: Zoom Meeting
Speaker: Dr. Yingying Wei, Chinese University of Hong Kong
Abstract: Like traditional meta-analysis that pools effect sizes across studies to improve statistical power, it is of increasing interest to conduct clustering jointly across datasets to identify disease subtypes for bulk genomic data and discover cell types for single-cell RNA-sequencing (scRNA-seq) data. Unfortunately, due to the prevalence of technical batch effects among high-throughput experiments, directly clustering samples from multiple datasets can lead to wrong results. The recent emerging meta-clustering approaches require all datasets to contain all subtypes, which is not feasible for many experimental designs.
In this talk, I will present our Batch-effects-correction-with-Unknown-Subtypes (BUS) framework. BUS is capable of correcting batch effects explicitly, grouping samples that share similar characteristics into subtypes, identifying features that distinguish subtypes, and enjoying a linear-order computational complexity. We prove the identifiability of BUS for not only bulk data but also scRNA-seq data whose dropout events suffer from missing not at random. We mathematically show that under two very flexible and realistic experimental designs—the “reference panel” and the “chain-type” designs—true biological variability can also be separated from batch effects. Moreover, despite the active research on analysis methods for scRNA-seq data, rigorous statistical methods to estimate treatment effects for scRNA-seq data—how an intervention or exposure alters the cellular composition and gene expression levels—are still lacking. Building upon our BUS framework, we further develop statistical methods to quantify treatment effects for scRNA-seq data.
Bio: https://www.sta.cuhk.edu.hk/peoples/ywei/
Novel Genetic Association Test and Cardiomyopathy Risk Prediction in Cancer Survivors 
DATE & TIME: January 24, 2022, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting
Speaker: Dr. Xuexia Wang, University of North Texas
Abstract: This talk includes two projects. Project 1: Gene-based association tests are widely used in GWAS. The power of a test is often limited by the sample size, the effect size, and the number of causal variants or their directions in a gene. In addition, access to individual-level data is often limited. To resolve the existing limitations, we proposed an optimally weighted combination (OWC) test based on summary statistics from GWAS. We analytically proved that aggregating the variants in one gene is the same as using the weighted combination of Z-scores for each variant based on the proposed score test. Several existing methods are special cases. We also numerically illustrated that our proposed test outperforms several existing methods via simulation studies. Furthermore, we utilized schizophrenia GWAS data and fasting glucose GWAS meta-analysis data to demonstrate that our method outperforms the existing methods in real data analyses. Project 2: We used a carefully curated list of 87 previously published genetic variants to determine whether the incorporation of genetic variants with non-genetic variables could improve the identification of cancer survivors at risk for anthracycline-related cardiomyopathy. We used anthracycline-exposed childhood cancer survivors from a Children’s Oncology Group study (COG-ALTE03N1: 146 cases; 195 matched controls) as the discovery set. Replication was performed in two anthracycline-exposed survivor populations: i) childhood cancer survivors from the Childhood Cancer Survivor Study (CCSS: 126 cases; 250 controls); ii) autologous blood or marrow transplantation (BMT) survivors from the BMT Survivor Study (BMTSS: 80 cases; 78 controls). The Clinical+Genetic model performed better than the Clinical Model in COG-ALTE03N1 (AUC of Clinical+Genetic Model = 0.88 vs. AUC of Clinical Model = 0.81) and BMTSS (AUC of Clinical+Genetic Model = 0.72 vs. AUC of Clinical Model = 0.64), but not in CCSS (AUC of Clinical+Genetic Model = 0.88 vs. AUC of Clinical Model = 0.89). However, the Clinical+Genetic model performed marginally better in CCSS patients without CVRFs where cardiomyopathy developed within 30 years of anthracycline exposure (AUC of Clinical+Genetic Model = 0.90 vs. AUC of Clinical Model = 0.85). Conclusions: Adding a comprehensively assembled genetic profile to clinical characteristics improves the identification of cancer survivors at risk for anthracycline-related cardiomyopathy.
Bio: https://math.unt.edu/people/xuexia-wang
Publication: https://pubmed.ncbi.nlm.nih.gov/32413235/
Advances in Mendelian Randomization
DATE & TIME: November 22, 2021, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting
Speaker: Dr. Zoltán Kutalik, Associate Professor, University of Lausanne
Abstract: First, I will motivate the need for causal inference in contrast to observational correlations. In particular, I’ll describe the principle of an instrumental variable approach heavily applied in genetic research, termed Mendelian Randomization (MR). Next, I will show four extensions of this method to different settings/assumptions: (i) In-depth quantification and correction of the bias of the most popular MR method (IVW); (ii) Modelling genetic architecture simultaneously with bidirectional causal effects in the presence of a heritable confounder; (iii) Causal inference for composite trait type exposures; (iv) estimation of non-linear causal effects. I’ll finish with two applications to omics data: first, we link differential gene expression analyses with bi-directional causal effects and finally, I’ll touch on how causal effects of different omics layers are mediated.
Bio: https://applicationspub.unil.ch/interpub/noauth/php/Un/UnPers.php?PerNum=1048220&LanCode=8&menu=coord
Publication: https://pubmed.ncbi.nlm.nih.gov/33737653/
Standardization and Penalty Parameter Selection in Penalized Linear Mixed Models
DATE & TIME: November 1, 2021, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting
Speaker: Anna Reisetter, PhD, University of Iowa
Abstract: Penalized linear mixed models (LMMs) have been developed to accurately identify genotype-phenotype associations in the presence of dependent samples. In spite of this, the statistical properties of these models are not well understood. In addition, there is a lack of available software for their implementation. In this talk, we provide an overview of penalized LMMs for the analysis of structured genetic data, while examining their statistical properties in the genetic association setting. We then focus on the statistical properties of penalized LMMs in a general setting, and provide recommendations for key components of their implementation, including appropriate standardization and penalty parameter selection. We demonstrate the benefits of our recommendations using both a general setting, and one specific to genetic data. We conclude with a detailed analysis of a large, empirical GWAS data set which contains complex correlation among samples. We use this analysis to illustrate the benefits of penalized LMMs compared to traditional genome-wide association methods, and to demonstrate the utility of penalizedLMM, an R package we have developed for the flexible, and user-friendly implementation of penalized LMMs.
Publication: https://onlinelibrary.wiley.com/doi/full/10.1002/gepi.22384
PanelPRO: A General Framework for Multi-Gene, Multi-Cancer Mendelian Risk Prediction Models 
DATE & TIME: October 18, 2021, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting
Speaker: Jane Liang, Ph.D. in Biostatistics, Harvard University
Abstract: Risk evaluation to identify individuals who are at greater risk of cancer as a result of heritable pathogenic variants is a valuable component of individualized clinical management. Using principles of Mendelian genetics, Bayesian probability theory, and variant-specific knowledge, Mendelian models derive the probability of carrying a pathogenic variant and developing cancer in the future, based on family history. Existing Mendelian models are widely employed, but are generally limited to specific genes and syndromes. However, the upsurge of multi-gene panel germline testing has spurred the discovery of many new gene-cancer associations that are not presently accounted for in these models. We have developed PanelPRO, a flexible, efficient Mendelian risk prediction framework that can incorporate an arbitrary number of genes and cancers, overcoming the computational challenges that arise because of the increased model complexity. Using simulations and a clinical cohort with germline panel testing data, we evaluate model performance, validate the reverse-compatibility of our approach with existing Mendelian models, and illustrate its usage.
Publication: https://arxiv.org/abs/2108.12504
Seminar Recording Link
Principled Approaches to the Practical Challenges of Real-World Data
DATE & TIME: October 4, 2021, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting
Speaker: Dr. Rebecca Hubbard, Professor of Biostatistics, University of Pennsylvania
Abstract: Interest in conducting research using real-world data, data generated as a by-product of digital transactions, has exploded over the past decade and has been further spurred by the 21st Century Cures Act. Real-world data facilitate understanding of treatment utilization and outcomes as they occur in routine practice, and studies using these data sources can potentially proceed rapidly compared to trials and observational studies that rely on primary data collection. However, using data sources that were not collected for research purposes comes at a cost, and naïve use of such data without considering their complexity and imperfect quality can lead to bias and inferential error. Real-world data frequently violate the assumptions of standard statistical methods, but it is not practicable to develop new methods to address every possible complication arising in their analysis. The statistician is faced with a quandary: how to effectively utilize real-world data to advance research without compromising best practices for principled data analysis. In this talk I will use examples from my research on methods for the analysis of electronic health records (EHR) derived-data to illustrate approaches to understanding the data generating mechanism for real-world data. Drawing on this understanding, I will then discuss approaches to identify, use, and develop principled methods for incorporating EHR into research. The overarching goal of this presentation is to raise awareness of challenges associated with the analysis of real-world data and demonstrate how a principled approach can be grounded in an understanding of the scientific context and data generating process.
Seminar Recording Link
Generalized V-Learning Framework for Estimating Dynamic Treatment Regimes
DATE & TIME: September 27, 2021, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting
Speaker: Dr. Duyeol Lee, Quantitative Analytics Specialist, Model Innovations Team, Wells Fargo Bank
Abstract: Precision medicine is an approach that incorporates personalized information to efficiently determine which treatments are best for which types of patients. A key component of precision medicine is creating mathematical estimators for clinical decision-making. Dynamic treatment regimes formalize tailored treatment plans as sequences of decision rules. Recently, the V-learning method was introduced to estimate optimal dynamic treatment regimes. This method showed good performance compared to the existing reinforcement learning methods such as greedy gradient Q-learning. However, the complicated functional form of its loss function makes it difficult to apply modern machine learning methods for the estimation of value functions of treatment policies. We propose a generalized V-learning framework for estimating optimal treatment regimes. The proposed method adopts widely used loss functions and an iterative method to estimate value functions. Simulation studies show that the proposed method provides better performance compared with the original V-learning method.
OSGA Webinar - Organizing Data in Spreadsheets
DATE & TIME: September 24, 2021, 12:00 PM - 1:00 PM CDT
LOCATION: Zoom Meeting
Speaker: Dr. Karl Broman, Professor, Department of Biostatistics & Medical Informatics, University of Wisconsin-Madison
Abstract: Spreadsheets are widely used software tools for data entry, storage, analysis, and visualization. Focusing on the data entry and storage aspects, this presentation will offer practical recommendations for organizing spreadsheet data to reduce errors and ease later analyses. The basic principles are: be consistent, write dates like YYYY-MM-DD, do not leave any cells empty, put just one thing in a cell, organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row), create a data dictionary, do not include calculations in the raw data files, do not use font color or highlighting as data, choose good names for things, make backups, use data validation to avoid data entry errors, and save the data in plaintext files.
Broman KW, Woo KH (2018) Data organization in spreadsheets. The American Statistician 78:2–10

Previous Webinar Series
Mendelian Randomization Analyses in a Multivariate Framework
DATE & TIME: September 13, 2021, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting
Speaker: Dr. Phillip Hunter Allman (University of Alabama at Birmingham School of Public Health)
Summary: Mendelian randomization (MR) is an application of instrumental variable (IV) methods to observational data in which the IV is a genetic variant. An IV is a variable which functions similarly to the random treatment group assignment seen in clinical trials. Numerous statistical methods exist for subject-level MR or IV analysis; such as two-stage predictor substitution (2SPS) or the correlated errors model (CEM). These methods have some limitations depending on the distributions involved and the number of IVs; particularly when it comes to asymptotic variance estimation and hypothesis testing. Our research explores extensions of the CEM to scenarios with non-normal variates. This talk will provide a brief introduction to MR; describe the popular statistical methods for such analyses; describe our extensions to the correlated errors model; and compare these methods with simulations and real data analysis.
 Summer Intern Presentation II
DATE & TIME: August 30, 2021, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting
Speaker: Zifan(Fred) Yu (University of Washington)
Title: Fast Algorithms for Elastic-net Matrix Linear Models
Abstract: Our work builds upon the Matrix Linear Models, which have been established and well-studied by Dr. Sen and Jane W. Liang who was a previous intern. The models are suitable for modeling high-throughput data where the responses are multivariate with each column being, for example, a genetic strain, and each row being an experiment run, and they take full advantages of the column covariates (which represent usually the environmental conditions) and the row covariates (features that are associated with the of the type of the response). Extended from the univariate linear models and some standard statistical approaches such as the t-test, the Matrix Linear Models provide the flexibility in modeling and computational speed. The previous work developed solutions to estimate the models under Lasso penalization with the assumption that the effects will be rare (sparsity in the coefficients). In order to overcome some of the limitations while still preserve some of the strengths of the Lasso type of solutions, we developed solutions to the models with Elastic-net, which is a combined regularization approach of the Lasso and Ridge types of regression. Similar as the previous work, we extended the proximal algorithms (ISTA, FISTA and ADMM) to be suitable for solving the Elastic-net problems. The outcome of our work is the Julia (a relatively young but promising programming language) package called MatrixLMnet extended with Elastic-net solutions with full functionalities as the Lasso solutions, which initially developed by Jane W. Liang and Dr. Sen, Saunak.

Speaker: Phillip Winston Miller (University of Memphis)
Title: bigBERD: An R package to generate reproducible analysis
Abstract: bigBERD is an R package which automatically generates code for general analysis plans in a piecewise and robust manner. Nearly all standard analysis plans contain the same basic steps: visualization, summarization of variables, univariate testing, and some type of regression analysis. Much time is spent copying, modifying, or developing new code to analyze methodologically similar problems.  My package aims to circumvent this issue by generating customized code for any dataset at hand. bigBERD operates in a stepwise manner. First, the user generates a Report directory tree to facilitate good data handling practices. Then the package reads the clean dataset and creates a short summary file, which contains useful information for the user and forms the basis of automated decision making. After these prerequisites are generated, the user can call one of four document generating functions: exploratory analysis, univariate analysis, multiple regression analysis, or time-to-event analysis. Each of these will generate a .Rmd document containing code customized to the task and dataset at hand. All code can be run immediately after generation or modified as the user sees fit.  The exploratory document contains code for univariate plots of each variable, bivariate plots, and a summary table. The univariate analysis function writes code to perform univariate analysis on all variables, stratified by a user-defined condition of interest. It also generates code for simple linear/logistic/multinomial/ordinal regression on an outcome of interest, with accompanying tables and plots. The multiple regression analysis function generates code to build a main effects model and a stepwise AIC reduced model, with summary tables and plots. The time-to-event analysis function generates KM plots with log-rank tests for categorical variables and a main-effects cox-proportional hazard model, with summary table.
 Summer Intern Presentation I
DATE & TIME: August 23, 2021, 2:00 PM - 3:00 PM CDT
LOCATION: Zoom Meeting
Speaker: Nadeesha Thewarapperuma (University of Kansas Medical Center)
Title: Analysis of DNA methylation data using a functional data approach
Abstract: We will use a data series that examines oral and pharyngeal carcinoma, and which includes 154 cases and 72 controls. A functional data approach will be used to analyze CpG island data for each gene to determine if there is a difference between case and control groups. The package fdANOVA will be used for the hypothesis tests. The null hypothesis is that the case and control groups have the same mean function, and the alternative is that that the mean functions differ.

Speaker: Ye Eun Bae (Florida State University)
Title: Building tools for understanding the genetic architecture of allopolyploidy
Abstract: Analysis of quantitative trait loci (QTL) is a useful approach to identify putatively causal genetic factors underlying quantitative trait variation in species. While many QTL analysis tools and methods have been proposed, it is challenging to extend them to allopolyploid species which require polyploidy-aware genetic mapping by using the fact that there is preferential paring between pairs of homologous chromosomes. In this project, we developed a QTL analysis tool for allopolyploids to provide insights into their genetic architecture and genotype-phenotype associations. As a starting point, we applied our tool to switchgrass (Panicum virgatum) datasets and visualized the application results to help understanding the allopolyploid nature of switchgrass.

Seminar Recording Link
We look forward to seeing you!

Processing Registration...

Powered by: