MarinStatsLectures - Bivariate Analysis

Bivariate Analysis

Bivariate is just another word for two variables. Bivariate analysis is one of the simplest forms of statistical analysis. The bivariate analysis involves the analysis of two variables, X: independent/explanatory/outcome variable and Y: dependent/outcome variable, to determine the relationship between them. bivariate analysis explores how the dependent (“outcome”) variable depends or is explained by the independent (“explanatory”) variable or it explores the association between the two variables without any cause and effect relationship. When dealing with bivariate data, most often we know what our outcome (dependent) variable is and what our explanatory (independent) variable is but sometimes there is no natural variable to define as the outcome and as the explanatory (for example examining the relationship between individual's IQs and EQs, where there is no natural variable to define as the outcome and as the explanatory) In this series, we’ve divided methods of bivariate analysis based on the type of dependent/ outcome (Y) variable and the type of independent /explanatory (X) variable. That is, deciding whether our X and Y variables are numeric (quantitative, continuous) or categorical (qualitative, factor), can help us decide which types of plots (graphical summaries) and analyses methods would be appropriate.

The statistics video tutorials use simple and easy to follow examples to explain concepts in Statistics. These videos are produced in a real-life lecture format so students can watch and learn as the teacher works through the examples at the pace of a real classroom.

Concepts in Statistics: Bivariate Analysis Video Tutorials

Bivariate Analysis Meaning: In this tutorial, we provide a big-picture overview of bivariate data analysis. This video is intended to set up all of the bivariate analysis that follows. Our approach is to divide methods of analysis based on the type of Dependent or Outcome (Y) variable and the type of Independent or Explanatory (X) variable. That is, deciding whether our X and Y variables are numeric (quantitative, continuous) or categorical (qualitative, factor), can help us decide which types of plots (graphical summaries) are appropriate, as well as which sorts of analyses may be appropriate.

Bivariate Analysis Categorical & Numerical: In this tutorial, you will get an overview of bivariate analysis when Y variable (Dependent variable /outcome variable) is numeric (or numerical, quantitative), and X variable (independent variable/explanatory variable) is categorical (or qualitative). This tutorial is an introduction to paired t-test, 2-sample t-test, Wilcoxon signed-rank test, Wilcoxon rank-sum test aka Mann Whitney u test, one-way analysis of variance, Kruskal Wallis one-way analysis of variance, and multiple comparisons in the context of one-way analysis of variance. Here We discuss the pros and cons of matching or pairing, as well as when it is and isn't possible.

Paired t-Test: In this video, we learn how to use paired t-test to compare means of 2 matched, paired, or dependent groups. The paired t-test is used to compare the means of 2 matched, paired, or dependent groups. It essentially becomes the univariate (one sample) t-test, by taking the difference in observations in the 2 groups and then conducting a test on the mean difference. The paired t-test is also known as paired two-sample t-test, paired-sample t-test or dependent sample t-test.We also cover building a confidence interval for the mean difference, as well as how this can be used to test a hypothesis. While we do show the calculations in the video, the video focuses on the concepts underlying the test. Common applications of the paired sample t-test include case-control studies or repeated-measures designs. Make sure to check out our video tutorial on how to conduct the paired t-test using R (Link Here)

Wilcoxon Signed-Rank Test: In this video, we learn what Wilcoxon Signed-Rank Test and Sign Test are and when should we use these tests. The Wilcoxon Signed-Rank Test, a non-parametric statistical hypothesis test, is used to test if the median difference between two paired or dependent groups is significantly different from 0. It can be thought of as a non-parametric alternative to the Paired sample t-test or dependent-samples t-test. We use the Sign Test as a simple introduction to a non-parametric approach, and then discuss why this simple test is inadequate, and move to discuss the signed-rank test. We use the explanation of the sign test and Wilcoxon signed-rank test to lay the foundation for non-parametric testing in general. This helps to understand the concept behind a non-parametric test, while not spending too much time in the details of the calculations. This understanding will serve to help us understand other non-parametric tests that follow. While they all use slightly different "formulas", they are all based on the same general principle. The main principle is that the "non-parametric" tests tend to work with ranks of observations, rather than the numeric values themselves. Make sure to check out the tutorial on how to conduct the Wilcoxon Signed Rank test with R (Link Here)

Two-Sample t -Test In this video, we talk about Two Sample t-test for independent groups. The Two-Sample t-test ( independent samples t-test or unpaired samples t-test) is used to compare the means of 2 independent groups. The test requires assuming that the variance (or standard deviation) of the two groups being compared are either equal or not equal (at the level of the population). We also cover building a confidence interval for the mean difference, as well as how this can be used to test a hypothesis. Also, we make mention of the non-parametric alternative to the independent samples t-test: the Wilcoxon rank-sum test (also known as the Mann Whitney U Test). While we do show the calculations in the video, the video focuses on the concepts underlying the test. Make sure to check out our video tutorial on how to conduct the paired t-test with R here (Link Here)

Two Sample t-test for Independent Groups: In this statistics video, we learn to compare the means of 2 independent groups using the two-sample t-test (independent samples t-test or unpaired samples t-test). The two-sample t-test requires assuming that the variance (or standard deviation) of the two groups being compared are either equal or not equal (at the level of the population). Here, we also learn to build a confidence interval for the mean difference, as well as how this can be used to test a hypothesis. Also, we make mention of the non-parametric alternative to the independent samples t-test: the Wilcoxon rank-sum test (also known as the Mann Whitney U Test). While we do show the calculations in the video, the video focuses on the concepts underlying the test. To learn how to conduct the independent two-sample t-test and calculate confidence interval with R watch this video (Link Here) and to learn how to conduct the Wilcoxon Rank-Sum (Mann-Whitney U) test (the nonparametric alternative to the two-sample t-test) in R Watch this video (Link Here)

Two Sample t-Test: Equal vs Unequal Variance Assumption: In this statistics tutorial, we learn about the assumption of equal variance (or standard deviation) vs non-equal variance (or standard deviation). When working with the 2 sample t-test, we must make one of those two assumptions. We also learn how to decide if we can assume equal variance or if we should assume unequal variance. We cover making this decision both in a subjective way, as well as describing more formal tests that can be used. Assuming equal variance is also referred to as 'pooling', or a 'pooled estimate' of the variance. this video also shows how to calculate the standard error for the difference in means under each of the assumptions, although the focus is on what each assumption means, in concept, and not on the calculations. To learn how to conduct the independent two-sample t-test and calculate confidence interval with R watch this video (Link Here)

One Way ANOVA (Analysis of Variance): Introduction: In One Way ANOVA (Analysis of Variance) video tutorial we will learn about one way analysis of variance (ANOVA) test, the purpose of ANOVA test, the null hypothesis and alternative hypothesis in ANOVA test, and the required assumptions for One Way ANOVA . One Way Analysis of Variance (ANOVA) is used to compare the means of 3 or more independent groups. The One Way Analysis of Variance (ANOVA) test requires assuming independent observations, independent groups, that the variance (or standard deviation) of the two groups being compared are approximately equal or that the sample size for each group is large. The non- parametric alternative to ANOVA is Kruskal Wallis One Way Analysis of Variance ((The Kruskal–Wallis test by ranks, Kruskal–Wallis H test (named after William Kruskal and W. Allen Wallis), or one-way ANOVA on ranks); Bootstrap (resampling) approaches are also another alternative to this test.To learn how to conduct ANOVA, ANOVA Multiple Comparisons & Kruskal Wallis in R watch this video (Link Here)

ANOVA (Analysis of Variance) and Sum of Squares: In this ANOVA video tutorial, we learn about Sum of Squares calculations and interpretations, the explained sum of squares, the unexplained sum of squares, between-group and within-group variability, signal and noise and the larger concept of Analysis of Variance (ANOVA) through examples. Analysis of Variance ANOVA is a concept in the statistical sciences (One-Way analysis of variance is the hypothesis test many are familiar with). The Total Sum of Squares (SST) is a measure of the total variability in Y. In fact, it is the numerator in the formula for calculating the variance of Y. What is Between-group Sum of Squares (SSB) and Within-group Sum of Squares (SSW)? The first thing to mention is that SST=SSB+SSW. The total sum of squares can be divided up into 2 parts; the amount of the sum of squares that can be explained by X (SSB) or the variability between groups, and the amount of the total sum of squares that is unexplained by X (SSW) or the variability that goes on within a group. Between Groups Sum of Squares (SSB) are also known as Explained Sum of Squares, Treatment Sum of Squares (SSt), Model Sum of Squares (SSm), Regression Sum of Squares (SSr). There are more. Also, these can all be reversed in order, such as Sum of Squares Treatment, and so on. Within Group Sum of Squares (SSW) are also known as Unexplained Sum of Squares, Sum of Squared Error (SSe), Sum of Squared Residuals (SSr). There are more. Also, these can all be reversed in order, such as Residual Sum of Squares, and so on! To learn how to conduct ANOVA, ANOVA Multiple Comparisons & Kruskal Wallis in R watch this video (Link Here)

ANOVA and F Statistic and P-Value: The analysis of variance test involves comparing variability explained by group differences (between-group variability) to variability that can not be explained by group differenced (within-group variability). This is done by calculating an F statistic that is the ratio of the two. Large values for the test statistic indicate there is likely a difference in means for at least one group. One Way Analysis of Variance (ANOVA) is used to compare the means of 3 or more independent groups. The ANOVA test requires assuming independent observations, independent groups, that the variance (or standard deviation) of the two groups being compared are approximately equal or that the sample size for each group is large. In this statistics tutorial, we learn what the F statistic tells us in Anova, how to find the F statistic, how to interpret F value and p-value in ANOVA, What is the degree of freedom for the F test, what does a large F statistic mean and more.

ANOVA & Bonferroni Correction for Multiple Comparisons: In this ANOVA video tutorial, we learn about Bonferroni's multiple testing correction (Bonferroni Correction) for Analysis of Variance (ANOVA). When comparing multiple groups, if the null hypothesis is rejected (with a small p-value), the conclusion is that there is evidence that at least one of the means differs from the rest, but there is no indication of which differ from others. To decide which we believe differ, we can conduct "multiple comparisons" of all pairwise sets of means. While working through an example with multiple comparisons, we will see that because we are making multiple comparisons at once, the chance of making a type I error (false positive) increases. Also called family-wise error rate (FWER), this is the probability of at least one type I error (at least one false positive) when performing multiple hypotheses tests. Bonferroni proposed a method to correct the inflated type I error rate. Bonferroni assumes that all pairwise tests are independent. This may not be true, but as we will see in this video, independence makes calculations simpler and it is also a bit more conservative. Bonferroni’s approach is to use an adjusted alpha level. The Bonferroni correction sets the significance cut-off for each test at (α/# of tests), in order to have an overall type I error rate of approximately α (alpha). While Bonferroni's method is not necessarily the `optimal' correction to use, it is easy to understand, and it is conservative. Other methods of correction for multiple comparisons do exist, Tukey's or Dunnett's, for example. They are all based on the same concept, so once you understand Bonferroni's correction, you will be able to understand the concepts behind the other options.

Chi Square Test of Independence: Pearson's Chi-Square Test of Independence can be used to test if two variables are independent or dependent, and is often used with categorical data. The Chi-Square Test can also be used to test how well a particular distribution fits a set of observed data and is referred to as Pearson's Goodness of Fit Test. The Pearson's Chi-Squared test works by comparing the observed contingency table, to what the table would be expected to look like, if the null hypothesis is true, and X and Y are independent. While the Chi-Square Test technically is referred to as a non-parametric test, the assumptions and approach to the test look more like a parametric test. If the null hypothesis is rejected, this test tells us nothing about the strength or direction of the association between X and Y, and we must use other measures of association to try and address this. In this statistics video, we learn how to use Pearson's Chi-Square Test of Independence to test if two categorical variables are independent or dependent; while we show the formula and calculations, our focus is on the concepts of Chi-Square Test, not the calculations. To learn how to use R for chi-square test watch this video (Link Here)

Odds Ratio, Relative Risk, and Risk Difference: In this tutorial, we discuss measures of association between 2 categorical variables (or factors), such as the Risk Difference, Relative Risk, and Odds Ratio. The Relative Risk also gets referred to as a Risk Ratio, Rate Ratio, Relative Rate, Incidence Rate Ratio, or Prevalence Rate Ratio. The Risk Difference also gets referred to as the Attributable Risk. Odds Ratio, Relative Risk, and Risk Difference are all slightly different ways of describing the association between two categorical variables. Here we learn how to use Odds Ratio, Relative Risk, and Risk Difference to describe the association between two categorical variables. The video does show the calculations for Odds Ratio, Relative Risk, and Risk Difference, but the focus is on the concepts. To learn how to use R to calculate relative risk, odds ratio and risk difference watch this video (Link Here)

Case-Control Study and Odds Ratio: The odds ratio is a measure of association for a case-control study. Odds Ratio is the odds of getting the disease for someone who is exposed relative to the odds of the disease for someone who is not exposed. While a case-control study does not allow one to estimate incidence or prevalence of a disease, and hence we can not estimate a relative risk, or a risk difference, we can estimate an odds ratio. In this statistics video tutorial, we explain exactly why we are able to estimate an odds ratio for a case-control study design with an example. We also note that case-control studies are often used for studying rare diseases and that in the case of a rare disease, an odds ratio is approximately equal to the relative risk. This is an extremely nice result, as the relative risk is what we are most interested in. But for a rare disease, we often must work with a case-control design, and with this design, we can not directly estimate the relative risk....but... we can estimate the odds ratio, which happens to be approximately equal to the relative risk, when the disease is rare...what a beautiful loop!