5 Visual Statistical Analysis

Hands-On Exercise for Week 4

Author

KB

Published

May 7, 2023

(First Published: May 4, 2023)

5.1 Learning Outcome

We will learn to create visual graphics with rich statistical information, visualise model diagnostics, and model parameters.

5.2 Getting Started

5.2.1 Install and load the required r libraries

Install and load the the required R packages. The name and function of the new package that will be used for this exercise are as follow:

  • ggstatsplot: offers various types of statistical plots and functions for statistical tests such as t-tests, ANOVA, correlation tests, and regression analysis

  • performance: offers functions for computing model evaluation metrics for model evaluation and comparison

  • parameters: provides for managing complex experiments with many parameters and for automating parameter tuning in machine learning workflows

  • see: provides a host of functions and tools to produce a range of publication-ready visualizations for model parameters, predictions, and performance diagnostics

Show the code
pacman::p_load(tidyverse,readxl, ggstatsplot, performance, parameters, see)

5.3 Visual Statistical Analysis with ggstatsplot package

ggstatsplot is an extension of ggplot2 package for creating graphics enriched with statistical test details. It:

  • Provides alternative statistical inference methods by default.

  • Follows the best practices for statistical reporting. For all statistical tests reported in the plots, the default template abides by the APA gold standard for statistical reporting. For example, here are the results from a t-test:

5.3.1 Import the data

We will be using the same exam scores data-set that was featured in my Hands-On Exercise for Week 1.

Show the code
exam <- read_csv('data/Exam_data.csv', show_col_types = FALSE )

5.3.2 One-sample Mean Test

We use gghistostats() to to build an visual of one-sample test on English scores.

In the following case:

  • type = "bayes": specifies the type of statistical test to perform on the data to generate the interval estimate. In this case, it is a Bayesian analysis, which provides a posterior distribution of plausible values based on prior knowledge and observed data.

  • test.value = 60: specifies the value for the null hypothesis that will be used to calculate the probability of the observed data. In this case, it is 60, which assumes that the average English score is 60.

Show the code
set.seed(1234)

gghistostats(
  data = exam,
  x = ENGLISH,
  type = "bayes",
  test.value = 60,
  xlab = "English scores"
)

Unpacking the Bayes Factor

A Bayes factor is the ratio of the likelihood of one particular hypothesis to the likelihood of another. It can be interpreted as a measure of the strength of evidence in favor of one theory among two competing theories.

The Bayes factor gives us a way to evaluate the data in favor of a null hypothesis, and to use external information to do so. It tells us what the weight of the evidence is in favor of a given hypothesis.

When we are comparing two hypotheses, H1 (the alternate hypothesis) and H0 (the null hypothesis), the Bayes Factor is often written as B10. It can be defined mathematically as:

The Schwarz criterion is one of the easiest ways to calculate rough approximation of the Bayes Factor.

How to interpret Bayes Factor

A Bayes Factor can be any positive number. One of the most common interpretations is this one—first proposed by Harold Jeffereys (1961) and slightly modified by Lee and Wagenmakers in 2013:

5.3.3 Two-sample Mean Test

Next, we use ggbetweenstats() to build a visual for two-sample mean test of Maths scores by gender. This generates a combination of box and violin plots along with jittered data points for between-subjects designs with statistical details included in the plot as a subtitle.

In the following case:

  • type = "np": specifies the type of test to be used to compare the groups, in this case, a non-parametric test (Wilcoxon-Mann-Whitney test).

  • messages = FALSE: specifies whether or not to display informative messages about the statistical test being performed.

Show the code
ggbetweenstats(
  data = exam,
  x = GENDER, 
  y = MATHS,
  type = "np",
  messages = FALSE
)

5.3.4 One-way ANOVA Test

We can also use ggbetweenstats() to build a visual for One-way ANOVA test on English score by race.

In the following case:

  • type = "p": specifies the type of test to be used to compare the groups, in this case, a parametric test (one-way ANOVA).

  • mean.ci = TRUE: specifies whether or not to display confidence intervals for the group means.

  • pairwise.comparisons = TRUE: specifies whether or not to display pairwise comparisons between the groups.

  • pairwise.display = "s": specifies the format of the pairwise comparison display, in this case, “s” for compact letter display.

  • p.adjust.method = "fdr": specifies the method used for p-value adjustment for multiple comparisons, in this case, false discovery rate (FDR) correction.

Show the code
ggbetweenstats(
  data = exam,
  x = RACE, 
  y = ENGLISH,
  type = "p",
  mean.ci = TRUE, 
  pairwise.comparisons = TRUE, 
  pairwise.display = "s",
  p.adjust.method = "fdr",
  messages = FALSE
)

Summary of tests with ggbetweenstats

5.3.5 Significant Test of Correlation

We can use ggscatterstats() to build a visual for Significant Test of Correlation between Maths scores and English scores. This creates a scatter plot with overlaid regression lines between the variables “MATHS” and “ENGLISH” in the “exam” dataset.

In the following case:

  • marginal = TRUE: specifies whether or not to display marginal histograms or density plots along the axes of the scatter plot. In this case, marginal plots are not displayed.
Show the code
ggscatterstats(
  data = exam,
  x = MATHS,
  y = ENGLISH,
  marginal = TRUE,
  )

5.3.5 Significant Test of Association (Dependence)

First, we bin the Maths scores into a 4-class variable using cut() function.

Show the code
exam1 <- exam %>% 
  mutate(MATHS_bins = cut(MATHS, 
               breaks = c(0,60,75,85,100))
  )

We use ggbarstats() to build a visual for Significant Test of Association between the categorised Maths scores and gender.

Show the code
ggbarstats(exam1, 
           x = MATHS_bins, 
           y = GENDER)

5.4 Visualise Models

In this section, we will learn how to visualise model diagnostic and model parameters.

5.4.1 Import the data

Toyota Corolla case study will be used and the datat-set is imported. We will build a model to discover factors affecting prices of used-cars by taking into consideration a set of explanatory variables.

Show the code
car_resale <- read_xls("data/ToyotaCorolla.xls", 
                       "data")
car_resale
# A tibble: 1,436 × 38
      Id Model    Price Age_08_04 Mfg_Month Mfg_Year     KM Quarterly_Tax Weight
   <dbl> <chr>    <dbl>     <dbl>     <dbl>    <dbl>  <dbl>         <dbl>  <dbl>
 1    81 TOYOTA … 18950        25         8     2002  20019           100   1180
 2     1 TOYOTA … 13500        23        10     2002  46986           210   1165
 3     2 TOYOTA … 13750        23        10     2002  72937           210   1165
 4     3  TOYOTA… 13950        24         9     2002  41711           210   1165
 5     4 TOYOTA … 14950        26         7     2002  48000           210   1165
 6     5 TOYOTA … 13750        30         3     2002  38500           210   1170
 7     6 TOYOTA … 12950        32         1     2002  61000           210   1170
 8     7  TOYOTA… 16900        27         6     2002  94612           210   1245
 9     8 TOYOTA … 18600        30         3     2002  75889           210   1245
10    44 TOYOTA … 16950        27         6     2002 110404           234   1255
# ℹ 1,426 more rows
# ℹ 29 more variables: Guarantee_Period <dbl>, HP_Bin <chr>, CC_bin <chr>,
#   Doors <dbl>, Gears <dbl>, Cylinders <dbl>, Fuel_Type <chr>, Color <chr>,
#   Met_Color <dbl>, Automatic <dbl>, Mfr_Guarantee <dbl>,
#   BOVAG_Guarantee <dbl>, ABS <dbl>, Airbag_1 <dbl>, Airbag_2 <dbl>,
#   Airco <dbl>, Automatic_airco <dbl>, Boardcomputer <dbl>, CD_Player <dbl>,
#   Central_Lock <dbl>, Powered_Windows <dbl>, Power_Steering <dbl>, …

5.4.2 Create a Multi-variate Linear Regression Model

We use lm() function of R Base Stats to calibrate a multi-variate linear regression model.

Show the code
model <- lm(Price ~ Age_08_04 + Mfg_Year + KM + 
              Weight + Guarantee_Period, data = car_resale)
model

Call:
lm(formula = Price ~ Age_08_04 + Mfg_Year + KM + Weight + Guarantee_Period, 
    data = car_resale)

Coefficients:
     (Intercept)         Age_08_04          Mfg_Year                KM  
      -2.637e+06        -1.409e+01         1.315e+03        -2.323e-02  
          Weight  Guarantee_Period  
       1.903e+01         2.770e+01  

5.4.3 Model Diagnostic: Check for Multicolinearity

We use check_collinearity() of performance package to conduct the test.

Show the code
check_collinearity(model)
# Check for Multicollinearity

Low Correlation

             Term  VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
               KM 1.46 [ 1.37,  1.57]         1.21      0.68     [0.64, 0.73]
           Weight 1.41 [ 1.32,  1.51]         1.19      0.71     [0.66, 0.76]
 Guarantee_Period 1.04 [ 1.01,  1.17]         1.02      0.97     [0.86, 0.99]

High Correlation

      Term   VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
 Age_08_04 31.07 [28.08, 34.38]         5.57      0.03     [0.03, 0.04]
  Mfg_Year 31.16 [28.16, 34.48]         5.58      0.03     [0.03, 0.04]
Show the code
plot(check_collinearity(model))

5.4.4 Model Diagnostic: Check Normality Assumption

There is the check_normality() function to conduct the test.

Show the code
model1 <- lm(Price ~ Age_08_04 + KM + 
              Weight + Guarantee_Period, data = car_resale)

check_n <- check_normality(model1)

plot(check_n)

5.4.5 Model Diagnostic: Check model for homogeneity of variances

We can use the check_heteroscedasticity() to do this check.

Show the code
check_h <- check_heteroscedasticity(model1)

plot(check_h)

5.4.6 Model Diagnostic: Complete Check

Thank goodness! There’s also a check_model()function to conduct the diagnostic tests discussed above.

Show the code
check_model(model1)

5.4.7 Visualise Regression Parameters

We can use plot() function of see package and parameters() of parameters package to visualise the parameters of a regression model.

Show the code
plot(parameters(model1))

Alternatively, we can use ggcoefstats() of ggstatsplot package to visualise the parameters of a regression model.

Show the code
ggcoefstats(model1, 
            output = "plot")

\(**That's\) \(all\) \(folks!**\)