2.5 Numerical Summaries for Bivariate Data

2.5.1 Covariance

Previously, we used the var() function to calculate the variance statistic of a sample for a single variable. A closely related statistic is the covariance, calculated by the cov() function.

cov(weight, age)
## [1] 26.63442

The covariance will be positive if the two variables tend to have large/positive value at the same time, and small/negative values at the same time. A positive covariance means that two variables tend to increase together and decrease together. A negative covariance means that when one variable increases, the other tends to decrease. Unfortunately, we often cannot identify a “large” value of the covariance without knowing how much the two variables vary individually. We will explore how to handle that in the next statistic.

Below, we break down the covariance into the contributions by each point. Notice that the points in the first and third quadrants have positive contributions, while the points in the second and fourth quadrants have negative contributions. Also note that the points farthest in the corners, far from the mean of both variables or the centroid, have the biggest contributions. (Don’t worry about the large block of code, we will not expect you to plot things like this, but it is an illustration of what R is capable of.)

covparts <- (age - mean(age)) * (weight - mean(weight))
plot(age, weight, 
     main = "Scatterplot with point size=covariance contribution",
     col = ifelse(covparts > 0, rgb(0, 0, 1, 0.6), rgb(1, 0, 0, 0.6)),
     pch = 16, cex = abs(covparts)^0.33, 
     xlab = "Age", ylab = "Weight")
abline(h = mean(weight), col = "black")
abline(v = mean(age), col = "black")
legend("bottomright", 
       legend = c("Positive", "Negative"),
       pch = 16, 
       col = c(rgb(0,0,1,0.6), rgb(1,0,0,0.6)))
labs <- c(4,16,18,22)
text(age[labs], weight[labs], 
     round(covparts[labs], 1), 
     adj = c(0.5, 0.25),
     cex = abs(covparts[labs])^0.1 / 1.5)

2.5.2 Correlation

We can calculate the sample correlation using the cor() function.

cor(weight, age)
## [1] 0.908391

Correlation scales the covariance by the standard deviations of the two variables. This gives a measure of how much the two variables vary together, relative to their individual variances.

cov(weight, age) / (sd(weight) * sd(age))
## [1] 0.908391

The correlation will always be between -1 and 1, with 1 representing perfect linearity sloping upward, and -1 representing perfect linearity sloping downward.