In many cases, we’re interested whether one variable is correlated with another: for example, do drivers who finish well in the Daytona 500 go on to have a good year? In other words, could you predict whether a driver would have a good season based on his or her finish at Daytona?
I’m explaining correlations in terms of this question, but the general ideas are the same.
Types of Correlations
If Daytona were perfectly predictive, then the driver who finished first at Daytona would win the championship, the driver who finished second would finish the season in second place, and so on. A plot of the driver’s season-ending rank vs. the race finish would look like this for a perfect correlation.
If we actually saw this in the real data, the only interpretation would be that NASCAR is fixed. (And we know it’s not, so we expect so scatter.) If there were a strong correlation, we’d more likely see something like this:
There’s a clear overall trend, even though there’s some scatter in the data. The more scatter, the less correlation and the less predictive ability.
One more practice run. Here’s the same data as above, but I’ve introduced some luck — good and bad. In this simulation, some of the top season finishers wrecked out of the race and a few of the backmarkers got lucky.
The overall strong trend is still there, but a couple data points (the ones highlighted) are way off the line. The data points below and to the right of the line are drivers who perform worse than their season-ending rank would predict. For example, the season champion gets caught up in an accident early and finishes dead last. The data points to the left and above the line are drivers who are over performing.
An then, if you get a graph that’s pretty much random, like this:
This tells you that there’s absolutely no correlation!