Aug 11, 2020 • Written by Rene-Jean Corneille
import pandas as pd
import seaborn as sns
import seaborn_qqplot as sq
import matplotlib.pyplot as plt
from scipy.stats import gamma, norm
Probability plots are a convenient way of comparing sample distributions graphically. They are not as popular as other methods of data visualisation, maybe they should. In the context of data anlysis, using them correctly can help at least help the data analyst discover quite easily some interesting facts about their dataset.
I have had difficulties finding an exhaustive resource on probability plots, also there often seems to be a confusion between the types of probability plots.
Probability plots can be either univariate or bivariate. They represent in a 2d plane the scatter plots of a random variable cumulative distribution function or its inverse.
the simple probability plot of a random variable is the plot where is an identically independantly sample from the same probability distribution as the variable and the cumulative distribution function of the variable .
the simple quantile plot of a random variable is the plot where a partition of the interval [0,1] and the generalized inverse of the cumulative distribution function of the variable - also known as the quantile function.
the probability-probability plot of two random variables is the plot where and anre an identically independantly sample respectively from the same probability distribution as the variable and the same probability distribution as the variable . and are respectively the cumulative distribution functions of the variable and the variable .
the quantile-quantile plot of two random variables is the plot where a partition of the interval [0,1]. and are respectively the quantile functions of the variable and the variable .
The simple probability plot is what is usually refered to as the probability plot, but I have also seen quantile-quantile plots refered to as probability plots (I assume because they are the most popular among the four above). The probability-probability plot is lesser known and among all of these; the only one that I studied thouroughly at university has been the quantile-quantile plot.
In practice the distribution of the variables are unknown. The only thing we have at our disposal are the samples and . Let's assume that both samples are of size . Since the distribution is unknown, we estimate the c.d.f. using the empirical c.d.f.:
It can be proven that converges to almost certainely when . The empirical quantile function is then defined as the generalized inverse of the empirical c.d.f:
The normal distribution helds a central (pun intended) role in modern statistics thanks to the Central Limit Theorem. Any centered and reduced iid sample if large enough has a distribution that is close enough to the Normal distribution. The "bivariate" probability plots can be used to compare an empirical probability distribution against the normal distribution. This is what makes the qq-plot so popular compared to its counterparts. It allows to visualize the behaviour of a random variable in the tail of its distribution compared to the normal distribution (which is some sort of equilibrium).
The simple quantile plot is not really a thing. Since the simple probability plot exists I simply extended the definition to quantiles. A distribution that can be easily used to produce example of fat-tailed distributions is the gamma distribution. The excess kurtosis of a gamma distribution of shape parameter and scale parameter 1 is .
sq.pplot(
data=pd.DataFrame(data={"gamma_samples": gamma.rvs(1,1,3,100)}),
x="gamma_samples",
y=norm,
kind='qq',
height=4,
aspect=2,
display_kws={"identity":True}
)
plt.show()
sq.pplot(
data=pd.DataFrame(data={"gaussian_samples": norm.rvs(0,1,100)}),
x="gaussian_samples",
y=norm,
kind='qq',
height=4,
aspect=2,
display_kws={"identity":True}
)
plt.show()
Probability plots seem underutilized in the context of data analysis (I may be wrong). Compared to a puley statistical approach, in data analysis there is usually a target variable that can be use as "hue". This can help to derive insights by comparison.
I wrote the micro library named seaborn-qqplot. Which is a probability plot add-on to seaborn. For this section I use the iris dataset as an example.
iris = sns.load_dataset('iris')
iris
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
150 rows × 5 columns
Here are some cases for which I find probability plots useful:
sq.pplot(iris, x="petal_length", y=gamma, hue="species", kind='qq', height=4, aspect=2)
plt.show()
We notice a visible change of the underlying distribution of the petal length conditionally to the species. Hence the petal length measurement is a good feature (as it can be measured).
sq.pplot(iris, x="sepal_length", y=gamma, hue="species", kind='qq', height=4, aspect=2)
plt.show()
Outliers can be spotted easily using qqplots, they usually stand out at either edge of the plot (as outlier is a min or a max od the sample example).
Probability plots are purely visual tools. From a statistics point of view, their usefuleness in data analysis are quite restricted (compare distributions) but in the context of data analysis they can reveal interesting pattern, especially when a given variable is conditioned to the value of the target (that the data scientist ultimately wants to predict).
I am thinking about writing a more extensive post about outlier detection but I think that requires a bit more experimenting on a more complex dataset.