Probability plots for data analysis

Aug 11, 2020 • Written by Rene-Jean Corneille

div class="nb-notebook">
import pandas as pd
import seaborn as sns
import seaborn_qqplot as sq
import matplotlib.pyplot as plt
from scipy.stats import gamma, norm

Probability plots are a convenient way of comparing sample distributions graphically. They are not as popular as other methods of data visualisation, maybe they should. In the context of data anlysis, using them correctly can help at least help the data analyst discover quite easily some interesting facts about their dataset.

I have had difficulties finding an exhaustive resource on probability plots, also there often seems to be a confusion between the types of probability plots.

Probability Plots

Probability plots can be either univariate or bivariate. They represent in a 2d plane the scatter plots of a random variable cumulative distribution function or its inverse.

  • the simple probability plot of a random variable XX is the plot (xi,FX(xi))(x_i, F_X(x_i)) where xix_i is an identically independantly sample from the same probability distribution as the variable XX and FXF_X the cumulative distribution function of the variable XX.

  • the simple quantile plot of a random variable XX is the plot (αi,QX(αi))(\alpha_i, Q_X(\alpha_i)) where αi\alpha_i a partition of the interval [0,1] and QXQ_X the generalized inverse of the cumulative distribution function of the variable XX - also known as the quantile function.

  • the probability-probability plot of two random variables (X,Y)(X,Y) is the plot (FX(xi),FY(yi))(F_X(x_i), F_Y(y_i)) where xix_i and yiy_i anre an identically independantly sample respectively from the same probability distribution as the variable XX and the same probability distribution as the variable YY. FXF_X and FYF_Y are respectively the cumulative distribution functions of the variable XX and the variable YY.

  • the quantile-quantile plot of two random variables (X,Y)(X,Y) is the plot (QX(αi),QY(αi))(Q_X(\alpha_i), Q_Y(\alpha_i)) where αi\alpha_i a partition of the interval [0,1]. QXQ_X and QYQ_Y are respectively the quantile functions of the variable XX and the variable YY.

The simple probability plot is what is usually refered to as the probability plot, but I have also seen quantile-quantile plots refered to as probability plots (I assume because they are the most popular among the four above). The probability-probability plot is lesser known and among all of these; the only one that I studied thouroughly at university has been the quantile-quantile plot.

In practice the distribution of the variables are unknown. The only thing we have at our disposal are the samples xix_i and yiy_i. Let's assume that both samples are of size NN. Since the distribution is unknown, we estimate the c.d.f. using the empirical c.d.f.:

FXN(x)=1NiN1xix[0,1], for RF_X^N(x) = \dfrac{1}{N} \sum_{i \leq N} \mathbf{1}_{{ x_i \leq x }} \in [0,1], \text{ for } \in \mathbb{R}

It can be proven that FXNF_X^N converges to FXF_X almost certainely when NN \rightarrow \infty. The empirical quantile function is then defined as the generalized inverse of the empirical c.d.f:

QXN(α)=minxRαFXN(x)R, for α[0,1]Q_X^N(\alpha) = \min_{x \in \mathbb{R}} { \alpha \leq F_X^N(x) } \in \mathbb{R}, \text{ for } \alpha \in [0,1]

Normal Probability Plots

The normal distribution helds a central (pun intended) role in modern statistics thanks to the Central Limit Theorem. Any centered and reduced iid sample if large enough has a distribution that is close enough to the Normal distribution. The "bivariate" probability plots can be used to compare an empirical probability distribution against the normal distribution. This is what makes the qq-plot so popular compared to its counterparts. It allows to visualize the behaviour of a random variable in the tail of its distribution compared to the normal distribution (which is some sort of equilibrium).

The simple quantile plot is not really a thing. Since the simple probability plot exists I simply extended the definition to quantiles. A distribution that can be easily used to produce example of fat-tailed distributions is the gamma distribution. The excess kurtosis of a gamma distribution of shape parameter kk and scale parameter 1 is 6k\dfrac{6}{k}.

sq.pplot(
    data=pd.DataFrame(data={"gamma_samples": gamma.rvs(1,1,3,100)}),
    x="gamma_samples",
    y=norm,
    kind='qq',
    height=4,
    aspect=2,
    display_kws={"identity":True}
)
plt.show()
sq.pplot(
    data=pd.DataFrame(data={"gaussian_samples": norm.rvs(0,1,100)}),
    x="gaussian_samples",
    y=norm,
    kind='qq',
    height=4,
    aspect=2,
    display_kws={"identity":True}
)
plt.show()

Application to Data Analysis

Probability plots seem underutilized in the context of data analysis (I may be wrong). Compared to a puley statistical approach, in data analysis there is usually a target variable that can be use as "hue". This can help to derive insights by comparison.

I wrote the micro library named seaborn-qqplot. Which is a probability plot add-on to seaborn. For this section I use the iris dataset as an example.

iris = sns.load_dataset('iris')
iris
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

Here are some cases for which I find probability plots useful:

  • visually assess a feature importance:
sq.pplot(iris, x="petal_length", y=gamma, hue="species", kind='qq', height=4, aspect=2)
plt.show()

We notice a visible change of the underlying distribution of the petal length conditionally to the species. Hence the petal length measurement is a good feature (as it can be measured).

  • visually detect outliers:
sq.pplot(iris, x="sepal_length", y=gamma, hue="species", kind='qq', height=4, aspect=2)
plt.show()

Outliers can be spotted easily using qqplots, they usually stand out at either edge of the plot (as outlier is a min or a max od the sample example).

Conclusion

Probability plots are purely visual tools. From a statistics point of view, their usefuleness in data analysis are quite restricted (compare distributions) but in the context of data analysis they can reveal interesting pattern, especially when a given variable is conditioned to the value of the target (that the data scientist ultimately wants to predict).

I am thinking about writing a more extensive post about outlier detection but I think that requires a bit more experimenting on a more complex dataset.


Subscribe

Get notified when I add new content.