import pandas as pd
import seaborn as sns
import seaborn_qqplot as sq
import matplotlib.pyplot as plt
from scipy.stats import gamma, norm

Probability plots are a convenient way of comparing sample distributions graphically. They are not as popular as other methods of data visualisation, maybe they should. In the context of data anlysis, using them correctly can help at least help the data analyst discover quite easily some interesting facts about their dataset.

I have had difficulties finding an exhaustive resource on probability plots, also there often seems to be a confusion between the types of probability plots.

Probability Plots

Probability plots can be either univariate or bivariate. They represent in a 2d plane the scatter plots of a random variable cumulative distribution function or its inverse.

the simple probability plot of a random variable $X$ is the plot $(x_i, F_X(x_i))$ where $x_i$ is an identically independantly sample from the same probability distribution as the variable $X$ and $F_X$ the cumulative distribution function of the variable $X$ .
the simple quantile plot of a random variable $X$ is the plot $(\alpha_i, Q_X(\alpha_i))$ where $\alpha_i$ a partition of the interval [0,1] and $Q_X$ the generalized inverse of the cumulative distribution function of the variable $X$ - also known as the quantile function.
the probability-probability plot of two random variables $(X,Y)$ is the plot $(F_X(x_i), F_Y(y_i))$ where $x_i$ and $y_i$ anre an identically independantly sample respectively from the same probability distribution as the variable $X$ and the same probability distribution as the variable $Y$ . $F_X$ and $F_Y$ are respectively the cumulative distribution functions of the variable $X$ and the variable $Y$ .
the quantile-quantile plot of two random variables $(X,Y)$ is the plot $(Q_X(\alpha_i), Q_Y(\alpha_i))$ where $\alpha_i$ a partition of the interval [0,1]. $Q_X$ and $Q_Y$ are respectively the quantile functions of the variable $X$ and the variable $Y$ .

The simple probability plot is what is usually refered to as the probability plot, but I have also seen quantile-quantile plots refered to as probability plots (I assume because they are the most popular among the four above). The probability-probability plot is lesser known and among all of these; the only one that I studied thouroughly at university has been the quantile-quantile plot.

In practice the distribution of the variables are unknown. The only thing we have at our disposal are the samples $x_i$ and $y_i$ . Let's assume that both samples are of size $N$ . Since the distribution is unknown, we estimate the c.d.f. using the empirical c.d.f.:

$F_X^N(x) = \dfrac{1}{N} \sum_{i \leq N} \mathbf{1}_{{ x_i \leq x }} \in [0,1], \text{ for } \in \mathbb{R}$

It can be proven that $F_X^N$ converges to $F_X$ almost certainely when $N \rightarrow \infty$ . The empirical quantile function is then defined as the generalized inverse of the empirical c.d.f:

$Q_X^N(\alpha) = \min_{x \in \mathbb{R}} { \alpha \leq F_X^N(x) } \in \mathbb{R}, \text{ for } \alpha \in [0,1]$

Normal Probability Plots

The normal distribution helds a central (pun intended) role in modern statistics thanks to the Central Limit Theorem. Any centered and reduced iid sample if large enough has a distribution that is close enough to the Normal distribution. The "bivariate" probability plots can be used to compare an empirical probability distribution against the normal distribution. This is what makes the qq-plot so popular compared to its counterparts. It allows to visualize the behaviour of a random variable in the tail of its distribution compared to the normal distribution (which is some sort of equilibrium).

The simple quantile plot is not really a thing. Since the simple probability plot exists I simply extended the definition to quantiles. A distribution that can be easily used to produce example of fat-tailed distributions is the gamma distribution. The excess kurtosis of a gamma distribution of shape parameter $k$ and scale parameter 1 is $\dfrac{6}{k}$ .

sq.pplot(
    data=pd.DataFrame(data={"gamma_samples": gamma.rvs(1,1,3,100)}),
    x="gamma_samples",
    y=norm,
    kind='qq',
    height=4,
    aspect=2,
    display_kws={"identity":True}
)
plt.show()

sq.pplot(
    data=pd.DataFrame(data={"gaussian_samples": norm.rvs(0,1,100)}),
    x="gaussian_samples",
    y=norm,
    kind='qq',
    height=4,
    aspect=2,
    display_kws={"identity":True}
)
plt.show()

Application to Data Analysis

Probability plots seem underutilized in the context of data analysis (I may be wrong). Compared to a puley statistical approach, in data analysis there is usually a target variable that can be use as "hue". This can help to derive insights by comparison.

I wrote the micro library named seaborn-qqplot. Which is a probability plot add-on to seaborn. For this section I use the iris dataset as an example.

iris = sns.load_dataset('iris')

iris

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

150 rows × 5 columns

Here are some cases for which I find probability plots useful:

visually assess a feature importance:

sq.pplot(iris, x="petal_length", y=gamma, hue="species", kind='qq', height=4, aspect=2)
plt.show()

We notice a visible change of the underlying distribution of the petal length conditionally to the species. Hence the petal length measurement is a good feature (as it can be measured).

visually detect outliers:

sq.pplot(iris, x="sepal_length", y=gamma, hue="species", kind='qq', height=4, aspect=2)
plt.show()

Outliers can be spotted easily using qqplots, they usually stand out at either edge of the plot (as outlier is a min or a max od the sample example).

Conclusion

Probability plots are purely visual tools. From a statistics point of view, their usefuleness in data analysis are quite restricted (compare distributions) but in the context of data analysis they can reveal interesting pattern, especially when a given variable is conditioned to the value of the target (that the data scientist ultimately wants to predict).

I am thinking about writing a more extensive post about outlier detection but I think that requires a bit more experimenting on a more complex dataset.

Rene-Jean Corneille

Probability plots for data analysis

Probability Plots

Normal Probability Plots

Application to Data Analysis

Conclusion