Statistics utilities¶

This module contains helpers for various statistical calculations.

Analytic KL-divergence calculations¶

`calculate_univariate_gaussian_kl`(…)	Calculate the (asymmetric) KL-divergence between the univariate Gaussian distributions \(p\) and \(q\)
`calculate_symmetric_kl_divergence`(p, q, …)	Calculates the symmetric KL-divergence between distributions p and q
`symmetric_entropy`(p, q)	Calculate the symmetric `scipy.stats.entropy()`.
`symmetric_gaussian_kl`(p, q)	Calculate the symmetric `pyllars.stats_utils.calculate_univariate_gaussian_kl()`.

Sufficient statistics and parameter estimation¶

`get_population_statistics`(…)	Calculate the population size, mean and variance based on subpopulation statistics
`get_categorical_mle_estimates`(observations, …)	Calculate the MLE estimates for the categorical observations
`fit_with_least_squares`(x, y, w[, order])	Fit a polynomial relationship between x and y.

Bayesian hypothesis testing¶

`bayesian_proportion_test`(x, int], n, int], …)	Perform a Bayesian test to identify significantly different proportions.
`bayesian_means_test`(x1, x2, …)	Perform a Bayesian test to identify significantly different means.

Distribution sampling¶

`normal_inverse_chi_square`(m, k, r, s[, size])	Sample from a normal-inverse-chi-square distribution with parameters m, k, r, s.
`sample_dirichlet_multinomial`(…)	Sample from a Dirichlet-multinomial distribution
`sample_beta_binomial`(alpha, beta, n_samples, …)	Sample from a beta-binomial distribution
`sample_gamma_poisson`(mean, var, size, …)	Sample from a gamma-Poisson distribution

Definitions¶

This module contains helpers for various statistical calculations.

pyllars.stats_utils.bayesian_means_test(x1: numpy.ndarray, x2: numpy.ndarray, use_jeffreys_prior: bool = True, prior1: Optional[Tuple[float, float, float, float]] = None, prior2: Optional[Tuple[float, float, float, float]] = None, num_samples: int = 1000, seed: int = 8675309) → Tuple[float, float, float][source]¶

Perform a Bayesian test to identify significantly different means.

The test is based on a Gaussian conjugate model. (The normal-inverse-chi-square distribution is the prior.) It uses Monte Carlo simulation to estimate the posterior of the difference between the means of the populations, under the (probably dubious) assumption that the observations are Gaussian distributed. It also estimates the likelihood that \(\mu_1 > \mu_2\), where :math`mu_i` is the mean of each sample.

Parameters:

x{1,2} (numpy.ndarray) – The observations of each sample
use_jeffreys_prior (bool) –
Whether to use the Jeffreys prior. For more details, see:

Murphy, K. Conjugate Bayesian analysis of the Gaussian distribution. Technical report, 2007.

Briefly, the Jeffreys prior is: \((\text{sample mean}, n, n-1, \text{sample variance})\), according to a pyllars.stats_utils.normal_inverse_chi_square() distribution.
prior{1,2} (typing.Optional[typing.Tuple[float,float,float,float]]) – If the Jeffreys prior is not used, then these parameters are used as the priors for the normal-inverse-chi-square. If only prior1 is given, then those values are also used for prior2, where prior_i is taken as the prior for x_i.
num_samples (int) – The number of simulations
seed (int) – The seed for the random number generator

Returns:

difference_{mean,var} (float) – The posterior mean and variance of the difference in the mean of the two samples. A negative difference_mean indicates that the mean of x2 is higher.
p_m1_greater (float) – The probability that \(\mu_1 > \mu_2\)

pyllars.stats_utils.bayesian_proportion_test(x: Tuple[int, int], n: Tuple[int, int], prior: Tuple[float, float] = (0.5, 0.5), prior2: Optional[Tuple[float, float]] = None, num_samples: int = 1000, seed: int = 8675309) → Tuple[float, float, float][source]¶

Perform a Bayesian test to identify significantly different proportions.

This test is based on a beta-binomial conjugate model. It uses Monte Carlo simulations to estimate the posterior of the difference between the proportions, as well as the likelihood that \(\pi_1 > \pi_2\) (where \(\pi_i\) is the likelihood of success in sample \(i\)).

Parameters:

x (typing.Tuple[int,int]) – The number of successes in each sample
n (typing.Tuple[int,int]) – The number of trials in each sample
prior (typing.Tuple[float,float]) – The parameters of the beta distribution used as the prior in the conjugate model for the first sample.
prior2 (typing.Optional[typing.Tuple[float,float]]) – The parameters of the beta distribution used as the prior in the conjugate model for the second sample. If this is not specified, then prior is used.
num_samples (int) – The number of simulations
seed (int) – The seed for the random number generator

Returns:

difference_{mean,var} (float) – The posterior mean and variance of the difference in the likelihood of success in the two samples. A negative mean indicates that the likelihood in sample 2 is higher.
p_pi_1_greater (float) – The probability that \(\pi_1 > \pi_2\)

pyllars.stats_utils.calculate_symmetric_kl_divergence(p: Any, q: Any, calculate_kl_divergence: Callable) → float[source]¶

Calculates the symmetric KL-divergence between distributions p and q

In particular, this function defines the symmetric KL-divergence to be:

\[D_{sym}(p||q) \:= \frac{D(p||q) + D(q||p)}{2}\]

Parameters:	{p,q} (typing.Any) – A representation of a distribution that can be used by the function calculate_kl_divergence calculate_kl_divergence (typing.Callable) – A function the calculates the KL-divergence between \(p\) and \(q\)
Returns:	symmetric_kl – The symmetric KL-divergence between \(p\) and \(q\)
Return type:	float

pyllars.stats_utils.calculate_univariate_gaussian_kl(mean_p_var_p: Tuple[float, float], mean_q_var_q: Tuple[float, float]) → float[source]¶

Calculate the (asymmetric) KL-divergence between the univariate Gaussian distributions \(p\) and \(q\)

That is, this calculates KL(p||q).

N.B. This function uses the variance!

N.B. The parameters for each distribution are passed as pairs for easy use with calculate_symmetric_kl_divergence.

See, for example, [1] for the formula.

Parameters:	{mean_p_var_p,mean_q_var_q} (Typing.Tuple[float,float]) – The parameters of the distributions.
Returns:	kl_divergence – The KL divergence between the two distributions.
Return type:	float

References

[1]	Penny, W. “KL-Divergences of Normal, Gamma, Dirichlet and Wishart densities.” Wellcome Department of Cognitive Neurology, University College London, 2001.

pyllars.stats_utils.fit_with_least_squares(x: numpy.ndarray, y: numpy.ndarray, w: Optional[numpy.ndarray] = None, order=<polynomial_order.linear: 1>) → Tuple[float, float, float, float][source]¶

Fit a polynomial relationship between x and y.

Optionally, the values can be weighted.

Parameters:

{x,y} (numpy.ndarray) – The input arrays
w (numpy.ndarray) – A weighting of each of the (x,y) pairs for regression
order (polynomial_order) – The order of the fit

Returns:

{intercept,slope,power} (float) – The coefficients of the fit. power is 0 if the order is linear.
r_sqr (float) – The coefficient of determination

pyllars.stats_utils.get_categorical_mle_estimates(observations: Iterable[int], cardinality: Optional[int] = None, use_laplacian_smoothing: bool = False, base_1: bool = False) → numpy.ndarray[source]¶

Calculate the MLE estimates for the categorical observations

Parameters:	observations (typing.Iterable[int]) – The observed values. These are taken to already be “label encoded”, so they should be integers in [0,cardinality). cardinality (typing.Optional[int]) – The cardinality of the categorical variable. If None, then this is taken as the number of unique values in observations. use_laplacian_smoothing (bool) – Whether to use Laplacian (“add one”) smoothing for the estimates. This can also be interpreted as a symmetric Dirichlet prior with a concentration parameter of 1. base_1 (bool) – Whether the observations are base 1. If so, then the range is taken as [1, cardinality].
Returns:	mle_estimates – The estimates. The size of the array is cardinality.
Return type:	numpy.ndarray

pyllars.stats_utils.get_population_statistics(subpopulation_sizes: numpy.ndarray, subpopulation_means: numpy.ndarray, subpopulation_variances: numpy.ndarray) → Tuple[float, float, float, float][source]¶

Calculate the population size, mean and variance based on subpopulation statistics

This code is based on “Chap“‘s answer here: https://stats.stackexchange.com/questions/30495

This calculation seems to underestimate the variance relative to numpy.var() on the entire dataset (determined by simulation). This may somehow relate to “biased” vs. “unbiased” variance estimates (basically, whether to subtract 1 from the population size). Still, naive approaches to correct for that do not produce variance estimates which exactly match those from numpy.var().

Parameters:	subpopulation_{sizes,means,variances} (numpy.ndarray) – The subpopulation sizes, means, and variances, respectively. These should all be the same size.
Returns:	population_{size,mean,variance,std} – The respective statistics about the entire population
Return type:	float

pyllars.stats_utils.normal_inverse_chi_square(m, k, r, s, size=1)[source]¶

Sample from a normal-inverse-chi-square distribution with parameters m, k, r, s.

This distribution is of interest because it is a conjugate prior for Gaussian observations.

Sampling is described in: https://www2.stat.duke.edu/courses/Fall10/sta114/notes15.pdf

Parameters:	k (m,) – m is the mean of the sampled mean; k relates to the variance of the sampled mean. s (r,) – r is the degrees of freedom in the chi-square distribution from which the variance is samples; s is something like a scaling factor. size (int or tuple of ints, or None) – Output shape. This shares the semantics as other numpy sampling functions.

pyllars.stats_utils.sample_beta_binomial(alpha: float, beta: float, n_samples: int = 1, size: Union[int, Tuple[int], None] = None) → Union[numpy.ndarray, int][source]¶

Sample from a beta-binomial distribution

Parameters:	beta (alpha,) – The alpha and beta shape parameters of the beta distribution. They must be positive. n_samples (int) – The number of samples to draw from this distribution size (typing.Optional[typing.Union[int, typing.Tuple[int]]]) – The output shape of the beta distribution. Please see the `numpy.random.beta` documentation for details.
Returns:	samples – The samples from the beta-binomial distribution
Return type:	typing.Union[int, numpy.ndarray]

Note

In order to see the outcome of multiple Bernoulli samples, the size parameter should be set to the number of desired samples, while n_samples should be set to 1.

pyllars.stats_utils.sample_dirichlet_multinomial(dirichlet_alphas: numpy.ndarray, num_samples: int) → numpy.ndarray[source]¶

Sample from a Dirichlet-multinomial distribution

Parameters:	dirichlet_alphas (numpy.ndarray) – The concentration parameters for the Dirichlet distribution num_samples (int) – The number of samples to draw for the multinomial
Returns:	counts – The number of counts from each level in the multinomial. This will be the same size as dirichlet_alphas.
Return type:	np.ndarray

pyllars.stats_utils.sample_gamma_poisson(mean: float, var: float, size: Union[int, Tuple[int], None] = None) → Union[numpy.ndarray, int][source]¶

Sample from a gamma-Poisson distribution

mean, var : float

The respective parameters of the Gamma distribution. They must be non-zero, and the variance should be positive.

N.B. The gamma is usually parameterized by a “shape” and “scale” (usually called \(k\) and :math:` heta`, respectively) or a “shape” and “rate” (usually called \(lpha\) and \(beta\), respectively). This implementation is based on simple arithmetic operations to derive the mean and variance from the shape and scale.

Specifically, we have the following relationships:

rac{mu^2}{sigma^2}: heta =

rac{sigma^2}{mu}

size : typing.Optional[typing.Union[int, typing.Tuple[int]]]

The output shape of the beta distribution. Please see the numpy.random.beta documentation for details.

samples : typing.Union[int, numpy.ndarray]

The samples from the gamma-Poisson distribution

In order to see the outcome of multiple Poisson samples, the size parameter should be set to the number of desired samples.

pyllars.stats_utils.symmetric_entropy(p, q) → float[source]¶: Calculate the symmetric scipy.stats.entropy().

pyllars.stats_utils.symmetric_gaussian_kl(p, q) → float[source]¶: Calculate the symmetric pyllars.stats_utils.calculate_univariate_gaussian_kl().

class pyllars.stats_utils.polynomial_order[source]¶: An enumeration.