测试数据是否符合正态分布

  • 可视化的方法

箱线图(Box-whisker plots)可视化数据是否对称或者扭曲

QQ-plot 可以对比数据是否符合指定的分布

柱状图

  • 假设检验的方法

Shapiro-Wilk test

D’Agostino-Pearson

Kolmogorov-Smirnov

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import quiz_tests

# Set plotting options
%matplotlib inline
plt.rc('figure', figsize=(16, 9))

创建一个正态和非正太分布

In [2]:
# Sample A: Normal distribution
sample_a = stats.norm.rvs(loc=0.0, scale=1.0, size=(1000,))

# Sample B: Non-normal distribution
sample_b = stats.lognorm.rvs(s=0.5, loc=0.0, scale=1.0, size=(1000,))

通过箱线图和柱状图可视化

In [5]:
# Sample A: Normal distribution
fig, axes = plt.subplots(2, 1, figsize=(16, 9), sharex=True)
#箱线图
axes[0].boxplot(sample_a, vert=False)
#柱状图
axes[1].hist(sample_a, bins=50)
axes[0].set_title("Boxplot of a Normal Distribution");
In [6]:
# Sample B: Non-normal distribution
fig, axes = plt.subplots(2, 1, figsize=(16, 9), sharex=True)
axes[0].boxplot(sample_b, vert=False)
axes[1].hist(sample_b, bins=50)
axes[0].set_title("Boxplot of a Lognormal Distribution");

通过QQ-plot可视化

In [7]:
# Q-Q plot of normally-distributed sample
plt.figure(figsize=(10, 10)); plt.axis('equal')
stats.probplot(sample_a, dist='norm', plot=plt);
In [10]:
# Q-Q plot of non-normally-distributed sample
plt.figure(figsize=(10, 10)); plt.axis('equal')
stats.probplot(sample_b, dist='norm', plot=plt);

正态分布的假设检验

正态分布的假设检验方法很多,这里利用scipy库的Shapiro-Wilk test方法进行检验。原假设是样本数据符合正态分布,如果P值大于选择的$\alpha$,则接受原假设,否则拒绝原假设。

https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.stats.shapiro.html

In [11]:
def is_normal_ks(sample, test=stats.kstest, p_level=0.05, **kwargs):    
    """
    sample: a sample distribution
    test: a function that tests for normality
    p_level: if the test returns a p-value > than p_level, assume normality
    
    return: True if distribution is normal, False otherwise
    """
    normal_args = (np.mean(sample),np.std(sample))
    
    t_stat, p_value = test(sample, 'norm', normal_args, **kwargs)
    print("Test statistic: {}, p-value: {}".format(t_stat, p_value))
    print("Is the distribution Likely Normal? {}".format(p_value > p_level))
    return p_value > p_level

quiz_tests.test_is_normal_ks(is_normal_ks)
Test statistic: 0.014762337753813415, p-value: 0.9813208156284505
Is the distribution Likely Normal? True
Test statistic: 0.11969044583541244, p-value: 6.108447081487611e-13
Is the distribution Likely Normal? False
Tests Passed
In [12]:
# Using Kolmogorov-Smirnov test
print("Sample A:-"); is_normal_ks(sample_a);
print("Sample B:-"); is_normal_ks(sample_b);
Sample A:-
Test statistic: 0.016918963065188558, p-value: 0.9370534969191464
Is the distribution Likely Normal? True
Sample B:-
Test statistic: 0.10934501078187586, p-value: 7.222222819791568e-11
Is the distribution Likely Normal? False
In [ ]: