Machine Learning for Developers
上QQ阅读APP看书,第一时间看更新

Variance

As we saw in the first example, the mean isn't sufficient to describe non-homogeneous or very dispersed samples.

In order to add a unique value describing how dispersed the sample set's values are, we need to look at the concept of variance, which needs the mean of the sample set as a starting point, and then averages the distances of the samples from the provided mean. The greater the variance, the more scattered the sample set.

The canonical definition of variance is as follows:

Let's write the following sample code snippet to illustrate this concept, adopting the previously used libraries. For the sake of clarity, we are repeating the declaration of the mean function:

    import math #This library is needed for the power operation 
def mean(sampleset): #Definition header for the mean function
total=0
for element in sampleset:
total=total+element
return total/len(sampleset)

def variance(sampleset): #Definition header for the mean function
total=0
setmean=mean(sampleset)
for element in sampleset:
total=total+(math.pow(element-setmean,2))
return total/len(sampleset)

myset1=[2.,10.,3.,6.,4.,6.,10.] #We create the data set
myset2=[1.,-100.,15.,-100.,21.]
print "Variance of first set:" + str(variance(myset1))
print "Variance of second set:" + str(variance(myset2))

The preceding code will generate the following output:

    Variance of first set:8.69387755102
Variance of second set:3070.64

As you can see, the variance of the second set was much higher, given the really dispersed values. The fact that we are computing the mean of the squared distance helps to really outline the differences, as it is a quadratic operation.