Python Data Analysis Cookbook
上QQ阅读APP看书,第一时间看更新

Sampling with probability weights

To create the nuclear bomb during the Second World War, physicists needed to perform pretty complicated calculations. Stanislaw Ulam got the idea to treat this challenge as a game of chance. Later, the method he came up with was given the code name Monte Carlo. Games of chance usually have very simple rules, but playing in an optimal way can be difficult. According to quantum mechanics, subatomic particles are also unpredictable. If we simulate many experiments with subatomic particles, we still can get an idea of how they are likely to behave. The Monte Carlo method is not deterministic, but it approaches the correct result for a complex computation for a sufficiently large number of simulations.

The statsmodels.distributions.empirical_distribution.ECDF class defines the cumulative distribution function of a data array. We can use its output to simulate a complex process. This simulation is not perfect, because we lose information in the process.

How to do it...

In this recipe, we will simulate weather processes. In particular, I am interested in annual temperature values. I am interested in finding out whether the simulated data sets also show an upward trend:

  1. The imports are as follows:
    from statsmodels.distributions.empirical_distribution import ECDF
    import dautil as dl
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.utils import check_random_state
    from IPython.html.widgets.interaction import interact
    from IPython.core.display import HTML
  2. Define the following function to calculate the slope:
    def slope(x, y):
        return np.polyfit(x, y, 1)[0]
  3. Define the following function to generate data for a single year:
    def simulate(x, years, rs, p):
        N = len(years)
        means = np.zeros(N)
    
        for i in range(N):
            sample = rs.choice(x, size=365, p=p)
            means[i] = sample.mean()
    
        return means, np.diff(means).mean(), slope(years, means)
  4. Define the following function to run multiple simulations:
    def run_multiple(times, x, years, p):
        sims = []
        rs = check_random_state(20)
    
        for i in range(times):
            sims.append(simulate(x, years, rs, p))
    
        return np.array(sims)
  5. Define the following function, which by default loads temperature values:
    def main(var='TEMP'):
        df = dl.data.Weather.load().dropna()[var]
        cdf = ECDF(df)
        x = cdf.x[1:]
        p = np.diff(cdf.y)
    
        df = df.resample('A')
        years = df.index.year
        sims = run_multiple(500, x, years, p)
    
        sp = dl.plotting.Subplotter(2, 1, context)
        plotter = dl.plotting.CyclePlotter(sp.ax)
        plotter.plot(years, df.values, label='Data')
        plotter.plot(years, sims[0][0], label='Sim 1')
        plotter.plot(years, sims[1][0], label='Sim 2')
        header = dl.data.Weather.get_header(var)
        sp.label(title_params=header, ylabel_params=header)
        sp.ax.legend(loc='best')
    
        sp.next_ax()
        sp.label()
        sp.ax.hist(sims.T[2], normed=True)
        plt.figtext(0.2, 0.3, 'Slope of the Data {:.3f}'.format(slope(years, df.values)))
        plt.tight_layout()

The notebook stored in the sampling_weights.ipynb file in this book's code bundle gives you the option to select other weather variables too. Refer to the following screenshot for the end result:

See also