
Sampling with probability weights
To create the nuclear bomb during the Second World War, physicists needed to perform pretty complicated calculations. Stanislaw Ulam got the idea to treat this challenge as a game of chance. Later, the method he came up with was given the code name Monte Carlo. Games of chance usually have very simple rules, but playing in an optimal way can be difficult. According to quantum mechanics, subatomic particles are also unpredictable. If we simulate many experiments with subatomic particles, we still can get an idea of how they are likely to behave. The Monte Carlo method is not deterministic, but it approaches the correct result for a complex computation for a sufficiently large number of simulations.
The statsmodels.distributions.empirical_distribution.ECDF
class defines the cumulative distribution function of a data array. We can use its output to simulate a complex process. This simulation is not perfect, because we lose information in the process.
How to do it...
In this recipe, we will simulate weather processes. In particular, I am interested in annual temperature values. I am interested in finding out whether the simulated data sets also show an upward trend:
- The imports are as follows:
from statsmodels.distributions.empirical_distribution import ECDF import dautil as dl import numpy as np import matplotlib.pyplot as plt from sklearn.utils import check_random_state from IPython.html.widgets.interaction import interact from IPython.core.display import HTML
- Define the following function to calculate the slope:
def slope(x, y): return np.polyfit(x, y, 1)[0]
- Define the following function to generate data for a single year:
def simulate(x, years, rs, p): N = len(years) means = np.zeros(N) for i in range(N): sample = rs.choice(x, size=365, p=p) means[i] = sample.mean() return means, np.diff(means).mean(), slope(years, means)
- Define the following function to run multiple simulations:
def run_multiple(times, x, years, p): sims = [] rs = check_random_state(20) for i in range(times): sims.append(simulate(x, years, rs, p)) return np.array(sims)
- Define the following function, which by default loads temperature values:
def main(var='TEMP'): df = dl.data.Weather.load().dropna()[var] cdf = ECDF(df) x = cdf.x[1:] p = np.diff(cdf.y) df = df.resample('A') years = df.index.year sims = run_multiple(500, x, years, p) sp = dl.plotting.Subplotter(2, 1, context) plotter = dl.plotting.CyclePlotter(sp.ax) plotter.plot(years, df.values, label='Data') plotter.plot(years, sims[0][0], label='Sim 1') plotter.plot(years, sims[1][0], label='Sim 2') header = dl.data.Weather.get_header(var) sp.label(title_params=header, ylabel_params=header) sp.ax.legend(loc='best') sp.next_ax() sp.label() sp.ax.hist(sims.T[2], normed=True) plt.figtext(0.2, 0.3, 'Slope of the Data {:.3f}'.format(slope(years, df.values))) plt.tight_layout()
The notebook stored in the sampling_weights.ipynb
file in this book's code bundle gives you the option to select other weather variables too. Refer to the following screenshot for the end result:

See also
- The Wikipedia page for the Monte Carlo method at https://en.wikipedia.org/wiki/Monte_Carlo_method (retrieved August 2015)
- The documentation for the
ECDF
class at http://statsmodels.sourceforge.net/0.6.0/generated/statsmodels.distributions.empirical_distribution.ECDF.html (retrieved August 2015)