Frank Kane's Taming Big Data with Apache Spark and Python

上QQ阅读APP看书，第一时间看更新

The friends by age example

Let's go through a key/value RDD example to illustrate these concepts. I've generated a fake dataset just completely at random, which represents a social network. On every line is a user ID, a username, the age of that user, and the number of friends that user has:

So for example, user ID 0 might be named Will and he's 33 years old and has 385 friends-these ages and numbers of friends are all completely assigned at random, so don't associate any sort of deep meaning to them. You might notice that I'm a Star Trek fan here. So that's our source data that we're going to work with and our task is to figure out the average number of friends by age. For example, what's the average number of friends for the average 33-year-old in our dataset? Well, let's figure that out.