上QQ阅读APP看书，第一时间看更新

Using lists, dicts, and sets

A Python sequence object, such as a list, is iterable. However, it has some additional features. We'll think of it as a materialized iterable. We've used the tuple() function in several examples to collect the output of a generator expression or generator function into a single tuple object. We can also materialize a sequence to create a list object.

In Python, a list display, or list comprehension, offers simple syntax to materialize a generator: we just add the [] brackets. This is ubiquitous to the point where the distinction between generator expression and list comprehension is lost. We need to disentangle the idea of generator expression from a list display that uses a generator expression.

The following is an example to enumerate the cases:

>>> range(10)
range(0, 10)
>>> [range(10)]
[range(0, 10)]
>>> [x for x in range(10)]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(range(10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

The first example is the range object, which is a type of generator function. It doesn't produce any values because it's lazy.

The range(10) function is lazy; it won't produce the 10 values until evaluated in a context that iterates through the values.

The second example shows a list composed of a single instance of the generator function. The [] syntax created a list literal of the range() object without consuming any values created by the iterator.

The third example shows a list comprehension built from a generator expression that includes a generator function. The function, range(10), is evaluated by a generator expression, x for x in range(10). The resulting values are collected into a list object.

We can also use the list() function to build a list from an iterable or a generator expression. This also works for set(), tuple(), and dict().

The list(range(10)) function evaluated the generator expression. The [range(10)] list literal does not evaluate the generator function.

While there's shorthand syntax for list, dict, and set using [] and {}, there's no shorthand syntax for a tuple. To materialize a tuple, we must use the tuple() function. For this reason, it often seems most consistent to use the list(), tuple(), and set() functions as the preferred syntax.

In the data-cleansing code, we used a composite function to create a list of four tuples. The function looked as follows:

with open("Anscombe.txt") as source:
    data = head_split_fixed(row_iter(source))
    print(list(data))

We assigned the results of the composite function to a name, data. The data looks as follows:

[['10.0', '8.04', '10.0', '9.14', '10.0', '7.46', '8.0', '6.58'], 
 ['8.0', '6.95', '8.0', '8.14', '8.0', '6.77', '8.0', '5.76'], 
 ...
 ['5.0', '5.68', '5.0', '4.74', '5.0', '5.73', '8.0', '6.89']]

We need to do a little bit more processing to make this useful. First, we need to pick pairs of columns from the eight-tuple. We can select pair of columns with a function, as shown in the following command snippet:

from typing import Tuple, cast

Pair = Tuple[str, str]
def series(
        n: int, row_iter: Iterable[List[Text]]
    ) -> Iterator[Pair]:
    for row in row_iter:
        yield cast(Pair, tuple(row[n*2:n*2+2]))

This function picks two adjacent columns based on a number between 0 and 3. It creates a tuple object from those two columns. The cast() function is a type hint to inform the mypy tool that the result will be a two-tuple where both items are strings. This is required because it's difficult for the mypy tool to determine that the expression tuple(row[n*2:n*2+2]) will select exactly two elements from the row collection.

We can now create a tuple-of-tuples collection, as follows:

with open("Anscombe.txt") as source:
    data = tuple(head_split_fixed(row_iter(source)))
    sample_I = tuple(series(0, data))
    sample_II = tuple(series(1, data))
    sample_III = tuple(series(2, data))
    sample_IV = tuple(series(3, data))

We applied the tuple() function to a composite function based on the head_split_fixed() and row_iter() methods. This will create an object that we can reuse in several other functions. If we don't materialize a tuple object, only the first sample will have any data. After that, the source iterator will be exhausted and all other attempts to access it would yield empty sequences.

The series() function will pick pairs of items to create the Pair objects. Again, we applied an overall tuple() function to materialize the resulting tuple-of-named tuple sequences so that we can do further processing on each one.

The sample_I sequence looks as follows:

(('10.0', '8.04'), ('8.0', '6.95'), ('13.0', '7.58'), 
('9.0', '8.81'), ('11.0', '8.33'), ('14.0', '9.96'), 
('6.0', '7.24'), ('4.0', '4.26'), ('12.0', '10.84'), 
('7.0', '4.82'), ('5.0', '5.68'))

The other three sequences are similar in structure. The values, however, are quite different.

The final thing we'll need to do is create proper numeric values from the strings that we've accumulated so that we can compute some statistical summary values. We can apply the float() function conversion as the last step. There are many alternative places to apply the float() function, and we'll look at some choices in Chapter 5, Higher Order Functions.

Here is an example describing the usage of the float() function:

mean = (
    sum(float(pair[1]) for pair in sample_I) / len(sample_I)
)

This will provide the mean of the y value in each two-tuple. We can gather a number of statistics as follows:

for subset in sample_I, sample_II, sample_III, sample_III:
    mean = (
        sum(float(pair[1]) for pair in subset)/len(subset)
    )
    print(mean)

We computed a mean for the y values in each two-tuple built from the source database. We created a common tuple of the namedtuple class structure so that we can have reasonably clear references to members of the source dataset. Using pair[1] can be an obscure way to reference a data item. In Chapter 7, Additional Tuple Techniques, we'll use named tuples to simplify references to items within a complex tuple.

To reduce memory use-and increase performance we prefer to use generator expressions and functions as much as possible. These iterate through collections in a lazy (or non-strict) manner, computing values only when required. Since iterators can only be used once, we're sometimes forced to materialize a collection as a tuple (or list) object. Materializing a collection costs memory and time, so we do it reluctantly.

Programmers familiar with Clojure can match Python's lazy generators with the lazy-seq and lazy-cat functions. The idea is that we can specify a potentially infinite sequence, but only take values from it as needed.