Parsing an XML file
We'll start by parsing an Extensible Markup Language (XML) file to get the raw latitude and longitude pairs. This will show you how we can encapsulate some not-quite-functional features of Python to create an iterable sequence of values.
We'll make use of the xml.etree module. After parsing, the resulting ElementTree object has a findall() method that will iterate through the available values.
We'll be looking for constructs, such as the following XML example:
<Placemark><Point> <coordinates>-76.33029518659048,
37.54901619777347,0</coordinates> </Point></Placemark>
The file will have a number of <Placemark> tags, each of which has a point and coordinate structure within it. This is typical of Keyhole Markup Language (KML) files that contain geographic information.
Parsing an XML file can be approached at two levels of abstraction. At the lower level, we need to locate the various tags, attribute values, and content within the XML file. At a higher level, we want to make useful objects out of the text and attribute values.
The lower-level processing can be approached in the following way:
import xml.etree.ElementTree as XML
from typing import Text, List, TextIO, Iterable
def row_iter_kml(file_obj: TextIO) -> Iterable[List[Text]]:
ns_map= { "ns0": "http://www.opengis.net/kml/2.2", "ns1": "http://www.google.com/kml/ext/2.2"}
path_to_points= ("./ns0:Document/ns0:Folder/ns0:Placemark/"
"ns0:Point/ns0:coordinates") doc= XML.parse(file_obj) return (comma_split(Text(coordinates.text)) for coordinates in
doc.findall(path_to_points, ns_map))
This function requires text from a file opened via a with statement. The result is a generator that creates list objects from the latitude/longitude pairs. As a part of the XML processing, this function includes a simple static dict object, ns_map, that provides the namespace mapping information for the XML tags we'll be searching. This dictionary will be used by the ElementTree.findall() method.
The essence of the parsing is a generator function that uses the sequence of tags located by doc.findall(). This sequence of tags is then processed by a comma_split() function to tease the text value into its comma-separated components.
The comma_split() function is the functional version of the split() method of a string, which is as follows:
def comma_split(text: Text) -> List[Text]:
return text.split(",")
We've used the functional wrapper to emphasize a slightly more uniform syntax. We've also added explicit type hints to make it clear that text is converted to a list of text values. Without the type hint, there are two potential definitions of split() that could be meant. The method applies to bytes as well as str. We've used the Text type name, which is an alias for str in Python 3.
The result of the row_iter_kml() function is an iterable sequence of rows of data. Each row will be a list composed of three strings—latitude, longitude, and altitude of a way point along this path. This isn't directly useful yet. We'll need to do some more processing to get latitude and longitude as well as converting these two numbers into useful floating-point values.
This idea of an iterable sequence of tuples (or lists) allows us to process some kinds of data files in a simple and uniform way. In Chapter 3, Functions, Iterators, and Generators, we looked at how Comma Separated Values (CSV) files are easily handled as rows of tuples. In Chapter 6, Recursions and Reductions, we'll revisit the parsing idea to compare these various examples.
The output from the preceding function looks like the following example:
[['-76.33029518659048', '37.54901619777347', '0'],
['-76.27383399999999', '37.840832', '0'],
['-76.459503', '38.331501', '0'],
etc.
['-76.47350299999999', '38.976334', '0']]
Each row is the source text of the <ns0:coordinates> tag split using the (,) that's part of the text content. The values are the east-west longitude, north-south latitude, and altitude. We'll apply some additional functions to the output of this function to create a usable subset of this data.