Practical Data Analysis
上QQ阅读APP看书,第一时间看更新

Data scrubbing

Data scrubbing, also called data cleansing, is the process of correcting or removing data in a dataset that is incorrect, inaccurate, incomplete, improperly formatted, or duplicated.

The result of the data analysis process not only depends on the algorithms, it also depends on the quality of the data. That's why the next step after obtaining the data, is data scrubbing. In order to avoid dirty data our dataset should possess the following characteristics:

  • Correct
  • Completeness
  • Accuracy
  • Consistency
  • Uniformity

The dirty data can be detected by applying some simple statistical data validation also by parsing the texts or deleting duplicate values. Missing or sparse data can lead you to highly misleading results.

Statistical methods

In this method we need some context about the problem (knowledge domain) to find values that are unexpected and thus erroneous, even if the data type match but the values are out of the range, it can be resolved by setting the values to an average or mean value. Statistical validations can be used to handle missing values which can be replaced by one or more probable values using Interpolation or by reducing the dataset using Decimation.

  • Mean: This is the value calculated by summing up all values and then dividing by the number of values.
  • Median: The median is the middle value in a sorted list of values.
  • Range Constraints: The numbers or dates should fall within a certain range. That is, they have minimum and/or maximum possible values.
  • Clustering: Usually, when we obtain data directly from the user some values include ambiguity or refer to the same value with a typo. For example, Buchanan Deluxe 750ml 12 x 01 and Buchanan Deluxe 750ml 12 x 01., which are different only by a dot, or in the case of Microsoft or MS instead of Microsoft Corporation which refer to the same company and all values are valid. In those cases, grouping can help us to get accurate data and eliminate the duplicated ones, enabling a faster identification of unique values.

Text parsing

We perform parsing to help us to validate if a string of data is well formatted and avoid syntax errors.

Regular expression patterns, usually text fields, would have to be validated in this way. For example, dates, e-mails, phone numbers, and IP addresses. Regex is a common abbreviation for regular expression.

In Python we will use the re module to implement regular expressions. We can perform text search and pattern validations.

Firstly, we need to import the re module.

import re

In the following examples, we will implement three of the most common validations (e-mail, IP address, and date format):

  • E-mail validation:
    myString = 'From: readers@packt.com (readers email)'
    result = re.search('([\w.-]+)@([\w.-]+)', myString)
    if result:
        print (result.group(0))  
        print (result.group(1))  
        print (result.group(2))  
    Output:
    >>> readers@packt.com
    >>> readers
    >>> packt.com
    

    The function search() scans through a string, searching for any location where the regex might match. The function group() helps us to return the string matched by the regex. The pattern \w matches any alphanumeric character and is equivalent to the class (a-z, A-Z, 0-9_).

  • IP address validation:
    isIP = re.compile('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
    myString = " Your IP is:  192.168.1.254  "
    result = re.findall(isIP,myString)
    print(result)
    Output:
    >>> 192.168.1.254 
    

    The function findall() finds all the substrings where the regex matches, and returns them as a list. The pattern \d matches any decimal digit, is equivalent to the class [0-9].

  • Date format:
    myString = "01/04/2001"
    isDate = re.match('[0-1][0-9]\/[0-3][0-9]\/[1-2][0-9]{3}',myString)
    if isDate:
        print("valid")
    else:
        print("invalid")
    Output:
    >>> 'valid' 
    

    The function match() finds if the regex matches with the string. The pattern implements the class [0-9] in order to parse the date format.

    Tip

    For more information about regular expressions, visit the link http://docs.python.org/3.2/howto/regex.html#regex-howto.

Data transformation

Data transformation is usually related to databases and data warehouses, where values from a source format are extracted, transformed, and loaded in a destination format.

Extract, Transform, and Load (ETL) obtains data from datasources, performs some transformation function depending on our data model and loads the result data into destination.

  • Data extraction allows us to obtain data from multiple datasources, such as relational databases, data streaming, text files (JSON, CSV, XML), and NoSQL databases.
  • Data transformation allows us to cleanse, convert, aggregate, merge, replace, validate, format, and split data.
  • Data loading allows us to load data into destination format, such as relational databases, text files (JSON, CSV, XML), and NoSQL databases.
Tip

In statistics, data transformation refers to the application of a mathematical function to the dataset or time series points.