Using Google Refine to prepare data
Preparing data for CiviCRM import can be a time-consuming, frustrating, and traumatic experience. But it is a job that has to be done. CiviCRM does an enormous amount of error checking on data import and will not import records with errors it spots.
Consider this data:
Here you can see that there are data inconsistencies in the center column. The town Poulton-le-Fylde has five different ways of spelling and presenting the data. This is quite a common problem in legacy systems that were designed to hold addressing data for label printing rather than for searching or geocoding. Another common problem is having data in the wrong columns. Towns, cities, and postcodes are often spread across many columns. The result is you cannot guarantee accurate search results or do any geocoding.
How to do it…
Google Refine is an excellent tool for cleaning your data, which is free and easy to use. This recipe shows you some of the basics of Google Refine.
- Download and install Google Refine from http://code.google.com/p/google-refine/wiki/Downloads?tm=2.
- You can import data from CSV, Excel, or a variety of other files into Google Refine to see how the import mechanism works. It is incredibly easy. Here we have launched Google Refine and have imported this data:
Here you can see that the Area column is mostly empty. The County column contains a city, Nottingham, which should be in the city column. If you were importing thousands of addresses it would be very time-consuming to try and fix the data using a spreadsheet or a database. Google Refine makes it easy.
- At the top of the Town/City column we can choose Text Facet as shown in the following screenshot:
This creates a list in the left-hand pane of the Google Refine screen:
The list contains the unique values in the Town/City column and how many rows contain each unique value. So in this example we can see there are 1482 different town/city choices, and that for the choice Aberdeen there are 13 rows that have that value.
- Click on the Cluster button in the top-right of our list; we can group these values based on algorithms that Google Refine provides:
You can play around with the Method and Keying Function values to see which values in the cluster suit you best. In the preceding screenshot you can see that each cluster contains spelling and formatting errors in the address data.
For each cluster you can then enter a value in New Cell Value and then update the data. You can work through your data in this way and clear up errors very quickly.
- In the preceding data, some postal codes contain a "period", and both postal code and city data are scattered in different columns, so if you wanted to use this to geocode data it would not work.
- Select the Address2 column and pick up postal code data and store a copy in the Address6 column.
- From the drop-down menu on the Address4 column, select Text Filter. If you look at the postal codes, you can see that there is a pattern within them. The postal code prefix ends with a number, then there is a whitespace, then the postal code suffix begins with a number. So you can search for that pattern using a regular expression. You can find out more about regular expressions at http://www.regular-expressions.info.
The expression for any digit is
\d
, and for a space it is\s
. - Search on
\d\s\d
. That is the same as saying "find cells in this column that contain the pattern of a number followed by a space followed by another number." - Tick the regular expression box.
- As you type in the filter, Google Refine automatically refreshes the data to meet the filter criteria. You will see that this filters the postal code data properly.
- Copy your found postal code values to the Address6 column. Go to Edit Cells | Transform from the Address6 drop-down menu and copy the data from the Address4 column using Google Refine's syntax, as shown in the following screenshot:
- Repeat this procedure for every column you need to fix for the postal codes.
- Once you have them all in the Address6: we can rename the column. We go to the Address6 column menu and select Edit Column and rename the column to
Postal Code
. - Finally we can remove the period from the data. We will have to go to Edit Cells | Transform to do so. And the resultant page is shown in the following screenshot:
How it works…
Google Refine gives you a dynamic view of your data directly as you transform it.
Google Refine stores all of your transformations. This means they can be reapplied to data again. In situations where the same data needs to be imported several times during the course of a development project, this is a godsend.
See more
- Find more about Google Refine at http://code.google.com/p/google-refine/
- Find the Google Refine documentation at http://code.google.com/p/google-refine/wiki/DocumentationForUsers