1. Character encoding
1a) Character encoding
- Computers deal with everything in zeros and ones, including letters.
- Characters are one byte: a series of eight zeros and ones.
Each character had a value that could go from one to 255:
- (binary) 00000001 = (decimal) 1 /
- (binary) 11111111 = (decimal) 255 /
Each letter of the alphabet and other symbols fall on this range:
- Letter ‘a’ is 97 on this range, ‘b’ is 98, ‘c’ is 99, etc.
Encoding in spreadsheets like Google Spreadsheets (link):
- The function CHAR(number) converts a number to its character value
- =CHAR(97) becomes ‘a’
1b) Differences in encoding
Different countries have used different encoding for characteristics specific to their language:
- French character code 201 ‘É’
- Russian character code 201 is ‘Shcha’ or ‘Щ’
1c) The problem with differences in encoding
- West European encoding is called ISO-8859-1
- Cyrillic encoding is called ISO 8859-5
Tool: Sublime Text 2 (link)
Example: Richest French families (download file)
- Open the file with Sublime Text
You can copy the data directly onto a spreadsheet. - Go to ‘File’, then ‘Reopen with Encoding’ and click the encoding you need
With each data set: (4:55)
- Correct encoding needs to be set before opening a data set.
- The computer software does not recognize the proper data encoding by itself.
- Today the Unicode standard (UTF-8) is mostly used.
Example: Svenska Dagbladet
Svenska Dagbladet used crowdsourcing to investigate the interest rates on mortgages in Sweden. This data was placed in a Google spreadsheet.
Tool: OpenRefine (link)
Example: Swedish interest rates (download file)
- Open the file in OpenRefine
All the special Swedish characters are shown as black question marks. - Go to ‘Character encoding’ and click the encoding you need
This example requires the Unicode standard (UTF-8). - Click ‘Parse cell text into numbers’ to automatically convert values into numbers
2) Decimal Separator
2a) Converting the comma to a dot in your data
Name the project and click ‘Create Project’.
- Go to the column in your project which you wish to edit (9:30)
In the example this is the ‘Rate’ column - Go to ‘Edit cells’ and click ‘Transform’
- Under ‘Expression’ use the ‘Replace’ function
replace(original value, “replaceable text”, “new text”) or replace(value, “,”, “.”) - Click ‘OK’
2b) Indicate as number instead of text
- Go to the column in your project which you wish to edit
- Go to ‘Edit cells’, then to ‘Common transforms’ and click ‘To number’
- OpenRefine now recognizes that column as containing numbers
2c) Analysing the data
For example the distribution of rates:
- Go to the column of values you wish to analyse
- Go to ‘Facet’ and click ‘Numeric facet’
- A distribution of the numbers is shown
all data sets you collect.
Reading list
General use, must-reads: Guide to bulletproofing your data
In-depth reads: An enjoyable read on character encoding
Tutorials:
Open Refine
- Some Open Refine recipes
- Comprehensive Google Refine tutorial to clean the Swedish banks dataset
LibreOffice/Excel
- Tutorial on how to clean data with Libre Office or Excel
- Open a CSV file encoded with UTF-8 in Excel 2007