본문 바로가기

카테고리 없음

Character encoding

1. Character encoding 

1a) Character encoding

  • Computers deal with everything in zeros and ones, including letters.
  • Characters are one byte: a series of eight zeros and ones.

Each character had a value that could go from one to 255:

  • (binary) 00000001 = (decimal) 1 /
  • (binary) 11111111 = (decimal) 255 /

Each letter of the alphabet and other symbols fall on this range:

  • Letter ‘a’ is 97 on this range, ‘b’ is 98, ‘c’ is 99, etc.

Encoding in spreadsheets like Google Spreadsheets (link):

  • The function CHAR(number) converts a number to its character value
  • =CHAR(97) becomes ‘a’

1b) Differences in encoding 

Different countries have used different encoding for characteristics specific to their language:

  • French character code 201 ‘É’
  • Russian character code 201 is ‘Shcha’ or ‘Щ’

1c) The problem with differences in encoding 

  • West European encoding is called ISO-8859-1
  • Cyrillic encoding is called ISO 8859-5

Tool: Sublime Text 2 (link)

Example: Richest French families (download file)

  • Open the file with Sublime Text
    You can copy the data directly onto a spreadsheet.
  • Go to ‘File’, then ‘Reopen with Encoding’ and click the encoding you need

With each data set: (4:55)

  • Correct encoding needs to be set before opening a data set.
  • The computer software does not recognize the proper data encoding by itself.
  • Today the Unicode standard (UTF-8) is mostly used.

Example: Svenska Dagbladet 

Svenska Dagbladet used crowdsourcing to investigate the interest rates on mortgages in Sweden. This data was placed in a Google spreadsheet.

Tool: OpenRefine (link)

Example: Swedish interest rates (download file

  • Open the file in OpenRefine
    All the special Swedish characters are shown as black question marks.
  • Go to ‘Character encoding’ and click the encoding you need
    This example requires the Unicode standard (UTF-8). 
  • Click ‘Parse cell text into numbers’ to automatically convert values into numbers

2) Decimal Separator 

2a) Converting the comma to a dot in your data

Name the project and click ‘Create Project’.

  • Go to the column in your project which you wish to edit (9:30)
    In the example this is the ‘Rate’ column
  • Go to ‘Edit cells’ and click ‘Transform’
  • Under ‘Expression’ use the ‘Replace’ function 
    replace(original value, “replaceable text”, “new text”) or replace(value, “,”, “.”)
  • Click ‘OK’

2b) Indicate as number instead of text 

  • Go to the column in your project which you wish to edit
  • Go to ‘Edit cells’, then to ‘Common transforms’ and click ‘To number’
  • OpenRefine now recognizes that column as containing numbers

2c) Analysing the data 

For example the distribution of rates:

  • Go to the column of values you wish to analyse
  • Go to ‘Facet’ and click ‘Numeric facet’
  • A distribution of the numbers is shown

 all data sets you collect.


Reading list

General use, must-reads: Guide to bulletproofing your data

In-depth reads: An enjoyable read on character encoding

Tutorials:

Open Refine

LibreOffice/Excel