본문 바로가기

카테고리 없음

OpenRefine

1) Why OpenRefine 

  • It’s free of charge.
  • It’s open source, allowing for improvements by its users.
  • It’s more powerful and easier to use than Microsoft Excel.
  • It’s not localised, so the functions are always in English. 
    This makes work easier to share.

1a) Installing OpenRefine 

OpenRefine:

  • Download and install the version fitting to your computer.
  • No admin privileges are needed to run OpenRefine.
  • For issues you can ask OpenRefine on Twitter.

1b) Launching OpenRefine 

  • Launching the program will open a new page in your browser:
    • ‘localhost:3333’ or ‘127.0.0.1:333’
  • OpenRefine creates a server on your own computer and the browser is connecting to this.
  • Everything you do in OpenRefine remains on your own computer.
    • Important in case of sensitive information that should not be shared online.

Example - Height of 40 people.csv 

  • Browse for the file to open and click ‘Next’
  • OpenRefine will show a preview of the first 100 rows of data
  • In this overview you tell OpenRefine how your data is formatted:
    • What character encoding
    • What column separation
    • Parse cell text into numbers, dates, …
  • Give a name to the data set and click ‘Create Project’

You are now in the OpenRefine main screen.

  • You are faced with a subset of your data:
    • OpenRefine displays as little information as possible to save CPU power

2) Using OpenRefine

OpenRefine is useful for analysing data and transforming and cleaning data.

2a) Analysing data 

  • The analysing feature is ‘Facet’ which is similar to a pivot table.
    Takes all the values in a column and presents it as a filter. 

Example:

  • Click on the column you need – Place of birth – go to ‘Facet’ and click on ‘Text facet’
    • 4 different values for Place of birth, with 10 rows each
  • Click on Gender, go to ‘Facet’ and click on ‘Text facet’
    • 2 different values for Gender with 20 rows each
  • This allows you to quickly select to values you are interested in.
  • Click on ‘F’ and ‘Spain’ and you will see the 6 rows that correspond to these.

In a big database, you can use facets to quickly mine your data for what you need.

  • Click on ‘reset’ to go back to the complete dataset facet.

2b) Edit cells 

  • OpenRefine allows for editing of cells in batch.
  • From country to regions, edit Denmark and Sweden into ‘North’ and Italy and Spain into ‘South’
    • In the text facet click on ‘edit’ for a value and rename that value
    • 2 different values for Place of birth with 20 rows each
  • All the changes have been applied to the complete dataset
    • This can be undone with ‘Undo/Redo’ and click on the first step ‘Create Project’
    • These previous steps will remain even when you close OpenRefine and can be extracted to share

2c) Transform data

Similar to Microsoft Excel, OpenRefine works with a system of formulas.

  • Convert the Height from inches to centimeters:
    • Click on the column you need – Height – go to ‘Edit cells’ and click on ‘Transform’
    • Add value * 2.54, which will give the original value and the new value, then click ‘OK’
  • The transformation is applied to the entire dataset and now shows Height in centimetres.
  • A second transformation to minimise the number of decimals:
    • Click on the column you need – Height – go to ‘Edit cells’ and click on ‘Transform’
    • Add round(value) and click ‘OK’ to round the numbers with no decimals

You have learned to 

  • Transform data.
  • Use facets.