1) Why OpenRefine
- It’s free of charge.
- It’s open source, allowing for improvements by its users.
- It’s more powerful and easier to use than Microsoft Excel.
- It’s not localised, so the functions are always in English.
This makes work easier to share.
1a) Installing OpenRefine
- Download and install the version fitting to your computer.
- No admin privileges are needed to run OpenRefine.
- For issues you can ask OpenRefine on Twitter.
1b) Launching OpenRefine
- Launching the program will open a new page in your browser:
- ‘localhost:3333’ or ‘127.0.0.1:333’
- OpenRefine creates a server on your own computer and the browser is connecting to this.
- Everything you do in OpenRefine remains on your own computer.
- Important in case of sensitive information that should not be shared online.
Example - Height of 40 people.csv
- Browse for the file to open and click ‘Next’
- OpenRefine will show a preview of the first 100 rows of data
- In this overview you tell OpenRefine how your data is formatted:
- What character encoding
- What column separation
- Parse cell text into numbers, dates, …
- Give a name to the data set and click ‘Create Project’
You are now in the OpenRefine main screen.
- You are faced with a subset of your data:
- OpenRefine displays as little information as possible to save CPU power
2) Using OpenRefine
OpenRefine is useful for analysing data and transforming and cleaning data.
2a) Analysing data
- The analysing feature is ‘Facet’ which is similar to a pivot table.
Takes all the values in a column and presents it as a filter.
Example:
- Click on the column you need – Place of birth – go to ‘Facet’ and click on ‘Text facet’
- 4 different values for Place of birth, with 10 rows each
- Click on Gender, go to ‘Facet’ and click on ‘Text facet’
- 2 different values for Gender with 20 rows each
- This allows you to quickly select to values you are interested in.
- Click on ‘F’ and ‘Spain’ and you will see the 6 rows that correspond to these.
In a big database, you can use facets to quickly mine your data for what you need.
- Click on ‘reset’ to go back to the complete dataset facet.
2b) Edit cells
- OpenRefine allows for editing of cells in batch.
- From country to regions, edit Denmark and Sweden into ‘North’ and Italy and Spain into ‘South’
- In the text facet click on ‘edit’ for a value and rename that value
- 2 different values for Place of birth with 20 rows each
- All the changes have been applied to the complete dataset
- This can be undone with ‘Undo/Redo’ and click on the first step ‘Create Project’
- These previous steps will remain even when you close OpenRefine and can be extracted to share
2c) Transform data
Similar to Microsoft Excel, OpenRefine works with a system of formulas.
- Convert the Height from inches to centimeters:
- Click on the column you need – Height – go to ‘Edit cells’ and click on ‘Transform’
- Add value * 2.54, which will give the original value and the new value, then click ‘OK’
- The transformation is applied to the entire dataset and now shows Height in centimetres.
- A second transformation to minimise the number of decimals:
- Click on the column you need – Height – go to ‘Edit cells’ and click on ‘Transform’
- Add round(value) and click ‘OK’ to round the numbers with no decimals
You have learned to
- Transform data.
- Use facets.