Skip to content

Latest commit

 

History

History
612 lines (444 loc) · 23 KB

openrefine.md

File metadata and controls

612 lines (444 loc) · 23 KB

OpenRefine: clean up and augment data

UQ Library 2024-09-16

OpenRefine?

OpenRefine is an open source program self-described as “a power tool for dealing with messy data”.

It can help you:

  • Get an overview of a data set, and explore subsets of it
  • Resolve inconsistencies in a data set, for example standardising date formatting or fixing typos
  • Split data up into more granular parts, for example splitting up cells with multiple authors into separate cells
  • Match local data up to other data sets, for example in matching local subjects against the Library of Congress Subject Headings
  • Enhance a data set populating it with data from other sources

Some common scenarios might be:

  • Where you want to know how many times a particular value (name, publisher, subject) appears in a column in your data
  • Where you want to know how values are distributed across your whole data set
  • Where you have a list of dates which are formatted in different ways, and want to change all the dates in the list to a single common date format
  • Where you have a list of names or terms that differ from each other but refer to the same people, places or concepts
  • Where you have several bits of data combined together in a single column, and you want to separate them out into new columns.

What you will learn today

  1. Import and export data
  2. What “facets” and “filters” are
  3. Edit cells
  4. Edit columns
  5. Group and transform data
  6. Reconcile a column with Wikidata

Installation

This lesson was designed using OpenRefine 3.8 but should be compatible with other supported versions.

OpenRefine requires to have a Java Runtime Environment (JRE) installed. Depending on your operating system, you might need to install one from the Java website, or use the Windows kit with embedded Java.

To install and run OpenRefine on your operating system, please follow the instructions on the official website.

Create a project

OpenRefine can deal with formats like TSV, CSV, ODS, XLS and XLSX, JSON, RDF, XML and more.

Exercise 1 – Create a project and import data

  1. In the “Create project” screen that opens when starting OpenRefine, select Web Addresses (URL)
  2. Enter the URL to our data: https://raw.githubusercontent.com/LibraryCarpentry/lc-open-refine/gh-pages/data/doaj-article-sample.csv
  3. Click “Next”
  4. Click in the Character encoding box and select UTF-8
  5. This screen is also where you can ignore rows, name the project, assign tags…
  6. Click on Create Project

You might not have to change the character encoding, but if you see inaccuracies in how special characters are displayed, you will have to find the right one for your data.

Layout of OpenRefine

How is data organised? How do I navigate it?

OpenRefine displays data in a tabular format: rows usually are records, and columns are a type of information.

You can decide how many rows you want to show at once, and navigate the pages with the controls above the table.

Drop-down menus next to the column headers allow you to do operations on the whole column.

Reorganising the data

You can reorder and remove columns by clicking the drop-down menu at the top of the first column (labelled ‘All’), and choosing Edit columns > Re-order / remove columns….

You can then drag and drop column names to reorder the columns, or remove columns completely if they are not required. For example, move the Publisher column closer to the beginning, and remove the ISSNs column.

You can also sort the rows by clicking on the drop-down menu for the column you want to sort on, and choosing Sort.The sorting is not permanent, and a new Sort drop-down menu appears at the top, which allows you to modify the current sort, remove it, or make the sort permanent.

Facetting and filtering

What are filters and facets in OpenRefine? How can they help me navigate data? How can I correct common data issues?

Facets and filters will allow you to gain insights on your data, zero-in onto issues you want to fix by hand, or only apply automated edits to one subset of your dataset.

Facet and filter information always appears in the left-hand panel in the OpenRefine interface, and they can be combined for more complex searches.

Filters

You can apply a text filter to your data to do a search and restrict the displayed items to it.

In the column drop-down menu, you can use the Text filter item and specify any string you want to filter for.

For example, try filtering the Subjects column with the term physic. The number of matching rows is displayed above the table.

Facets

A facet groups all the values that appear in a column, and allows you to subset the data and edit values across many records at the same time.

The simplest type of Facet is called a ‘Text facet’. It groups all the text values in a column and lists each value with the number of records it appears in.

Let’s create a Text Facet for a column:

  1. Click on the drop-down menu at the top of the Publisher column and choose Facet > Text Facet
  2. Filter the data by clicking on a value
  3. Add more values with include
  4. You can invert your selection
  5. Click reset to deselect all values.

Exercise 2 – Explore licenses

Which licences are used for articles in this dataset? Which one is the most common? How many articles don’t have a licence?

  1. In the License column drop-down menu, choose Facet > Text facet
  2. CC BY has the highest count
  3. 6 articles don’t have a License value, marked (blank)
  4. Notice that filtering the data with this facet will update the values in the Publisher facet. This makes it easy to see which publishers use which licences.
  5. Close both facets with the “Remove All” button

There are more facets available, like numeric and timeline facets that produce graphs, or even custom facets for more advanced operations, like facets that return a logical value (“true” or “false”).

Exercise 3 – Facet by blank

Find all the publications without a DOI (digital object identifier) with the Facet by blank. How many are there?

  1. Facet > Customized facet > Facet by blank

This is more efficient than a text facet, as it will only give two categories: “true” for blank cells, “false” for cells with a value. You could then focus on the blank cells to populate the missing data by hand (an “Edit” button appears when you hover over a cell).

When you filter your data with facets or text filters, remember that any operation that you carry out will only apply to the matches.

Amending data

Using a text facet, you can edit values for a whole subset in one action with the edit option. This is very useful for fixing small typos or variations in spelling.

Exercise 4 – Correct values

Use a text facet on the Language column to replace English values with the language code EN.

  1. Facet > Text facet
  2. Hover over the term English and click edit
  3. Replace it with EN and click Apply

All the occurrences of English are replaced by EN, and the facet auto-updates to only show the remaining values.

Split multi-valued cells

How do I deal with cells that contain multiple values?

Author names are currently separated with the pipe symbol |. We can use the Split multi-valued cells function to put them in their own cells.

Exercise 5 – Split multi-valued cells

  1. Open the drop-down menu for the Authors column
  2. Select Edit cells > Split multi-valued cells
  3. Use the pipe character | as a separator
  4. Click OK

Note that we now have a lot more rows, and that they are still numbered sequentially.

How would we join data from several cells together, for example after cleaning the data up? We can do the opposite operation, with Edit cells > Join multi-valued cells, and deciding which character we want to use to separate values.

Make sure you choose the right delimiter for the kind of data you deal with!

Exercise 6 – Practise splitting and joining

How would you split and then join again the data in the Subjects column?

View modes

OpenRefine has two view modes for viewing data: Rows and Records. In Records mode, OpenRefine can link together multiple rows as belonging to the same record.

Switch to the records mode. Note how the numbering has changed: several rows are related to the same record, because of the splitting we just did.

If you now filter for an author name, notice how the “rows” and “records” view will change how much is displayed? We can stick to “records” to have all the associated information.

Clustering

What can I use to identify and replace varying forms of the same data?

Fixing variations by hand like we did for the languages works fine for small numbers of values, but if there are hundreds or more, it becomes very impractical.

The Cluster function uses an algorithm to group together similar, but inconsistent values in a given column and lets you merge these inconsistent values into a single value you choose.

This is very effective where you have data with minor variations in values, e.g. names of people, organisations, places, classification terms…

Exercise 7 – clean up the author names

  1. Make sure your authors are already split into different rows
  2. Choose Edit cells > Cluster and edit from the Authors column
  3. Using the defaults (key collision Method and fingerprint Keying Function), click “Cluster” and work through the clusters of values, merging them into a single value where appropriate

By default, OpenRefine uses the most common value to merge the data, but we can choose which value we prefer by clicking on it, or we can write our own replacement value.

  1. You can also try changing the clustering method being used – which ones work well? Do they identify extra clusters?

The best clustering algorithm will depend on the kind of data your are processing. For more information on available algorithms, see the documentation page on clustering methods.

Transformations

Facets, filters and clustering already allow us to explore and clean up our data.

To do more advanced operations, like splitting data into several columns, standardising a format without losing the original values, or extracting a data type from a string, we need transformations.

Transformations are ways of manipulating data in columns. They are normally written in a special language called GREL (General Refine Expression Language).

Some common transformations are accessible directly in the menus, for example to change the case of the values, or to remove bounding whitespace characters.

Exercises 8 – Correct publisher data

  1. Create a text facet on the Publisher column
  2. Note that in the values there are two that look identical – why does this value appear twice?
  3. On the Publisher column, use the drop-down menu to select Edit cells > Common transforms > Collapse consecutive whitespace
  4. Look at the publisher facet now – has it changed? (if it hasn’t changed, try clicking the Refresh option to make sure it updates)

Writing transformations

The transformation screen is available through Edit cells > Transform..., in which you can write your custom GREL command with various functions and preview its effect.

GREL supports two types of syntax:

  • value.function(options)
  • function(value, options)

Either is valid, and which is used is completely down to personal preference. In these notes the first syntax is used, as it is easier to read when writing long commands.

The most simple “transformation” is value, which will keep the data as it is. We need to add extra functions to the command to actually modify the original value.

Exercise 9 – Put titles into Title Case

Use Facets and the GREL function toTitlecase() to put some titles in Title Case:

  1. Facet by Publisher
  2. Select “Akshantala Enterprises” and “Society of Pharmaceutical Technocrats” (To select multiple values in the facet, use the “include” link that appears to the right of the facet.)
  3. See that the Titles for these are all in uppercase
  4. Click the drop-down menu on the Title column
  5. Choose Edit cells > Transform...
  6. In the Expression box type value.toTitlecase()
  7. In the Preview tab, see the effect of running this will be
  8. Click “OK”

Note that this transformation is also available in the Common transforms.

Back in the Transform dialog, you can now see a history of transformations, and save your favourite ones by clicking on the star next to them. The Help tab is also a helpful reference.

Undo and redo actions

In the left-hand panel, you can open the Undo / Redo tab to access all the steps taken so far, and undo steps by clicking the last step you want to preserve.

The Extract... button allows you to save a set of steps as a JSON (JavaScript Object Notation) script, to re-apply them later on this or any other dataset. To execute an extracted set of steps, click the Apply button.

Transforming strings, numbers, dates and boolean

How do I transform different data types?

Data types and regular expressions will allow you to write more complex GREL transformations.

All data in OpenRefine has a “type”. The most common is “string”, which is a piece of text. Data types supported are:

  • String
  • Number
  • Date-time
  • Boolean (logical)
  • Array

Date and numbers: we currently have a Date column where the data is represented as a string. If we wanted to sort according to this data, it would not end up chronological. We therefore need to convert the values to a date data type for OpenRefine to interpret them properly.

Exercise 10 – Reformat the date

  1. Make sure you remove all Facets and Filters
  2. On the Date column, use the drop-down menu to select Edit cells > Common transforms > To date
  3. Note how the values are now displayed in green and follow a standard convention for their display format (ISO8601) - this indicates they are now stored as date-time data types in OpenRefine. We can now carry out functions that are specific to Dates.
  4. In the Date column drop-down, select Edit column > Add column based on this column. Using this dialog, you can create a new column, while preserving the old column.
  5. In the ‘New column name’ type “Formatted Date”
  6. In the ‘Expression’ box, type the GREL expression value.toString("dd MMMM yyyy")

Having a column with the proper data type will also allow us to use features like the timeline facet.

Booleans: a boolean is a binary value that can either be “true” or “false”. They are often used in GREL expressions, for example:

value.contains("test")

… generates a boolean value which depends on whether the cell value contains the string “test” anywhere.

That allows use to build more complex transformations, for example by carrying out a transformation only if a test is successful:

if(value.contains("test"),"replacement text",value)

… replaces a cell value with the string “replacement text” only if the value in the cell contains the string “test” anywhere.

Handling arrays

There is another inconsistency in our “Authors” column: a few names are using the order <last name>, <first name>, and all others use the <first name> <last name> order. We’d like to stick to the same order for all, but to do that, we need to talk about arrays.

How do I use arrays in a GREL expression?

An array is a list of values. Arrays can be sorted, de-duplicated and manipulated in other ways in GREL expressions. They usually are the result of a transformation, and cells can’t contain them.

For example, if a cell contains the following string:

“Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday”

… it can be transformed into an array with the split function:

value.split(",")

… which would result in the following array:

[“Monday”,”Tuesday”,”Wednesday”,”Thursday”,”Friday”,”Saturday”,”Sunday”]

Combining the split function with a sort function like so:

value.split(",").sort()

… would result in an array of days of the week sorted alphabetically:

[“Friday”,”Monday”,”Saturday”,”Sunday”,”Thursday”,”Tuesday”,”Wednesday”]

You can output a value from an array by specifying its position in the list. To extract the first value of the array created by the split function:

value.split(",")[0]

Many programming languages, like GREL, start indexing at 0, which is why we have to use the 0th position to return the first element of the array.

If you want to go back to one single string, in order to store the result in cells, you can use the join function:

value.split(",").sort().join(",")

Exercise 11 – Reverse author names

We can focus on the rows that have a comma in their author name:

  1. Using the “Authors” column drop-down menu, use a Text filter
  2. Filter for the comma character: “,”.

Now that we have narrowed down to the lines with a comma in the author name, we can use the split() function.

  1. On the Authors column use the drop-down menu and select Edit cells > Transform
  2. In the Expression box type value.split(", "). This will create an array of separate last name and first name, removing the comma and space.

To get the author name in the natural order, you can reverse the array and join it back together with a space to create the string you need:

  1. In the Expression box, add to the existing expression until it reads value.split(", ").reverse().join(" ")
  2. In the Preview section, you should be able see this has reversed the array, and joined it back into a string, without any comma
  3. Click OK to apply your transformation, and notice how your filter shows 0 results, because all commas have been removed in the “Authors” column.

For more elaborate array creation, we can use the match() function along with Regular Expressions. Regular Expressions, or “Regex”, use a syntax to match pretty much any pattern of characters.

Reconciliation

We can associate one of our columns to an external database by using OpenRefine’s reconciliation feature:

Exercise 12 – Reconcile publisher data against Wikidata

  • On the “Publisher” column, select Reconcile > Start reconciling.... The default service listed is Wikidata.
  • The tool will automatically try to match the column to a Wikidata type, but we can specify exactly the one we want to use.
  • After checking the existing matches, and completing by hand the missing ones, we can add extra data to our project by using Edit column > Add columns for reconciled values. For example, try adding the headquarters location of the publishers into a new column.

Exporting the data and saving a project

How do I export transformed data?

OpenRefine can export a transformed dataset into a variety of formats. You can access the menu with the top-right Export button. The Custom Tabular Exporter offers extra options, like selecting a subset of columns, reordering them, removing headers, selecting a format and uploading as a Google Spreadsheet.

Your project is automatically saved every 15 minutes and can be reopened with Open Project when you next start OpenRefine. However, if you need to share or save a snapshot of your project, you can use Export > Export project to download an archive.

When you quit OpenRefine, make sure you interrupt the OpenRefine process properly with Ctrl + C (for both Windows and Linux) or Command + Q (for MacOS) in the terminal, to guarantee that your project changes are saved. See the official documentation on starting and exiting OpenRefine.

Useful links

OpenRefine learning:

Regular Expressions:

Other data sources for reconciliation:

Examples of OpenRefine uses:

Another dataset to play around with:

Licence

This lesson is based on a Library Carpentry lesson, available here: https://librarycarpentry.github.io/lc-open-refine/ It is also released under a Creative Commons Attribution license. The following is a human-readable summary of (and not a substitute for) the full legal text of the CC BY 4.0 license.

You are free:

  • to Share – copy and redistribute the material in any medium or format
  • to Adapt – remix, transform, and build upon the material

for any purpose, even commercially.

Under the following terms:

  • Attribution – You must give appropriate credit (mentioning that your work is derived from work that is Copyright © Stéphane Guillou and, where practical, linking to https://gitlab.com/stragu/DSH), provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.