Lexer, parser, and manipulator for messy date metadata values
Add this line to your application’s Gemfile:
gem 'emendate'
And then execute:
$ bundle install
Or install it yourself as:
$ gem install emendate
Reference documentation for using and understanding the application: https://github.com/kspurgin/emendate/tree/main/docs
Code/API documentation: coming soon
You can play with Emendate after installation by doing the following (assuming you cloned the repo into a folder called emendate
):
cd emendate bin/console
The main command is:
Emendate.parse('early 19th c.')
The parsed result is an Emendate::Result
Ruby object:
#<Emendate::Result:0x00007fbbdcbd5bd8 @dates= [#<Emendate::ParsedDate:0x00007fbbdcbd5e58 @certainty=[], @date_end=nil, @date_end_full="1834-12-31", @date_start=nil, @date_start_full="1801-01-01", @inclusive_range=true, @index_dates=[], @original_string=nil>], @errors=[], @original_string="early 19th c.", @warnings=[]>
You may pass in the following to return a Ruby Hash:
Emendate.parse('mid 1990s').to_h => {:original_string=>"mid 1990s", :dates=> [{:original_string=>nil, :index_dates=>[], :date_start=>nil, :date_end=>nil, :date_start_full=>"1994-01-01", :date_end_full=>"1996-12-31", :inclusive_range=>true, :certainty=>[]}], :errors=>[], :warnings=>["Interpreting pluralized year as decade"]}
Or, the following to return JSON:
Emendate.parse('mid 1990s').to_json => "{\"original_string\":\"mid 1990s\",\"dates\":[{\"original_string\":null,\"index_dates\":[],\"date_start\":null,\"date_end\":null,\"date_start_full\":\"1994-01-01\",\"date_end_full\":\"1996-12-31\",\"inclusive_range\":true,\"certainty\":[]}],\"errors\":[],\"warnings\":[\"Interpreting pluralized year as decade\"]}"
For more details on the returned result, see docs/output.adoc.
For extended use details see docs/use.adoc. This includes information on:
-
passing in date parsing options
-
working with the example set to get an idea of what functionality is currently present
-
using a utility script to bulk run date values in a CSV through Emendate, getting back a CSV report of Emendate’s output
-
getting a more detailed result back to dig into how/why you are getting a given result (probably only useful for debugging/development)
Legacy metadata contains date data encoded using various standard and semi-standard date-recording conventions, or without any consistent conventions.
Emendate’s goal is to parse such date strings and return structured data objects that can be used by other applications to supply consistently formatted date values in metadata or indexes.
The test set of date formats Emendate intends to support can be viewed here. Note that formats with an unparseable
tag are not expected to be supported in any realistic timeframe.
I am responsible for migrating client data to new systems. The target systems vary in the way date data must be prepared:
-
Islandora 7 wants dates encoded in MODS XML
-
Islandora 8 wants EDTF strings
-
CollectionSpace wants XML like:
<fieldCollectionDateGroup> <scalarValuesComputed>true</scalarValuesComputed> <dateEarliestSingleCertainty/> <dateEarliestSingleQualifierUnit/> <dateDisplayDate>8-11-09</dateDisplayDate> <dateLatestScalarValue>2009-08-12T00:00:00.000Z</dateLatestScalarValue> <dateEarliestSingleQualifierValue/> <datePeriod/> <dateLatestEra/> <dateEarliestSingleDay>11</dateEarliestSingleDay> <dateEarliestSingleQualifier/> <dateEarliestSingleYear>2009</dateEarliestSingleYear> <dateLatestCertainty/> <dateAssociation/> <dateLatestDay/> <dateEarliestSingleMonth>8</dateEarliestSingleMonth> <dateEarliestSingleEra>urn:cspace:core.collectionspace.org:vocabularies:name(dateera):item:name(ce)'CE'</dateEarliestSingleEra> <dateLatestYear/> <dateLatestQualifierUnit/> <dateNote/> <dateLatestQualifierValue/> <dateLatestQualifier/> <dateEarliestScalarValue>2009-08-11T00:00:00.000Z</dateEarliestScalarValue> <dateLatestMonth/> </fieldCollectionDateGroup>
I kept running into the same patterns of messy date metadata in the data to be migrated, and found myself re-writing (or seeing that I would need to re-write) the same logic in the migration tooling for each tool I support. So I decided to encapsulate this in Emendate.
I was initially using Chronic in preparation of some date metadata, but it is not at all oriented to the kind of date formats typically found in cutural heritage institution data. Further, it returns just a Ruby Time
object, which does not support the complex structured information I needed such as: certainty (approximate, uncertain, supplied/inferred date), inclusive ranges/intervals, and dealing with values like "early 19th century" or "before 1672."
When I discussed the issues I was facing with my colleague Lora Woodford, she pointed me to Timetwister, developed by New York Public Library. This looked very promising, as it has been developed specifically for cultural heritage institution date data, and it returns a structured data object with the types of data we typically need to represent complex date data in our systems.
At that time, I had gathered 99 test examples of different date formats from client data. (At the time, the set did not include EDTF date patterns or some of the date conventions used in MARC records.
When I ran my examples through Timetwister, 45 of the 99 examples weren’t handled as expected (or at all).
At this point, I began to examine the Timetwister codebase, to see if I could contribute back to make it work for a wider range of date formats.
I was discouraged from this approach by finding that much of the parsing is handled by long, complex regular expressions. I immediately saw how some of the stuff in my example set couldn’t reasonably be handled that way. I saw there is an issue from 2016 to add EDTF support, which was still open as of 2021-02-12. There are many reasons why this could still be open, but if you have built up your regexp matching based on some set of initial assumptions, something like EDTF or some of my examples could make it nearly impossible to include them without adding byzantine logical loops and more complexity to already complex and opaque regexes ( really hard to maintain and debug over time), or starting from scratch.
Though the regex approach is common in tools trying to do things like this (I examined several), most of them seem to be attempting to handle a somewhat more standard universe of things than Emendate is.
Faced with trying to contribute back to Timetwister and possibly ending up rewriting much of it, I opted to continue work on Emendate.
I also have looked into the following libraries, none of which seemed to cover the entire problem I am trying to solve with Emendate, but all of which have informed the development of Emendate and helped me understand this problem space more fully.