Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADR to handle dataset information in TIMDEX #125

Merged
merged 6 commits into from
Feb 27, 2024
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
258 changes: 258 additions & 0 deletions docs/adrs/0002-field-for-data-type-form-information.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,258 @@
# 2. Data Model Fields to Support "Format" and "Data Type" UI Filters

Follows: https://mitlibraries.github.io/guides/misc/adr.html

Date: 2024-02-15

## Status

Proposed
ghukill marked this conversation as resolved.
Show resolved Hide resolved

## Context

### Goal

Identify TIMDEX fields to store information about GIS resources that will support "Format" and "Data Type" UI facet filters for the GDT project.

### Overview

With the addition of geospatial data to TIMDEX, and a new "geo" focused UI, the question was raised as to what fields should be used for two new UI facet filters:

* "Data Type"
* values like: `[Polygon, Point, Raster, Image, ...]`
* "Format"
* values like: `[Shapefile, TIFF, GeoTIFF, JPEG, ...]`

This would mirror functionality in the [current "Geoweb" application](https://geodata.mit.edu/) that is slated to be sunsetted.

The source records come from a new [GeoHarvester](https://github.com/MITLibraries/geo-harvester), which normalizes geospatial metadata from a variety of formats to the
geospatial, discovery scoped metadata format [Aardvark](https://opengeometadata.org/ogm-aardvark/). Transmogrifier then converts these Aardvark records into TIMDEX records.

Data for both UI filters will be controlled terms in the Aardvark records created by GeoHarvester:

* UI filter "Format" from Aardvark field `dct_format_s`:
* https://opengeometadata.org/ogm-aardvark/#format-values (e.g. "Shapefile", "GeoTIFF", "TIFF" etc.)
* UI filter "Data Type" from Aardvark field `gbl_resourceType_sm` (from either list):
* https://opengeometadata.org/ogm-aardvark/#resource-type-values-loc (e.g. `[Atlases, Cadastral maps, Quadrangle maps]`)
* https://opengeometadata.org/ogm-aardvark/#resource-type-values-ogm (e.g. `[Polygon data, Raster data, Line data, Point data]`)

The first iteration mapped in the following way:
* `TIMDEX.content_type` = "Geospatial data" (static value for all records)
* `TIMDEX.format` = values from `Aardvark.dct_format_s`
* `TIMDEX.subjects@kind="Subject scheme not provided"` = values from `Aardvark.gbl_resourceType_sm`

This resulted in a TIMDEX record like:
```json
{
"content_type": "Geospatial data",
"format": "Shapefile",
"subjects": [
{
"value": "Polygon data",
"kind": "Subject scheme not provided"
},
{
"value": "Vector data",
"kind": "Subject scheme not provided"
}
]
}
```

#### 1- The static value "Geospatial data" for `content_type` removes `content_type` as an option for these other data points

It was originally thought that `content_type` was a very high level bucket, allowing various TIMDEX sources to share a facet with values like,
[Geospatial data, Language materials, Musical sound recording, Projected medium, ...].

However, on further analysis it was observed that `content_type` _could_ be a good fit for values like `Polygon data` or `Raster data`, as the long tail of the current `content_type` data shows quite a few examples
of hierarchically and thematically similar values for other sources like [Thesis, Article, Diaries, Scrapbooks, ...]. These values are concerned with the conceptual or structural makeup of the resource,
not the container or external form which field `format` is more aligned with.

#### 2- DSpace is currently the only TIMDEX source that writes to `format` with a hardcoded value `electronic resource`

Analyzing the use of `format` revealed that _only_ the DSpace source was writing to this TIMDEX field. This obscured the possibility that some external/container oriented terms
currently found in `content_type` -- from various pre-existing TIMDEX sources -- might make more sense as `format` values. And, that other TIMDEX sources could begin writing to this field for
the first time.

Better understanding this, the external container values found in `Aardvark.dct_format_s` like "Shapefile" or "GeoTIFF" seems aligned with the `format` field. And, what should
not be ignored, this results in naming parity across all hops: `Aardvark.dct_format_s --> TIMDEX.format --> UI:Format`.

#### 3- The "resource type" data (from `Aardvark.gbl_resourceType_sm`) are arguably not subjects, and furthermore do not have a qualifier that allows cherry-picking them for aggregation

While it was initially convenient to place values from `Aardvark.gbl_resourceType_sm` in `subjects`, it was generally agreed upon that they are not subjects. Furthermore, even
with a `kind` qualifier, it would be awkward at best to access them, or at worst reduce the meaning and usefulness of `subjects` across all TIMDEX sources.

Still thinking that `content_type` was tied up with the static "Geospatial data" value, this was a primary blocker. `content_type` is ultimately a better fit for these values, is already
multivalued, and has precedence with other source mappings like source `datacite` where a source metadata field `ResourceType.resourceTypeGeneral` maps to `content_type` (note the metadata
field naming parity with Aaardvark).

#### 4- Where do we encode "Geospatial data" then?

Considering all above, where then _can_ we encode "Geospatial data" to differentiate 100k GIS records from 1m ebooks in Alma? How can a user quickly cull _all_ of TIMDEX down to just records that
likely have geospatial data (i.e. GIS layers) without needing to immediately jump to much more granular distinctions like "Polygon data" or "Raster data", thereby also removing relevant records from their
search/browsing?

This might indicate an area where the TIMDEX data model could grow: adding a higher level, highly controlled, "categorical" field that would group records at this level. An
example is Primo, which has a "Resource Type" filter which filters at this level (ignoring the naming collision of Aardvark's `gbl_resourceType_sm` field which has a narrower scope). While
TIMDEX is not Primo, it is also not _not_ Primo insofar as it aggregates metadata records from fairly disparate platforms and sources. This act of aggregation might require an additional level of
categorical hierarchy to filter records by.

A potentially important note: we might _not_ expect to find the values/terms for this high level, categorical field in the metadata records themselves, because their descriptive focus is not
concerned with that level of description.

Examples:
- a GIS FGDC XML file does not need to indicate "Geospatial data" anywhere in the record
- an ArchivesSpace EAD record does not need to indicate "Archival material" anywhere in the record
- Alma is trickier: MARC records often _do_ attempt to classify at this high level, and some of these have ended up in `content_type` and likely could be "pushed up" to this new field

Given the presence of `content_type` already, to model after other aggregation platforms and metadata schemas, this new field might be something along the lines of `content_class`. These values may
be static per TIMDEX source (e.g. ASpace = "Archival Materials", GIS = "Geospatial data") or the source's metadata records may contribute to the logic (e.g. MARC's `006` and `008` field relationship
that might provide `content_class` level terms like ["Books", "Music", "Visual Materials", "Maps", ...]).

It is likely beyond the scope of this ADR to propose the addition of that field, but introducing it as a thought experiment takes some conceptual pressure off of `content_type` which could operate
as intended, and to some degree already is, as more granular "type" or "form" information about the resource.

### Possible Solutions

#### Option 1- Use `subjects` with `kind="Data Type"`

In this approach, "Data Type" values would be stored as `subjects` with `kind="Data Type"`.

Example:
```json
{
"subjects": [
{
"value": "Polygon",
"kind": "Data Type"
},
{
"value": "Vector",
"kind": "Data Type"
}
]
}
```

Pros:
* does not require a change to TIMDEX data model anywhere

Cons:
* these "Data Type" values don't feel like subjects; they are not really _about_ the resource so much as describing its type/structure/form


#### Option 2- Create new, multivalued string field `form`

In this approach, "Data Type" value would be stored in a new, multivalued string field `form`:

Example:
```json
{
"form": ["Polygon", "Vector"]
}
```

Pros:
* purely additive change to data model
* simple, top level property makes aggregations very simple

Cons:
* still require, and sit along next to, `literary_form` field for describing text sources as "Fiction" or "Nonfiction"


#### Option 3- Create new, multivalued objects field `form`; collapse `literary_form` into this

In this approach, "Data Type" value would be stored in a new, multivalued object field `form`:

Example:
```json
{
"form": [
{
"value": "Polygon",
"kind": "Data Type"
},
{
"value": "Vector",
"kind": "Data Type"
}
]
}
```

Pros:
* allows collapsing of `literary_form` field; noting some shared sentiment that this field might be too source-specific for TIMDEX
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I don't know that I feel strongly enough to veto this option, I do have concerns.

While I agree that, in hindsight, literary_form field seems to over-fit the data model to one data source, I'm not sure how comfortable I am creating a vocabulary that encompasses all of these values: nonfiction, fiction, point, polygon, raster, image

Terms like point, polygon, and raster feel like they address the internal structure of the record, but nonfiction and fiction are more about the content (akin to genre). Equivalent literary terms feel like they might be prose, verse, or drama (which are values that I don't think appear in any current field?)

image is a value I also struggle with in this set, as I understand it to be closely related to (if not synonymous with) raster. I vaguely remember seeing raster data that is not an image, so maybe there's something here I'm missing though.

Copy link
Member

@JPrevost JPrevost Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matt-bernhardt would your concern be addressed by the different kind provided by this Object?

Example:

GIS record

{
    "form": [
        {
            "value": "Polygon",
            "kind": "Data Type"
        },
        {
            "value": "Vector",
            "kind": "Data Type"
        }
    ]
}

Literary form example

{
    "form": [
        {
            "value": "Fiction",
            "kind": "Literary Form"
        }
    ]
}

I remain unsure myself if kind addresses the "sameness" across fields. When I imagine using this data in a User Interface, I find myself agreeing with your concern that the geo values looks super weird along with the literary form values. We could address that by exposing aggregations that are filtered to the separate kind values but then it starts to feel slightly weird again of why are we combining things in OpenSearch that we would then split out again in GraphQL and thus nobody would ever use the combined data. It definitely starts to feel like if we are concerned we'll never use the data together -- because it is indeed "different" -- we need to consider if it ever should have been together in the first place. I'm still not sure.

* like other object fields, leaves the door open for adding a `uri` property at a later time

Cons:
* would require reworking the transformations + re-indexing any sources that use `literary_form`
* nested field type, a bit harder to query for aggregations

#### Option 4 - Use `file_formats` for current `format` values and `format` for Data Type values

In this approach, the current `MITAardvark.format` values would shift to the previously unused `MITAardvark.file_formats` property and the Data Type values would be stored in `MITAardvark.format`

Example:
```json

{
"content_type": "Geospatial data",
"format": ["Polygon", "Point", "Raster", "Image"],
"file_formats": ["Shapefile", "TIFF", "GeoTIFF", "JPEG"]
}
```

Pros:
* does not require TIMDEX data model changes

Cons:
* `file_formats` has previously only stored MIME type values, such as `application/pdf`
* may require explanation of the facet mapping in the UI documentation
* may require updates of other transform classes for consistency
Comment on lines +191 to +211
Copy link
Contributor Author

@ghukill ghukill Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finding Option 4 pretty appealing now.

One question: what if we had a record for a print map? If the Aardvark record had dct_format_s="Print map", are we okay with this value mapping to file_formats?

Maybe helpful for discussion, this spreadsheet has dct_format_s and gbl_resourceType_sm values across all MIT and OGM records.

In support of Option 4, I'm hard pressed to find a value that isn't a digital file format in essence (where even msot of the complex ones could be mapped to "Online resource" or something similar).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question: if we go this route, are we okay losing mimetype as a value in a TIMDEX record? My inclination is that it would be okay as I'd imagine this field is not widely used yet.

If we did want specifics about mimetypes, perhaps that could get added to the links section? Seems more meaningful for a mimetype to point precisely to an actual resource vs the record as an abstract whole.

Copy link
Contributor

@ehanson8 ehanson8 Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we did want specifics about mimetypes, perhaps that could get added to the links section? Seems more meaningful for a mimetype to point precisely to an actual resource vs the record as an abstract whole.

Agree that including it with the actual download link would be more useful, Link.file_type attribute?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question: what if we had a record for a print map? If the Aardvark record had dct_format_s="Print map", are we okay with this value mapping to file_formats?

If we do go this route, we may want to change the name from file_formats so it's more broadly applicable but I don't have a suggestion yet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question: what if we had a record for a print map? If the Aardvark record had dct_format_s="Print map", are we okay with this value mapping to file_formats?

If we do go this route, we may want to change the name from file_formats so it's more broadly applicable but I don't have a suggestion yet

I've had time to look more closely at the spreadsheet linked above, and I don't think we'll encounter "Print Maps", at least in the gismit or gisogm sources. So maybe this is moot? I think it's safe to say that the values we get back from dct_format_s in the MITAardvark files will uniformly be digital in nature. And therefore, file_formats could be applicable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we did want specifics about mimetypes, perhaps that could get added to the links section? Seems more meaningful for a mimetype to point precisely to an actual resource vs the record as an abstract whole.

Agree that including it with the actual download link would be more useful, Link.file_type attribute?

If we did repurpose file_formats, and therefore needed to move those mimetype data points to another field, and we did like putting them on the links entries, I might propose mimetype explicitly; no ambiguity there.

If we go that route, I think we'd want to make sure the links made sense though. If we're just pointing someone back to an Alma record, or ASpace page, I'm not sure that an link.mimetype=application/pdf makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I can weigh in on what to do with the mimetype values at the moment, but as far as the rest of this choice is concerned, I think I'd prefer option 4 over the others - at least as I understand them now.

To be clear, this option would leave the current literary_form field untouched?


#### Option 5 - Map `format` to "Format" filter, map 'content_type' to "Data Type" filter

In this approach, there would be **no** immediate data model changes. As outlined above, both the pre-existing `format` and `content_type` fields would be sufficient
for mapping data from the Aardvark records in such a way to support "Format" and "Data Type" UI filters.

This option **does** implicitly propose a new higher level TIMDEX field, something along the lines of `content_class`, but this is not an immediate requirement, and it
might be helpful to decouple that from this decision at hand.

- GIS TIMDEX sources
- continue to map Aardvark `dct_format_s` to TIMDEX `format`, driving the new "Format" UI filter
- map Aardvark `gbl_resourceType_sm` to TIMDEX `content_type`, driving the new "Data Type" UI filter
- this "Data Type" UI filter, pulling from `content_type`, likely only makes sense in the Geo TIMDEX UI context
- a "Content Type" facet filter across all TIMDEX records would have values like ["Short stories", "Polygon data", "Thesis", ...] sitting next to each other
- accept that "Geospatial data" is not an available, high level value to facet on until a field like `content_class` is modeled and added

- Other TIMDEX sources
- DSpace
- remove blanket `format=electronic resource` and instead derive `format` values from available `file_formats` mimetypes for friendly forms
- e.g. `application/pdf` suggests "PDF", or `text/csv` suggests "CSV", to name a couple example
- there are python libraries that can handle 90% of these conversions, if a friendly form is not present in the library

Examples:
```json
{
"content_type": ["Polygon data"],
"format": "Shapefile"
}
```

```json
{
"content_type": ["Raster data", "Image"],
"format": "GeoTIFF"
}
```

## Decision

TBD
ghukill marked this conversation as resolved.
Show resolved Hide resolved

## Consequences

For GIS records, both the "Format" and "Data Type" UI filters will mirror those same filters in the legacy "Geoweb" system, with the same or highly similar values.

For other TIMDEX sources, given the decision outlined above, there should be little to no effect. However, as discussed, the inability to encode "Geospatial data" may suggest that a new, higher level
TIMDEX categorical field would be beneficial, and this would require work in Transmogrifier to provide a meaningful value for this new field for all sources.
Loading