schema.org markup for video objects #834

lwrubel · 2023-11-28T16:20:26Z

Resolves #824. Adds schema.org JSON for a videos that are world-downloadable. Example (https://sul-purl-stage.stanford.edu/vx195sw6395):

Example 2: (https://sul-purl-uat.stanford.edu/bc169zr6817) checked on Google's Rich Results Test:
<script type="application/ld+json">{"@context":"http://schema.org","@type":"VideoObject","name":"SCRF session - 1 - PIP-II","thumbnailUrl":"https://stacks-uat.stanford.edu/file/druid:bc169zr6817/bc169zr6817_SRF_AWLC_Oct20-3_thumb.jp2","uploadDate":"2020-10-20","embedUrl":"https://embed-uat.stanford.edu/iframe/?url=https%3A%2F%2Fsul-purl-uat.stanford.edu%2Fbc169zr6817"}</script>

lwrubel · 2023-12-01T22:47:49Z

@arcadiafalcone would you take a look at the specs (especially spec/lib/metadata/schema_dot_org_spec.rb) to see if the Cocina is realistic? If there are other scenarios to include in the test, let me know.

arcadiafalcone · 2023-12-04T18:15:26Z

Is there a schema.org reason for joining title and subtitle with a new line rather than the more usual semicolon?

purl/spec/lib/metadata/schema_dot_org_spec.rb

Line 93 in a94ffa5

"name": 'My Dataset\nMore title'

The DOI might be in identifier.uri instead of identifier.value for some objects.

purl/spec/lib/metadata/schema_dot_org_spec.rb

Line 199 in a94ffa5

{ "value": "https://doi.org/10.25740/hj293cv5980",

And the ORCID may also be in either identifier.value or identifier.uri - MODS doesn't make the distinction.

purl/spec/lib/metadata/schema_dot_org_spec.rb

Line 389 in a94ffa5

"identifier": {"uri": "https://orcid.org/0000-0000-0000-0000"}}]

lwrubel · 2023-12-05T18:24:52Z

Thanks @arcadiafalcone. I'm fixing title and subtitle concatenation since I was erroneously using the same approach as for the description. Will use semi-colon to concatenate.

For the DOI in the identifier.uri, would this spec suffice?

purl/spec/lib/metadata/schema_dot_org_spec.rb

Line 198 in 6eb5412

context 'with DOI in identifier uri' do

For the ORCID, the current code / specs handle a contributor having identifier.uri or the contributor having identifier.value with type being "orcid". Is there also a case where it could be identifier.value without a type?

arcadiafalcone · 2023-12-05T19:21:12Z

My error, it should be title: subtitle (colon rather than semicolon).

arcadiafalcone · 2023-12-05T19:25:41Z

Re: DOI in identifier.uri, so long as it confirms that it is the right domain for a DOI (not all values in this field will be DOIs).

Re: ORCID, identifier.value should always have a type. (If it doesn't, that's a data problem.)

lwrubel · 2023-12-06T17:38:02Z

Thanks, @arcadiafalcone. I've adjusted the specs and code for these.

justinlittman · 2023-12-06T18:59:57Z

lib/metadata/schema_dot_org.rb

@@ -44,38 +42,58 @@ def dataset?
      false
    end

+    def video?
+      # Only return video metadata if world-downloadable.
+      video = JsonPath.new("$.description.form[?(@['value'] == 'moving image' && @['type'] == 'resource type')]").on(@cocina_json)


Why is this coming from description?

That was @arcadiafalcone's recommendation: #824 (comment). Is there a different field in the cocina we should consider?

type = "https://cocina.sul.stanford.edu/models/media"?

I would expect https://cocina.sul.stanford.edu/models/media to include audio as well, and we only want video. Am I not understanding the type correctly?

I'm fairly sure that I'm guessing here. Alternatively, resource's type = "https://cocina.sul.stanford.edu/models/resources/video"

@arcadiafalcone could you make a recommendation about using a fileset resource type vs the descriptive metadata for videos?

Descriptive metadata is dependent on the user to create it, so the fileset resource type is probably a bit more reliable (it's a core metadata field that's usually provided, but a user could make an error).

Thanks, that's helpful!

Is the schema.org expectation that a "video" = streaming video? Or would it include a video file that may be downloadable but isn't presented in a player? If the expectation is streaming, then the fileset resource type is the only guarantee that it will be presented that way.

We may also have description that says something is a video when the video itself hasn't been digitized. I'm not aware of examples but we do have "audio" where only the record label/album cover/liner notes have been digitized.

I've updated the code to use the fileset resource type for determining it's a video.

justinlittman · 2023-12-06T19:01:30Z

lib/metadata/schema_dot_org.rb

    end

-    def access
+    def access?


Shouldn't this take into account the file permissions as well?

I think I need help understanding if it's possible to have an object with access.download == "world" and the video itself is not world-downloadable. Is that the scenario you're thinking of, @justinlittman?

It's possible. It's less likely than the inverse case, where the access.download is none but one or more files are world-downloadable.

I've adjusted the code to look at the access.download permissions for the first file where hasMimeType includes video. It also looks at the object-level rights and for those to be "world". If you think we need to allow those to be download == none, let me know @andrewjbtw

justinlittman · 2023-12-06T19:05:29Z

@lwrubel Can a video object have more than one video file?

lwrubel · 2023-12-07T18:24:16Z

@justinlittman Looking at the canonical examples, I see that it is possible to have multiple video files (e.g. https://purl.stanford.edu/yj807zw8315). I'm trying to figure out what makes the most sense for schema.org.

There is an ItemList with limited usage, but it's intended for video carousels more than multi-file objects. Honestly, I'm not sure if it's important to surface thumbnails or embedUrls of more than one video, if on the PURL we currently show the first one, with access to the rest. They share descriptive metadata. (Not sure if there is rights variability among a list of videos, but we could pull the embedURL and thumbnail for the first video that is world-downloadable.)

lwrubel · 2023-12-08T21:34:08Z

Noting here that there seem to very few videos that meet our criteria for the uploadDate of the object having an event.date.type == 'publication'. Next week I will run a report to see how few they really are. I don't know if there are other date types that might be logical for this purpose, @arcadiafalcone. Here's Google's description of the field: https://developers.google.com/search/docs/appearance/structured-data/video#video-object.

lwrubel · 2023-12-08T21:36:23Z

I think I've addressed the comments, and this is otherwise fine to proceed on any further code review. We can tweak the date logic or broaden/narrow criteria further if needed, of course.

justinlittman

Sorry.

justinlittman · 2023-12-11T12:27:15Z

lib/metadata/schema_dot_org.rb

+    def format_specific_fields
+      if dataset?
+        return { "identifier": identifier,
+                 "isAccessibleForFree": access?,


What is the implication if this is "false"? Does it make sense to include the metadata in those cases?

Google Dataset Search includes datasets that are not free, or require a DUA. There's a "free" filter. So it's worth including non-free datasets.

But why publicize something that we can't provide?

@amyehodge I am making the assumption that there are SDR-deposited datasets that are not freely available but possible to use, once the user goes through some process (e.g. a DUA or contacting the author). Or that creators of Stanford-only datasets may still want some kind of visibility in Google Dataset Search. But maybe that is not a realistic scenario or something we support?

Yes, we do have datasets like this. One example is https://purl.stanford.edu/sg732vt3619, which has a PURL but in order to gain access you must 1) be a Stanford person and 2) agree to the Data Use Agreement, so the files aren't downloadable from the PURL. We would want this crawled.

Another example here https://purl.stanford.edu/gh587bx9720 where the data files are actually on the PURL.

justinlittman · 2023-12-11T12:30:57Z

lib/metadata/schema_dot_org.rb

+    def video?
+      # Only return video metadata if world-downloadable.
+      video = JsonPath.new("$.structural.contains[?(@['type'] == 'https://cocina.sul.stanford.edu/models/resources/video')]").on(@cocina_json)
+      return true if video.any? && access? && video_access?


Can you name the methods such that it is clearer how access? is different from video_access??

Also, the method name video? suggests that it answers the question whether this is a video. But this method does more than that. Perhaps "render_video_metadata?" or similar.

Changing access? to object_access?. Changing video? to render_video_metadata?

justinlittman · 2023-12-11T12:32:13Z

lib/metadata/schema_dot_org.rb

    end

    def schema_type?
-      dataset?
+      dataset? || video?


This method is poorly named since it is taking into account rights as well as the schema type.

Changed to render_video_metadata?

justinlittman · 2023-12-11T12:36:49Z

lib/metadata/schema_dot_org.rb

+      # need to find the file that is the one for the video (based on mime-type). Then get the access and download rights for that.
+      file_access = JsonPath.new('$[*].structural.contains[*][?(@.hasMimeType =~ /video/)].access.download').first(video)
+
+      return true if file_access == 'world'


L108-110 could just be file_access == 'world'

Oh, right. Done here and in dataset?

lwrubel force-pushed the t824-video-schema branch 4 times, most recently from e6ca5ce to 6c568b4 Compare December 1, 2023 22:12

lwrubel changed the title ~~[WIP] Initial schema.org markup for video objects~~ [DRAFT] Initial schema.org markup for video objects Dec 1, 2023

lwrubel changed the title ~~[DRAFT] Initial schema.org markup for video objects~~ [DRAFT] schema.org markup for video objects Dec 1, 2023

lwrubel force-pushed the t824-video-schema branch 2 times, most recently from db06769 to 6eb5412 Compare December 5, 2023 14:31

lwrubel force-pushed the t824-video-schema branch 3 times, most recently from d907b40 to 0f4d2f5 Compare December 6, 2023 17:21

lwrubel changed the title ~~[DRAFT] schema.org markup for video objects~~ schema.org markup for video objects Dec 6, 2023

lwrubel marked this pull request as ready for review December 6, 2023 17:38

justinlittman reviewed Dec 6, 2023

View reviewed changes

lwrubel force-pushed the t824-video-schema branch from fdd0892 to d2fea9c Compare December 7, 2023 14:37

lwrubel force-pushed the t824-video-schema branch 2 times, most recently from 08aa387 to 97e4daf Compare December 8, 2023 19:10

lwrubel changed the title ~~schema.org markup for video objects~~ [HOLD] schema.org markup for video objects Dec 8, 2023

lwrubel changed the title ~~[HOLD] schema.org markup for video objects~~ schema.org markup for video objects Dec 8, 2023

justinlittman requested changes Dec 11, 2023

View reviewed changes

schema.org markup for video objects

27a6fdb

lwrubel force-pushed the t824-video-schema branch from 97e4daf to 27a6fdb Compare December 11, 2023 14:00

justinlittman approved these changes Dec 11, 2023

View reviewed changes

lwrubel merged commit db48b44 into main Dec 11, 2023
1 check passed

lwrubel deleted the t824-video-schema branch December 11, 2023 20:18

schema.org markup for video objects #834

schema.org markup for video objects #834

Conversation

lwrubel commented Nov 28, 2023 • edited Loading

lwrubel commented Dec 1, 2023

arcadiafalcone commented Dec 4, 2023

lwrubel commented Dec 5, 2023

arcadiafalcone commented Dec 5, 2023

arcadiafalcone commented Dec 5, 2023

lwrubel commented Dec 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewjbtw Dec 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lwrubel Dec 8, 2023 • edited Loading

Choose a reason for hiding this comment

justinlittman commented Dec 6, 2023

lwrubel commented Dec 7, 2023

lwrubel commented Dec 8, 2023

lwrubel commented Dec 8, 2023

justinlittman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lwrubel commented Nov 28, 2023 •

edited

Loading

andrewjbtw Dec 8, 2023 •

edited

Loading

lwrubel Dec 8, 2023 •

edited

Loading