Structured Data 2021 #2174

rviscomi · 2021-04-27T19:32:06Z

Part I Chapter 4: Structured Data

If you're interested in contributing to the Structured Data chapter of the 2021 Web Almanac, please reply to this issue and indicate which role or roles best fit your interest and availability: author, reviewer, analyst, and/or editor.

Content team

Lead	Authors	Reviewers	Analysts	Editors	Coordinator
@jonoalderson	@jonoalderson @cyberandy	@kevinmarks @vdwijngaert @jvandriel @philbarker	@GregBrimble	@jvandriel @JasmineDWillson	@rviscomi

Expand for more information about each role

The content team lead is the chapter owner and responsible for setting the scope of the chapter and managing contributors' day-to-day progress.
Authors are subject matter experts and lead the content direction for each chapter. Chapters typically have one or two authors. Authors are responsible for planning the outline of the chapter, analyzing stats and trends, and writing the annual report.
Reviewers are also subject matter experts and assist authors with technical reviews during the planning, analyzing, and writing phases.
Analysts are responsible for researching the stats and trends used throughout the Almanac. Analysts work closely with authors and reviewers during the planning phase to give direction on the types of stats that are possible from the dataset, and during the analyzing/writing phases to ensure that the stats are used correctly.
Editors are technical writers who have a penchant for both technical and non-technical content correctness. Editors have a mastery of the English language and work closely with authors to help wordsmith content and ensure that everything fits together as a cohesive unit.
The section coordinator is the overall owner for all chapters within a section like "User Experience" or "Page Content" and helps to keep each chapter on schedule.

Note: The time commitment for each role varies by the chapter's scope and complexity as well as the number of contributors.

For an overview of how the roles work together at each phase of the project, see the Chapter Lifecycle doc.

Milestone checklist

0. Form the content team

May 31: The content team has at least one author, reviewer, and analyst

1. Plan content

June 15 The content team has completed the chapter outline in the draft doc

2. Gather data

June 30: Analysts have added all necessary custom metrics and drafted a PR (example) to track query progress

July 1 - 31: HTTP Archive runs the July crawl

3. Validate results

September 30: Analysts have queried all metrics and saved the output to the results sheet

4. Draft content

October 31: The content team has written, reviewed, and edited the chapter in the doc

5. Publication

November 15: The completed chapter and all required metadata and figures are converted to markdown and submitted to GitHub

December 1: Target launch date 🚀

Chapter resources

Refer to these 2021 Structured Data resources throughout the content creation process:

📄 Google Docs for outlining and drafting content
🔍 SQL files for committing the queries used during analysis
📊 Google Sheets for saving the results of queries
📝 Markdown file for publishing content and managing public metadata

jonoalderson · 2021-04-27T19:43:11Z

As discussed in Slack, I'd be very keen to author this. I'd also be happy to take my hat out of the 'Author' ring (and to play Reviewer instead) for #2148 so as to be able to resource this effectively (which I've updated accordingly).

rviscomi · 2021-05-04T20:02:29Z

@jono-alderson thanks for your interest in authoring this chapter! As the content team lead, you'll be responsible for the scope and direction of the chapter and keeping it on schedule. We automatically monitor the staffing and progress of each chapter based on the state of the initial comment so please keep that updated as you add new contributors and meet each milestone.

We've created a Google Doc for this chapter, which you're encouraged to use to collaborate with the content team on the initial outline, metrics, and ultimately the final draft.

Next steps for this chapter are:

Due May 31: Complete Milestone 0: Form the content team by finding contributors to be a peer reviewer and analyst
Due June 15: Get started on Milestone 1: Plan content by brainstorming some content ideas in the doc (you may need to request edit access)

There's not currently a section coordinator for this chapter, so I'll be periodically checking in with you directly to make sure the chapter is staying on schedule. Reach out here in this issue if you have any questions about the process.

More information about the content team lead and author roles and responsibilities are available for reference in the wiki if needed.

To anyone else interested in contributing to this chapter, please comment below to join the team!

GregBrimble · 2021-05-06T00:05:43Z

Hey @jono-alderson ,

If you'll have me, I'd love to help out with the analysis for this chapter, this year!

jonoalderson · 2021-05-06T07:31:32Z

Hey @jono-alderson ,
If you'll have me, I'd love to help out with the analysis for this chapter, this year!

That'd be wonderful, thanks! NB, I'm aiming to start outlining a plan and firing out some comms this weekend :)

rviscomi · 2021-05-11T04:08:30Z

Hi @jono-alderson just checking in. Here are some tips to help keep the chapter on track:

Request edit access to the doc and start brainstorming an outline for the chapter
Consider announcing to your professional networks that you're looking for co-contributors knowledgable in structured data to join the chapter
Edit the top comment to keep the chapter metadata in sync with all reviewers and analysts and also any completed milestones (helpful for us to monitor progress at a glance in 2021 Chapter Progress #2179)

⚠️ Note that if we're unable to meet Milestone 0 by May 31 we may have to close this chapter and refocus our efforts on other chapters.

vdwijngaert · 2021-05-11T10:12:11Z

Happy to help if you guys need any more reviewers :)

kevinmarks · 2021-05-11T10:22:48Z

You asked about microformats - I'm happy to help review on that area, and help those running analyses make sense of them.

jonoalderson · 2021-05-11T10:24:56Z

Happy to help if you guys need any more reviewers :)

Thanks - more reviewers are definitely welcome! I have a feeling that we're going to need lots of hands on deck for this!

jonoalderson · 2021-05-11T10:25:54Z

You asked about microformats - I'm happy to help review on that area, and help those running analyses make sense of them.

Thanks, Kevin, that'd be amazing. I'm conscious that whilst schema.org and JSON-LD is very trendy at the moment, there's lot of structured data out there in legacy formats that I'm keen for us not to overlook. I'll add you as a reviewer! Delightful to have your input.

jonoalderson · 2021-05-11T10:28:03Z

@rviscomi I don't appear to be able to edit the top comment; do I need some permissions?

GregBrimble · 2021-05-11T12:52:56Z

I've apparently got edit access, so I've added @kevinmarks and @vdwijngaert as reviewers, and myself as an analyst, @jono-alderson :)

I've also checked off that May 31st milestone since we now have at least one of each role. Do you want to remove the help wanted badges, or are you still looking for more people to help out?

jonoalderson · 2021-05-11T12:59:03Z

Thanks! Still happy to invite more folks. It's a big topic, so I'm happy to cat-herd involvement from a wider pool potentially; unless there are good reasons not to?

Could you also add @jvandriel as a reviewer and editor, please? :)

GregBrimble · 2021-05-11T13:03:14Z

Nope, I'm sure that's fine to leave the badges up if we're still looking for people :)

Added, and also put everyone in the frontmatter of the Google doc as well.

cyberandy · 2021-05-11T13:25:25Z

Hi all 👋 happy to contribute on this one - either as author or editor, whatever feels more necessary.

jvandriel · 2021-05-11T13:55:51Z

I'm happy to join and help out as well - also very curious to see the outcome

rviscomi · 2021-05-11T14:04:55Z

@rviscomi I don't appear to be able to edit the top comment; do I need some permissions?

@jono-alderson you'll need to accept our invitation to join the HTTP Archive team in order to get edit access on GitHub. Check your email or visit https://github.com/HTTPArchive/ to accept.

Happy to see the increased interest in this chapter!

rviscomi · 2021-05-13T19:02:47Z

Here's the sharable link for anyone to join the Slack channel: https://join.slack.com/t/httparchive/shared_invite/zt-45sgwmnb-eDEatOhqssqNAKxxOSLAaA

jonoalderson · 2021-05-16T13:27:28Z

Looks like we have everybody in Slack except for @vdwijngaert; are you able to join us, Koen? :)

philbarker · 2021-05-17T16:06:10Z

@jono-alderson I'm here because @jvandriel asked, then I saw your tweet asking for involvement from people with expertise in Dublin Core / other metadata. I might be able to help as reviewer, if you still need such help.

jonoalderson · 2021-05-17T17:15:00Z

Hi @philbarker, thanks for reaching out! That'd be amazing; I'll add you to the team list! I know I'm personally weak on knowledge around DC, so keen to have an expert involved!

Please feel free to jump into the Slack channel, and contribute any ideas/direction, etc!

rviscomi · 2021-05-27T19:09:55Z

All, the outline in the chapter doc is looking great. Nice work! 🚀

@jono-alderson is the outline complete, or are you still adding to it?

jonoalderson · 2021-05-28T08:56:45Z

Getting there!
Hoping for a bit more feedback from the crew, as I feel there's more we could do without being too over-ambitious. Any ideas, folks?

GregBrimble · 2021-05-28T10:34:54Z

One thing I might suggest would be a deeper integration with knowledge graphs like Wikidata. If I've got this structured data on a page:

{
  "@type": "Person",
  "name": "Greg Brimble",
  "nationality": {
    "@type": "Country",
    "name": "United Kingdom"
  },
  "sameAs": ["https://www.wikidata.org/wiki/Q52444075"]
}

and Wikidata has this:

"instance of" → "human"
(P31 → Q5)

"country of citizenship" → "United Kingdom"
(P27 → Q145)

How much overlap is there? Does the structured data provide information not found in Wikidata, or vice-versa?
Are there any inconsistent claims?
Do the types of the entity match?
etc.

This is getting dangerously close to what my undergraduate dissertation was on 😅 The difficulty is in doing the ontology matching (finding equivalent properties and entities), which might be a bit out-of-scope for this analysis (e.g. Schema.org's "Person" ≠ Q5, but Schema.org's "nationality" === P27).

jonoalderson · 2021-05-28T10:49:07Z

That'd be pretty awesome, but I think that comparing to external sources at scale is going to be waaayyy out of scope.
We should definitely put more attention on sameAs declarations, though; there's bound to be some interesting findings in directing common hostnames and patterns in there.

JasmineDWillson · 2021-05-28T21:18:02Z

Might be interesting to touch on the use of sameAs:

In terms of its limitations

e.g. When owl:sameAs isn’t the Same: An Analysis of Identity Links on the Semantic Web

Or ways that others have tried to navigate mapping to terms that are not entirely equivalent

e.g. the CWRC ontology's hasFunctionalRelation predicate which "Relates...to external terms that are semantically incommensurate but that may be pragmatically related for processing purposes such as search and retrieval"

without diving too deeply into the mire of ontology mapping...

rviscomi · 2021-06-16T02:49:11Z

Hey @jono-alderson, could you give an update on the chapter outline? I see some new topics added today, but not sure if it's still being worked on. If it's finalized you could check off Milestone 1 above, otherwise let us know when you think it'll be ready. Thanks!

@GregBrimble please take a close look at the outline to see whether we need any custom metrics to extract structured data info from the DOM at runtime. Those would need to be written and merged no later than the end of the month to be added to the test pipeline in time.

jonoalderson · 2021-06-16T07:57:46Z

Hello hello! I'm happy with the chapter outline, and will check off the milestone now.

@GregBrimble, I think we need to explore your message in Slack (https://httparchive.slack.com/archives/C021GGN9W4D/p1623610269059000) ASAP, as that might influence our next steps.

GregBrimble · 2021-08-04T12:13:13Z

And we've got the run's results! July's data is up so we can now play around in BigQuery.

I've started the queries in #2293, and have requested edit access to the results sheet so I can start putting stuff down there. Checked the error log as a first priority, and so far, it looks pretty good. We have our structured data custom metrics on 13,775,158 of the 13,778,213 pages we've run against. We captured 508 error logs, and I'm assuming the rest (2,547) failed so hard that we couldn't even capture the exception. 99.98% success is good enough for me.

This analysis is due September 30, but I can't imagine it takes nearly that long. The hardest bit will be the JSON-LD parsing, which in all likelihood I'm going to do locally. I'll do bits and pieces over the next few days, so keep an eye on that linked PR to follow along :)

jonoalderson · 2021-08-04T13:26:18Z

This is monumentally exciting!
I'll make a start on some of the 'generic' content/write-up (fleshed out introductions, etc) in the meantime!

rviscomi · 2021-09-13T15:12:53Z

👋 Hi @jono-alderson @cyberandy @GregBrimble, just checking in on the chapter progress. How is the analysis coming along?

GregBrimble · 2021-09-13T17:21:19Z

Hey, made decent progress the weekend before last, but it's a busy week at work, this week, so haven't had a chance to get back to it. I'll get this completed next weekend ☺️

jonoalderson · 2021-09-30T07:07:51Z

Any updates from your side, @GregBrimble ?

rviscomi · 2021-11-29T22:25:12Z

@jonoalderson @cyberandy @kevinmarks @vdwijngaert @jvandriel @philbarker @GregBrimble @jvandriel @JasmineDWillson

Thank you all for your hard work getting this chapter over the finish line in time for the pre-release—Structured Data has been the most-read (English-version) chapter in the past couple of weeks! Congratulations on finishing the chapter, and I'm excited to see us launch the rest of the chapters along side it on Wednesday 🎉

When you get 5 minutes, I'd really appreciate if you could fill out our contributor survey to tell us (the project leads) about your experience. It's super helpful to hear what went well or what could be improved for next time. 🙏

rviscomi added 2021 chapter Tracking issue for a 2021 chapter help wanted Extra attention is needed labels Apr 27, 2021

rviscomi mentioned this issue Apr 27, 2021

📣 Contribute to the 2021 Web Almanac #2167

Closed

rviscomi added this to the 2021 Content Team Assignments milestone Apr 27, 2021

jonoalderson mentioned this issue Apr 27, 2021

Performance 2021 #2148

Closed

6 tasks

github-actions bot mentioned this issue Apr 30, 2021

2021 Chapter Progress #2179

Closed

rviscomi added help wanted: analysts This chapter is looking for data analysts help wanted: reviewers This chapter is looking for reviewers labels May 4, 2021

rviscomi assigned jonoalderson May 4, 2021

rviscomi mentioned this issue May 11, 2021

Accessibility 2021 #2147

Closed

6 tasks

rviscomi removed help wanted Extra attention is needed help wanted: analysts This chapter is looking for data analysts help wanted: reviewers This chapter is looking for reviewers labels May 11, 2021

jonoalderson added the help wanted: analysts This chapter is looking for data analysts label May 16, 2021

rviscomi removed the help wanted: analysts This chapter is looking for data analysts label May 17, 2021

GregBrimble mentioned this issue Jun 28, 2021

Structured Data 2021 HTTPArchive/legacy.httparchive.org#218

Merged

GregBrimble mentioned this issue Aug 4, 2021

Structured Data 2021 queries #2293

Merged

21 tasks

rviscomi modified the milestones: 2021 Content Planning, 2021 Analysis Oct 1, 2021

rviscomi modified the milestones: 2021 Analysis, 2021 Content Writing Oct 14, 2021

This was referenced Nov 6, 2021

Structured Data 2021 SQL update - add page counts #2442

Merged

Structured Data 2021 Markdown #2466

Merged

rviscomi closed this as completed in #2466 Nov 13, 2021

tunetheweb mentioned this issue Nov 18, 2021

Add Sankey diagram to Structured Data chapter #2560

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Structured Data 2021 #2174

Structured Data 2021 #2174

rviscomi commented Apr 27, 2021 •

edited

Loading

jonoalderson commented Apr 27, 2021 •

edited

Loading

rviscomi commented May 4, 2021

GregBrimble commented May 6, 2021

jonoalderson commented May 6, 2021 •

edited

Loading

rviscomi commented May 11, 2021 •

edited

Loading

vdwijngaert commented May 11, 2021

kevinmarks commented May 11, 2021

jonoalderson commented May 11, 2021

jonoalderson commented May 11, 2021 •

edited

Loading

jonoalderson commented May 11, 2021

GregBrimble commented May 11, 2021 •

edited

Loading

jonoalderson commented May 11, 2021 •

edited

Loading

GregBrimble commented May 11, 2021

cyberandy commented May 11, 2021

jvandriel commented May 11, 2021

rviscomi commented May 11, 2021 •

edited

Loading

rviscomi commented May 13, 2021

jonoalderson commented May 16, 2021

philbarker commented May 17, 2021

jonoalderson commented May 17, 2021

rviscomi commented May 27, 2021

jonoalderson commented May 28, 2021

GregBrimble commented May 28, 2021

jonoalderson commented May 28, 2021

JasmineDWillson commented May 28, 2021 •

edited

Loading

rviscomi commented Jun 16, 2021

jonoalderson commented Jun 16, 2021

GregBrimble commented Aug 4, 2021

jonoalderson commented Aug 4, 2021

rviscomi commented Sep 13, 2021

GregBrimble commented Sep 13, 2021

jonoalderson commented Sep 30, 2021

rviscomi commented Nov 29, 2021 •

edited

Loading

Structured Data 2021 #2174

Structured Data 2021 #2174

Comments

rviscomi commented Apr 27, 2021 • edited Loading

Part I Chapter 4: Structured Data

Content team

Milestone checklist

0. Form the content team

1. Plan content

2. Gather data

3. Validate results

4. Draft content

5. Publication

Chapter resources

jonoalderson commented Apr 27, 2021 • edited Loading

rviscomi commented May 4, 2021

GregBrimble commented May 6, 2021

jonoalderson commented May 6, 2021 • edited Loading

rviscomi commented May 11, 2021 • edited Loading

vdwijngaert commented May 11, 2021

kevinmarks commented May 11, 2021

jonoalderson commented May 11, 2021

jonoalderson commented May 11, 2021 • edited Loading

jonoalderson commented May 11, 2021

GregBrimble commented May 11, 2021 • edited Loading

jonoalderson commented May 11, 2021 • edited Loading

GregBrimble commented May 11, 2021

cyberandy commented May 11, 2021

jvandriel commented May 11, 2021

rviscomi commented May 11, 2021 • edited Loading

rviscomi commented May 13, 2021

jonoalderson commented May 16, 2021

philbarker commented May 17, 2021

jonoalderson commented May 17, 2021

rviscomi commented May 27, 2021

jonoalderson commented May 28, 2021

GregBrimble commented May 28, 2021

jonoalderson commented May 28, 2021

JasmineDWillson commented May 28, 2021 • edited Loading

rviscomi commented Jun 16, 2021

jonoalderson commented Jun 16, 2021

GregBrimble commented Aug 4, 2021

jonoalderson commented Aug 4, 2021

rviscomi commented Sep 13, 2021

GregBrimble commented Sep 13, 2021

jonoalderson commented Sep 30, 2021

rviscomi commented Nov 29, 2021 • edited Loading

rviscomi commented Apr 27, 2021 •

edited

Loading

jonoalderson commented Apr 27, 2021 •

edited

Loading

jonoalderson commented May 6, 2021 •

edited

Loading

rviscomi commented May 11, 2021 •

edited

Loading

jonoalderson commented May 11, 2021 •

edited

Loading

GregBrimble commented May 11, 2021 •

edited

Loading

jonoalderson commented May 11, 2021 •

edited

Loading

rviscomi commented May 11, 2021 •

edited

Loading

JasmineDWillson commented May 28, 2021 •

edited

Loading

rviscomi commented Nov 29, 2021 •

edited

Loading