[FCL-492] ADR 21: XML format for documents where source text cannot be extracted by parser #183

dragon-dxw · 2024-11-20T16:27:06Z

FCL-492

Part of FCL-454

doc/adr/0021-xml-with-no-docx.md

dragon-dxw

People seem broadly happy, work in Jim's notes and maybe Nick's ADR for final version.

doc/adr/0021-xml-with-no-docx.md

jacksonj04 · 2024-11-27T09:55:07Z

doc/adr/0021-xml-with-no-docx.md

+
+Metadata is provided to the parser via spreadsheets. We will want to preserve some of this data.
+
+The existing tag names (see the appendix) will be used inside a `uk:external-metadata` tag in the `proprietary` tag to allow this metadata to be searched at the same time as other documents that have these tags in the running body of the judgment. Values MUST NOT be deliberately replicated across `external-metadata` and other places, but MAY be replicated if the text body is successfully parsed and the string appeared in externally provided data sources.


I'm not immediately clear on what this MUST NOT is talking about - is it that we shouldn't be copying them into the body unless the body text already contains them and we can semantically extract them?

My gut feeling is "tags should appear in one place or the other, not both, and we search on a combination of akn:party and uk:party"

The alternative is "we reparse everything and just search for uk:party".

The first made more sense when I thought we could just drop an akn:party into the proprietary namespace; we wouldn't need to update our search at all.

jacksonj04 · 2024-11-27T09:56:13Z

doc/adr/0021-xml-with-no-docx.md

+
+Metadata is provided to the parser via spreadsheets. We will want to preserve some of this data.
+
+The existing tag names (see the appendix) will be used inside a `uk:external-metadata` tag in the `proprietary` tag to allow this metadata to be searched at the same time as other documents that have these tags in the running body of the judgment. Values MUST NOT be deliberately replicated across `external-metadata` and other places, but MAY be replicated if the text body is successfully parsed and the string appeared in externally provided data sources.


We haven't used external-metadata in @jurisdatum's prototype, but I think we can probably ditch it given than we're using things like uk:party across the board now.

Hmmm. This does potentially mean allowing editors easy editing access to a set of fields is more complicated (we can't just expose those fields as an arbitrary table).

doc/adr/0021-xml-with-no-docx.md

jacksonj04 · 2024-11-27T10:11:45Z

doc/adr/0021-xml-with-no-docx.md

+
+TODO: I think this part wants tearing apart carefully.
+
+Each of these metrics defaults to `1`: we have no reason to doubt the quality of the document. Values of `0` imply a complete lack of quality, or an absence of that feature. Values below `0.5` should be considered poor, and mitigations applied. Values above `1` might be signs of additional verification.


I wonder if we should add at least one boolean here:

uk:markup-human-reviewed: An editor has actually eyeballed this document and given it the all-clear vs it being handled entirely by the computer (so we can distinguish "this has been spot checked" from "this has been autopublished"

I'm also wondering if this a the point where we could indicate what level of parsing has even been attempted, maybe one of none (ie we've literally got metadata and that's it), text-only (we've had a stab at the words, but no guarantee they're in the right order) or structured (the parser has tried to extract the structure of the document as well).

Added markup-human-reviewed.
The latter feels like that's the scores: the first is 0 markup, 0 text; the second is 0 markup, 0.3 text; the third is 0.8 markup, 0.8 words?

jacksonj04 · 2024-11-27T10:12:18Z

doc/adr/0021-xml-with-no-docx.md

+
+---
+
+## Appendix: Current Search XPATH


This feels like something we should document elsewhere and link to, possibly alongside the actual query xpaths in the ML repo.

jacksonj04 · 2024-11-27T10:16:57Z

doc/adr/0021-xml-with-no-docx.md

+
+#### Rejected for Impossible Tasks
+
+Documents with no DOCX cannot be reparsed, so SHOULD be excluded from the list of reparse candidates, SHOULD not be reparsable in the UI and MUST not be sent for reparsing.


This needs to be a bit more complex I think:

Documents which do not have a parseable source document MUST NOT be sent for reparsing at any point, as we can't do anything with them, but we are doing some kind of parsing on the PDFs, it's just not trying to extract any kind of meaning from the document text itself. It might be that the list of source documents we consider viable for parsing changes over time.

Documents which do not have structured markup MUST NOT be sent for enrichment, because down that road madness lies.

Enrichment will actually work just about fine without markup; the existence of ref-tags may be useful.

doc/adr/0021-xml-with-no-docx.md

ADR 21: XML with no docx

4c7492f

dragon-dxw force-pushed the xml-with-no-docx branch from e508c69 to 4c7492f Compare November 20, 2024 16:29

dragon-dxw commented Nov 26, 2024

View reviewed changes

doc/adr/0021-xml-with-no-docx.md Outdated Show resolved Hide resolved

dragon-dxw commented Nov 26, 2024

View reviewed changes

doc/adr/0021-xml-with-no-docx.md Outdated Show resolved Hide resolved

dragon-dxw commented Nov 26, 2024

View reviewed changes

jacksonj04 requested changes Nov 27, 2024

View reviewed changes

jacksonj04 changed the title ~~ADR 21: XML with no docx~~ ADR 21: XML format for documents where source text cannot be extracted by parser Nov 27, 2024

jacksonj04 changed the title ~~ADR 21: XML format for documents where source text cannot be extracted by parser~~ [FCL-492] ADR 21: XML format for documents where source text cannot be extracted by parser Nov 27, 2024

Addressing issues

4e44342

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FCL-492] ADR 21: XML format for documents where source text cannot be extracted by parser #183

[FCL-492] ADR 21: XML format for documents where source text cannot be extracted by parser #183

dragon-dxw commented Nov 20, 2024 •

edited by jacksonj04

Loading

dragon-dxw left a comment

jacksonj04 Nov 27, 2024

dragon-dxw Nov 29, 2024

jacksonj04 Nov 27, 2024

dragon-dxw Nov 29, 2024

jacksonj04 Nov 27, 2024

dragon-dxw Nov 29, 2024

jacksonj04 Nov 27, 2024

jacksonj04 Nov 27, 2024

dragon-dxw Nov 29, 2024


		Metadata is provided to the parser via spreadsheets. We will want to preserve some of this data.

		The existing tag names (see the appendix) will be used inside a `uk:external-metadata` tag in the `proprietary` tag to allow this metadata to be searched at the same time as other documents that have these tags in the running body of the judgment. Values MUST NOT be deliberately replicated across `external-metadata` and other places, but MAY be replicated if the text body is successfully parsed and the string appeared in externally provided data sources.


		TODO: I think this part wants tearing apart carefully.

		Each of these metrics defaults to `1`: we have no reason to doubt the quality of the document. Values of `0` imply a complete lack of quality, or an absence of that feature. Values below `0.5` should be considered poor, and mitigations applied. Values above `1` might be signs of additional verification.


		#### Rejected for Impossible Tasks

		Documents with no DOCX cannot be reparsed, so SHOULD be excluded from the list of reparse candidates, SHOULD not be reparsable in the UI and MUST not be sent for reparsing.

[FCL-492] ADR 21: XML format for documents where source text cannot be extracted by parser #183

Are you sure you want to change the base?

[FCL-492] ADR 21: XML format for documents where source text cannot be extracted by parser #183

Conversation

dragon-dxw commented Nov 20, 2024 • edited by jacksonj04 Loading

dragon-dxw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dragon-dxw commented Nov 20, 2024 •

edited by jacksonj04

Loading