Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FCL-492] ADR 21: XML format for documents where source text cannot be extracted by parser #183

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dragon-dxw
Copy link
Collaborator

@dragon-dxw dragon-dxw commented Nov 20, 2024

FCL-492

Part of FCL-454

Copy link
Collaborator Author

@dragon-dxw dragon-dxw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People seem broadly happy, work in Jim's notes and maybe Nick's ADR for final version.

doc/adr/0021-xml-with-no-docx.md Outdated Show resolved Hide resolved
doc/adr/0021-xml-with-no-docx.md Outdated Show resolved Hide resolved

Metadata is provided to the parser via spreadsheets. We will want to preserve some of this data.

The existing tag names (see the appendix) will be used inside a `uk:external-metadata` tag in the `proprietary` tag to allow this metadata to be searched at the same time as other documents that have these tags in the running body of the judgment. Values MUST NOT be deliberately replicated across `external-metadata` and other places, but MAY be replicated if the text body is successfully parsed and the string appeared in externally provided data sources.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not immediately clear on what this MUST NOT is talking about - is it that we shouldn't be copying them into the body unless the body text already contains them and we can semantically extract them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut feeling is "tags should appear in one place or the other, not both, and we search on a combination of akn:party and uk:party"

The alternative is "we reparse everything and just search for uk:party".

The first made more sense when I thought we could just drop an akn:party into the proprietary namespace; we wouldn't need to update our search at all.


Metadata is provided to the parser via spreadsheets. We will want to preserve some of this data.

The existing tag names (see the appendix) will be used inside a `uk:external-metadata` tag in the `proprietary` tag to allow this metadata to be searched at the same time as other documents that have these tags in the running body of the judgment. Values MUST NOT be deliberately replicated across `external-metadata` and other places, but MAY be replicated if the text body is successfully parsed and the string appeared in externally provided data sources.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't used external-metadata in @jurisdatum's prototype, but I think we can probably ditch it given than we're using things like uk:party across the board now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. This does potentially mean allowing editors easy editing access to a set of fields is more complicated (we can't just expose those fields as an arbitrary table).

doc/adr/0021-xml-with-no-docx.md Outdated Show resolved Hide resolved

TODO: I think this part wants tearing apart carefully.

Each of these metrics defaults to `1`: we have no reason to doubt the quality of the document. Values of `0` imply a complete lack of quality, or an absence of that feature. Values below `0.5` should be considered poor, and mitigations applied. Values above `1` might be signs of additional verification.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should add at least one boolean here:

  • uk:markup-human-reviewed: An editor has actually eyeballed this document and given it the all-clear vs it being handled entirely by the computer (so we can distinguish "this has been spot checked" from "this has been autopublished"

I'm also wondering if this a the point where we could indicate what level of parsing has even been attempted, maybe one of none (ie we've literally got metadata and that's it), text-only (we've had a stab at the words, but no guarantee they're in the right order) or structured (the parser has tried to extract the structure of the document as well).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added markup-human-reviewed.
The latter feels like that's the scores: the first is 0 markup, 0 text; the second is 0 markup, 0.3 text; the third is 0.8 markup, 0.8 words?


---

## Appendix: Current Search XPATH
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like something we should document elsewhere and link to, possibly alongside the actual query xpaths in the ML repo.


#### Rejected for Impossible Tasks

Documents with no DOCX cannot be reparsed, so SHOULD be excluded from the list of reparse candidates, SHOULD not be reparsable in the UI and MUST not be sent for reparsing.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be a bit more complex I think:

  • Documents which do not have a parseable source document MUST NOT be sent for reparsing at any point, as we can't do anything with them, but we are doing some kind of parsing on the PDFs, it's just not trying to extract any kind of meaning from the document text itself. It might be that the list of source documents we consider viable for parsing changes over time.
  • Documents which do not have structured markup MUST NOT be sent for enrichment, because down that road madness lies.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enrichment will actually work just about fine without markup; the existence of ref-tags may be useful.

doc/adr/0021-xml-with-no-docx.md Outdated Show resolved Hide resolved
doc/adr/0021-xml-with-no-docx.md Outdated Show resolved Hide resolved
@jacksonj04 jacksonj04 changed the title ADR 21: XML with no docx ADR 21: XML format for documents where source text cannot be extracted by parser Nov 27, 2024
@jacksonj04 jacksonj04 changed the title ADR 21: XML format for documents where source text cannot be extracted by parser [FCL-492] ADR 21: XML format for documents where source text cannot be extracted by parser Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants