-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FCL-492] ADR 21: XML format for documents where source text cannot be extracted by parser #183
base: main
Are you sure you want to change the base?
Conversation
e508c69
to
4c7492f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
People seem broadly happy, work in Jim's notes and maybe Nick's ADR for final version.
doc/adr/0021-xml-with-no-docx.md
Outdated
|
||
Metadata is provided to the parser via spreadsheets. We will want to preserve some of this data. | ||
|
||
The existing tag names (see the appendix) will be used inside a `uk:external-metadata` tag in the `proprietary` tag to allow this metadata to be searched at the same time as other documents that have these tags in the running body of the judgment. Values MUST NOT be deliberately replicated across `external-metadata` and other places, but MAY be replicated if the text body is successfully parsed and the string appeared in externally provided data sources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not immediately clear on what this MUST NOT is talking about - is it that we shouldn't be copying them into the body unless the body text already contains them and we can semantically extract them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut feeling is "tags should appear in one place or the other, not both, and we search on a combination of akn:party and uk:party"
The alternative is "we reparse everything and just search for uk:party".
The first made more sense when I thought we could just drop an akn:party into the proprietary namespace; we wouldn't need to update our search at all.
doc/adr/0021-xml-with-no-docx.md
Outdated
|
||
Metadata is provided to the parser via spreadsheets. We will want to preserve some of this data. | ||
|
||
The existing tag names (see the appendix) will be used inside a `uk:external-metadata` tag in the `proprietary` tag to allow this metadata to be searched at the same time as other documents that have these tags in the running body of the judgment. Values MUST NOT be deliberately replicated across `external-metadata` and other places, but MAY be replicated if the text body is successfully parsed and the string appeared in externally provided data sources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We haven't used external-metadata
in @jurisdatum's prototype, but I think we can probably ditch it given than we're using things like uk:party
across the board now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. This does potentially mean allowing editors easy editing access to a set of fields is more complicated (we can't just expose those fields as an arbitrary table).
doc/adr/0021-xml-with-no-docx.md
Outdated
|
||
TODO: I think this part wants tearing apart carefully. | ||
|
||
Each of these metrics defaults to `1`: we have no reason to doubt the quality of the document. Values of `0` imply a complete lack of quality, or an absence of that feature. Values below `0.5` should be considered poor, and mitigations applied. Values above `1` might be signs of additional verification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should add at least one boolean here:
uk:markup-human-reviewed
: An editor has actually eyeballed this document and given it the all-clear vs it being handled entirely by the computer (so we can distinguish "this has been spot checked" from "this has been autopublished"
I'm also wondering if this a the point where we could indicate what level of parsing has even been attempted, maybe one of none
(ie we've literally got metadata and that's it), text-only
(we've had a stab at the words, but no guarantee they're in the right order) or structured
(the parser has tried to extract the structure of the document as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added markup-human-reviewed.
The latter feels like that's the scores: the first is 0 markup, 0 text; the second is 0 markup, 0.3 text; the third is 0.8 markup, 0.8 words?
|
||
--- | ||
|
||
## Appendix: Current Search XPATH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like something we should document elsewhere and link to, possibly alongside the actual query xpaths in the ML repo.
doc/adr/0021-xml-with-no-docx.md
Outdated
|
||
#### Rejected for Impossible Tasks | ||
|
||
Documents with no DOCX cannot be reparsed, so SHOULD be excluded from the list of reparse candidates, SHOULD not be reparsable in the UI and MUST not be sent for reparsing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be a bit more complex I think:
- Documents which do not have a parseable source document MUST NOT be sent for reparsing at any point, as we can't do anything with them, but we are doing some kind of parsing on the PDFs, it's just not trying to extract any kind of meaning from the document text itself. It might be that the list of source documents we consider viable for parsing changes over time.
- Documents which do not have structured markup MUST NOT be sent for enrichment, because down that road madness lies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enrichment will actually work just about fine without markup; the existence of ref-tags may be useful.
FCL-492
Part of FCL-454