-
Notifications
You must be signed in to change notification settings - Fork 774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEI translator full rewrite #3245
base: master
Are you sure you want to change the base?
Conversation
Tested with malformed html in Zotero fields.
…into tei_export
TEI.js
Outdated
|
||
/* 2024, Frédéric Glorieux. | ||
|
||
// item produced by Zotero |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish we had test cases for export translators, but this is a lot to put at the top of the translator. We can keep it in the commit history but should remove from the translator before merging IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I tried was a kind of documentation for end users. TEI is always an interpretation, there are more than one way to encode an information. To understand how the old translator was working, I had to test various kind of records to trace the fields in TEI output, I thought a good idea to explain the code a bit more. Comment in source code is not the right place, you are right, test case is a good idea, but there is still something to find for a documentation.
* Imitated from zotero source code | ||
* https://github.com/zotero/zotero/blob/main/chrome/content/zotero/itemTree.jsx#L2472 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, Zotero is imitating the behavior of Citeproc.js here, so let's link straight to the source too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link added. Below the spec, @dstillman #3054 (comment) make me rewrite the parser to handle tricky cases like
Not closing <b> or<strong>unknown tag</strong>, kept as text nodes; <i>italic</i>, is an element
Reading code helps me to make the right choices on each case.
TEI.js
Outdated
let discardedNode = nodeStack.pop(); | ||
nodeStack[0].append(discardedMarkup.token, ...discardedNode.childNodes); | ||
} | ||
// return textContent; // lint see it’s not used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old code used for debug. Deleted.
TEI.js
Outdated
if (!html) return; | ||
// import html as dom | ||
let dom = xmlParser.parseFromString(html, "text/html"); | ||
let body = dom.getElementsByTagName("body").item(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let body = dom.getElementsByTagName("body").item(0); | |
let body = dom.body; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
* @param {*} item | ||
* @returns | ||
*/ | ||
function parseExtraFields(item) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? Won't it happen upon saving the item anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m sorry to not understand your question. In the context of a translator, do you mean I can get a state of an item where extra field is parsed to CSL? I would be glad if extraToCSL() have been available through ZU, but it is not.
TEI.js
Outdated
// unicode classes seems not supported | ||
// xmlid = xmlid.normalize("NFD").replace(/\p{M}+/u, ''); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use XRegExp:
Line 138 in edf44a1
var nameFormat1RE = new ZU.XRegExp("^\\p{Letter}+\\s\\p{Letter}+\\s\\p{Letter}+$"); |
This will be supported natively in Z7+.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tip.
TEI.js
Outdated
let iso = null; | ||
let year = Number(date.year); | ||
if (isNaN(year)) return iso; | ||
iso = String(date.year).padStart(4, '0'); | ||
let month = Number(date.month); | ||
if (isNaN(month)) return iso; | ||
// january = 0 | ||
iso += '-' + String(date.month + 1).padStart(2, '0'); | ||
if (!date.day) return iso; | ||
iso += '-' + String(date.day).padStart(2, '0'); | ||
return iso; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not strToISO
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed ZU.strToISO(item.date). Deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle field rights. Ex : item.rights = "© Gallimard";
<biblStruct type="book" corresp="http://zotero.org/groups/5048422/items/RPRC4566">
<monogr xml:lang="fr">
<!-- ... -->
<availability status="restricted">
<p>© Gallimard</p>
</availability>
</monogr>
</biblStruct>```
The present TEI exporter in Zotero is old, 15 years old, without improvements. Even if the new exporter is not yet perfect, it is a lot better than the old one. All comments of reviewers have been handled.
New features
Best is to see how is handled a quite complex item.
What Zotero do with this item in APA7.
Piaget, J. (1929). Not closing <b> or<strong>unknown tag</strong>, kept as text nodes; italic, is an element [Review of Rich text >> italic, par Eugène Minkowski; Paris, <Payot>, 1927]. Archives de psychologie, 22(85), 117‑118.
This TEI export.
The previous TEI export (original is not well idented)