-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
We have different entries for the surface id's #204
Comments
Can it be no extension and then it will allow whatever extension it finds?
…On Mon, 4 Sept 2023 at 14:50, Marco La Rosa ***@***.***> wrote:
* External email: Please exercise caution *
------------------------------
Assigned #204 to @nthieberger.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
I vote for "no extension" too. I'm guessing these are all OCR-sourced content? I think the files that are imported via the XSLT route all end up with extension-less page identifiers |
ok - no extension it is. I will need to rewrite a bunch of tei files and that's really slow going. |
@Conal-Tuohy Surface files on their own are not valid xml files. Since you can fix the surface entries on the way out by using the file name, can we just make them all
That is, dispense with the namespace declaration and the id altogether? |
I think we could dispense with the xml:id since it's going to be created from the page file name when it's exported. I don't know if there's any other purpose for it? But I think these files are potentially valid (they can be validated as individual pages with a schema that accepts that as a root element) I wonder if the XML editor (code mirror) would be happy editing an XML "fragment" i.e. without a single root element? https://www.w3.org/TR/xml-fragment/ That would require changing the code which processes those files so that it wraps them with a root element and a TEI namespace declaration before doing the XML processing, but that would be doable. It would be nice to simplify things here for the users. |
I think we should keep the root element. I'm just wondering if we can ditch all the properties on it to simplify it from the user's perspective and depend on your codepath to always ensure that what comes out is correct. |
I just did a test with A1688_Milson_Kamilaroi. I removed the xmlns and xml:id properties from the surface element of the first page, generated the complete tei and found that page was missing. I put those properties back, regen'ed the tei and that page is now back in. I'm guessing something in your code is not liking one or both of those properties not being there but if we didn't need them there it simplifies the surface files a lot. |
The crucial property of that root element is the namespace declaration. NB although superficially the syntax of that |
Namespaces in XML are a notoriously awkward feature that was tacked on after XML 1.0 was already in place, but they're now unavoidable. The key thing to understand about namespaces in XML is that the full name ("expanded name") of any XML element consists of two distinct parts. There's the so-called "local name" which is e.g. In the TEI world, all the elements share the same namespace URI Given a document like this:
... the XPath expression
... the two XPath expressions will return In general, TEI-aware software (including OxGarage, etc) will address TEI elements by their expanded names (not their local names) and hence it will fail to recognise elements in a TEI file if they lack a namespace declaration. |
Another thing about namespace declarations is that they belong purely to the serialized form of XML, so if you parse an XML document to produce an in-memory DOM Document for instance, or in XSLT an instance of the XML Data Model, the namespaces in effect will be "flattened" at the time of parsing, so that each element will have been assigned its own namespace URI which no longer depends on the ancestor element where it was declared; you can use the DOM API to modify the namespace URI of that ancestor root element, but that will not affect the namespace URIs of the descendant elements. That means if you want to rename the elements in a DOM document so that they all have a new namespace URI, you have to do it to each element individually. |
Yep - I know all that. I'm just wondering if your code can add all of that when it receives a simple So to phrase the question as clearly as I can: is your code able to process a surface stub file without namespace and id and just add all of that as the first step in the processing chain when generating a preview or assembling the TEI file? |
To add some more context to my question. I need to run a migration on the backend that downloads every single tei stub file and checks that the xml:id is in the expected form. The consensus decision is that it should be as follows:
That is, no file extension; and, that's fine and doable. The follow up question is whether we want to go further and rewrite all of the stub files as:
That is - a plain old surface element. It makes no difference to me either way other than the fact that the migration will take hours to run and I need to watch it carefully. So, I want to do it once only. Hence the question above: can we go with the plain old surface definition and delegate all of the XML fixing to your code so we never have to do this again? Codemirror doesn't care if it's not valid XML: It still treats it like XML and handles it as such. |
@Conal-Tuohy The path forward is for you to decide. You are the XML expert here. What do you think? |
I would like to think it over a bit, actually. See #192 for a related concern. I don't think there's any urgency in making this bulk data update, since any TEI files that are exported will be automatically fixed up on export, and in the meantime I think the I do think it would be good to simplify things for the user, and along those lines I think the simplest thing we could do for them would be to hide (or remove) the root element altogether, in the UI. But if we did, my preference would be to continue to store the page files as well-formed XML documents (i.e. with a single root element), rather than as XML 'fragments" (consisting of a sequence of text nodes and elements without a single enclosing root element), so we'd be unwrapping and re-wrapping the root element purely in the UI. If we did that, then the existing |
ok. Happy to leave this for now. I don't like the idea of the UI manipulating the data in and out in order to simplify the view if the underlying file maintains the root element so if we keep it in the data, then the UI will display it as it's just something that can go wrong for no good reason. I will need to make a pass over the data at some stage to fix the varying identifiers as shown at the start of this ticket so please do get back to me! |
from the team - rewrite all the pages and remove |
In Bates 35 - no file extension:
In ATSIQLdHolmer1988 - file extension is
.tiff
In A1688_Milson_Kamilaroi - file extension is
.jpeg
Surely we should settle on one of them? If so - which one? (And if it's .jpg then I think it should be jpg not jpeg).
@nthieberger
@sophlew
@Conal-Tuohy
Vote now!
The text was updated successfully, but these errors were encountered: