Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We have different entries for the surface id's #204

Open
marcolarosa opened this issue Sep 4, 2023 · 16 comments
Open

We have different entries for the surface id's #204

marcolarosa opened this issue Sep 4, 2023 · 16 comments
Assignees

Comments

@marcolarosa
Copy link
Contributor

marcolarosa commented Sep 4, 2023

In Bates 35 - no file extension:

<surface xmlns="http://www.tei-c.org/ns/1.0" xml:id="Bates35-001">

In ATSIQLdHolmer1988 - file extension is .tiff

<surface xmlns="http://www.tei-c.org/ns/1.0" xml:id="ATSIQLdHolmer1988-001.tiff">

In A1688_Milson_Kamilaroi - file extension is .jpeg

<surface xmlns="http://www.tei-c.org/ns/1.0" xml:id="A1688_Milson_Kamilaroi-001.jpeg">

Surely we should settle on one of them? If so - which one? (And if it's .jpg then I think it should be jpg not jpeg).

@nthieberger
@sophlew
@Conal-Tuohy

Vote now!

* [ ] no extension
* [ ] tiff extension
* [ ] jpg (not jpeg) extension
* [ ] jpeg (not jpg) extension
@nthieberger
Copy link

nthieberger commented Sep 4, 2023 via email

@Conal-Tuohy
Copy link
Collaborator

I vote for "no extension" too. I'm guessing these are all OCR-sourced content? I think the files that are imported via the XSLT route all end up with extension-less page identifiers

@marcolarosa
Copy link
Contributor Author

ok - no extension it is. I will need to rewrite a bunch of tei files and that's really slow going.

@marcolarosa
Copy link
Contributor Author

@Conal-Tuohy Surface files on their own are not valid xml files. Since you can fix the surface entries on the way out by using the file name, can we just make them all

<surface>
... content in here.
</surface>

That is, dispense with the namespace declaration and the id altogether?

@Conal-Tuohy
Copy link
Collaborator

I think we could dispense with the xml:id since it's going to be created from the page file name when it's exported. I don't know if there's any other purpose for it?

But I think these files are potentially valid (they can be validated as individual pages with a schema that accepts that as a root element)

I wonder if the XML editor (code mirror) would be happy editing an XML "fragment" i.e. without a single root element? https://www.w3.org/TR/xml-fragment/

That would require changing the code which processes those files so that it wraps them with a root element and a TEI namespace declaration before doing the XML processing, but that would be doable. It would be nice to simplify things here for the users.

@marcolarosa
Copy link
Contributor Author

I think we should keep the root element. I'm just wondering if we can ditch all the properties on it to simplify it from the user's perspective and depend on your codepath to always ensure that what comes out is correct.

@marcolarosa
Copy link
Contributor Author

I just did a test with A1688_Milson_Kamilaroi. I removed the xmlns and xml:id properties from the surface element of the first page, generated the complete tei and found that page was missing. I put those properties back, regen'ed the tei and that page is now back in.

I'm guessing something in your code is not liking one or both of those properties not being there but if we didn't need them there it simplifies the surface files a lot.

@Conal-Tuohy
Copy link
Collaborator

Conal-Tuohy commented Sep 4, 2023

The crucial property of that root element is the namespace declaration. NB although superficially the syntax of that xmlns is the same as an attribute such as the xml:id or the rend attributes on other elements, it is actually considered to be not an attribute, and it functions quite differently from other attributes in that it defines a scope which is inherited by child elements. An xmlns on an element modifies/qualifies the name of that element and the names of all the descendant elements of that element (unless overridden by other xmlns declarations). Without that xmlns, none of the elements in that document will be recognised as TEI elements at all.

@Conal-Tuohy
Copy link
Collaborator

Conal-Tuohy commented Sep 4, 2023

Namespaces in XML are a notoriously awkward feature that was tacked on after XML 1.0 was already in place, but they're now unavoidable.

The key thing to understand about namespaces in XML is that the full name ("expanded name") of any XML element consists of two distinct parts.

There's the so-called "local name" which is e.g. surface or p or whatever, and the other part of the name is the "namespace name" or "namespace URI", which is the formal identifier for a logical "scope" within which the various "local names" each have a single well-defined meaning. Different elements can have the same "local name" and are considered totally unrelated if their namespace URIs are different.

In the TEI world, all the elements share the same namespace URI http://www.tei-c.org/ns/1.0, which is, in effect, the name of the TEI vocabulary,.

Given a document like this:

<surface xmlns="http://www.tei-c.org/ns/1.0" xml:id="Bates35-001">
   <p>Hello world</p>
</surface>

... the XPath expression local-name(/*/*) will return the string "p", and the expression namespace-URI(/*/*) will return the string "http://www.tei-c.org/ns/1.0", whereas for this document:

<surface xml:id="Bates35-001">
   <p>Hello world</p>
</surface>

... the two XPath expressions will return "p" and "" (the empty string which indicates the element's name is not qualified by a namespace).

In general, TEI-aware software (including OxGarage, etc) will address TEI elements by their expanded names (not their local names) and hence it will fail to recognise elements in a TEI file if they lack a namespace declaration.

@Conal-Tuohy
Copy link
Collaborator

Another thing about namespace declarations is that they belong purely to the serialized form of XML, so if you parse an XML document to produce an in-memory DOM Document for instance, or in XSLT an instance of the XML Data Model, the namespaces in effect will be "flattened" at the time of parsing, so that each element will have been assigned its own namespace URI which no longer depends on the ancestor element where it was declared; you can use the DOM API to modify the namespace URI of that ancestor root element, but that will not affect the namespace URIs of the descendant elements.

That means if you want to rename the elements in a DOM document so that they all have a new namespace URI, you have to do it to each element individually.

@marcolarosa
Copy link
Contributor Author

Yep - I know all that.

I'm just wondering if your code can add all of that when it receives a simple <surface> element before it does all its other things. I don't think codemirror cares if it's there or not.

So to phrase the question as clearly as I can: is your code able to process a surface stub file without namespace and id and just add all of that as the first step in the processing chain when generating a preview or assembling the TEI file?

@marcolarosa
Copy link
Contributor Author

marcolarosa commented Sep 4, 2023

To add some more context to my question.

I need to run a migration on the backend that downloads every single tei stub file and checks that the xml:id is in the expected form. The consensus decision is that it should be as follows:

<surface xmlns="http://www.tei-c.org/ns/1.0" xml:id="Bates35-001">

That is, no file extension; and, that's fine and doable.

The follow up question is whether we want to go further and rewrite all of the stub files as:

<surface>

That is - a plain old surface element.

It makes no difference to me either way other than the fact that the migration will take hours to run and I need to watch it carefully. So, I want to do it once only. Hence the question above: can we go with the plain old surface definition and delegate all of the XML fixing to your code so we never have to do this again?

Codemirror doesn't care if it's not valid XML:

Screenshot 2023-09-05 at 9 25 55 am

It still treats it like XML and handles it as such.

@marcolarosa
Copy link
Contributor Author

@Conal-Tuohy The path forward is for you to decide. You are the XML expert here. What do you think?

@Conal-Tuohy
Copy link
Collaborator

I would like to think it over a bit, actually. See #192 for a related concern. I don't think there's any urgency in making this bulk data update, since any TEI files that are exported will be automatically fixed up on export, and in the meantime I think the surface element and its xml:id attribute are easily-ignorable bits of cruft for most users.

I do think it would be good to simplify things for the user, and along those lines I think the simplest thing we could do for them would be to hide (or remove) the root element altogether, in the UI. But if we did, my preference would be to continue to store the page files as well-formed XML documents (i.e. with a single root element), rather than as XML 'fragments" (consisting of a sequence of text nodes and elements without a single enclosing root element), so we'd be unwrapping and re-wrapping the root element purely in the UI. If we did that, then the existing surface elements with their namespace declarations and their sometimes unconventionalxml:id attributes would be entirely hidden from the users and there'd be no need to do a bulk update at all, I think.

@marcolarosa
Copy link
Contributor Author

marcolarosa commented Sep 5, 2023

ok. Happy to leave this for now.

I don't like the idea of the UI manipulating the data in and out in order to simplify the view if the underlying file maintains the root element so if we keep it in the data, then the UI will display it as it's just something that can go wrong for no good reason.

I will need to make a pass over the data at some stage to fix the varying identifiers as shown at the start of this ticket so please do get back to me!

@marcolarosa
Copy link
Contributor Author

from the team - rewrite all the pages and remove xml:id - keep the namespace declaration. And whilst I'm here - build the tei dataset for @Conal-Tuohy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants