-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XSL: <def/>
inserts too many spaces
#13
Comments
Is this still a live issue? I try to avoid XSLT 1 at all costs, but can have a look, especially if there's a real example of that, which I can test the solution agaist. |
The file <?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="freedict-dictionary.css"?>
<?oxygen RNGSchema="freedict-P5.rng" type="xml"?>
<!DOCTYPE TEI SYSTEM "freedict-P5.dtd">
<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:wikdict="http://www.wikdict.com/ns/1.0">
<text>
<body xml:lang="de">
<entry>
<form>
<orth>abel</orth>
</form>
<gramGrp>
<pos>suffix</pos>
</gramGrp>
<sense>
<cit type="trans" xml:lang="sv">
<quote>bar</quote>
</cit>
<sense>
<def>def1</def>
</sense>
<sense>
<def>def2</def>
</sense>
</sense>
</entry>
</body>
</text>
</TEI> gives the following result when processed with
where I would have expected something like
I usually touch neither the XSL nor the c5 files, so I'm not sure this the correct example for this problem. |
Thanks Karl, this is an interesting example, even if a bit unusual. What is the structure of the information here, please? The conversion script assumes a uniform sequence of
and wrapping the |
and wrapping the `<cit>` into its own `<sense>` would help a lot (probably an ***@***.***` attribute on `<sense> would help as well, because that takes precedence over just counting the elements).
I don't really like the n attribute, because counting is what machines are
better at than humans. Wrapping the cit in a sense feels more natural to me.
Is the TEI standard concerned about this? If not, I would still a stricter
interpretation on our side, it makes things much easier to handle.
|
Agreed about And yes, Sebastian, you're right: we need to constrain stuff ourselves, the TEI is just a toolkit. Fortunately, we have some emerging standards (and our practice) to guide us. |
Yes. The entry has two senses, but both translate to the same Swedish word. This a a very common thing in the WikDict dictionaries and this kind of grouping has made the output a lot more readable on www.wikdict.com, so I replicated it in the TEI files. I think I asked for suggestions on how to encode it when I first did this and this was the result. For important words, there are multiple of these groups where one translation applies to multiple senses. I tried to keep the example minimal, so I included only one. See https://www.wikdict.com/de-en/haus for the HTML version of such a case. |
Oh, that page looks really nice!
You were absolutely right about the shorter example showing things better, but I lacked some context, now I have it. |
I can see now that it does make sense for the stylesheets to handle such structures better.
It would be great to have your solution documented in the HOWTO at
https://github.com/freedict/fd-dictionaries/wiki/FreeDict-HOWTO-%E2%80%93-Writing-Text-Encoding-Initiative-XML-Files
Thanks
|
Those refer to specific sense numbers on the original Wiktionary pages. That's mostly used on complicated pages in the German Wiktionary and hard to match, due to the unstructured approach of Wiktionary (everything is just wiki markup). I might be able to that that at some point, but it is not an easy task, and it might require changes in the dbnary project I am building on.
My approach is to preserve things as they are in Wiktionary, assuming that the page authors know better than I do. Many Wiktionaries prefer one version (slashes or brackets) over the other use that on most pages. I would expect correct entries to be enclosed by slashes on both sides or brackets on both sides. Leaving one off or mixing them for a single pronunciation is wrong, as far as I know. When such cases happen, it could be wrong in the Wiktionary page or an error during the extraction (parsing sensible content from mostly presentational markup is messy). I have to investigate on a case-by-case basis to find out. Feel free to open issues on https://github.com/karlb/wikdict-gen/ if you see anything suspicious. |
A TEI element like this:
Leads to:
This should be fixed.
The text was updated successfully, but these errors were encountered: