Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XSL: <def/> inserts too many spaces #13

Open
humenda opened this issue Aug 6, 2018 · 9 comments
Open

XSL: <def/> inserts too many spaces #13

humenda opened this issue Aug 6, 2018 · 9 comments

Comments

@humenda
Copy link
Member

humenda commented Aug 6, 2018

A TEI element like this:

<sense>
  <def>my trans</def>
</sense>

Leads to:

1.
  my trans

This should be fixed.

@bansp
Copy link
Member

bansp commented Jan 6, 2021

Is this still a live issue? I try to avoid XSLT 1 at all costs, but can have a look, especially if there's a real example of that, which I can test the solution agaist.

@karlb
Copy link
Member

karlb commented Jan 6, 2021

The file

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="freedict-dictionary.css"?>
<?oxygen RNGSchema="freedict-P5.rng" type="xml"?>
<!DOCTYPE TEI SYSTEM "freedict-P5.dtd">
<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:wikdict="http://www.wikdict.com/ns/1.0">
	<text>
		<body xml:lang="de">
			<entry>
				<form>
					<orth>abel</orth>
				</form>
				<gramGrp>
					<pos>suffix</pos>
				</gramGrp>
				<sense>
					<cit type="trans" xml:lang="sv">
						<quote>bar</quote>
					</cit>
					<sense>
						<def>def1</def>
					</sense>
					<sense>
						<def>def2</def>
					</sense>
				</sense>
			</entry>
		</body>
	</text>
</TEI>

gives the following result when processed with xsltproc tei2c5.xsl def-example.tei:

abel
abel <suffix>
bar 2.
def1
 3.
def2

where I would have expected something like

abel
abel <suffix>
bar
  1. def1
  2. def2

I usually touch neither the XSL nor the c5 files, so I'm not sure this the correct example for this problem.

@bansp
Copy link
Member

bansp commented Jan 7, 2021

Thanks Karl, this is an interesting example, even if a bit unusual. What is the structure of the information here, please?
I understand the first <cit> that provides a translation equivalent in Swedish. Do the further <sense> elements define -abel in German?

The conversion script assumes a uniform sequence of <sense> elements, so I'm guessing it gets confused with

cit
sense
sense

and wrapping the <cit> into its own <sense> would help a lot (probably an @n attribute on <sense> would help as well, because that takes precedence over just counting the elements). But I don't want to speculate further, at this point.

@humenda
Copy link
Member Author

humenda commented Jan 7, 2021 via email

@bansp
Copy link
Member

bansp commented Jan 7, 2021

Agreed about @n, it's for when you want to override the machine.
I threw together some entries at https://github.com/freedict/fd-dictionaries/tree/master/shared/testing, see https://github.com/freedict/fd-dictionaries/blob/master/shared/testing/test_1.xml .
Going to add some more, and tinker inside that file. But for now, it can serve to illustrate why I asked about how the info in Karl's example was structured.

And yes, Sebastian, you're right: we need to constrain stuff ourselves, the TEI is just a toolkit. Fortunately, we have some emerging standards (and our practice) to guide us.

@karlb
Copy link
Member

karlb commented Jan 7, 2021

I understand the first that provides a translation equivalent in Swedish. Do the further elements define -abel in German?

Yes. The entry has two senses, but both translate to the same Swedish word. This a a very common thing in the WikDict dictionaries and this kind of grouping has made the output a lot more readable on www.wikdict.com, so I replicated it in the TEI files. I think I asked for suggestions on how to encode it when I first did this and this was the result.

For important words, there are multiple of these groups where one translation applies to multiple senses. I tried to keep the example minimal, so I included only one. See https://www.wikdict.com/de-en/haus for the HTML version of such a case.

@bansp
Copy link
Member

bansp commented Jan 7, 2021

Oh, that page looks really nice!
Two questions:

  • there are numerical references there (like [7]) that don't seem to point to an easily identifiable spot, are you going to make them work at some further step? I initially thought that it's just a matter of changing bullets into numbers, but that won't work
  • (off-topic in this ticket, but you were expecting this ;-) ): what's the rule for the absence of slashes on the left vs. on the right, and are there some examples of a mix of slanted and square brackets, as you mentioned elsewhere?

You were absolutely right about the shorter example showing things better, but I lacked some context, now I have it.
I probably wasn't part of that discussion that you mentioned, through my fault alone. A quick thought is that the example does make sense indeed, especially if you were to provide some more details for each of the (sub)senses, like PoS, pronunciation, etc. I can see now that it does make sense for the stylesheets to handle such structures better.

@humenda
Copy link
Member Author

humenda commented Jan 8, 2021 via email

@karlb
Copy link
Member

karlb commented Jan 9, 2021

there are numerical references there (like [7]) that don't seem to point to an easily identifiable spot, are you going to make them work at some further step? I initially thought that it's just a matter of changing bullets into numbers, but that won't work

Those refer to specific sense numbers on the original Wiktionary pages. That's mostly used on complicated pages in the German Wiktionary and hard to match, due to the unstructured approach of Wiktionary (everything is just wiki markup). I might be able to that that at some point, but it is not an easy task, and it might require changes in the dbnary project I am building on.

(off-topic in this ticket, but you were expecting this ;-) ): what's the rule for the absence of slashes on the left vs. on the right, and are there some examples of a mix of slanted and square brackets, as you mentioned elsewhere?

My approach is to preserve things as they are in Wiktionary, assuming that the page authors know better than I do. Many Wiktionaries prefer one version (slashes or brackets) over the other use that on most pages.

I would expect correct entries to be enclosed by slashes on both sides or brackets on both sides. Leaving one off or mixing them for a single pronunciation is wrong, as far as I know. When such cases happen, it could be wrong in the Wiktionary page or an error during the extraction (parsing sensible content from mostly presentational markup is messy). I have to investigate on a case-by-case basis to find out.

Feel free to open issues on https://github.com/karlb/wikdict-gen/ if you see anything suspicious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants