Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework FormattedText model to better support USX3/USFM3 import #93

Open
schierlm opened this issue Aug 3, 2024 · 4 comments
Open

Rework FormattedText model to better support USX3/USFM3 import #93

schierlm opened this issue Aug 3, 2024 · 4 comments
Assignees
Milestone

Comments

@schierlm
Copy link
Owner

schierlm commented Aug 3, 2024

The current FormattedText model, which is used as intermediate format for every conversion (except conversions between two Paratext formats) has been there since the beginning of BibleMultiConverter. Yet, other Bible formats have evolved. Threrfore, rework the internal model.

Some ideas:

  1. FormattingInstructionKind: Add new constants

    • PSALM_TITLE (titles of Psalms which sometimes are part of verse 1, sometimes before it)
    • ADDED_TEXT (text added by the translator which is not linked to original source, often conjunctions

    When exporting those to a format that does not support them, treat both as ITALIC.

  2. Add Speaker markup to mark text spoken by a person other than Jesus. Speakers can be identified
    by labels (e.g. "Moses") or Strongs numbers (e.g. "H4872").

  3. Rework LineBreakKind based on
    ExtendedLineBreakKind
    used for Paratext export

  4. GrammarInformation: Add suffix letters for Strongs numbers (optional), also add a way to add
    arbitrary key-value pairs (like in OSIS or Paratext). Values need not be ASCII only (e.g. Greek Lemma).

  5. Links: Support

    • Anchors in the text (by id)
    • Links to those anchors
    • Links to external hyperlinks
    • Links to external images (which may be displayed inline if supported by the format)
  6. Footnotes: Add a flag whether a footnote contains text or cross references. For now, this is done by adding XREF_MARKER to the beginning of the footnote text, but many new formats have this distrinction and parsing for magic strings gets cumbersome.

  7. Cross References: Support cross references that span more than one book; also support cross references that do not reference individual verses, but whole chapters or books.

As this is a major task (needs to touch most of the modules), my plan is in a first step to only update the roundtrip formats, and make the other formats "just" work again (using fallbacks or ignoring the new options). Will keep a list of status of the modules (e.g. compiles again, tested, compared against format spec), trying to not make a format worse than before anywhere in the process.

When exporting other features from USFM to FormattedText, use ExtraAttributes wherever possible. This should also include custom tags and custom milestones. There should be an option to convert UBXF alignment milestones (for a single alignment source) to GrammarInformation instead of extra attributes.

Did I miss anything? Feature should be present in both USFM3/USX3 and in more than one other format.

// cc @Rolf-Smit @Michahel @shadow-light @paul1149

@Rolf-Smit
Copy link
Contributor

Lately I have not been actively working with this tool, I mostly use it to convert from USX (2/3) to USFM (3) (which my application surprisingly can parse faster then XML).

From my perspective I don't have any remarks about your plans to rework FormattedText. I think it would be good if the intermediate format supports as many features as possible, in a sensible and generic way.

@schierlm
Copy link
Owner Author

Updated the issue to not forget to add support for UBXF alignment milestones.

@schierlm
Copy link
Owner Author

@Rolf-Smit just a heads up: in a553d4b I changed the intermediate format used by Paratext formats by moving Figure, VerseStart and VerseEnd to be BookContent instead of CharacterContent (all Paratext formats supported so far do not support those nested in character tags or footnotes anyway). This makes some parsing easier and removes some ugly workarounds that made extending the format harder.

Not sure if that affects your use cases.

schierlm added a commit that referenced this issue Sep 25, 2024
@schierlm
Copy link
Owner Author

UBXF alignment milestones are now implemented, see 93eeed0 (part of main branch)

A very early alpha of the new FormattedText model is available in the newmodel branch (supported formats).
Nightly build: https://nightly.link/schierlm/BibleMultiConverter/workflows/main.yaml/newmodel/BibleMultiConverter-AllInOneEdition-Release.zip

@schierlm schierlm added this to the v0.1 milestone Oct 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants