Skip to content

Commit

Permalink
Pendo markdown conversion support (#3)
Browse files Browse the repository at this point in the history
* Initial markdown conversion: basic syntax and strikethrough

* Remplement micromark gfm strikethrough syntax plugin for pendo underline

* Add strikethrough mdast extension reimplementation for underline

* Update tests to use tree nodes instead of snapshots

* Add stringify tests for implemented extensions
* Bugfix for stringify strikethrough with pluses
* Add type for underline mdast node

* Fix const reference

* Make parser types local to the plugin

* Disallow single-tilde strikethrough

* Remove position info from parsed md

* Convert color syntax to XML nodes before parsing to AST

* Add ast conversion from mdast html nodes to custom color node

* Update color nodes test to use actual parser

* Fix invalid source string in test

* Disable markdown syntax unsupported by Pendo

* Fix type definitions placement

* Make markdown plugins stand out better

* Fix test for code indented disable

* * Backconvert color nodes
* Enable color ast transform in main convert interface
* More tests

* Force delete position from ast to reduce diff noise

* Clean up tree visitor returns

* Fix tree visitors again

* Implement escaping markdown syntax with components

* * Refactor component ast nodes to hold metadata internally
* Move component factories to where the definitions are
* Support self-closing components

* Clean up folder structure

* Assemble all steps for producing escaped strings

* Implement backconverting of escaped strings

* Add todo for color nodes

* Add test for backconverting a shuffled string

* * Add tests for underline markdown extension
* Change default plugin setting to enable single-plus underline

* Fix underline plugin node type extension

* Move strikethrough plugin typedef

* Fix micromark plugin test names

* Clean up string transformer color exports

* Clean up ast transformer color tests and exports

* More tests for the component ast transformer (and some bugfixes)

* Fix private types

* Add test for backconverting from shuffled string

* Make tree type generic and ensure to not mutate original

* Sad path tests for escaped string backconversion

* Bugfix for double closing tag backconversion

* Create TS definitions for a loctool plugin interface

* Implement interface for pendo xliff filetype

* Add ilib-xliff dependency and define its types

* Add type definitions for Resource subclasses

* Update typedefs folder structure

* Infer concrete created resource type

* Constrain resource factory props based on resType

* Clean up translation unit typedef

* Fix optional overrides type in resource clone method

* Allow constraining resource type on a TranslationSet

* Implement parsing of the pendo xliff

* Relative import sugar

* Attach resource fields definitions directly to respective classes

* Add missing xliff serialize method declaration

* Implement writing out localized files

* Revert "Attach resource fields definitions directly to respective classes"

This reverts commit 0449675.

This currently deviates too much from the original loctool documentation and introduces additional confusion about other missing fields for which interfaces define getters.

* Add e2e test loctool project

* Fix plugin entrypoint export

* Add missing resource types mapping

* Fix relative path and buggy locale replacement in name

* Fix reference comments

* Add todos

* Separate plugin responsibilities like paths and locales from file processing

* Add support for path template mapping

* Add locale mapping

* Add test readme

* Eslint no-unused-vars ignore underscores

* Don't modify trans unit when there are no components

* Add filtering of trans units based on their datatype

* Example comment

* Create unit tests for pendo file processing

* Workaround for trans-unit ID vs resname conflicts

* Commit e2e translations

* Npmrc registry override

* Switch to directly using XML tree for transforming pendo xliff files

* Clean up repo setup

* Update lockfile

* Clean up repo setup

* Add missing Apache 2.0 headers

* Typo fix

* Skip backconversion entirely when there are no components in localized string

* Fix path mapping templates

* Add readme

* Fix E2E by adding a devdependency on itself

* Add test for xliff 2.0 parsing
  • Loading branch information
wadimw authored Sep 27, 2024
1 parent c5860e6 commit 58845f4
Show file tree
Hide file tree
Showing 53 changed files with 8,220 additions and 37 deletions.
1 change: 1 addition & 0 deletions .npmrc
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
registry=https://registry.npmjs.org/
225 changes: 225 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1 +1,226 @@
# ilib-loctool-pendo-md

[Loctool](https://github.com/iLib-js/loctool) plugin to handle translation strings exported from [Pendo](https://www.pendo.io/).

This plugin accepts an XLIFF file exported from Pendo (`<xliff version="1.2"><file datatype="pendoguide">`) and extracts existing translation units from it mapping them 1:1 to loctool Resources with **escaped Markdown syntax**.

## Extraction

### Markdown Syntax

As per [Pendo documentation](https://support.pendo.io/hc/en-us/articles/360031866552-Use-markdown-syntax-for-guide-text-styling), the app supports a subset of classic Markdown syntax:

```md
_italics_ or _italics_
**bold**
[links](example.com)

1. ordered lists

- unordered lists

* unordered lists

- unordered lists
```

And some custom extensions:

```md
~~Strikethrough~~
++Underline++
{color: #000000}colored text{/color}
```

There is a high risk of breaking this syntax by translators, so the main task of this plugin is to **escape** this syntax using XML-like component tags `<c0></c0>`.

### Escaping

Given a Pendo markdown string like

```markdown
String with _emphasis_, ++underline++, {color: #FF0000}colored text{/color} and [a link](https://example.com)
```

transform it to an escaped string

```text
String with <c0>emphasis</c0>, <c1>underline</c1>, <c2>color</c2> and <c3>a link</c3>
```

### Unescaping (backconversion)

After parsing a source string, plugin keeps track of escaped components:

```text
- c0: emphasis
- c1: underline
- c2: color #FF0000
- c3: link https://example.com
```

Thanks to that, during localization this plugin is able to **unescape** (backconvert) these components in a translated string:

```text
Translated string, <c3>translated link</c3> <c1>translated underline</c1>, <c0>translated emphasis</c0> <c2>translated color</c2>
```

it will transform it back to the markdown syntax

```markdown
Translated string, [translated link](example.com) ++translated underline++, _translated emphasis_ {color: $FF0000}translated color{/color}
```

Note that it supports shuffled order of components, since this is often required in different languages.

## Translation

During the _localize_ step, this plugin will output a copy of the original Pendo XLIFF for each locale defined in the loctool's `project.json` settings. For each source string which has translation in loctool (i.e. provided via loctool's xliff files), this translation will optionally be unescaped as described above and will be insterted into the corresponting `<target>` element content in the output file.

Additionally, this plugin supports output locale mapping.

## Example localization process

Below you can find a step-by-step process to showcase the plugin's intention.

Given a source Pendo XLIFF file `$PROJECT/guides/A000A00Aaa0aaa-AaaaAaa00A0a_en.xliff`

```xml
<xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2">
<file original="A000A00Aaa0aaa-AaaaAaa00A0a" datatype="pendoguide" source-language="en-US" target-language="">
<body>
<group id="Aaaaaaaa0aAaaAAA0AAA0A0aAaa">
<trans-unit id="8de49842-c1fd-4536-905e-8817673b4c24|md">
<source><![CDATA[**Callout!**]]></source>
<target></target>
<note>TextView</note>
</trans-unit>
</group>
</body>
</file>
</xliff>
```

and the following loctool configuration

```json
{
"name": "ilib-loctool-pendo-md-test",
"id": "ilib-loctool-pendo-md-test",
"description": "translate strings exported from Pendo",
"projectType": "custom",
"sourceLocale": "en",
"includes": ["guides/*.xliff"],
"settings": {
"xliffsDir": "translations",
"locales": ["pl-PL"],
"localeMap": {
"pl-PL": "pl"
},
"pendo": {
"mappings": {
"guides/*.xliff": {
"template": "[dir]/[basename]_[locale].[extension]"
}
}
}
},
"plugins": ["ilib-loctool-pendo-md"]
}
```

invoking

```sh
loctool localize "$PROJECT"
```

will first run the _extract_ step and produce a loctool XLIFF with extracted **escaped** strings `$PROJECT/ilib-loctool-pendo-md-test-extracted.xliff`:

```xml
<?xml version="1.0" encoding="utf-8"?>
<xliff version="1.2">
<file original="" source-language="en" product-name="ilib-loctool-pendo-md-test">
<body>
<trans-unit id="1" resname="8de49842-c1fd-4536-905e-8817673b4c24|md" restype="string" datatype="plaintext">
<source>&lt;c0&gt;Callout!&lt;/c0&gt;</source>
<note>TextView [c0: strong]</note>
</trans-unit>
</body>
</file>
</xliff>
```

notice that:

1. markdown strong `** **` in the source string is now escaped as components `<c0> </c0>`
2. trans-unit comment is updated to include description of the escaped components: _[c0: strong]_

Then, loctool will immediately run the _localize_ step and produce a (not really) localized copy of the source file `$PROJECT/guides/A000A00Aaa0aaa-AaaaAaa00A0a_en_pl.xliff`:

```xml
<xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2">
<file original="A000A00Aaa0aaa-AaaaAaa00A0a" datatype="pendoguide" source-language="en-US" target-language="pl">
<body>
<group id="Aaaaaaaa0aAaaAAA0AAA0A0aAaa">
<trans-unit id="8de49842-c1fd-4536-905e-8817673b4c24|md">
<source><![CDATA[**Callout!**]]></source>
<target/>
<note>TextView</note>
</trans-unit>
</group>
</body>
</file>
</xliff>
```

notice that:

1. target tag stays empty because there is no translation available yet
2. file name includes mapped output locale `pl` rather than the translation locale `pl-PL`
3. `target-language` attribute is also filled using the mapped output locale

Now you need to obtain translations. Assume you've sent the loctool XLIFF file `$PROJECT/ilib-loctool-pendo-md-test-extracted.xliff` to a linguist and received translations for locale `pl-PL`. Following your project's config, you put it in `$PROJECT/translations/ilib-loctool-pendo-md-test-pl-PL.xliff`:

```xml
<?xml version="1.0" encoding="utf-8"?>
<xliff version="1.2">
<file original="" source-language="en" target-language="pl-PL" product-name="ilib-loctool-pendo-md-test">
<body>
<trans-unit id="1" resname="8de49842-c1fd-4536-905e-8817673b4c24|md" restype="string" datatype="plaintext">
<source>&lt;c0&gt;Callout!&lt;/c0&gt;</source>
<target>&lt;c0&gt;Wywołanie!&lt;/c0&gt;</target>
<note>TextView [c0: strong]</note>
</trans-unit>
</body>
</file>
</xliff>
```

note that the target also has `<c0> </c0>` tags in it, since your linguist knew how to handle XML-like tags properly.

Running loctool again

```sh
loctool localize "$PROJECT"
```

this time, it will load the _pl-PL_ translations from the file specified in your `xliffsDir` folder and _localize_ step will backconvert and insert those translations while regenerating the (now actually) localized file `$PROJECT/guides/A000A00Aaa0aaa-AaaaAaa00A0a_en_pl.xliff`:

```xml
<xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2">
<file original="A000A00Aaa0aaa-AaaaAaa00A0a" datatype="pendoguide" source-language="en-US" target-language="pl">
<body>
<group id="Aaaaaaaa0aAaaAAA0AAA0A0aAaa">
<trans-unit id="8de49842-c1fd-4536-905e-8817673b4c24|md">
<source><![CDATA[**Callout!**]]></source>
<target state="translated">**Wywołanie!**</target>
<note>TextView</note>
</trans-unit>
</group>
</body>
</file>
</xliff>
```

which you can safely import back to Pendo.
14 changes: 14 additions & 0 deletions docs/micromark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Note on unstable `micromark` versions

Currently, this plugin uses the following versions of `micromark`-related dependencies:

```json
"mdast-util-from-markdown": "^0",
"mdast-util-to-markdown": "^0",
"mdast-util-gfm-strikethrough": "^0",
"micromark-extension-gfm-strikethrough": "^0"
```

this is because all these packages became ESM-only at the moment of their stable release `1.0.0`, while at the time of writing `loctool` is written in CommonJS (so it can't `import`) and loads plugins synchronously (so it can't `import()` either) - see plugin loader source at https://github.com/iLib-js/loctool/blob/v2.25.1/lib/CustomProject.js#L116.

This should be _mostly fine_, since these versions have been used publicly in `remark v13` (i.e. `remark-parse@9`). In addition, Pendo strings are expected to have low complexity due to being pre-segmented prior to export from Pendo.
21 changes: 16 additions & 5 deletions eslint.config.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,6 @@ import eslint from "@eslint/js";
import tseslint from "typescript-eslint";
import prettier from "eslint-config-prettier";

// // unnecessary async to test linting
// export async function unnecessarilyAsync() {
// return true;
// }

export default tseslint.config(
{ ignores: ["node_modules", "dist"] },
eslint.configs.recommended,
Expand All @@ -23,4 +18,20 @@ export default tseslint.config(
},
},
prettier,
{
rules: {
"@typescript-eslint/no-unused-vars": [
"error",
{
args: "all",
argsIgnorePattern: "^_",
caughtErrors: "all",
caughtErrorsIgnorePattern: "^_",
destructuredArrayIgnorePattern: "^_",
varsIgnorePattern: "^_",
ignoreRestSiblings: true,
},
],
},
},
);
22 changes: 21 additions & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -42,16 +42,36 @@
"@types/eslint-config-prettier": "^6.11.3",
"@types/eslint__js": "^8.42.3",
"@types/jest": "^29.5.12",
"@types/micromatch": "^4.0.9",
"@types/node": "18",
"@types/ungap__structured-clone": "^1.2.0",
"eslint": "^9.9.1",
"eslint-config-prettier": "^9.1.0",
"globals": "^15.9.0",
"husky": "^9.1.5",
"jest": "^29.7.0",
"lint-staged": "^15.2.10",
"loctool": "^2.25.2",
"prettier": "^3.3.3",
"ts-jest": "^29.2.5",
"typescript": "^5.5.4",
"typescript-eslint": "^8.4.0"
"typescript-eslint": "^8.4.0",
"unist-builder": "2",
"unist-util-visit": "^2",
"ilib-loctool-pendo-md": "file:."
},
"dependencies": {
"@ungap/structured-clone": "^1.2.0",
"ilib-xml-js": "^1.7.0",
"mdast-util-from-markdown": "^0",
"mdast-util-gfm-strikethrough": "^0",
"mdast-util-to-markdown": "^0",
"micromark": "~2.11.0",
"micromark-extension-gfm-strikethrough": "^0",
"micromatch": "^4.0.8",
"unist-util-remove-position": "3"
},
"peerDependencies": {
"loctool": "^2.25.2"
}
}
9 changes: 0 additions & 9 deletions src/__tests__/addNumbers.test.ts

This file was deleted.

15 changes: 0 additions & 15 deletions src/addNumbers.ts

This file was deleted.

25 changes: 24 additions & 1 deletion src/index.ts
Original file line number Diff line number Diff line change
@@ -1 +1,24 @@
export * from "./addNumbers";
/**
* Copyright © 2024, Box, Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licensefs/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
*
* See the License for the specific language governing permissions and
* limitations under the License.
*/

import type { Plugin } from "loctool";
import PendoXliffFileType from "./loctool/PendoXliffFileType";

// loctool plugin entrypoint
const plugin: Plugin = PendoXliffFileType;

export = plugin;
Loading

0 comments on commit 58845f4

Please sign in to comment.