Improve fromBodyXML to convert Spark's kitchen sink. #58

adgad · 2025-01-15T18:55:12Z

Spark has a kitchen sink article, which has most of the types of content produced by Spark *

The from-bodyxml library in this repository was originally written to work with simple Sustainable Views articles, and so did not have the transformers required to convert this kitchen sink. There was a failing test in the test suite.

The failing test was actually using XML from Spark - which represents what Spark publishes, and not what is available in the public C&M APIs. I think when we use this library, we would be using it with the public API representation (🤔 ), so I updated the fixtures to use that. I also updated the expected output to be exactly what is returned by FT.com's API, as this is what we're currently serving users.

Then, systematically went through and fixed stuff until the tests pass.

**Question: ** should we just use the existing implementation of this from CP.
Pro: it already exists, and has been battle-tested in the real world, so we can be confident it works on all real articles.
Con: it might need some decoupling from CP stuff. they also have a bunch of Workarounds - but presumably we need a strategy for those anyway for the migration?!

* There are a few things that it does not have:

Scrollytelling
Tables
Custom Code Components
???

The test fixture we were using came from Spark - however C&M do various transformations which mean the bodyXML we publish from Spark isn't necessarily the same as the bodyXML that a user would read from the API. For the purpose of this transformer, we are interested in the latter, so this change updates the test fixture file to be the published version of the bodyXML. I have also taken the expected JSON from the bodyTree produced by cp-content-pipeline-api (removing the "data" blocks that contain referenceIds). There may be other places where the CP-generated tree is not actually valid, but will deal with that as we go along.

This introduces a new internal type __LIFT_CHILDREN__, which is used to indicate that we don't want this node, but want its' children. Examples: <experimental> <div class="n-content-layout__container">

epavlova

thought (non-blocking): I have just a couple of thoughts if useful, nothing closely related with the code of the PR.

I am also thinking that the starting format of this transformer should be the bodyXML format returned by the Internal Content API. That way the CP implementation and this one will use the same starting point.

The biggest semantical difference between the bodyXML format that Spark publishes and the one returned from the Internal Content API are the opaque and translucent namespaces. However, I think they will play role in the complexity of the other transformer (content tree -> external bodyXML), it should be an issue for this one.

I really don't have enough knowledge to weight on the CP implementation. Is it possible to build a testing strategy that makes sure that this implementation builds exactly the same Content Tree as the CP implementation (do we want this actually)? For example, we iterate over lots (all) uuids and compare.

adgad · 2025-01-28T12:19:29Z

I like that idea!

Maybe a test that runs in Circle periodically that:

fetches the last X notifications from C&M, gets the articles from /internalcontent
converts to bodyTree using from-bodyxml
fetches the same articles from www.ft.com/__content and compares
maybe also validates that the from-bodyxml correctly adheres to the generated JSON schema

adgad added 12 commits January 14, 2025 15:52

Add transformer for big-number

597cfd0

Remove stray whitespace around pull-quote

410aa32

Add transformer for youtube-video

5473f0e

Fix assertion for video to include embedded attribute

07d3c17

Get <ft-content> ID from url attribute, rather than id

a9ba9b1

Remove extraneous empty children from expected pullquote

c7e8306

Add transformer for Recommended link

d27d111

Add transformer for experimental layout

32f8ec6

This introduces a new internal type __LIFT_CHILDREN__, which is used to indicate that we don't want this node, but want its' children. Examples: <experimental> <div class="n-content-layout__container">

Add transformer for h3 node

39236e1

Validate layoutWidth attribute and fall back to full-width

9b0bc85

Add version to outputted body

Loading
Loading status checks…

210f3ab

adgad requested review from chee and a team as code owners January 15, 2025 18:55

epavlova approved these changes Jan 16, 2025

View reviewed changes

epavlova requested a review from a team January 16, 2025 15:27

chee merged commit 200d63d into main Jan 28, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve fromBodyXML to convert Spark's kitchen sink. #58

Improve fromBodyXML to convert Spark's kitchen sink. #58

adgad commented Jan 15, 2025

epavlova left a comment

adgad commented Jan 28, 2025

Improve fromBodyXML to convert Spark's kitchen sink. #58

Improve fromBodyXML to convert Spark's kitchen sink. #58

Conversation

adgad commented Jan 15, 2025

epavlova left a comment

Choose a reason for hiding this comment

adgad commented Jan 28, 2025