Generating "smarter" Markdown for LLM ingestion #4476

kirkilj · 2024-05-28T22:02:32Z

kirkilj
May 28, 2024

Our team has been asked by various organizations within our company to provide versions of our documentation content in Markdown format for LLM ingestion, since Markdown ingestion is supported by default in many models. HTML input is also often supported, but the tags and attributes are usually stripped out and only the text nodes are ingested.

OT provides a few markdown output formats, but we'd like to explore the option to override various XSL templates in the org.lwdita plugin for customization. When I look at the dita2markdown.xsl, it appears to be generating Pandoc AST XML. The comments in that stylesheet indicate that certain features are intended for eventual XHTML output, which makes sense if you wanted to embed information in the AST that could be leveraged in HTML output.

One modification we'd like to experiment with is injecting some form of DITA semantics into the Markdown text, and the dita2markdown.xsl stylesheet seems like the place to do it, so it could be used to inject text into any output format supported by lwdita.

For example, whenever a prereq element is detected, we may want to inject the text, "The following criteria are prerequisites for the following steps." before the text from the DITA source is rendered to Markdown. The same goes for some semantic filtering attributes such as "platform", by injecting a phrase such as "For the EKS platform, " before the text content if the platform attr ="EKS". These are just examples. A lot of experimentation would be done to determine if doing so would make a difference in the semantic accuracy of chat interactions with a model. Would the results of a chat prompt such as "What should I do before I reconfigure a network interface?" yield better results with injected semantics than if just the initial text nodes were ingested, for example?

It seems that we'd need to set the PRESERVE-DITA-CLASS parameter to 'yes' so that the DITA class string would be added to the AST payload, and we'd need a lookup function that says if a certain class name or attribute name/value is encountered in the AST, then prepend the specified string before the source text. Optionally we may want to extend the rule to indicate that the insertion should only be done only when a given xpath/regex matches.

If there are better places to insert such logic then I'm all ears.

In particular, I'd be interested in thoughts from @jelovirt, @raducoravu, and @chrispy-snps, but all comers are welcome who may have insight into the implementation or similar needs.

Answered by jelovirt

May 29, 2024

Any preprocessing extension point will work depending on what you want to do, but the post preprocessing extension point is likely the best because you have all the info available to you at that point.

View full answer

raducoravu · 2024-05-29T06:14:01Z

raducoravu
May 29, 2024
Collaborator

Interesting discussion John, not sure if I have yet something useful to contribute but I will start a discussion with our Oxygen Feedback team which recently added support for AI-based responses in our user's manual web site, support which we also plan to make commercially available.
Feeding an AI engine with the entire Markdown content of an user's manual is not feasible, so usually there are two steps:

Plain keyword based search in all topics to find the most relevant 5-6 ones. For this step our Feedback team added extra semantic search capabilities using OpenAI as a way to compute embeddings for the entire content and to compare these embeddings with the searched keywords, allowing for example to search in one language and find results in another: https://www.oxygenxml.com/oxygen_feedback_cloud/whatisnew4.0.html#4.0Oxygen_Feedback_Search_Engine_Enhancements
Feed the content of these relevant found 5-6 HTML or Markdown topics to the AI engine along with the user's question and start a conversation with the AI.

About your comments on filtering:

The same goes for some semantic filtering attributes such as "platform", by injecting a phrase such as "For the EKS platform, " before the text content if the platform attr ="EKS".

I think it is more accurate to create as you did until now separate filtered outputs for each product's user guide and have separate AI based searching in each of them. So the end user would first choose the product themselves, open the userguide web site for that product which is already filtered and search only inside its contents. Otherwise if the AI ingests the unfiltered contents for all products, who knows what kind of mixed answers it would give, mixing capabilities of one product with capabilities of another.

1 reply

kirkilj May 29, 2024
Author

Thanks for your thoughts, Radu. We do indeed have separate deliverables for each product, with over 100 of them in our division. Within certain products, there may also be customer or platform specific deliverables that are separately generated at build-time using conditionalization . However there are cases where run-time filtering using the DITAVAL passthrough action is more appropriate where a customer could be running the same product on multiple Kubernetes platforms where the instruction differences are a very small percentage of the content scope. Platform could be a facet in a keyword search, but a conversational search presents different challenges.

jelovirt · 2024-05-29T07:00:27Z

jelovirt
May 29, 2024
Maintainer

If the only thing that's needed is to add additional text based on e.g. profiling attributes, the best place to add this is DITA-OT preprocessing step. The LwDITA Markdown output is in intended to be extended at this point and it's way easier to add the LLM optimization content during preprocessing.

4 replies

kirkilj May 29, 2024
Author

Which pre-processing extension point would best for this? If it's done in a base pre-processing step, could it be injected into any OT format's output if we desired?

jelovirt May 29, 2024
Maintainer

Any preprocessing extension point will work depending on what you want to do, but the post preprocessing extension point is likely the best because you have all the info available to you at that point.

Answer selected by kirkilj

kirkilj May 29, 2024
Author

Got it. Thanks.

kirkilj May 29, 2024
Author

If label processing using the en.xml file shows up in Markdown output, then it's possible to just use that as the data store instead of implementing something specific. Minor consideration however. The prepend-text would just need to be separated from the unless-regex by a separator.

chrispy-snps · 2024-06-08T11:22:11Z

chrispy-snps
Jun 8, 2024

@kirkilj - we're going a bit of a different way.

Our common input format to the LLM content processing pipeline is HTML because (1) it is a ubiquitous format and tooling is plentiful, and (2) it is structured and attributed, enabling rich information storage and structural processing.

For (1), all our inputs get converted to HTML for input to the content processor - DITA-OT output, Salesforce knowledge articles, Word documents, even Markdown documents written by product teams. (For DITA-OT output, we have an "LLM" plugin that simplifies the HTML a bit for efficiency.)

The content processor processes and chunks the HTML structurally and hierarchically using Beautiful Soup. This allows for structural processing, recognition, and manipulation, such as inserting "helper text" as you described in your post. We also make use of HTML attributes, such as using @class attributes to identify certain HTML-rendered DITA elements and take specific processing action for them (abstracts, glossary terms, etc).

We actually have the opposite problem that you describe. When we give Markdown to the content processor, I need a convention for product teams to specify user-defined attributes and directives in their Markdown source, and I need a way to convert the Markdown to HTML that allows these attributes to survive.

I haven't looked too deeply into Markdown-to-HTML conversion paths yet. The DITA-OT would be a nice solution, as I can customize its output. Do you know if it supports translating any kind of Markdown attributes into @class keywords in HTML5 output?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DITA-OT

Generating "smarter" Markdown for LLM ingestion #4476

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

DITA-OT

Generating "smarter" Markdown for LLM ingestion #4476

kirkilj May 28, 2024

Replies: 3 comments · 5 replies

raducoravu May 29, 2024 Collaborator

kirkilj May 29, 2024 Author

jelovirt May 29, 2024 Maintainer

kirkilj May 29, 2024 Author

jelovirt May 29, 2024 Maintainer

kirkilj May 29, 2024 Author

kirkilj May 29, 2024 Author

chrispy-snps Jun 8, 2024

kirkilj
May 28, 2024

Replies: 3 comments 5 replies

raducoravu
May 29, 2024
Collaborator

kirkilj May 29, 2024
Author

jelovirt
May 29, 2024
Maintainer

kirkilj May 29, 2024
Author

jelovirt May 29, 2024
Maintainer

kirkilj May 29, 2024
Author

kirkilj May 29, 2024
Author

chrispy-snps
Jun 8, 2024