-
Our team has been asked by various organizations within our company to provide versions of our documentation content in Markdown format for LLM ingestion, since Markdown ingestion is supported by default in many models. HTML input is also often supported, but the tags and attributes are usually stripped out and only the text nodes are ingested. OT provides a few markdown output formats, but we'd like to explore the option to override various XSL templates in the org.lwdita plugin for customization. When I look at the dita2markdown.xsl, it appears to be generating Pandoc AST XML. The comments in that stylesheet indicate that certain features are intended for eventual XHTML output, which makes sense if you wanted to embed information in the AST that could be leveraged in HTML output. One modification we'd like to experiment with is injecting some form of DITA semantics into the Markdown text, and the dita2markdown.xsl stylesheet seems like the place to do it, so it could be used to inject text into any output format supported by lwdita. For example, whenever a It seems that we'd need to set the If there are better places to insert such logic then I'm all ears. In particular, I'd be interested in thoughts from @jelovirt, @raducoravu, and @chrispy-snps, but all comers are welcome who may have insight into the implementation or similar needs. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 5 replies
-
Interesting discussion John, not sure if I have yet something useful to contribute but I will start a discussion with our Oxygen Feedback team which recently added support for AI-based responses in our user's manual web site, support which we also plan to make commercially available.
About your comments on filtering:
I think it is more accurate to create as you did until now separate filtered outputs for each product's user guide and have separate AI based searching in each of them. So the end user would first choose the product themselves, open the userguide web site for that product which is already filtered and search only inside its contents. Otherwise if the AI ingests the unfiltered contents for all products, who knows what kind of mixed answers it would give, mixing capabilities of one product with capabilities of another. |
Beta Was this translation helpful? Give feedback.
-
If the only thing that's needed is to add additional text based on e.g. profiling attributes, the best place to add this is DITA-OT preprocessing step. The LwDITA Markdown output is in intended to be extended at this point and it's way easier to add the LLM optimization content during preprocessing. |
Beta Was this translation helpful? Give feedback.
-
@kirkilj - we're going a bit of a different way. Our common input format to the LLM content processing pipeline is HTML because (1) it is a ubiquitous format and tooling is plentiful, and (2) it is structured and attributed, enabling rich information storage and structural processing. For (1), all our inputs get converted to HTML for input to the content processor - DITA-OT output, Salesforce knowledge articles, Word documents, even Markdown documents written by product teams. (For DITA-OT output, we have an "LLM" plugin that simplifies the HTML a bit for efficiency.) The content processor processes and chunks the HTML structurally and hierarchically using Beautiful Soup. This allows for structural processing, recognition, and manipulation, such as inserting "helper text" as you described in your post. We also make use of HTML attributes, such as using We actually have the opposite problem that you describe. When we give Markdown to the content processor, I need a convention for product teams to specify user-defined attributes and directives in their Markdown source, and I need a way to convert the Markdown to HTML that allows these attributes to survive. I haven't looked too deeply into Markdown-to-HTML conversion paths yet. The DITA-OT would be a nice solution, as I can customize its output. Do you know if it supports translating any kind of Markdown attributes into |
Beta Was this translation helpful? Give feedback.
Any preprocessing extension point will work depending on what you want to do, but the post preprocessing extension point is likely the best because you have all the info available to you at that point.