Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XMLOutputParser does not parse CDATA #6666

Open
5 tasks done
Aman-14 opened this issue Aug 30, 2024 · 2 comments
Open
5 tasks done

XMLOutputParser does not parse CDATA #6666

Aman-14 opened this issue Aug 30, 2024 · 2 comments
Labels
auto:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@Aman-14
Copy link

Aman-14 commented Aug 30, 2024

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain.js documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain.js rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

import { XMLOutputParser } from "./output_parsers/xml.js";

const xml = `<?xml version="1.0" encoding="UTF-8"?>
<userProfile>
  <userID>12345</userID>
  <email>[email protected]</email>
  <bio><![CDATA[John is a senior developer with <10 years> of experience. He uses <Typescript> at work.]]></bio>
</userProfile>`;

const parser = new XMLOutputParser();
parser.parse(xml).then(console.log).catch(console.error);

Output:

{
  userProfile: [
    { userID: '12345' },
    { email: '[email protected]' },
    { bio: '' }
  ]
}

Expected Output:

{
  userProfile: [
    { userID: '12345' },
    { email: '[email protected]' },
    {
      bio: 'John is a senior developer with <10 years> of experience. He uses <Typescript> at work.'
    }
  ]
}

As we can see XMLOutputParser does not parse CDATA text.

Error Message and Stack Trace (if applicable)

No response

Description

The XMLOutputParser in the LangChain library is not correctly parsing XML content that includes CDATA sections when used in a Node.js environment. The parser appears to be ignoring the CDATA content, resulting in incomplete parsing of XML.

System Info

OS: mac os
Node version: 20.14.0
Yarn version: 1.22.22

Copy link

dosubot bot commented Aug 30, 2024

The issue with the XMLOutputParser not parsing CDATA sections correctly is due to the sax.parser used in the parseXMLMarkdown function not handling CDATA sections. The current implementation lacks an event handler for CDATA sections.

To fix this, you need to add an event handler for oncdata in the sax.parser configuration. Here is how you can modify the code:

const parser = sax.parser(true);

// Event handlers
parser.onopentag = (node: any) => { /* ... */ };
parser.onclosetag = () => { /* ... */ };
parser.ontext = (text: any) => { /* ... */ };
parser.onattribute = (attr: any) => { /* ... */ };

// Add handler for CDATA sections
parser.oncdata = (cdata: any) => {
  if (elementStack.length > 0) {
    const currentElement = elementStack[elementStack.length - 1];
    currentElement.text += cdata;
  }
};

This addition will allow the parser to correctly handle and include CDATA sections in the parsed result [1][2].

To continue talking to Dosu, mention @dosu.

@dosubot dosubot bot added the auto:bug Related to a bug, vulnerability, unexpected error with an existing feature label Aug 30, 2024
@Aman-14
Copy link
Author

Aman-14 commented Aug 30, 2024

Correct, i can create a PR to fix this. Please ping me if any human is reading this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant