Skip to content
This repository has been archived by the owner on May 30, 2022. It is now read-only.

How to avoid deserializing certain XML elements (like HTML tags) and return as String. #6

Open
sdttttt opened this issue May 5, 2022 · 6 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@sdttttt
Copy link

sdttttt commented May 5, 2022

Hi, mingun, Thank you for your work.

I want to skip the serialization of some child nodes in fast_xml and return them as strings.
I didn't find this solution in fast_xml. What do I need to do?

<item>
  <description>
    <p style="text-indent:2em;"> Text </p>
    <img src="../1.jpg">
  </description>
</item>

like description parse:

Event::Text(e) => e == "
    <p style="text-indent:2em;"> Text </p>
    <img src="../1.jpg">
    "

Similar questions: tafia/quick-xml#241

@Mingun
Copy link
Owner

Mingun commented May 5, 2022

If you look want to use serde (I think so because in your original question you use serde) it seems this is impossible now. Probably need a new attribute (actually a special field name, because we couldn't pass any attributes from struct to deserializer) which will instruct fast_xml::de::Deserializer calls visitor.visit_string(...) with subtree content.

If you would use low-level events API, then you can mark buf_position() of a reader, skip XML subtree, and take buf_position() again, then convert that range to a String. Also, I've implemented a Span API and will propose a PR soon. Then you can just get span of the tree instead doing that yourself.

@sdttttt
Copy link
Author

sdttttt commented May 6, 2022

Thank you for your precious time to answer my question. 👍 You can close this issue after merging the PR of the Span api.

@sdttttt
Copy link
Author

sdttttt commented May 6, 2022

The original XML can be retrieved from the original string slice using reader.buffer_position. That solved my problem.

"item" => {
                        let mut buf = Vec::new();

                        let start_position = reader.buffer_position();

						// Non-standard labels may exist.
                        reader.check_end_names(false);
                        loop {
                            match reader.read_event(&mut buf) {
                                Ok(Event::End(ref e)) => match reader.decode(e.name()).unwrap() {
                                    "item" => break,
                                    _ => {}
                                },
                                Ok(Event::Eof) => break,
                                _ => {}
                            }
                        }
                        reader.check_end_names(true);

                        let end_position = reader.buffer_position();

                        let text_string = text.to_string();

						// 7 is `</item>` length.
                        let item_slice = &text_string.as_bytes()[start_position..end_position - 7];

                        buf.clear();

                        console_log!("start: {}, end: {}", start_position, end_position - 7);
                        console_log!("{}", reader.decode(item_slice).unwrap());
}

@Mingun
Copy link
Owner

Mingun commented May 9, 2022

I think the correct way to handle this would to add a special method to the Reader that will be able to read a content of the tag, ignoring markup. That would be not a well-formed XML, although. That approach still have an open question: should we track a possible xml tags in the ignored markup? The results can be differ depending on the surrounding tag:

  1. XML tag is p:

    <p>
      <p style="text-indent:2em;"> Text </p>
      <img src="../1.jpg">
    </p>

    In that case should we return

    r#"
      <p style="text-indent:2em;"> Text "#

    or

    r#"
      <p style="text-indent:2em;"> Text </p>
      <img src="../1.jpg">
    "#

    ?
    In other words, should we track nesting of tags that matches the surrounding tag? It is possible to implement both strategies, but what should be the default?

  2. XML tag is img. If we select the second approach, then this XML will be invalid:

    <img>
      <p style="text-indent:2em;"> Text </p>
      <img src="../1.jpg">
    </img>

@Mingun
Copy link
Owner

Mingun commented May 9, 2022

The third variant -- add a method like:

impl Reader {
  // (signature approximate)
  pub fn read_html(&self, end: &str, not_closed_tags: &[&str]) -> Result<BytesText> {
  }
}

It should consume all events until Event::End(end_tag), tracking inner markup, but allow explicitly listed events to be not-closed. In general, implementation will be similar to Reader::read_text. Feel free to submit a PR implementing it.

@sdttttt
Copy link
Author

sdttttt commented May 9, 2022

hum, If I'm reading something in non-standard XML format (such as HTML), I usually know about it in advance and turn off the closing check of the tag. reader.check_end_names(false);

@Mingun Mingun added enhancement New feature or request good first issue Good for newcomers labels May 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants