How to avoid deserializing certain XML elements (like HTML tags) and return as String. #6

sdttttt · 2022-05-05T08:09:49Z

Hi, mingun, Thank you for your work.

I want to skip the serialization of some child nodes in fast_xml and return them as strings.
I didn't find this solution in fast_xml. What do I need to do?

<item>
  <description>
    <p style="text-indent:2em;"> Text </p>
    <img src="../1.jpg">
  </description>
</item>

like description parse:

Event::Text(e) => e == "
    <p style="text-indent:2em;"> Text </p>
    <img src="../1.jpg">
    "

Similar questions: tafia/quick-xml#241

The text was updated successfully, but these errors were encountered:

Mingun · 2022-05-05T14:20:59Z

If you look want to use serde (I think so because in your original question you use serde) it seems this is impossible now. Probably need a new attribute (actually a special field name, because we couldn't pass any attributes from struct to deserializer) which will instruct fast_xml::de::Deserializer calls visitor.visit_string(...) with subtree content.

If you would use low-level events API, then you can mark buf_position() of a reader, skip XML subtree, and take buf_position() again, then convert that range to a String. Also, I've implemented a Span API and will propose a PR soon. Then you can just get span of the tree instead doing that yourself.

sdttttt · 2022-05-06T00:19:07Z

Thank you for your precious time to answer my question. 👍 You can close this issue after merging the PR of the Span api.

sdttttt · 2022-05-06T06:53:38Z

The original XML can be retrieved from the original string slice using reader.buffer_position. That solved my problem.

"item" => {
                        let mut buf = Vec::new();

                        let start_position = reader.buffer_position();

						// Non-standard labels may exist.
                        reader.check_end_names(false);
                        loop {
                            match reader.read_event(&mut buf) {
                                Ok(Event::End(ref e)) => match reader.decode(e.name()).unwrap() {
                                    "item" => break,
                                    _ => {}
                                },
                                Ok(Event::Eof) => break,
                                _ => {}
                            }
                        }
                        reader.check_end_names(true);

                        let end_position = reader.buffer_position();

                        let text_string = text.to_string();

						// 7 is `</item>` length.
                        let item_slice = &text_string.as_bytes()[start_position..end_position - 7];

                        buf.clear();

                        console_log!("start: {}, end: {}", start_position, end_position - 7);
                        console_log!("{}", reader.decode(item_slice).unwrap());
}

Mingun · 2022-05-09T05:43:44Z

I think the correct way to handle this would to add a special method to the Reader that will be able to read a content of the tag, ignoring markup. That would be not a well-formed XML, although. That approach still have an open question: should we track a possible xml tags in the ignored markup? The results can be differ depending on the surrounding tag:

XML tag is p:
```

 Text 
 <img src="../1.jpg">

```
In that case should we return
```
r#"
 Text "#
```
or
```
r#"
 Text 
 <img src="../1.jpg">
"#
```
?
In other words, should we track nesting of tags that matches the surrounding tag? It is possible to implement both strategies, but what should be the default?
XML tag is img. If we select the second approach, then this XML will be invalid:
```
<img>
 Text 
 <img src="../1.jpg">
</img>
```

Mingun · 2022-05-09T06:33:40Z

The third variant -- add a method like:

impl Reader {
  // (signature approximate)
  pub fn read_html(&self, end: &str, not_closed_tags: &[&str]) -> Result<BytesText> {
  }
}

It should consume all events until Event::End(end_tag), tracking inner markup, but allow explicitly listed events to be not-closed. In general, implementation will be similar to Reader::read_text. Feel free to submit a PR implementing it.

sdttttt · 2022-05-09T07:46:07Z

hum, If I'm reading something in non-standard XML format (such as HTML), I usually know about it in advance and turn off the closing check of the tag. reader.check_end_names(false);

Mingun added enhancement New feature or request good first issue Good for newcomers labels May 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to avoid deserializing certain XML elements (like HTML tags) and return as String. #6

How to avoid deserializing certain XML elements (like HTML tags) and return as String. #6

sdttttt commented May 5, 2022 •

edited by Mingun

Loading

Mingun commented May 5, 2022

sdttttt commented May 6, 2022 •

edited

Loading

sdttttt commented May 6, 2022 •

edited

Loading

Mingun commented May 9, 2022

Mingun commented May 9, 2022

sdttttt commented May 9, 2022

How to avoid deserializing certain XML elements (like HTML tags) and return as String. #6

How to avoid deserializing certain XML elements (like HTML tags) and return as String. #6

Comments

sdttttt commented May 5, 2022 • edited by Mingun Loading

Mingun commented May 5, 2022

sdttttt commented May 6, 2022 • edited Loading

sdttttt commented May 6, 2022 • edited Loading

Mingun commented May 9, 2022

Mingun commented May 9, 2022

sdttttt commented May 9, 2022

sdttttt commented May 5, 2022 •

edited by Mingun

Loading

sdttttt commented May 6, 2022 •

edited

Loading

sdttttt commented May 6, 2022 •

edited

Loading