Fails to remove some img elements #475

RichardoC · 2024-05-05T18:41:26Z

Hello,

Below is some example code where removing elements (specifically img) doesn't work on a BBC news webpage.

I'm trying to remove everything that isn't actually text, such as images, CSS, scripts etc to get a cleaner source for my example webpage summariser

This code seems to work for <script> <style> but not for
If you run it, you'll see some in the printed text.

Thanks!

```golang

url := "https://www.bbc.co.uk/news/uk-england-london-68552817"
resp, err := http.Get(url)
if err != nil {
slog.Error("Failed to get site", "site", url, "error", err)
return "", err
}
defer resp.Body.Close()

doc, err := goquery.NewDocumentFromReader(resp.Body)

if err != nil {
slog.Error("Failed to get site body", "site", url, "error", err)
return "", err
}

// doc.Find("img").Remove()

doc.Find("img").Each(func(i int, el *goquery.Selection) {
slog.Error(fmt.Sprintf("Found img %+v", el))
fmt.Println()
fmt.Println()
fmt.Println()
el.Remove()
})
// We don't care about images, we only want the text on the site
doc.Find("picture").Each(func(i int, el *goquery.Selection) {
el.Remove()
})
// We don't care about scripts, we only want the text on the site
doc.Find("script").Each(func(i int, el *goquery.Selection) {
el.Remove()
})
// We don't care about css, we only want the text on the site
doc.Find("style").Each(func(i int, el *goquery.Selection) {
el.Remove()
})
// We don't care about videos, we only want the text on the site
doc.Find("video").Each(func(i int, el *goquery.Selection) {
el.Remove()
})

// remove excess newlines etc
bdyText := standardizeSpaces(doc.Text())
slog.Info(bdyText)

</details>

The text was updated successfully, but these errors were encountered:

mna · 2024-05-05T19:01:04Z

Hello Richard,

It seems like all images that remain are inside a <noscript> tag, which explains why they are not removed (the content of noscript is not parsed by the html parser, it parses as if scripts were enabled). This seems to get rid of all img elements:

	doc.Find("noscript").Each(func(i int, el *goquery.Selection) {
		el.Remove()
	})

And you may want to look at #139 for further information about the noscript behaviour.

Hope this helps,
Martin

RichardoC · 2024-05-05T19:42:10Z

Hello Martin, that solves my issue. Thank you for the explanation as well!

RichardoC changed the title ~~Failes to remove some img elements~~ Fails to remove some img elements May 5, 2024

RichardoC closed this as completed May 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fails to remove some img elements #475

Fails to remove some img elements #475

RichardoC commented May 5, 2024 •

edited

Loading

mna commented May 5, 2024

RichardoC commented May 5, 2024

Fails to remove some img elements #475

Fails to remove some img elements #475

Comments

RichardoC commented May 5, 2024 • edited Loading

mna commented May 5, 2024

RichardoC commented May 5, 2024

RichardoC commented May 5, 2024 •

edited

Loading