Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails to remove some img elements #475

Closed
RichardoC opened this issue May 5, 2024 · 2 comments
Closed

Fails to remove some img elements #475

RichardoC opened this issue May 5, 2024 · 2 comments

Comments

@RichardoC
Copy link

RichardoC commented May 5, 2024

Hello,

Below is some example code where removing elements (specifically img) doesn't work on a BBC news webpage.

I'm trying to remove everything that isn't actually text, such as images, CSS, scripts etc to get a cleaner source for my example webpage summariser

This code seems to work for <script> <style> but not for
If you run it, you'll see some in the printed text.

Thanks!

```golang

url := "https://www.bbc.co.uk/news/uk-england-london-68552817"
resp, err := http.Get(url)
if err != nil {
slog.Error("Failed to get site", "site", url, "error", err)
return "", err
}
defer resp.Body.Close()

doc, err := goquery.NewDocumentFromReader(resp.Body)

if err != nil {
slog.Error("Failed to get site body", "site", url, "error", err)
return "", err
}

// doc.Find("img").Remove()

doc.Find("img").Each(func(i int, el *goquery.Selection) {
slog.Error(fmt.Sprintf("Found img %+v", el))
fmt.Println()
fmt.Println()
fmt.Println()
el.Remove()
})
// We don't care about images, we only want the text on the site
doc.Find("picture").Each(func(i int, el *goquery.Selection) {
el.Remove()
})
// We don't care about scripts, we only want the text on the site
doc.Find("script").Each(func(i int, el *goquery.Selection) {
el.Remove()
})
// We don't care about css, we only want the text on the site
doc.Find("style").Each(func(i int, el *goquery.Selection) {
el.Remove()
})
// We don't care about videos, we only want the text on the site
doc.Find("video").Each(func(i int, el *goquery.Selection) {
el.Remove()
})

// remove excess newlines etc
bdyText := standardizeSpaces(doc.Text())
slog.Info(bdyText)

</details>
@RichardoC RichardoC changed the title Failes to remove some img elements Fails to remove some img elements May 5, 2024
@mna
Copy link
Member

mna commented May 5, 2024

Hello Richard,

It seems like all images that remain are inside a <noscript> tag, which explains why they are not removed (the content of noscript is not parsed by the html parser, it parses as if scripts were enabled). This seems to get rid of all img elements:

	doc.Find("noscript").Each(func(i int, el *goquery.Selection) {
		el.Remove()
	})

And you may want to look at #139 for further information about the noscript behaviour.

Hope this helps,
Martin

@RichardoC
Copy link
Author

Hello Martin, that solves my issue. Thank you for the explanation as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants