You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
if err != nil {
slog.Error("Failed to get site body", "site", url, "error", err)
return "", err
}
// doc.Find("img").Remove()
doc.Find("img").Each(func(i int, el *goquery.Selection) {
slog.Error(fmt.Sprintf("Found img %+v", el))
fmt.Println()
fmt.Println()
fmt.Println()
el.Remove()
})
// We don't care about images, we only want the text on the site
doc.Find("picture").Each(func(i int, el *goquery.Selection) {
el.Remove()
})
// We don't care about scripts, we only want the text on the site
doc.Find("script").Each(func(i int, el *goquery.Selection) {
el.Remove()
})
// We don't care about css, we only want the text on the site
doc.Find("style").Each(func(i int, el *goquery.Selection) {
el.Remove()
})
// We don't care about videos, we only want the text on the site
doc.Find("video").Each(func(i int, el *goquery.Selection) {
el.Remove()
})
It seems like all images that remain are inside a <noscript> tag, which explains why they are not removed (the content of noscript is not parsed by the html parser, it parses as if scripts were enabled). This seems to get rid of all img elements:
doc.Find("noscript").Each(func(i int, el *goquery.Selection) {
el.Remove()
})
And you may want to look at #139 for further information about the noscript behaviour.
Hello,
Below is some example code where removing elements (specifically img) doesn't work on a BBC news webpage.
I'm trying to remove everything that isn't actually text, such as images, CSS, scripts etc to get a cleaner source for my example webpage summariser
This code seems to work for <script> <style> but not for
If you run it, you'll see some in the printed text.
Thanks!
url := "https://www.bbc.co.uk/news/uk-england-london-68552817"
resp, err := http.Get(url)
if err != nil {
slog.Error("Failed to get site", "site", url, "error", err)
return "", err
}
defer resp.Body.Close()
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
slog.Error("Failed to get site body", "site", url, "error", err)
return "", err
}
// doc.Find("img").Remove()
doc.Find("img").Each(func(i int, el *goquery.Selection) {
slog.Error(fmt.Sprintf("Found img %+v", el))
fmt.Println()
fmt.Println()
fmt.Println()
el.Remove()
})
// We don't care about images, we only want the text on the site
doc.Find("picture").Each(func(i int, el *goquery.Selection) {
el.Remove()
})
// We don't care about scripts, we only want the text on the site
doc.Find("script").Each(func(i int, el *goquery.Selection) {
el.Remove()
})
// We don't care about css, we only want the text on the site
doc.Find("style").Each(func(i int, el *goquery.Selection) {
el.Remove()
})
// We don't care about videos, we only want the text on the site
doc.Find("video").Each(func(i int, el *goquery.Selection) {
el.Remove()
})
// remove excess newlines etc
bdyText := standardizeSpaces(doc.Text())
slog.Info(bdyText)
The text was updated successfully, but these errors were encountered: