Skip to content

TrawlerGo πŸ› is a basic HTTP crawler written in Go, designed to efficiently discover all URLs within a specified domain while capturing related HTTP request information.

License

Notifications You must be signed in to change notification settings

joaooliveirapro/trawlergo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Trawlergo πŸ›

Basic HTTP crawler in Golang. Use this to find out all the URLs for given domain along with related information from the HTTP request.

Features

  • Regex match to include/exclude paths
  • Concurrency safe
  • HTTP Request information includes:
    • Response status code
    • Added count (how many new links fonund on page)

Install

$ go get github.com/joaooliveirapro/trawlergo # install
$ go mod tidy                                 # clean up dependencies

How to use

tg := trawlergo.App{
	Workers:2,                                        // Number of Go routines 
	MaxDepth: 1000,                                   // Max HTTP requests (safe stop)
	Domain: "www.mysite.com",                         // To standardize relative URLs. Don't include the protocol
	StartingURLs, []string{"https://www.mysite.com/"} // Starting URLs
	ExcludeRegex  []string{"/no-go", "[\d]"}          // Don't include these paths
	IncludeRegex  []string{"/some-path-001"}          // Include these paths
}
tg.Run()
tg.SaveToJSON("data.json")
App must have as many StartingURLs as Workers set to avoid premature exit of Workers.

// data.json
[
 {
  "addedCount": 3,
  "statusCode": 200,
  "url": "https://crawler-test.com/mobile/separate_desktop_with_different_h1"
 },
 {
  "addedCount": 0,
  "statusCode": 200,
  "url": "https://crawler-test.com/mobile/separate_desktop_with_different_links_in"
 },
 ...
]

License

The MIT License (MIT)

About

TrawlerGo πŸ› is a basic HTTP crawler written in Go, designed to efficiently discover all URLs within a specified domain while capturing related HTTP request information.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages