Skip to content

Releases: crwlrsoft/crawler

v0.4.1

10 May 15:32
Compare
Choose a tag to compare
v0.4.1 Pre-release
Pre-release

Fixed

  • The Json step now also works with Http responses as input.

v0.4.0

06 May 14:27
Compare
Choose a tag to compare
v0.4.0 Pre-release
Pre-release

Added

  • The BaseStep class now has where() and orWhere() methods to filter step outputs. You can set multiple filters that will be applied to all outputs. When setting a filter using orWhere it's linked to the previously added Filter with "OR". Outputs not matching one of the filters, are not yielded. The available filters can be accessed through static methods on the new Filter class. Currently available filters are comparison filters (equal, greater/less than,...), a few string filters (contains, starts/ends with) and url filters (scheme, domain, host,...).
  • The GetLink and GetLinks steps now have methods onSameDomain(), notOnSameDomain(), onDomain(), onSameHost(), notOnSameHost(), onHost() to restrict the which links to find.
  • Automatically add the crawler's logger to the Store so you can also log messages from there. This can be breaking as the StoreInterface now also requires the addLogger method. The new abstract Store class already implements it, so you can just extend it.

Changed

  • The Csv step can now also be used without defining a column mapping. In that case it will use the values from the first line (so this makes sense when there are column headlines) as output array keys.

v0.3.0

26 Apr 23:10
Compare
Choose a tag to compare
v0.3.0 Pre-release
Pre-release

Added

  • By calling monitorMemoryUsage() you can tell the Crawler to add log messages with the current memory usage after every step invocation. You can also set a limit in bytes when to start monitoring and below the limit it won't log memory usage.

Fixed

  • Previously the use of Generators actually didn't make a lot of sense, because the outputs of one step were only iterated and passed on to the next step, after the current step was invoked with all its inputs. That makes steps with a lot of inputs bottlenecks and causes bigger memory consumption. So, changed the crawler to immediately pass on outputs of one step to the next step if there is one.

v0.2.0

25 Apr 10:29
Compare
Choose a tag to compare
v0.2.0 Pre-release
Pre-release

[0.2.0] - 2022-04-25

Added

  • uniqueOutputs() method to Steps to get only unique output values. If outputs are array or object, you can provide a key that will be used as identifier to check for uniqueness. Otherwise, the arrays or objects will be serialized for comparison which will probably be slower.
  • runAndTraverse() method to Crawler, so you don't need to manually traverse the Generator, if you don't need the results where you're calling the crawler.
  • Implement the behaviour for when a Group step should add something to the Result using setResultKey or addKeysToResult, which was still missing. For groups this will only work when using combineToSingleOutput.

v0.1.0

18 Apr 11:58
e54edf7
Compare
Choose a tag to compare
v0.1.0 Pre-release
Pre-release

Initial Version containing

  • Crawler class being the main unit that executes all the steps that you'll add to it, handling input and output of the steps.
  • HttpCrawler class using the PoliteHttpLoader (version of HttpLoader sticking to robots.txt rules) using any PSR-18 HTTP client under the hood and having an own implementation for a cookie jar.
  • Some ready to use steps for HTTP, HTML, XML, JSON and CSV.
  • Loops and Groups.
  • Crawler has a PSR-3 LoggerInterface and passes it on to all the steps. The included steps log some messages about what they're doing. Package includes a simple CliLogger.
  • Crawler requires a User Agent and an included BotUserAgent class provides an easy interface for bot user agent strings.
  • Stores to save the final results can be added to the Crawler. Simple CSV File Store is shipped with the package.