Releases · crwlrsoft/crawler · GitHub

10 May 15:32

otsch

v0.4.1 Pre-release

Pre-release

Fixed

The Json step now also works with Http responses as input.

Assets 2

06 May 14:27

otsch

v0.4.0 Pre-release

Pre-release

Added

The BaseStep class now has where() and orWhere() methods to filter step outputs. You can set multiple filters that will be applied to all outputs. When setting a filter using orWhere it's linked to the previously added Filter with "OR". Outputs not matching one of the filters, are not yielded. The available filters can be accessed through static methods on the new Filter class. Currently available filters are comparison filters (equal, greater/less than,...), a few string filters (contains, starts/ends with) and url filters (scheme, domain, host,...).
The GetLink and GetLinks steps now have methods onSameDomain(), notOnSameDomain(), onDomain(), onSameHost(), notOnSameHost(), onHost() to restrict the which links to find.
Automatically add the crawler's logger to the Store so you can also log messages from there. This can be breaking as the StoreInterface now also requires the addLogger method. The new abstract Store class already implements it, so you can just extend it.

Changed

The Csv step can now also be used without defining a column mapping. In that case it will use the values from the first line (so this makes sense when there are column headlines) as output array keys.

Assets 2

26 Apr 23:10

otsch

v0.3.0 Pre-release

Pre-release

Added

By calling monitorMemoryUsage() you can tell the Crawler to add log messages with the current memory usage after every step invocation. You can also set a limit in bytes when to start monitoring and below the limit it won't log memory usage.

Fixed

Previously the use of Generators actually didn't make a lot of sense, because the outputs of one step were only iterated and passed on to the next step, after the current step was invoked with all its inputs. That makes steps with a lot of inputs bottlenecks and causes bigger memory consumption. So, changed the crawler to immediately pass on outputs of one step to the next step if there is one.

Assets 2

25 Apr 10:29

otsch

v0.2.0 Pre-release

Pre-release

[0.2.0] - 2022-04-25

Added

uniqueOutputs() method to Steps to get only unique output values. If outputs are array or object, you can provide a key that will be used as identifier to check for uniqueness. Otherwise, the arrays or objects will be serialized for comparison which will probably be slower.
runAndTraverse() method to Crawler, so you don't need to manually traverse the Generator, if you don't need the results where you're calling the crawler.
Implement the behaviour for when a Group step should add something to the Result using setResultKey or addKeysToResult, which was still missing. For groups this will only work when using combineToSingleOutput.

Assets 2

18 Apr 11:58

otsch

v0.1.0 Pre-release

Pre-release

Initial Version containing

Crawler class being the main unit that executes all the steps that you'll add to it, handling input and output of the steps.
HttpCrawler class using the PoliteHttpLoader (version of HttpLoader sticking to robots.txt rules) using any PSR-18 HTTP client under the hood and having an own implementation for a cookie jar.
Some ready to use steps for HTTP, HTML, XML, JSON and CSV.
Loops and Groups.
Crawler has a PSR-3 LoggerInterface and passes it on to all the steps. The included steps log some messages about what they're doing. Package includes a simple CliLogger.
Crawler requires a User Agent and an included BotUserAgent class provides an easy interface for bot user agent strings.
Stores to save the final results can be added to the Crawler. Simple CSV File Store is shipped with the package.

Assets 2