Releases: crwlrsoft/crawler
Releases · crwlrsoft/crawler
v0.4.1
v0.4.0
Added
- The
BaseStep
class now haswhere()
andorWhere()
methods to filter step outputs. You can set multiple filters that will be applied to all outputs. When setting a filter usingorWhere
it's linked to the previously added Filter with "OR". Outputs not matching one of the filters, are not yielded. The available filters can be accessed through static methods on the newFilter
class. Currently available filters are comparison filters (equal, greater/less than,...), a few string filters (contains, starts/ends with) and url filters (scheme, domain, host,...). - The
GetLink
andGetLinks
steps now have methodsonSameDomain()
,notOnSameDomain()
,onDomain()
,onSameHost()
,notOnSameHost()
,onHost()
to restrict the which links to find. - Automatically add the crawler's logger to the
Store
so you can also log messages from there. This can be breaking as theStoreInterface
now also requires theaddLogger
method. The new abstractStore
class already implements it, so you can just extend it.
Changed
- The
Csv
step can now also be used without defining a column mapping. In that case it will use the values from the first line (so this makes sense when there are column headlines) as output array keys.
v0.3.0
Added
- By calling monitorMemoryUsage() you can tell the Crawler to add log messages with the current memory usage after every step invocation. You can also set a limit in bytes when to start monitoring and below the limit it won't log memory usage.
Fixed
- Previously the use of Generators actually didn't make a lot of sense, because the outputs of one step were only iterated and passed on to the next step, after the current step was invoked with all its inputs. That makes steps with a lot of inputs bottlenecks and causes bigger memory consumption. So, changed the crawler to immediately pass on outputs of one step to the next step if there is one.
v0.2.0
[0.2.0] - 2022-04-25
Added
uniqueOutputs()
method to Steps to get only unique output values. If outputs are array or object, you can provide a key that will be used as identifier to check for uniqueness. Otherwise, the arrays or objects will be serialized for comparison which will probably be slower.runAndTraverse()
method to Crawler, so you don't need to manually traverse the Generator, if you don't need the results where you're calling the crawler.- Implement the behaviour for when a
Group
step should add something to the Result usingsetResultKey
oraddKeysToResult
, which was still missing. For groups this will only work when usingcombineToSingleOutput
.
v0.1.0
Initial Version containing
Crawler
class being the main unit that executes all the steps that you'll add to it, handling input and output of the steps.HttpCrawler
class using thePoliteHttpLoader
(version ofHttpLoader
sticking torobots.txt
rules) using any PSR-18 HTTP client under the hood and having an own implementation for a cookie jar.- Some ready to use steps for HTTP, HTML, XML, JSON and CSV.
- Loops and Groups.
- Crawler has a PSR-3 LoggerInterface and passes it on to all the steps. The included steps log some messages about what they're doing. Package includes a simple CliLogger.
- Crawler requires a User Agent and an included BotUserAgent class provides an easy interface for bot user agent strings.
- Stores to save the final results can be added to the Crawler. Simple CSV File Store is shipped with the package.