A spider crawler extension for Pimcore Dynamic Search.
Caution
This Connector has reached its end of life and only receives compatibility update. It will not be developed further. Use the Trinity Data Provider instead!
Release | Supported Pimcore Versions | Supported Symfony Versions | Release Date | Maintained | Branch |
---|---|---|---|---|---|
3.x | 11.0 |
^6.4 |
28.09.2023 | Feature Branch | master |
2.x | 10.0 - 10.6 |
^5.4 |
19.12.2021 | No | 2.x |
1.x | 6.6 - 6.9 |
^4.4 |
18.04.2021 | No | 1.x |
"require" : {
"dachcom-digital/dynamic-search" : "~3.0.0",
"dachcom-digital/dynamic-search-data-provider-crawler" : "~3.0.0"
}
You need to install / enable the Dynamic Search Bundle first. Read more about it here. After that, proceed as followed:
Add Bundle to bundles.php
:
<?php
return [
\DsWebCrawlerBundle\DsWebCrawlerBundle::class => ['all' => true],
];
dynamic_search:
context:
default:
data_provider:
service: 'web_crawler'
options:
always:
own_host_only: true
full_dispatch:
seed: 'http://your-domain.test'
valid_links:
- '@^http://your-domain.test.*@i'
user_invalid_links:
- '@^http://your-domain.test\/members.*@i'
single_dispatch:
host: 'http://your-domain.test.test'
normalizer:
service: 'web_crawler_localized_resource_normalizer'
Name | Default Value | Description |
---|---|---|
own_host_only |
false | |
allow_subdomains |
false | |
allow_query_in_url |
false | |
allow_hash_in_url |
false | |
allowed_mime_types |
['text/html', 'application/pdf'] | |
allowed_schemes |
['http'] | |
content_max_size |
0 |
Name | Default Value | Description |
---|---|---|
seed |
null | |
valid_links |
[] | |
user_invalid_links |
[] | |
max_link_depth |
15 | |
max_crawl_limit |
0 |
Name | Default Value | Description |
---|---|---|
host |
null |
Identifier: web_crawler_default_resource_normalizer
Normalize simple documents
Options: none
Identifier: web_crawler_localized_resource_normalizer
Scaffold localized documents
Options:
Name | Default Value | Allowed Type | Description |
---|---|---|---|
locales |
all pimcore enabled languages | array | |
skip_not_localized_documents |
true | bool | if false, an exception rises if a document/object has no valid locale |
Identifier: http_response_html_scaffolder
Simple object scaffolder.
Supported types: VDB\Spider\Resource
with content-type text/html
.
Identifier: http_response_pdf_scaffolder
Simple object scaffolder.
Supported types: VDB\Spider\Resource
with content-type application/pdf
.
Identifier: pimcore_element_scaffolder
Simple object scaffolder.
Supported types: Asset
, Document
, DataObject\Concrete
.
Identifier: resource_uri_extractor
Supported Scaffolder: http_response_html_scaffolder
, http_response_pdf_scaffolder
Return Type: string|null
Options: none
Identifier: resource_language_extractor
Supported Scaffolder: http_response_html_scaffolder
, http_response_pdf_scaffolder
Return Type: string|null
Options: none
Identifier: resource_meta_extractor
Supported Scaffolder: http_response_html_scaffolder
Return Type: string|null
Options:
Name | Default Value | Allowed Type | Description |
---|---|---|---|
name |
null | string | The name of the meta tag to fetch the value from |
Identifier: resource_html_tag_content_extractor
Supported Scaffolder: http_response_html_scaffolder
Return Type: string|null
Options: none
Identifier: resource_text_extractor
Supported Scaffolder: http_response_html_scaffolder
, http_response_pdf_scaffolder
Return Type: string|null
Name | Default Value | Allowed Type | Description |
---|---|---|---|
content_start_indicator |
<!-- main-content --> |
string | Marks the begin of the indexable page content |
content_end_indicator |
<!-- /main-content --> |
string | Marks the end of the indexable page conten |
content_exclude_start_indicator |
null | null|string | Marks the begin of the text to be excluded from indexing |
content_exclude_end_indicator |
null | null|string | Marks the end of the text to be excluded from indexing |
Identifier: resource_title_extractor
Supported Scaffolder: http_response_html_scaffolder
, http_response_pdf_scaffolder
Return Type: string|null
Options: none
Copyright: DACHCOM.DIGITAL
For licensing details please visit LICENSE.md
Before updating, please check our upgrade notes!