Skip to content

Commit

Permalink
Fix getting non HTML documents via browser
Browse files Browse the repository at this point in the history
In production getting the raw response body in the listener for
Network.responseReceived sometimes failed. Waiting a little helped.
So, as it seems that sometimes it is just not ready yet, move getting
the raw response body to a later point and also do it only when the
responseIsHtmlDocument() method returns false.
  • Loading branch information
otsch committed Jan 10, 2025
1 parent 38084ae commit 42e0e70
Show file tree
Hide file tree
Showing 2 changed files with 43 additions and 14 deletions.
14 changes: 9 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,24 +6,28 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### [3.1.2] - 2025-01-08
## [3.1.3] - 2025-01-10
### Fixed
* Further improve getting the raw response body from non-HTML documents via Chrome browser.

## [3.1.2] - 2025-01-08
### Fixed
* When loading a non-HTML document (e.g., XML) via the Chrome browser, the library now retrieves the original source. Previously, it returned the outerHTML of the rendered document, which wrapped the content in an HTML structure.

### [3.1.1] - 2025-01-07
## [3.1.1] - 2025-01-07
### Fixed
* When the `validateAndSanitize()` method of a step throws an `InvalidArgumentException`, the exception is now caught, logged and the step is not invoked with the invalid input. This improves fault tolerance. Feeding a step with one invalid input shouldn't cause the whole crawler run to fail. Exceptions other than `InvalidArgumentException` remain uncaught.

### [3.1.0] - 2025-01-03
## [3.1.0] - 2025-01-03
### Added
* New method `HeadlessBrowserLoaderHelper::setPageInitScript()` (`$crawler->getLoader()->browser()->setPageInitScript()`) to provide javascript code that is executed on every new browser page before navigating anywhere.
* New method `HeadlessBrowserLoaderHelper::useNativeUserAgent()` (`$crawler->getLoader()->browser()->useNativeUserAgent()`) to allow using the native `User-Agent` that your Chrome browser sends by default.

### [3.0.4] - 2024-12-18
## [3.0.4] - 2024-12-18
### Fixed
* Minor improvement for the `DomQuery` (base for `Dom::cssSelector()` and `Dom::xPath()`): enable providing an empty string as selector, to simply get the node that the selector is applied to.

### [3.0.3] - 2024-12-11
## [3.0.3] - 2024-12-11
### Fixed
* Improved fix for non UTF-8 characters in HTML documents declared as UTF-8.

Expand Down
43 changes: 34 additions & 9 deletions src/Loader/Http/HeadlessBrowserLoaderHelper.php
Original file line number Diff line number Diff line change
Expand Up @@ -91,26 +91,22 @@ public function navigateToPageAndGetRespondedRequest(
?string $proxy = null,
?CookieJar $cookieJar = null,
): RespondedRequest {
$browser = $this->getBrowser($request, $proxy);

$this->page = $browser->createPage();
$this->page = $this->getBrowser($request, $proxy)->createPage();

$statusCode = 200;

$responseHeaders = [];

$responseBody = '';
$requestId = null;

$this->page->getSession()->once(
"method:Network.responseReceived",
function ($params) use (&$statusCode, &$responseHeaders, &$responseBody) {
function ($params) use (&$statusCode, &$responseHeaders, &$requestId) {
$statusCode = $params['response']['status'];

$responseHeaders = $this->sanitizeResponseHeaders($params['response']['headers']);

$responseBody = $this->page?->getSession()->sendMessageSync(new Message('Network.getResponseBody', [
'requestId' => $params['requestId'],
]))->getData()['result']['body'] ?? '';
$requestId = $params['requestId'] ?? null;
},
);

Expand All @@ -122,7 +118,11 @@ function ($params) use (&$statusCode, &$responseHeaders, &$responseBody) {

$this->callPostNavigateHooks();

$html = $this->responseIsHtmlDocument($this->page) ? $this->page?->getHtml() : $responseBody;
if (is_string($requestId) && $this->page && !$this->responseIsHtmlDocument($this->page)) {
$html = $this->tryToGetRawResponseBody($this->page, $requestId) ?? $this->page->getHtml();
} else {
$html = $this->page?->getHtml();
}

$this->addCookiesToJar($cookieJar, $request->getUri());

Expand Down Expand Up @@ -387,4 +387,29 @@ protected function responseIsHtmlDocument(?Page $page = null): bool
return true;
}
}

/**
* In production, retrieving the raw response body using the Network.getResponseBody message sometimes failed.
* Waiting briefly before sending the message appeared to resolve the issue.
* So, this method tries up to three times with a brief wait between each attempt.
*/
protected function tryToGetRawResponseBody(Page $page, string $requestId): ?string
{
for ($i = 1; $i <= 3; $i++) {
try {
$message = $page->getSession()->sendMessageSync(new Message('Network.getResponseBody', [
'requestId' => $requestId,
]));

if ($message->isSuccessful() && $message->getData()['result']['body']) {
return $message->getData()['result']['body'];
}
} catch (Throwable) {
}

usleep($i * 100000);
}

return null;
}
}

0 comments on commit 42e0e70

Please sign in to comment.