Fix getting non HTML documents via browser

In production getting the raw response body in the listener for Network.responseReceived sometimes failed. Waiting a little helped. So, as it seems that sometimes it is just not ready yet, move getting the raw response body to a later point and also do it only when the responseIsHtmlDocument() method returns false.
crwlrsoft · Jan 10, 2025 · 42e0e70 · 42e0e70
1 parent 38084ae
commit 42e0e70
Show file tree

Hide file tree

Showing 2 changed files with 43 additions and 14 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,24 +6,28 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
-### [3.1.2] - 2025-01-08
+## [3.1.3] - 2025-01-10
+### Fixed
+* Further improve getting the raw response body from non-HTML documents via Chrome browser.
+
+## [3.1.2] - 2025-01-08
 ### Fixed
 * When loading a non-HTML document (e.g., XML) via the Chrome browser, the library now retrieves the original source. Previously, it returned the outerHTML of the rendered document, which wrapped the content in an HTML structure.
 
-### [3.1.1] - 2025-01-07
+## [3.1.1] - 2025-01-07
 ### Fixed
 * When the `validateAndSanitize()` method of a step throws an `InvalidArgumentException`, the exception is now caught, logged and the step is not invoked with the invalid input. This improves fault tolerance. Feeding a step with one invalid input shouldn't cause the whole crawler run to fail. Exceptions other than `InvalidArgumentException` remain uncaught.
 
-### [3.1.0] - 2025-01-03
+## [3.1.0] - 2025-01-03
 ### Added
 * New method `HeadlessBrowserLoaderHelper::setPageInitScript()` (`$crawler->getLoader()->browser()->setPageInitScript()`) to provide javascript code that is executed on every new browser page before navigating anywhere.
 * New method `HeadlessBrowserLoaderHelper::useNativeUserAgent()` (`$crawler->getLoader()->browser()->useNativeUserAgent()`) to allow using the native `User-Agent` that your Chrome browser sends by default.
 
-### [3.0.4] - 2024-12-18
+## [3.0.4] - 2024-12-18
 ### Fixed
 * Minor improvement for the `DomQuery` (base for `Dom::cssSelector()` and `Dom::xPath()`): enable providing an empty string as selector, to simply get the node that the selector is applied to.
 
-### [3.0.3] - 2024-12-11
+## [3.0.3] - 2024-12-11
 ### Fixed
 * Improved fix for non UTF-8 characters in HTML documents declared as UTF-8.
 

diff --git a/src/Loader/Http/HeadlessBrowserLoaderHelper.php b/src/Loader/Http/HeadlessBrowserLoaderHelper.php
@@ -91,26 +91,22 @@ public function navigateToPageAndGetRespondedRequest(
         ?string $proxy = null,
         ?CookieJar $cookieJar = null,
     ): RespondedRequest {
-        $browser = $this->getBrowser($request, $proxy);
-
-        $this->page = $browser->createPage();
+        $this->page = $this->getBrowser($request, $proxy)->createPage();
 
         $statusCode = 200;
 
         $responseHeaders = [];
 
-        $responseBody = '';
+        $requestId = null;
 
         $this->page->getSession()->once(
             "method:Network.responseReceived",
-            function ($params) use (&$statusCode, &$responseHeaders, &$responseBody) {
+            function ($params) use (&$statusCode, &$responseHeaders, &$requestId) {
                 $statusCode = $params['response']['status'];
 
                 $responseHeaders = $this->sanitizeResponseHeaders($params['response']['headers']);
 
-                $responseBody = $this->page?->getSession()->sendMessageSync(new Message('Network.getResponseBody', [
-                    'requestId' => $params['requestId'],
-                ]))->getData()['result']['body'] ?? '';
+                $requestId = $params['requestId'] ?? null;
             },
         );
 
@@ -122,7 +118,11 @@ function ($params) use (&$statusCode, &$responseHeaders, &$responseBody) {
 
         $this->callPostNavigateHooks();
 
-        $html = $this->responseIsHtmlDocument($this->page) ? $this->page?->getHtml() : $responseBody;
+        if (is_string($requestId) && $this->page && !$this->responseIsHtmlDocument($this->page)) {
+            $html = $this->tryToGetRawResponseBody($this->page, $requestId) ?? $this->page->getHtml();
+        } else {
+            $html = $this->page?->getHtml();
+        }
 
         $this->addCookiesToJar($cookieJar, $request->getUri());
 
@@ -387,4 +387,29 @@ protected function responseIsHtmlDocument(?Page $page = null): bool
             return true;
         }
     }
+
+    /**
+     * In production, retrieving the raw response body using the Network.getResponseBody message sometimes failed.
+     * Waiting briefly before sending the message appeared to resolve the issue.
+     * So, this method tries up to three times with a brief wait between each attempt.
+     */
+    protected function tryToGetRawResponseBody(Page $page, string $requestId): ?string
+    {
+        for ($i = 1; $i <= 3; $i++) {
+            try {
+                $message = $page->getSession()->sendMessageSync(new Message('Network.getResponseBody', [
+                    'requestId' => $requestId,
+                ]));
+
+                if ($message->isSuccessful() && $message->getData()['result']['body']) {
+                    return $message->getData()['result']['body'];
+                }
+            } catch (Throwable) {
+            }
+
+            usleep($i * 100000);
+        }
+
+        return null;
+    }
 }