Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add har recorder #181

Merged
merged 7 commits into from
Dec 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 1 addition & 3 deletions .github/workflows/pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,5 @@ jobs:
run: |
shopt -s inherit_errexit
set -xeEo pipefail

sed -i 's,pagegraph:.*,pagegraph: "/opt/brave.com/brave-nightly/brave-browser-nightly",' test/config.js
npm install
npm run test
DEBUG=1 PAGEGRAPH_CRAWL_TEST_BINARY_PATH=/opt/brave.com/brave-nightly/brave-browser-nightly npm run test
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
node_modules/
.DS_Store
.swp
.vscode/
.vscode/
test/debug_output/
test/output/
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Command line tool for crawling web pages with PageGraph.

Install
---
For building the tool, you need to have `tsc` (TypeScript Compiler) package installed.
Requires a recent version of node (current testing is done on `v23.4.0`).

```bash
npm install
Expand All @@ -32,11 +32,11 @@ npm run crawl -- \
--debug debug
```

The `-t` specifies how many seconds to crawl the URL provided in `-u` using the PageGraph binary in `-b`.
The `-t` specifies how many seconds to crawl the URL provided in `-u` using the PageGraph binary in `-b`.

You can see all supported options:
```bash
npm run crawl -- -h
```

**NOTE:** PageGraph currently does not track puppeteer / automation scripts, and so modifying or interacting with the document through [devtools/puppeteer](https://pptr.dev/) while recording a PageGraph file will fail.
**NOTE:** PageGraph currently does not track puppeteer / automation scripts, and so modifying or interacting with the document through [devtools/puppeteer](https://pptr.dev/) while recording a PageGraph file will likely fail.
73 changes: 70 additions & 3 deletions built/brave/crawl.js
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,12 @@ import * as osLib from 'os';
import Xvbf from 'xvfb';
import { isTopLevelPageNavigation, isTimeoutError } from './checks.js';
import { asHTTPUrl } from './checks.js';
import { createScreenshotPath, writeGraphML, deleteAtPath } from './files.js';
import { createScreenshotPath, writeGraphML, writeHAR, deleteAtPath } from './files.js';
import { getLogger } from './logging.js';
import { makeNavigationTracker } from './navigation_tracker.js';
import { selectRandomChildUrl } from './page.js';
import { puppeteerConfigForArgs, launchWithRetry } from './puppeteer.js';
import { harFromMessages } from 'chrome-har';
const xvfbPlatforms = new Set(['linux', 'openbsd']);
const setupEnv = (args) => {
const logger = getLogger(args);
Expand Down Expand Up @@ -53,6 +54,46 @@ const waitUntilUnless = (secs, unlessFunc, intervalMs = 500) => {
}, intervalMs);
});
};
const prepareHARGenerator = async (client, networkEvents, pageEvents, storeHarBody, responseBodies, logger) => {
await client.send('Page.enable');
await client.send('Network.enable');
const networkMethods = [
'Network.requestWillBeSent',
'Network.requestServedFromCache',
'Network.dataReceived',
'Network.responseReceived',
'Network.resourceChangedPriority',
'Network.loadingFinished',
'Network.loadingFailed',
];
const pageMethods = [
'Page.loadEventFired',
'Page.domContentEventFired',
'Page.frameStartedLoading',
'Page.frameAttached',
'Page.frameScheduledNavigation',
];
networkMethods.forEach((method) => {
client.on(method, (params) => {
networkEvents.push({ method, params });
if (storeHarBody && method == 'Network.loadingFinished') {
const responseParams = params;
const requestId = responseParams.requestId;
client.send('Network.getResponseBody', { requestId: requestId })
.then((responseBody) => {
responseBodies.set(requestId.toString(), responseBody);
}, (reason) => {
logger.error('LoadingFinishedError: ' + reason);
});
}
});
});
pageMethods.forEach((method) => {
client.on(method, (params) => {
pageEvents.push({ method, params });
});
});
};
const generatePageGraph = async (seconds, page, client, waitFunc,
// eslint-disable-next-line max-len
logger) => {
Expand All @@ -75,7 +116,7 @@ export const doCrawl = async (args, previouslySeenUrls) => {
const depth = Math.max(args.recursiveDepth, 1);
let randomChildUrl;
let shouldRedirectToUrl;
const puppeteerConfig = puppeteerConfigForArgs(args);
const puppeteerConfig = await puppeteerConfigForArgs(args);
const { launchOptions } = puppeteerConfig;
const envHandle = setupEnv(args);
let shouldStopWaitingFlag = false;
Expand All @@ -101,6 +142,12 @@ export const doCrawl = async (args, previouslySeenUrls) => {
// and wait for idle time.
const page = await browser.newPage();
const client = await page.target().createCDPSession();
const networkEvents = [];
const pageEvents = [];
const responseBodies = new Map();
if (args.storeHar) {
await prepareHARGenerator(client, networkEvents, pageEvents, args.storeHarBody, responseBodies, logger);
}
client.on('Target.targetCrashed', (event) => {
const logMsg = {
targetId: event.targetId,
Expand Down Expand Up @@ -147,7 +194,7 @@ export const doCrawl = async (args, previouslySeenUrls) => {
}
// Otherwise, we're in a redirect loop, so stop recording
// the pagegraph, but continue.
logger.error('Quitting bc we\'re in a redirect loop');
logger.info('Quitting bc we\'re in a redirect loop');
shouldStopWaitingFlag = true;
const client = await page.createCDPSession();
await client.send('Page.stopLoading');
Expand All @@ -169,6 +216,26 @@ export const doCrawl = async (args, previouslySeenUrls) => {
logger.info('Loaded ', String(urlToCrawl));
const response = await generatePageGraph(args.seconds, page, client, shouldStopWaitingFunc, logger);
await writeGraphML(args, urlToCrawl, response, logger);
// Store HAR
if (args.storeHar) {
// ensure that all bodies are loaded
await Promise.all(responseBodies);
// merge responses and bodies
networkEvents.forEach((event) => {
if (args.storeHarBody && event.method == 'Network.responseReceived') {
const requestId = event.params.requestId;
const responseBody = responseBodies.get(requestId.toString());
const responseParams = event.params;
responseParams.response.body = Buffer.from(responseBody.body, responseBody.base64Encoded ? 'base64' : undefined).toString();
}
});
const allEvents = pageEvents
.concat(networkEvents);
const har = harFromMessages(allEvents, {
includeTextFromResponseBody: args.storeHarBody,
});
await writeHAR(args, urlToCrawl, har, logger);
}
if (depth > 1) {
randomChildUrl = await selectRandomChildUrl(page, logger);
}
Expand Down
23 changes: 20 additions & 3 deletions built/brave/files.js
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
import { writeFile } from 'node:fs/promises';
import { rm, writeFile } from 'node:fs/promises';
import { join, parse } from 'node:path';
import { remove } from 'fs-extra';
import { isDir } from './checks.js';
const createFilename = (url) => {
const fileSafeUrl = String(url).replace(/[^\w]/g, '_');
Expand All @@ -12,6 +11,11 @@ const createGraphMLPath = (args, url) => {
? join(args.outputPath, createFilename(url))
: args.outputPath;
};
const createHARPath = (args, url) => {
const outputPath = createGraphMLPath(args, url);
const pathParts = parse(outputPath);
return pathParts.dir + '/' + pathParts.name + '.har';
};
export const createScreenshotPath = (args, url) => {
const outputPath = createGraphMLPath(args, url);
const pathParts = parse(outputPath);
Expand All @@ -27,6 +31,19 @@ export const writeGraphML = async (args, url, response, logger) => {
logger.error('saving Page.generatePageGraph output: ', String(err));
}
};
export const writeHAR = async (args, url, har, logger) => {
try {
const outputFilename = createHARPath(args, url);
await writeFile(outputFilename, JSON.stringify(har, null, 4));
logger.info('Writing HAR file to: ', outputFilename);
}
catch (err) {
logger.error('saving HAR file: ', String(err));
}
};
export const deleteAtPath = async (path) => {
await remove(path);
await rm(path, {
recursive: true,
force: true,
});
};
29 changes: 14 additions & 15 deletions built/brave/puppeteer.js
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
import { cp } from 'node:fs/promises';
import * as pathLib from 'path';
import fsExtraLib from 'fs-extra';
import tmpLib from 'tmp';
import puppeteerLib from 'puppeteer-core';
import { isDir } from './checks.js';
import { deleteAtPath } from './files.js';
import { getLogger } from './logging.js';
const disabledBraveFeatures = [
'Speedreader',
Expand All @@ -27,7 +29,7 @@ const disabledChromeFeatures = [
'IPH_SidePanelGenericMenuFeature',
];
const disabledFeatures = disabledBraveFeatures.concat(disabledChromeFeatures);
const profilePathForArgs = (args) => {
const profilePathForArgs = async (args) => {
const logger = getLogger(args);
// The easiest case is if we've been told to use an existing profile.
// In this case, just return the given path.
Expand All @@ -47,12 +49,18 @@ const profilePathForArgs = (args) => {
? args.persistProfilePath
: tmpLib.dirSync({ prefix: 'pagegraph-profile-' }).name;
const shouldClean = args.persistProfilePath === undefined;
fsExtraLib.copySync(templateProfile, destProfilePath);
if (isDir(destProfilePath)) {
logger.info(`Profile exists at ${String(destProfilePath)}, so deleting.`);
await deleteAtPath(destProfilePath);
}
await cp(templateProfile, destProfilePath, {
recursive: true,
});
logger.verbose(`Crawling with profile at ${String(destProfilePath)}.`);
return { profilePath: destProfilePath, shouldClean };
};
export const puppeteerConfigForArgs = (args) => {
const { profilePath, shouldClean } = profilePathForArgs(args);
const makePuppeteerConf = async (args) => {
const { profilePath, shouldClean } = await profilePathForArgs(args);
process.env.PAGEGRAPH_OUT_DIR = args.outputPath;
const chromeArgs = [
'--ash-no-nudges',
Expand Down Expand Up @@ -106,22 +114,13 @@ export const puppeteerConfigForArgs = (args) => {
shouldClean,
};
};
export const puppeteerConfigForArgs = makePuppeteerConf;
const asyncSleep = async (millis) => {
return await new Promise(resolve => setTimeout(resolve, millis));
};
const defaultComputeTimeout = (tryIndex) => {
return Math.pow(2, tryIndex - 1) * 1000;
};
// const makeLaunchPuppeteerFunc = (shouldStealth: boolean,
// logger: Logger): VanillaPuppeteer => {
// if (shouldStealth === true) {
// logger.info('Running with puppeteer-extra-plugin-stealth')
// const puppeteerExtra = new PuppeteerExtra(puppeteerLib, undefined)
// puppeteerExtra.use(stealthPluginLib())
// return puppeteerExtra
// }
// return puppeteerLib
// }
export const launchWithRetry = async (launchOptions, stealthMode, logger,
// eslint-disable-next-line max-len
retryOptions) => {
Expand Down
16 changes: 11 additions & 5 deletions built/brave/validate.js
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import * as fsLib from 'fs';
import * as osLib from 'os';
import * as pathLib from 'path';
import * as hasBinLib from 'hasbin';
import which from 'which';
import { asHTTPUrl, isDir, isExecFile } from './checks.js';
import { getLoggerForLevel } from './logging.js';
const possibleBraveBinaryPaths = [
Expand All @@ -26,11 +26,13 @@ const guessBinary = () => {
'brave-browser-stable',
'brave-browser',
];
const firstBraveBinary = hasBinLib.first.sync(possibleBraveBinaryNames);
if (firstBraveBinary === false) {
return false;
for (const aBinaryName of possibleBraveBinaryNames) {
const binaryPath = which.sync(aBinaryName, { nothrow: true });
if (binaryPath) {
return binaryPath;
}
}
return firstBraveBinary;
return false;
};
export const validate = (rawArgs) => {
const logger = getLoggerForLevel(rawArgs.logging);
Expand Down Expand Up @@ -80,6 +82,8 @@ export const validate = (rawArgs) => {
const userAgent = rawArgs.user_agent;
const crawlDuplicates = rawArgs.crawl_duplicates;
const screenshot = rawArgs.screenshot;
const storeHar = rawArgs.store_har;
const storeHarBody = rawArgs.store_har_body;
const validatedArgs = {
executablePath: String(executablePath),
outputPath,
Expand All @@ -96,6 +100,8 @@ export const validate = (rawArgs) => {
userAgent,
crawlDuplicates,
screenshot,
storeHar,
storeHarBody,
};
if (rawArgs.proxy_server !== undefined) {
try {
Expand Down
12 changes: 12 additions & 0 deletions built/run.js
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,18 @@ parser.add_argument('--screenshot', {
dest: 'screenshot',
default: false,
});
parser.add_argument('--har', {
help: 'Generate a HAR file at the end of the crawl.',
action: 'store_true',
dest: 'store_har',
default: false,
});
parser.add_argument('--har-body', {
help: 'Store the response bodies in the HAR file. Only works in combination with --har.',
action: 'store_true',
dest: 'store_har_body',
default: false,
});
// parser.add_argument('--no-stealth', {
// help: 'Do not enable the "puppeteer-extra-plugin-stealth" extension.',
// default: false,
Expand Down
33 changes: 26 additions & 7 deletions changelog.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,36 @@
Version 1.1.2
1.1.3
---

Add ability to record a HAR of the crawled page
(PR [#180](https://github.com/brave/pagegraph-crawl/pull/180)).

Removed `fs-extras` and `chai` dependencies, in favor for standard library
approaches.

Specify and enforce a minimum node version, v20.0.0.

Clean up and reworking of test runner, including removing the hardcoded
config file for the tests.

Added [standardjs](https://standardjs.com/) linting for test code.

Fixed race in tests that cause random-seeming failures.


1.1.2
---

Update eslint to 8.11.0, which resolves a non-useful warning when linting.

Minor version bumps in other depenencies.
Minor version bumps in other dependencies.


Version 1.1.1
1.1.1
---

Minor version bumps in depenencies.
Minor version bumps in dependencies.

Version 1.1.0
1.1.0
---

Also pass `--disable-first-run-ui`, to suppress some additional, unneeded and
Expand All @@ -20,12 +39,12 @@ for the same reason.

Remove some no longer needed dependencies.

Version 1.0.2
1.0.2
---

Fix issue with some landing pages not loading.

Version 1.0.1
1.0.1
---

Add this `changelog.md` file, and start tagging releases.
Expand Down
3 changes: 3 additions & 0 deletions eslint.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ export default tseslint.config(
'@typescript-eslint/no-empty-function': 'off',
'@typescript-eslint/no-unused-vars': ['error', {
"caughtErrorsIgnorePattern": "ignore"
}],
'camelcase': ['error', {
'properties': 'never'
}]
}
},
Expand Down
Loading
Loading