Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VK video not archived #272

Closed
edsu opened this issue Jan 25, 2024 · 5 comments
Closed

VK video not archived #272

edsu opened this issue Jan 25, 2024 · 5 comments

Comments

@edsu
Copy link

edsu commented Jan 25, 2024

I happened to notice that the video in https://vk.com/wall-1113595_548588 is not archived?

$ scoop -o vk-video.wacz https://vk.com/wall-1113595_548588
[09:44:06] INFO Capture-specific temporary folder /Users/edsummers/.nodenv/versions/18.11.0/lib/node_modules/@harvard-lil/scoop/tmp/bDBxDm/ created.
[09:44:06] INFO TCP-Proxy-Server started {"address":"::1","family":"IPv6","port":9000}
[09:44:06] INFO User Agent used for capture: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.6167.57 Safari/537.36
[09:44:07] INFO Scoop 0.6.5 was initialized with the following options:
[09:44:07] INFO {
  logLevel: 'info',
  screenshot: true,
  pdfSnapshot: false,
  domSnapshot: false,
  captureVideoAsAttachment: true,
  captureCertificatesAsAttachment: true,
  provenanceSummary: true,
  attachmentsBypassLimits: true,
  captureTimeout: 60000,
  loadTimeout: 20000,
  networkIdleTimeout: 20000,
  behaviorsTimeout: 20000,
  captureVideoAsAttachmentTimeout: 30000,
  captureCertificatesAsAttachmentTimeout: 10000,
  captureWindowX: 1600,
  captureWindowY: 900,
  maxCaptureSize: 209715200,
  autoScroll: true,
  autoPlayMedia: true,
  grabSecondaryResources: true,
  runSiteSpecificBehaviors: true,
  headless: true,
  userAgentSuffix: '',
  blocklist: [
    '/https?://localhost/', '0.0.0.0/8',
    '10.0.0.0/8',           '100.64.0.0/10',
    '127.0.0.0/8',          '169.254.0.0/16',
    '172.16.0.0/12',        '192.0.0.0/29',
    '192.0.2.0/24',         '192.88.99.0/24',
    '192.168.0.0/16',       '198.18.0.0/15',
    '198.51.100.0/24',      '203.0.113.0/24',
    '224.0.0.0/4',          '240.0.0.0/4',
    '255.255.255.255/32',   '::/128',
    '::1/128',              '::ffff:0:0/96',
    '100::/64',             '64:ff9b::/96',
    '2001::/32',            '2001:10::/28',
    '2001:db8::/32',        '2002::/16',
    'fc00::/7',             'fe80::/10',
    'ff00::/8'
  ],
  intercepter: 'ScoopProxy',
  proxyHost: 'localhost',
  proxyPort: 9000,
  proxyVerbose: false,
  publicIpResolverEndpoint: 'https://icanhazip.com',
  ytDlpPath: '/Users/edsummers/.nodenv/versions/18.11.0/lib/node_modules/@harvard-lil/scoop/executables/yt-dlp',
  cripPath: '/Users/edsummers/.nodenv/versions/18.11.0/lib/node_modules/@harvard-lil/scoop/executables/crip'
}
[09:44:07] INFO 🍨 Starting capture of https://vk.com/wall-1113595_548588.
[09:44:07] INFO STEP [1/10]: Out-of-browser detection and capture of non-web resource
(node:11512) ExperimentalWarning: The Fetch API is an experimental feature. This feature could change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
[09:44:07] INFO Requested URL is assumed to be a web page (no content-type found)
[09:44:07] INFO STEP [2/10]: Wait for initial page load
[09:44:31] WARN STEP [2/10]: Wait for initial page load - failed
[09:44:31] INFO STEP [3/10]: Capture page info
[09:44:55] WARN Could not fetch favicon at url https://vk.com/images/icons/favicons/fav_logo.ico?7.
[09:44:55] INFO STEP [4/10]: Browser scripts
[09:45:09] INFO captureTimeout of 60000ms reached. Ending further capture.
[09:45:14] WARN STEP [5/10]: Wait for network idle (skipped)
[09:45:14] INFO STEP [6/10]: Scroll-up
[09:45:18] INFO STEP [7/10]: Screenshot
[09:45:29] WARN STEP [7/10]: Screenshot - ended due to max time or size reached.
[09:45:29] INFO STEP [8/10]: Out-of-browser capture of video as attachment (if any)
[09:46:11] WARN STEP [8/10]: Out-of-browser capture of video as attachment (if any) - ended due to max time or size reached.
[09:46:11] INFO STEP [9/10]: Capturing certificates info
[09:46:17] INFO STEP [10/10]: Provenance summary
[09:46:17] INFO Closing browser and intercepter
[09:46:17] INFO TCP-Proxy-Server closed
[09:46:17] INFO Clearing capture-specific temporary folder /Users/edsummers/.nodenv/versions/18.11.0/lib/node_modules/@harvard-lil/scoop/tmp/bDBxDm/
[09:46:17] INFO Exporting capture to WACZ
[09:46:20] INFO 1 WARC(s) to process
[09:46:20] INFO Initializing output stream at: /Users/edsummers/.nodenv/versions/18.11.0/lib/node_modules/@harvard-lil/scoop/tmp/Q5LzYR/data.wacz
[09:46:20] INFO Initializing indexer
[09:46:20] INFO Indexing WARCS
[09:46:21] INFO Harvesting sorted indexes from trees
[09:46:21] INFO Writing CDX to WACZ
[09:46:21] INFO Writing pages.jsonl to WACZ
[09:46:21] INFO Writing WARCs to WACZ
[09:46:21] INFO Writing datapackage.json to WACZ
[09:46:21] INFO Writing datapackage-digest.json to WACZ
[09:46:21] INFO Finalizing WACZ
[09:46:21] INFO WACZ was finalized
[09:46:22] INFO vk-video.wacz saved to disk.

I loaded the vk-video.wacz that was created into ReplayWeb.Page

Screenshot 2024-01-25 at 9 46 35 AM
Screenshot 2024-01-25 at 9 46 44 AM

And then clicked on the video:

Screenshot 2024-01-25 at 9 46 47 AM

It doesn't seem to work with browsertrix-crawler either, so maybe this is an issue with the video behavior? Or maybe the problems are unrelated. I thought it was worth reporting though...

@matteocargnelutti
Copy link
Collaborator

Thanks for the detailed report @edsu 😄

I haven't been able to reproduce that issue. The video should be under the "Extracted video data" summary page, which is not present in your screenshots.

I see in the logs you shared that the capture timeout was hit during the browser scripts phase, which doesn't match what I have observed during my tests. Would you mind trying again and letting us know if you see variability here? That might help us deal with edge cases such as this one more efficiently in the future.

Alternatively, for that capture specifically, you might want to:

  • Run Scoop with some combination of the following to see which behavior is problematic: --auto-scroll=false --auto-play-media=false --grab-secondary-resources=false --run-site-specific-behaviors=false.
  • Run Scoop with a longer browser behaviors timeout using --behaviors-timeout=100000

An aside -- I see in your logs that you are using Node 18.x. While we are still testing on Node 18 and it's unlikely to have an effect, our focus is currently on Node 20 and 21 and I'd recommend upgrading if possible.

@edsu
Copy link
Author

edsu commented Jan 25, 2024

Thanks for the quick response @matteocargnelutti -- in this case I wasn't specifically interested in the attached video, but was hoping that the embedded video would be part of the WARC data and would play when viewing the web page:

scoop -o vk-video.wacz https://vk.com/wall-1113595_548588

Does the video play in the web page replay for you?

I upgraded to node v21.5.0 and ran with a longer --behaviors-timeout(which required upping --capture-timeout too) which resulted in the video attachment working. But the video still does not play in replay of the web page.

scoop --capture-timeout=3600000 --behaviors-timeout=3600000 -o vk-video.wacz https://vk.com/wall-1113595_548588
[12:46:07] INFO Capture-specific temporary folder /Users/edsummers/.nodenv/versions/21.5.0/lib/node_modules/@harvard-lil/scoop/tmp/ab4nwI/ created.
[12:46:07] INFO TCP-Proxy-Server started {"address":"::1","family":"IPv6","port":9000}
[12:46:07] INFO User Agent used for capture: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.6167.57 Safari/537.36
[12:46:07] INFO Scoop 0.6.5 was initialized with the following options:
[12:46:07] INFO {
  logLevel: 'info',
  screenshot: true,
  pdfSnapshot: false,
  domSnapshot: false,
  captureVideoAsAttachment: true,
  captureCertificatesAsAttachment: true,
  provenanceSummary: true,
  attachmentsBypassLimits: true,
  captureTimeout: 3600000,
  loadTimeout: 20000,
  networkIdleTimeout: 20000,
  behaviorsTimeout: 3600000,
  captureVideoAsAttachmentTimeout: 30000,
  captureCertificatesAsAttachmentTimeout: 10000,
  captureWindowX: 1600,
  captureWindowY: 900,
  maxCaptureSize: 209715200,
  autoScroll: true,
  autoPlayMedia: true,
  grabSecondaryResources: true,
  runSiteSpecificBehaviors: true,
  headless: true,
  userAgentSuffix: '',
  blocklist: [
    '/https?://localhost/', '0.0.0.0/8',
    '10.0.0.0/8',           '100.64.0.0/10',
    '127.0.0.0/8',          '169.254.0.0/16',
    '172.16.0.0/12',        '192.0.0.0/29',
    '192.0.2.0/24',         '192.88.99.0/24',
    '192.168.0.0/16',       '198.18.0.0/15',
    '198.51.100.0/24',      '203.0.113.0/24',
    '224.0.0.0/4',          '240.0.0.0/4',
    '255.255.255.255/32',   '::/128',
    '::1/128',              '::ffff:0:0/96',
    '100::/64',             '64:ff9b::/96',
    '2001::/32',            '2001:10::/28',
    '2001:db8::/32',        '2002::/16',
    'fc00::/7',             'fe80::/10',
    'ff00::/8'
  ],
  intercepter: 'ScoopProxy',
  proxyHost: 'localhost',
  proxyPort: 9000,
  proxyVerbose: false,
  publicIpResolverEndpoint: 'https://icanhazip.com',
  ytDlpPath: '/Users/edsummers/.nodenv/versions/21.5.0/lib/node_modules/@harvard-lil/scoop/executables/yt-dlp',
  cripPath: '/Users/edsummers/.nodenv/versions/21.5.0/lib/node_modules/@harvard-lil/scoop/executables/crip'
}
[12:46:07] INFO 🍨 Starting capture of https://vk.com/wall-1113595_548588.
[12:46:07] INFO STEP [1/10]: Out-of-browser detection and capture of non-web resource
[12:46:08] INFO Requested URL is assumed to be a web page (no content-type found)
[12:46:08] INFO STEP [2/10]: Wait for initial page load
[12:46:30] INFO STEP [3/10]: Capture page info
[12:46:51] WARN Could not fetch favicon at url https://vk.com/images/icons/favicons/fav_logo.ico?7.
[12:46:51] INFO STEP [4/10]: Browser scripts
[12:47:22] INFO STEP [5/10]: Wait for network idle
[12:47:22] INFO STEP [6/10]: Scroll-up
[12:47:23] INFO STEP [7/10]: Screenshot
[12:47:23] INFO STEP [8/10]: Out-of-browser capture of video as attachment (if any)
[12:47:46] INFO STEP [9/10]: Capturing certificates info
[12:47:53] INFO STEP [10/10]: Provenance summary
[12:47:53] INFO Closing browser and intercepter
[12:47:53] INFO TCP-Proxy-Server closed
[12:47:53] INFO Clearing capture-specific temporary folder /Users/edsummers/.nodenv/versions/21.5.0/lib/node_modules/@harvard-lil/scoop/tmp/ab4nwI/
[12:47:53] INFO Exporting capture to WACZ
[12:47:58] INFO 1 WARC(s) to process
[12:47:58] INFO Initializing output stream at: /Users/edsummers/.nodenv/versions/21.5.0/lib/node_modules/@harvard-lil/scoop/tmp/HAMH75/data.wacz
[12:47:58] INFO Initializing indexer
[12:47:58] INFO Indexing WARCS
[12:47:59] INFO Harvesting sorted indexes from trees
[12:47:59] INFO Writing CDX to WACZ
[12:47:59] INFO Writing pages.jsonl to WACZ
[12:47:59] INFO Writing WARCs to WACZ
[12:48:00] INFO Writing datapackage.json to WACZ
[12:48:00] INFO Writing datapackage-digest.json to WACZ
[12:48:00] INFO Finalizing WACZ
[12:48:00] INFO WACZ was finalized
[12:48:00] INFO vk-video.wacz saved to disk.

@matteocargnelutti
Copy link
Collaborator

@edsu

I upgraded to node v21.5.0 and ran with a longer --behaviors-timeout(which required upping --capture-timeout too) which resulted in the video attachment working.

Nice 😄 !


in this case I wasn't specifically interested in the attached video, but was hoping that the embedded video would be part of the WARC data and would play when viewing the web page
[...]
Does the video play in the web page replay for you?

It doesn't. Based on what I've seen so far: I think making this possible would require either a custom browser behavior or some ad-hoc HTTP request / response rewriting.

  • We don't do the latter (altering HTTP exchanges - for example for playback enhancement purposes) by principle.
  • For the former, we use browsertrix-behaviors and let users decide if they want to activate them or not. In that specific case, I don't know if a site-specific behavior would be helpful.

Cheers 👋

@edsu
Copy link
Author

edsu commented Jan 29, 2024

Ok I'll look into adding a behavior. Browsertrix-crawler has an option to run custom behaviors now. Have you considered adding anything like that to scoop?

@edsu edsu closed this as completed Jan 29, 2024
@matteocargnelutti
Copy link
Collaborator

Yes indeed, #109 is in that spirit. Please don't hesitate to add a comment there with additional thoughts. Thanks for all the feedback, @edsu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants