You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What did you expect to happen? What happened instead?
There are several images on the page that directly get displayed when opening the live site. However, archiving the page with grab-site and replaying with ReplayWeb.page, the images do not load directly, appearing as broken images or blank spaces. Some of them may appear when expanded by clicking on the broken images.
Archived:
Live site:
Archived:
Live site:
I have verified that the resource behind the broken images has indeed been archived by looking for the src attribute on the live page and looking for the same URL inside the archive, so I don't think the issue is in the crawling. For example, this thumbnail image appears on the live site and is part of the archive, but does not display in the replay as seen in the first screenshot.
In addition, some scripts don't work properly. When navigating to the previous or next blog page, ReplayWeb.page will first display a page saying "Post not found". Refreshing the page will make it load properly (but still with the missing images).
My belief is that both the missing images and the script errors are replay issues.
Then I open the archive using ReplayWeb.page-2.2.4.AppImage, and navigate to the page: https://promptingweekly.substack.com/p/prompting-principle-if-youre-fighting
In case it's still because some files didn't get crawled, I also made a 3.8GB version of the archive where I set no upper bound on --level and I set --page-requisites-level=20. The problem persists with the bigger archive, but it's too big to upload here so I provide a smaller one for repro.
The text was updated successfully, but these errors were encountered:
On further investigation, it does seem more like a problem with the crawling. The live page actually uses the <source srcset="..."> tag to display the page, not <img src="...">, and it seems like the URL for the srcset attribute is not included in the archive.
I'm now raising the issue with grab-site here and closing this one.
ReplayWeb.page Version
v2.2.4
What did you expect to happen? What happened instead?
There are several images on the page that directly get displayed when opening the live site. However, archiving the page with grab-site and replaying with ReplayWeb.page, the images do not load directly, appearing as broken images or blank spaces. Some of them may appear when expanded by clicking on the broken images.
Archived:
![image](https://private-user-images.githubusercontent.com/12495508/394243682-80c57f63-2d86-4791-ba22-3003b1a306c9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzOTYyNzcsIm5iZiI6MTczOTM5NTk3NywicGF0aCI6Ii8xMjQ5NTUwOC8zOTQyNDM2ODItODBjNTdmNjMtMmQ4Ni00NzkxLWJhMjItMzAwM2IxYTMwNmM5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEyVDIxMzI1N1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWU3MWNhNjVhZDA1MzFjYTRiNmUxZjkzOTM1OTExMWJlNzM4NDFmMTBjZDMxZjlmZjczZDczNzNlNGVmM2M2NDImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.1XmGm-f78ttawfA5BYuVVyhSj9g1MQap547x9XT60DI)
Live site:
![image](https://private-user-images.githubusercontent.com/12495508/394244226-4c8363ba-17d2-46f3-8d83-edbad293e6a8.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzOTYyNzcsIm5iZiI6MTczOTM5NTk3NywicGF0aCI6Ii8xMjQ5NTUwOC8zOTQyNDQyMjYtNGM4MzYzYmEtMTdkMi00NmYzLThkODMtZWRiYWQyOTNlNmE4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEyVDIxMzI1N1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTJmYmM0NDk2MDRmNmExZTliZjg2N2VmOTRiMmEzZWU4NjZkN2QwM2NlNjEzMTljYmEyM2VlZjMwNTBhNjY1ODYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.f8eQ4g74_lt4RHt0GDhzwIyxq0LMj7B2Eyk3o12madA)
Archived:
![image](https://private-user-images.githubusercontent.com/12495508/394243948-a328cc89-fae2-4012-947c-86a23fa6b59c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzOTYyNzcsIm5iZiI6MTczOTM5NTk3NywicGF0aCI6Ii8xMjQ5NTUwOC8zOTQyNDM5NDgtYTMyOGNjODktZmFlMi00MDEyLTk0N2MtODZhMjNmYTZiNTljLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEyVDIxMzI1N1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTI0YjBjMDcxNzdiNWFkODAwNzRmZDhiMmNiYWU1YTZkZThkYzM4NjA3M2I2ZGZmZmI2MDhiM2FlZmY4YTdhZjcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.5Tuplxr3KEOjZdF1lO0aIWE_uXTrimY69sv2vx1YiVo)
Live site:
![image](https://private-user-images.githubusercontent.com/12495508/394244400-149c462b-0ef9-4abc-b4d5-46e05f651ee0.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzOTYyNzcsIm5iZiI6MTczOTM5NTk3NywicGF0aCI6Ii8xMjQ5NTUwOC8zOTQyNDQ0MDAtMTQ5YzQ2MmItMGVmOS00YWJjLWI0ZDUtNDZlMDVmNjUxZWUwLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEyVDIxMzI1N1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTYwMTI0Yzk3ZTdmMzc2MzhkYzlkZTc3MmVkOGE4Njg3OGRjZTIzNWM4YWQ1MThiNGQ1ZWM3MzhlN2FiY2U1Y2UmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.XRANBE_xOjbb29GkAhjrJAHqSggKUCJqtAaj6dJYbZg)
I have verified that the resource behind the broken images has indeed been archived by looking for the
src
attribute on the live page and looking for the same URL inside the archive, so I don't think the issue is in the crawling. For example, this thumbnail image appears on the live site and is part of the archive, but does not display in the replay as seen in the first screenshot.In addition, some scripts don't work properly. When navigating to the previous or next blog page, ReplayWeb.page will first display a page saying "Post not found". Refreshing the page will make it load properly (but still with the missing images).
My belief is that both the missing images and the script errors are replay issues.
Step-by-step reproduction instructions
First I run:
I include these other two URLs so that their domain names shouldn't be considered "offsite".
The contents of the ignores file is:
Then I open the archive using ReplayWeb.page-2.2.4.AppImage, and navigate to the page:
https://promptingweekly.substack.com/p/prompting-principle-if-youre-fighting
You can download the WARC here: https://drive.google.com/file/d/1fJuWwgSTVfh9IdD47RC2lw67tWSryG4S/view?usp=sharing
Additional details
I run Ubuntu 20.04 LTS.
In case it's still because some files didn't get crawled, I also made a 3.8GB version of the archive where I set no upper bound on
--level
and I set--page-requisites-level=20
. The problem persists with the bigger archive, but it's too big to upload here so I provide a smaller one for repro.The text was updated successfully, but these errors were encountered: