Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get list of domain hosts that have data in Internet Archive #2

Open
phikoehn opened this issue Jul 26, 2019 · 0 comments
Open

Get list of domain hosts that have data in Internet Archive #2

phikoehn opened this issue Jul 26, 2019 · 0 comments

Comments

@phikoehn
Copy link

With regard to this:

As a reminder, the Internet Archive has a variety of collections. What
you currently have is https://archive.org/details/survey_00002?tab=about .

As Jefferson Bailey from the Internet Archive described it: The above is
a "survey" crawl from a few years ago -- these crawls aim to archive at
least the landing page of every host ever seen. It's about 100TB total
and looks pretty rich in text captures (~3.2B or so) and since its aim
is breadth over depth, would maybe have decent ccTLD / language
representation.

The collection is 100 TB.

It would be great to get list of all hosts with responsive www port.

We promised to crawl all *.ee domains once, so this would give us
such a list. It may have other uses as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant