Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how does worker pick a site after crash? #231

Open
mishranitin2003 opened this issue Oct 6, 2021 · 3 comments
Open

how does worker pick a site after crash? #231

mishranitin2003 opened this issue Oct 6, 2021 · 3 comments

Comments

@mishranitin2003
Copy link

Scenario: I have warcprox and brozzler worker running on my local machine. While in the middle of archiving a website, if brozzler worker process is killed such as either using 'kill -9 <process_id>' or closing the console session.
After both warcprox and brozzler worker instances are restarted (on same ports as before), the site will not be picked for crawling. This is due to reason that db('Brozzler').table('sites').claimed property = true.

Query:

  • Is there a configuration property that can be set up so that the site can be picked by any single brozzler worker even if claimed=true?
@nlevitt
Copy link
Contributor

nlevitt commented Oct 7, 2021

If you wait an hour, it should start crawling again. See https://github.com/internetarchive/brozzler/blob/e23fa68d6/brozzler/frontier.py#L117. If you can't wait, you could set claimed=false in rethinkdb.

@mishranitin2003
Copy link
Author

Thanks @nlevitt for your quick reply. The problem is deciding when to make claimed=false. Is there any specific reason to choose 60 minutes or is just random?
Do you think it would be acceptable to make this 60 minutes configurable? If yes, please let me know and I can raise a PR for the same and if you need branch name to be against issue #231 or something else?

mishranitin2003 pushed a commit to mishranitin2003/brozzler-issue-231-claim-limit that referenced this issue Oct 11, 2021
- Configurable claimed limit as it was hard coded to 60. The nodes in case of crash can come back in fairly quick time.
@nlevitt
Copy link
Contributor

nlevitt commented Oct 12, 2021

@mishranitin2003 It's not random. It has to be high enough that you will never have one worker claim a site when another is legitimately working on it. The value should not be configurable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants