Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How link scores affect the scheduling #20

Open
chris-zen opened this issue Nov 12, 2015 · 2 comments
Open

How link scores affect the scheduling #20

chris-zen opened this issue Nov 12, 2015 · 2 comments

Comments

@chris-zen
Copy link

Hi, I didn't find any documentation on how the link scores affect/influence their scheduling. It would be nice to understand the relation between:

  • spider defined score for the links
  • link analysis scorer (PR/HITS)
  • scheduler prioritization (BFS, Freq)

Thanks

@plafl
Copy link
Contributor

plafl commented Nov 15, 2015

Until the documentation is updated: in the case of PageRank the algorithm used is Personalized PageRank as originally described here. This means that the higher the spider defined score the higher the probability of a random jump to the page.

The job of the scheduler is to decide what to do with the final scores (spider score + PageRank/HITS). In case of the BFS it simply picks the highest scored web page that is still uncrawled. In case of the FreqScheduler this information is ignored and it simply tries to (re)crawl the pages with the desired frequency: if a page has frequency 8 and the other 2 then the first one is crawled 4 times more often.

@chris-zen
Copy link
Author

Thanks @plafl, very informative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants