Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I have done some workaround across implementing ActiveMq to the messagebus so would like to contribute #350

Open
wants to merge 332 commits into
base: exception-dbw
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
332 commits
Select commit Hold shift + click to select a range
e57c763
adding zmq broker output
sibiryakov Nov 1, 2017
baed69e
more work
sibiryakov Nov 1, 2017
8c5f638
adding cityhash to test reqs
sibiryakov Nov 1, 2017
b050066
enabling all tests
sibiryakov Nov 1, 2017
e8dcd48
converting row prefix to bytes
sibiryakov Nov 2, 2017
90b3530
style
sibiryakov Nov 2, 2017
5e685c7
Merge pull request #299 from scrapinghub/headers-etc
sibiryakov Nov 2, 2017
679c8c0
add message type logging and time for batch
Preetwinder Mar 15, 2017
312dd58
smaller poll intervals
sibiryakov Mar 23, 2017
da3d0c7
more logging in offset-fetcher
sibiryakov Mar 27, 2017
87f9e53
logging contents of e
sibiryakov Mar 27, 2017
bec1877
change err message and raising kafka log level to INFO
sibiryakov Mar 28, 2017
c2e57a6
fix of 'Future' object is not iterable
sibiryakov Mar 30, 2017
1ccbd42
simplifying logging configuration
sibiryakov May 29, 2017
a271206
Update hbase.py
clarksun Dec 27, 2017
b69e318
maybe miss b?
clarksun Dec 27, 2017
fbc6fc2
update BCPerHostLimit in cluster example
clarksun Dec 27, 2017
4536702
Merge pull request #305 from scrapinghub/logging-improvement
sibiryakov Jan 12, 2018
6a7f061
Merge pull request #302 from scrapinghub/per-type-worker-stats
sibiryakov Jan 12, 2018
3c07c5d
Merge pull request #304 from scrapinghub/logging-async-fetcher
sibiryakov Jan 12, 2018
6a3e824
Merge pull request #303 from scrapinghub/smaller-kafka-poll-intervals
sibiryakov Jan 12, 2018
1fe7b4c
Merge pull request #306 from clarksun/patch-1
sibiryakov Jan 24, 2018
90f685a
Merge pull request #307 from clarksun/patch-2
sibiryakov Jan 24, 2018
3cbfd67
Merge pull request #309 from clarksun/patch-4
sibiryakov Jan 24, 2018
cbeee88
hbase states table option
sibiryakov Mar 15, 2017
9d90f6b
reducing logging verbosity
sibiryakov Jan 24, 2018
db17e76
Merge pull request #301 from scrapinghub/states-table
sibiryakov Jan 24, 2018
285e03b
Send crawl stats to Kafka message bus
vshlapakov Jun 27, 2017
3c16f22
Add Mesos task id as a tag for DB worker metrics
vshlapakov Jul 10, 2017
5aaff5f
Send backend stats to message bus
vshlapakov Jul 10, 2017
c296a86
No need to setup stats log partitions
vshlapakov Jul 26, 2017
ea8fb18
Decrease log-level for stats producer
vshlapakov Aug 2, 2017
7589208
lru cache for HBaseState
vshlapakov Jul 12, 2017
457b8ee
codecs tests fixed
sibiryakov Jan 30, 2018
5d4d98a
other tests fix
sibiryakov Jan 30, 2018
e5b946c
moving stats to worker submodule
sibiryakov Jan 30, 2018
852bef5
base stats log, zmq stats log
sibiryakov Jan 30, 2018
a0fe937
Merge pull request #313 from scrapinghub/lru-cache-states
sibiryakov Jan 30, 2018
e44f827
Merge pull request #312 from scrapinghub/stats-log
sibiryakov Jan 30, 2018
500055f
Split DB worker to components based on threads
vshlapakov Jul 14, 2017
ff6c7dd
Fix attribute error related with job_id
vshlapakov Jul 24, 2017
469c64a
Run only batchgen via thread
vshlapakov Jul 24, 2017
5d7b1e6
Polish changes and fix minor issues
vshlapakov Jul 24, 2017
198e3b9
Minor code simplification
vshlapakov Jul 24, 2017
5923e21
Initial job_id must be 0 for db worker
vshlapakov Jul 24, 2017
d14c4c8
Smarted backoff for batchgen component
vshlapakov Jul 26, 2017
f6ed24f
Do back-off if no requests to schedule
vshlapakov Jul 26, 2017
deb3617
max_request_size is set to 4Mb
sibiryakov Jul 24, 2017
d5b1cc7
Update kafkabus producers settings
vshlapakov Aug 3, 2017
1b7951a
Move values to constants
vshlapakov Aug 3, 2017
b0e039e
more precise scoring, encoding from scrapy, hash for request
vshlapakov Jul 26, 2017
66941c9
changing scrapy test website
sibiryakov Apr 6, 2018
257ee93
ov buffer test fix
sibiryakov Apr 6, 2018
eed086e
Merge pull request #316 from scrapinghub/misc
sibiryakov Apr 6, 2018
637012b
making Kafka mb to support latest kafka-python
sibiryakov Apr 9, 2018
ccd554e
Merge pull request #315 from scrapinghub/kafka-params-up
sibiryakov Apr 9, 2018
3a8e2d6
string formatting corrected in the connection information being logged
akshayphilar Apr 12, 2018
0908e03
db worker test fix
sibiryakov Apr 23, 2018
90d4616
Merge pull request #314 from scrapinghub/dbw-components-teardown
sibiryakov Apr 23, 2018
786c2cd
Setting for splitting partitions b/w batchgens
vshlapakov Aug 2, 2017
4093cb3
conflicts resolved
vshlapakov Aug 2, 2017
343fe31
Config option to pass partitions
vshlapakov Aug 2, 2017
b4547c7
No need new setting
vshlapakov Aug 2, 2017
c44fe05
Minor renaming
vshlapakov Aug 3, 2017
7a1dee9
Wrap sending message logic
vshlapakov Aug 3, 2017
eaee5a4
Catch and print shutdown errors
vshlapakov Sep 6, 2017
426c1bc
Non-exhausted generators must be closed manually
vshlapakov Aug 11, 2017
e902ad3
conflicts resolved
vshlapakov Aug 21, 2017
7cb12c0
Provide SW states cache hits ratio
vshlapakov Aug 24, 2017
40a2b78
Provide flushed batches count for SW states
vshlapakov Aug 24, 2017
a3762f2
Simplify state stats using with a variable
vshlapakov Aug 24, 2017
28139a2
Provide stats for SW state cache evictions
vshlapakov Sep 6, 2017
a4d65bb
Skip links-extracted when reached limit for domain
vshlapakov Sep 1, 2017
c70b0ad
no need to flush on every batch
sibiryakov Sep 13, 2017
c3ce7cc
create_request method added to manager, etc
sibiryakov Oct 5, 2017
01b3b73
removed domain info logic from hbase.queue, domain mw enabled by default
sibiryakov Oct 5, 2017
2060f0c
enabling domain fingerprint mw by default
sibiryakov Oct 5, 2017
bea71e8
strategy tests fixed
sibiryakov May 8, 2018
c6e074f
using unittest
sibiryakov May 8, 2018
0bd0255
tests fix
sibiryakov May 8, 2018
389396a
speeding up serialization
sibiryakov May 8, 2018
8e4ffc3
Merge pull request #327 from scrapinghub/create-req-manager
sibiryakov May 8, 2018
6773404
test case
sibiryakov May 8, 2018
95170ca
Merge pull request #324 from scrapinghub/hbase-state-cache
sibiryakov May 8, 2018
117f341
Merge pull request #326 from scrapinghub/no-flush-scoring-log
sibiryakov May 8, 2018
c0c6920
Merge pull request #323 from scrapinghub/generators-pypy
sibiryakov May 8, 2018
47004eb
Merge pull request #322 from scrapinghub/log-improve-batchgen-sw
sibiryakov May 8, 2018
d402216
Merge pull request #321 from scrapinghub/part-split-btw-batchgen
sibiryakov May 8, 2018
bbfcfba
Merge pull request #320 from scrapinghub/log-connected-host
sibiryakov May 8, 2018
184a719
Merge branch 'master' into pre-filter-of-links-ext
sibiryakov May 8, 2018
930a96f
Merge pull request #325 from scrapinghub/pre-filter-of-links-ext
sibiryakov May 8, 2018
9d6ad05
removed ambiguous redirect logic
sibiryakov Nov 24, 2017
a821652
retries budget added in offsets fetch
sibiryakov Dec 25, 2017
a6057e9
fix of partitioning in batch-gen
sibiryakov Jan 30, 2018
8d6e7b1
domain stats in batch gen
sibiryakov Feb 7, 2018
88cdd46
fixes
sibiryakov Feb 7, 2018
a868aeb
hbase refactor, first DomainCache with second gen draft
sibiryakov Feb 12, 2018
6590a1c
converting keys to native str
sibiryakov Feb 13, 2018
5e4f086
set serialization fix
sibiryakov Feb 13, 2018
0542494
flush fix and dosctring
sibiryakov Feb 13, 2018
c12f021
crash fix for py3
sibiryakov Feb 16, 2018
1ed0be6
improved error output
sibiryakov Feb 16, 2018
1a627d1
add seeds run mode
sibiryakov Mar 16, 2018
47a0012
overused buffer settings
sibiryakov Mar 22, 2018
48fffd4
passing args to run()
sibiryakov Mar 23, 2018
514e5d3
removed seeds addition through spider log
sibiryakov Mar 23, 2018
825fee0
codecs test fixed
sibiryakov Mar 23, 2018
92def37
renaming add_seeds to read_seeds(file)
sibiryakov Mar 23, 2018
5146876
fixed reading of seeds
sibiryakov Mar 23, 2018
033fcc5
s3 dependency
sibiryakov Mar 23, 2018
27985a0
docs
sibiryakov Apr 10, 2018
66347c2
slot for hbase queue partitioning
sibiryakov Apr 20, 2018
fc4dee3
support of non-iterable argument
sibiryakov Apr 26, 2018
b809725
documented None values
sibiryakov May 8, 2018
1e59bf5
DEFAULT_HBASE_THRIFT_FRAME_SIZE constant
sibiryakov May 10, 2018
2b1ee27
check for a key presence before deletion
sibiryakov May 10, 2018
2211ede
disabling scrapy test
sibiryakov May 10, 2018
7afb27e
codecs tests are fixed
sibiryakov May 10, 2018
f1b6e01
overusedbuffer test cases are fixed
sibiryakov May 10, 2018
b0be022
more work to codecs fix
sibiryakov May 11, 2018
017b70e
batch_size as argument and bug fix + new test case
sibiryakov May 11, 2018
4e81956
table creation
sibiryakov May 11, 2018
167ba3e
removed add_seeds
sibiryakov May 11, 2018
775dfbb
crawling strategy test fix
sibiryakov May 11, 2018
5f5d29f
table presence fix
sibiryakov May 11, 2018
662faad
add seeds run mode test, sw test fix
sibiryakov May 11, 2018
934b2f8
memory backends tests fix
sibiryakov May 11, 2018
95c46a3
don't use hbase namespace for testing
sibiryakov May 11, 2018
d9464c5
strategy test case fix
sibiryakov May 11, 2018
92f3f54
removed add_seeds
sibiryakov May 14, 2018
1a4f8c8
syntax
sibiryakov May 14, 2018
da87cbd
Merge pull request #330 from scrapinghub/test-fix2
sibiryakov May 14, 2018
0ffb2ab
first draft version
sibiryakov May 15, 2018
d701f55
batched workflow extracted
sibiryakov May 23, 2018
4a27212
more work
sibiryakov May 23, 2018
c6c159b
FrontierManager refactoring, more to come
sibiryakov May 23, 2018
fde25fc
proper init hierarchy
sibiryakov May 25, 2018
ec47475
import, frontier_manager and strategy fixes
sibiryakov May 28, 2018
d6c2e5d
SW test case fixes
sibiryakov May 28, 2018
3affb12
tests fixed
sibiryakov May 30, 2018
32069fe
tests fixed
sibiryakov May 30, 2018
214d345
backend tests fixes
sibiryakov May 30, 2018
38de8de
default crawling strategy
sibiryakov May 31, 2018
bcc3d75
scrapy mws test fix
sibiryakov May 31, 2018
ac0da26
dbw test cases fix
sibiryakov May 31, 2018
0ac5f9c
memory backends test fixes
sibiryakov May 31, 2018
d9b1d64
Redis test case fix
sibiryakov May 31, 2018
1de3c4b
syntax
sibiryakov May 31, 2018
90484e4
more work on style
sibiryakov May 31, 2018
986fad4
fix of add seeds utility
sibiryakov May 31, 2018
2e04254
fixing distributed run mode
sibiryakov Jun 1, 2018
b90275b
auto_start/test_mode attrs refactor
sibiryakov Jun 4, 2018
7b1d3fb
move of BaseCrawlingStrategy, BFS/DFS strategy
sibiryakov Jun 18, 2018
42722d1
seeds have to be injected using crawling strategy
sibiryakov Jun 18, 2018
cd8d95e
new CS guide, and single mode quick start
sibiryakov Jun 18, 2018
df214c6
removed obsolete test
sibiryakov Jul 12, 2018
ebd8532
new path for a crawling strategy
sibiryakov Jul 12, 2018
b37e7d2
style
sibiryakov Jul 12, 2018
bca36a8
DomainMetadata interface
sibiryakov Jul 13, 2018
524dd33
using DomainMetadata in hbase DomainCache
sibiryakov Jul 13, 2018
943231b
SQLA DomainMetadata storage
sibiryakov Jul 13, 2018
7c1272a
moved domain cache test
sibiryakov Jul 13, 2018
f7044ac
DomainCache is instantiated in HBaseBackend now
sibiryakov Jul 16, 2018
1f1f4a1
removing old memory backends
sibiryakov Jul 16, 2018
276da48
removing old sqlalchemy backends
sibiryakov Jul 16, 2018
f2ba99c
pointing to existing memory backend
sibiryakov Jul 16, 2018
8865daf
backends docs update
sibiryakov Jul 20, 2018
f5da2ec
local mode instantiation
sibiryakov Jul 20, 2018
4028a36
error fix
sibiryakov Jul 20, 2018
61efc5e
style
sibiryakov Jul 20, 2018
97fe2ab
fix of references in docs
sibiryakov Jul 23, 2018
cb999cc
fixes of single process sqlalchemy tutorial
sibiryakov Jul 23, 2018
b8c1cac
Discovery crawling strategy
sibiryakov Jul 23, 2018
e339e49
removed distributed spiders run mode
sibiryakov Jul 23, 2018
b3247fc
crawling strategy docs
sibiryakov Jul 23, 2018
ca53383
style mainly
sibiryakov Jul 23, 2018
d8c9c3a
some mistakes corrected
sibiryakov Jul 24, 2018
ae138ed
putting the code under name=main
sibiryakov Jul 24, 2018
2faf280
is -> ==
sibiryakov Jul 24, 2018
fc161f6
docstring
sibiryakov Jul 24, 2018
8e13eb1
page_error -> request_error
sibiryakov Jul 24, 2018
a6d1c37
states tests extraction
sibiryakov Jul 24, 2018
2c982d2
new hbase states test
sibiryakov Jul 24, 2018
2dffcab
fix
sibiryakov Jul 24, 2018
17c4489
moving states tester
sibiryakov Jul 24, 2018
26eff5e
style
sibiryakov Jul 24, 2018
ed5c4af
using fixtures
sibiryakov Jul 24, 2018
a94ee78
sqlalchemy states
sibiryakov Jul 24, 2018
5631001
test fix
sibiryakov Jul 24, 2018
4205ea2
table creation
sibiryakov Jul 24, 2018
b46600b
updated tests
sibiryakov Jul 25, 2018
dffa5e7
style
sibiryakov Jul 25, 2018
19025cd
test fix
sibiryakov Jul 25, 2018
b31cbf1
frontera-scrapy test is enabled
sibiryakov Jul 25, 2018
95c34f2
fix warning
sibiryakov Jul 25, 2018
dbb51d7
skipping scrapy spider for now
sibiryakov Jul 25, 2018
6ad1a4c
renaming modules to support py3.7
sibiryakov Jul 25, 2018
03a5f91
to the previous commit
sibiryakov Jul 25, 2018
8ac2316
Merge pull request #331 from scrapinghub/strategy-in-manager
sibiryakov Jul 25, 2018
f91dd7a
updated with latest changes
sibiryakov Jul 25, 2018
a26a0a9
version bump
sibiryakov Jul 25, 2018
dbf15b0
Update README.md
sibiryakov Jul 25, 2018
fb05073
some refactor + devnull producer for stats
sibiryakov Jul 30, 2018
4a8f0c1
docs + example config update
sibiryakov Jul 30, 2018
264ad76
Merge pull request #341 from scrapinghub/fix-general-spider
sibiryakov Jul 30, 2018
51526c1
local_mode option
sibiryakov Jul 30, 2018
8830ce2
Merge pull request #342 from scrapinghub/fix-docs
sibiryakov Jul 30, 2018
7b6717d
version bump
sibiryakov Jul 30, 2018
fff3617
Remove invalid keyword argument 'sleep' from kafka pull
icapurro Aug 21, 2018
4fb16c9
Removed offset unit test
icapurro Aug 28, 2018
bd3211b
Fix the redis backend
vfcosta Sep 11, 2018
bd781e5
Merge pull request #347 from vfcosta/fix-redis
sibiryakov Sep 11, 2018
0b245ef
Merge pull request #345 from icapurro/fix-kafka-pull
sibiryakov Sep 12, 2018
246575a
Merge branch 'master' of https://github.com/scrapinghub/crawl-frontier
sibiryakov Nov 2, 2018
0c8c58b
scrapy role, overview update
sibiryakov Nov 2, 2018
037c3b4
Update cluster-setup docs
guillermoap Nov 14, 2018
e1a4ca9
Merge pull request #351 from guillermoap/update_docs
sibiryakov Nov 15, 2018
f5ef109
Docs: BaseCrawlingStrategy._refresh_states() also takes a non iterab…
IAlwaysBeCoding Nov 29, 2018
0c81482
Update frontera/strategy/__init__.py
sibiryakov Nov 30, 2018
94886b3
Merge pull request #355 from IAlwaysBeCoding/IAlwaysBeCoding-_refresh…
sibiryakov Dec 3, 2018
dfdc396
removing call to absent API in happybase
sibiryakov Jan 14, 2019
424f569
Update after redis update
Jan 21, 2019
e418807
fixed DomainCache crash on init
sibiryakov Apr 3, 2019
2e23f27
Merge pull request #365 from scrapinghub/domain-cache-crash-fix
sibiryakov Apr 3, 2019
f20de81
Merge pull request #361 from khellan/redis_fix
sibiryakov Apr 5, 2019
45576af
fixing Cache.__init__ signature changes
sibiryakov Apr 5, 2019
22d3bc5
more work
sibiryakov Apr 5, 2019
fd294e0
Merge pull request #366 from scrapinghub/test-fix3
sibiryakov Apr 5, 2019
9bb47c9
explicit cluster metadata bootstrap in Kafka client
sibiryakov Apr 12, 2019
4e6f547
syntax fixes
sibiryakov Apr 12, 2019
1e951c6
line breaks
sibiryakov Apr 12, 2019
5762a26
notimplemented->error
sibiryakov Apr 12, 2019
befc866
update models.py
Jul 3, 2019
2457357
Update models.py
Jul 3, 2019
13efd27
Update models.py
Jul 3, 2019
02326d2
Merge pull request #375 from liho00/liho00-patch-1
sibiryakov Jul 5, 2019
59050bf
Bump to Python 3.6
Oct 23, 2019
f05104e
Use thriftpy2
Oct 23, 2019
e4885cb
Add Python 3.7
Oct 23, 2019
4a4d8b8
Make test_zmq_message_bus() flaky
Oct 23, 2019
4e49f76
Add .env to gitignore
Oct 23, 2019
5c5972b
Add badges
Oct 24, 2019
482b4d5
Add pyversions badge
Oct 24, 2019
e6bb8da
Rename to "python versions"
Oct 24, 2019
068f169
Merge pull request #381 from suzil/cleanup/add-badges
sibiryakov Oct 25, 2019
c5a7001
Move thriftpy2 to test_requires
Oct 25, 2019
ebb4d32
Also include thriftpy2 as an hbase requirement
Oct 25, 2019
b98accd
Merge pull request #379 from suzil/bump/py36
sibiryakov Oct 25, 2019
d5f5905
add assertion error for easier debugging
a-shkarupin Nov 29, 2019
84f9e10
Merge pull request #389 from a-shkarupin/assertion_message
sibiryakov Nov 29, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ var/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
Expand Down Expand Up @@ -52,3 +53,6 @@ docs/_build/

# PyBuilder
target/

# Virtualenv
.env/
19 changes: 19 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ services:
- docker
- mysql
- postgresql
- redis-server

env:
global:
Expand All @@ -26,6 +27,21 @@ matrix:
- docker
- mysql
- postgresql
- redis-server
- python: 3.6
env: TOXENV=py36
services:
- docker
- mysql
- postgresql
- redis-server
- python: 3.7
env: TOXENV=py37
services:
- docker
- mysql
- postgresql
- redis-server

install:
- pip install -U tox wheel codecov
Expand Down Expand Up @@ -57,6 +73,9 @@ script: tox
after_success:
- codecov

after_script:
- cat broker.log

deploy:
provider: pypi
distributions: sdist bdist_wheel
Expand Down
20 changes: 12 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,28 @@
# Frontera

[![pypi](https://img.shields.io/pypi/v/frontera)](https://pypi.org/project/frontera/)
[![python versions](https://img.shields.io/pypi/pyversions/frontera.svg)](https://pypi.org/project/frontera/)
[![Build Status](https://travis-ci.org/scrapinghub/frontera.svg?branch=master)](https://travis-ci.org/scrapinghub/frontera)
[![codecov](https://codecov.io/gh/scrapinghub/frontera/branch/master/graph/badge.svg)](https://codecov.io/gh/scrapinghub/frontera)

## Overview

Frontera is a web crawling framework consisting of [crawl frontier](http://nlp.stanford.edu/IR-book/html/htmledition/the-url-frontier-1.html),
and distribution/scaling primitives, allowing to build a large scale online web crawler.
Frontera is a web crawling framework consisting of [crawl frontier](http://nlp.stanford.edu/IR-book/html/htmledition/the-url-frontier-1.html), and distribution/scaling primitives, allowing to build a large scale online web crawler.

Frontera takes care of the logic and policies to follow during the crawl. It stores and prioritises links extracted by
the crawler to decide which pages to visit next, and capable of doing it in distributed manner.

## Main features

- Online operation: small requests batches, with parsing done right after fetch.
- Pluggable backend architecture: low-level storage logic is separated from crawling policy.
- Three run modes: single process, distributed spiders, distributed backend and spiders.
- Pluggable backend architecture: low-level backend access logic is separated from crawling strategy.
- Two run modes: single process and distributed.
- Built-in SqlAlchemy, Redis and HBase backends.
- Built-in Apache Kafka and ZeroMQ message buses.
- Built-in crawling strategies: breadth-first, depth-first, Discovery (with support of robots.txt and sitemaps).
- Battle tested: our biggest deployment is 60 spiders/strategy workers delivering 50-60M of documents daily for 45 days, without downtime,
- Transparent data flow, allowing to integrate custom components easily using Kafka.
- Message bus abstraction, providing a way to implement your own transport (ZeroMQ and Kafka are available out of the box).
- RDBMS and HBase backends.
- Revisiting logic with RDBMS.
- Optional use of Scrapy for fetching and parsing.
- 3-clause BSD license, allowing to use in any commercial product.
- Python 3 support.
Expand All @@ -37,5 +43,3 @@ $ pip install frontera

Join our Google group at https://groups.google.com/a/scrapinghub.com/forum/#!forum/frontera or check GitHub issues and
pull requests.


6 changes: 3 additions & 3 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,16 +48,16 @@

# General information about the project.
project = u'Frontera'
copyright = u'2014-2016, Frontera authors'
copyright = u'2014-2018, Frontera authors'

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = '0.6'
version = '0.8'
# The full version, including alpha/beta/rc tags.
release = '0.6.0'
release = '0.8.0.1'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down
29 changes: 12 additions & 17 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,14 @@
Frontera |version| documentation
================================

`Frontera`_ is a web crawling tool box, allowing to build crawlers of any scale and purpose.
`Frontera`_ is a web crawling tool box, allowing to build crawlers of any scale and purpose. It includes:

`Frontera`_ provides :ref:`crawl frontier <crawl-frontier>` framework by managing *when* and *what* to crawl next,
and checking for *crawling goal* accomplishment.
* :ref:`crawl frontier <crawl-frontier>` framework managing *when* and *what* to crawl and checking for crawling goal* accomplishment,

Frontera also provides replication, sharding and isolation of all crawler components to scale and distribute it.
* workers, Scrapy wrappers, and data bus components to scale and distribute the crawler.

Frontera contain components to allow creation of fully-operational web crawler with `Scrapy`_. Even though it was
originally designed for Scrapy, it can also be used with any other crawling framework/system as the framework offers
a generic tool box.
originally designed for Scrapy, it can also be used with any other crawling framework/system.


Introduction
Expand Down Expand Up @@ -53,18 +51,22 @@ Using Frontera
:hidden:

topics/installation
topics/strategies
topics/frontier-objects
topics/frontier-middlewares
topics/frontier-canonicalsolvers
topics/frontier-backends
topics/message_bus
topics/own_crawling_strategy
topics/custom_crawling_strategy
topics/scrapy-integration
topics/frontera-settings

:doc:`topics/installation`
HOWTO and Dependencies options.

:doc:`topics/strategies`
A list of built-in crawling strategies.

:doc:`topics/frontier-objects`
Understand the classes used to represent requests and responses.

Expand All @@ -75,13 +77,13 @@ Using Frontera
Identify and make use of canonical url of document.

:doc:`topics/frontier-backends`
Define your own crawling policy and custom storage.
Built-in backends, and tips on implementing your own.

:doc:`topics/message_bus`
Built-in message bus reference.

:doc:`topics/own_crawling_strategy`
Implementing own crawling strategy for distributed backend.
:doc:`topics/custom_crawling_strategy`
Implementing your own crawling strategy.

:doc:`topics/scrapy-integration`
Learn how to use Frontera with Scrapy.
Expand Down Expand Up @@ -130,13 +132,10 @@ Developer documentation
topics/tests
topics/loggers
topics/frontier-tester
topics/faq
topics/contributing
topics/glossary




:doc:`topics/architecture`
See how Frontera works and its different components.

Expand All @@ -158,13 +157,9 @@ Developer documentation
:doc:`topics/frontier-tester`
Test your frontier in an easy way.

:doc:`topics/faq`
Frequently asked questions.

:doc:`topics/contributing`
HOWTO contribute.


:doc:`topics/glossary`
Glossary of terms.

Expand Down
Binary file modified docs/source/topics/_images/high-level-arc.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/source/topics/architecture.rst
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ Where *sharded* means component consumes messages of assigned partition only, e.
stream, and *replicated* is when components consume stream regardless of partitioning.

Such design allows to operate online. Crawling strategy can be changed without having to stop the crawl. Also
:doc:`crawling strategy <own_crawling_strategy>` can be implemented as a separate module; containing logic
:doc:`crawling strategy <custom_crawling_strategy>` can be implemented as a separate module; containing logic
for checking the crawling stopping condition, URL ordering, and scoring model.

Frontera is polite to web hosts by design and each host is downloaded by no more than one spider process.
Expand Down
30 changes: 17 additions & 13 deletions docs/source/topics/cluster-setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Things to setup before you start

Things to implement before you start
====================================
* :doc:`Crawling strategy <own_crawling_strategy>`
* :doc:`Crawling strategy <custom_crawling_strategy>` or :doc:`pick one from Frontera package <strategies>`
* Spider code

Configuring Kafka
Expand All @@ -43,7 +43,7 @@ Configuring HBase

Configuring Frontera
====================
Every Frontera component requires it's own configuration module, but some options are shared, so we recommend to create
Every Frontera component requires its own configuration module, but some options are shared, so we recommend to create
a common modules and import settings from it in component's modules.

1. Create a common module and add there: ::
Expand Down Expand Up @@ -88,7 +88,7 @@ a common modules and import settings from it in component's modules.
from __future__ import absolute_import
from .worker import *

CRAWLING_STRATEGY = '' # path to the crawling strategy class
STRATEGY = '' # path to the crawling strategy class
LOGGING_CONFIG='logging-sw.conf' # if needed

The logging can be configured according to https://docs.python.org/2/library/logging.config.html see the
Expand All @@ -101,6 +101,7 @@ The logging can be configured according to https://docs.python.org/2/library/log

BACKEND = 'frontera.contrib.backends.remote.messagebus.MessageBusBackend'
KAFKA_GET_TIMEOUT = 0.5
LOCAL_MODE = False # by default Frontera is prepared for single process mode


6. Configure Scrapy settings module. It's located in Scrapy project folder and referenced in scrapy.cfg. Let's add
Expand All @@ -125,9 +126,11 @@ Starting the cluster
First, let's start storage worker: ::

# start DB worker only for batch generation
$ python -m frontera.worker.db --config [db worker config module] --no-incoming
...
# Then start next one dedicated to spider log processing
# use single instance for every 10 partitions
$ python -m frontera.worker.db --config [db worker config module] --no-incoming --partitions 0 1


# Optionally, start next one dedicated to spider log processing.
$ python -m frontera.worker.db --no-batches --config [db worker config module]


Expand All @@ -141,17 +144,18 @@ Next, let's start strategy workers, one process per spider log partition: ::
You should notice that all processes are writing messages to the log. It's ok if nothing is written in streams,
because of absence of seed URLs in the system.

Let's put our seeds in text file, one URL per line and start spiders. A single spider per spider feed partition: ::
Let's put our seeds in text file, one URL per line and run::

$ python -m frontera.utils.add_seeds --config [your_frontera_config] --seeds-file [path to your seeds file]

Finally, a single spider per spider feed partition: ::

$ scrapy crawl [spider] -L INFO -s SEEDS_SOURCE = 'seeds.txt' -s SPIDER_PARTITION_ID=0
...
$ scrapy crawl [spider] -L INFO -s SPIDER_PARTITION_ID=1
$ scrapy crawl [spider] -L INFO -s SPIDER_PARTITION_ID=2
...
$ scrapy crawl [spider] -L INFO -s SPIDER_PARTITION_ID=N

You should end up with N spider processes running. Usually it's enough for a single instance to read seeds from
``SEEDS_SOURCE`` variable to pass seeds to Frontera cluster. Seeds are only read if spider queue is empty.
::setting:`SPIDER_PARTITION_ID` can be read from config file also.
You should end up with N spider processes running. Also :setting:`SPIDER_PARTITION_ID` can be read from config file.

After some time seeds will pass the streams and will be scheduled for downloading by workers. Crawler is bootstrapped.
You're done, crawler should start crawling. Any component can be restarted any time, without major data loss. However,
for pausing its enough to stop batch gen only.
Loading