scrapinghub · amitsing89 · Nov 1, 2017 · Nov 1, 2017 · Nov 1, 2017 · Nov 1, 2017
diff --git a/.gitignore b/.gitignore
@@ -21,6 +21,7 @@ var/
 *.egg-info/
 .installed.cfg
 *.egg
+MANIFEST
 
 # PyInstaller
 #  Usually these files are written by a python script from a template
@@ -52,3 +53,6 @@ docs/_build/
 
 # PyBuilder
 target/
+
+# Virtualenv
+.env/
diff --git a/.travis.yml b/.travis.yml
@@ -10,6 +10,7 @@ services:
   - docker
   - mysql
   - postgresql
+  - redis-server
 
 env:
   global:
@@ -26,6 +27,21 @@ matrix:
         - docker
         - mysql
         - postgresql
+        - redis-server
+    - python: 3.6
+      env: TOXENV=py36
+      services:
+        - docker
+        - mysql
+        - postgresql
+        - redis-server
+    - python: 3.7
+      env: TOXENV=py37
+      services:
+        - docker
+        - mysql
+        - postgresql
+        - redis-server
 
 install:
   - pip install -U tox wheel codecov
@@ -57,6 +73,9 @@ script: tox
 after_success:
   - codecov
 
+after_script:
+  - cat broker.log
+
 deploy:
   provider: pypi
   distributions: sdist bdist_wheel

diff --git a/README.md b/README.md
@@ -1,22 +1,28 @@
 # Frontera
 
+[![pypi](https://img.shields.io/pypi/v/frontera)](https://pypi.org/project/frontera/)
+[![python versions](https://img.shields.io/pypi/pyversions/frontera.svg)](https://pypi.org/project/frontera/)
+[![Build Status](https://travis-ci.org/scrapinghub/frontera.svg?branch=master)](https://travis-ci.org/scrapinghub/frontera)
+[![codecov](https://codecov.io/gh/scrapinghub/frontera/branch/master/graph/badge.svg)](https://codecov.io/gh/scrapinghub/frontera)
+
 ## Overview
 
-Frontera is a web crawling framework consisting of [crawl frontier](http://nlp.stanford.edu/IR-book/html/htmledition/the-url-frontier-1.html), 
-and distribution/scaling primitives, allowing to build a large scale online web crawler. 
+Frontera is a web crawling framework consisting of [crawl frontier](http://nlp.stanford.edu/IR-book/html/htmledition/the-url-frontier-1.html), and distribution/scaling primitives, allowing to build a large scale online web crawler. 
 
 Frontera takes care of the logic and policies to follow during the crawl. It stores and prioritises links extracted by 
 the crawler to decide which pages to visit next, and capable of doing it in distributed manner.
 
 ## Main features
 
 - Online operation: small requests batches, with parsing done right after fetch.
-- Pluggable backend architecture: low-level storage logic is separated from crawling policy.
-- Three run modes: single process, distributed spiders, distributed backend and spiders.
+- Pluggable backend architecture: low-level backend access logic is separated from crawling strategy.
+- Two run modes: single process and distributed.
+- Built-in SqlAlchemy, Redis and HBase backends.
+- Built-in Apache Kafka and ZeroMQ message buses.
+- Built-in crawling strategies: breadth-first, depth-first, Discovery (with support of robots.txt and sitemaps).
+- Battle tested: our biggest deployment is 60 spiders/strategy workers delivering 50-60M of documents daily for 45 days, without downtime,
 - Transparent data flow, allowing to integrate custom components easily using Kafka.
 - Message bus abstraction, providing a way to implement your own transport (ZeroMQ and Kafka are available out of the box).
-- RDBMS and HBase backends.
-- Revisiting logic with RDBMS.
 - Optional use of Scrapy for fetching and parsing.
 - 3-clause BSD license, allowing to use in any commercial product.
 - Python 3 support.
@@ -37,5 +43,3 @@ $ pip install frontera
 
 Join our Google group at https://groups.google.com/a/scrapinghub.com/forum/#!forum/frontera or check GitHub issues and 
 pull requests.
-
-
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -48,16 +48,16 @@
 
 # General information about the project.
 project = u'Frontera'
-copyright = u'2014-2016, Frontera authors'
+copyright = u'2014-2018, Frontera authors'
 
 # The version info for the project you're documenting, acts as replacement for
 # |version| and |release|, also used in various other places throughout the
 # built documents.
 #
 # The short X.Y version.
-version = '0.6'
+version = '0.8'
 # The full version, including alpha/beta/rc tags.
-release = '0.6.0'
+release = '0.8.0.1'
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -4,16 +4,14 @@
 Frontera |version| documentation
 ================================
 
-`Frontera`_ is a web crawling tool box, allowing to build crawlers of any scale and purpose.
+`Frontera`_ is a web crawling tool box, allowing to build crawlers of any scale and purpose. It includes:
 
-`Frontera`_ provides :ref:`crawl frontier <crawl-frontier>` framework by managing *when* and *what* to crawl next,
-and checking for *crawling goal* accomplishment.
+* :ref:`crawl frontier <crawl-frontier>` framework managing *when* and *what* to crawl and checking for crawling goal* accomplishment,
 
-Frontera also provides replication, sharding and isolation of all crawler components to scale and distribute it.
+* workers, Scrapy wrappers, and data bus components to scale and distribute the crawler.
 
 Frontera contain components to allow creation of fully-operational web crawler with `Scrapy`_. Even though it was
-originally designed for Scrapy, it can also be used with any other crawling framework/system as the framework offers
-a generic tool box.
+originally designed for Scrapy, it can also be used with any other crawling framework/system.
 
 
 Introduction
@@ -53,18 +51,22 @@ Using Frontera
    :hidden:
 
    topics/installation
+   topics/strategies
    topics/frontier-objects
    topics/frontier-middlewares
    topics/frontier-canonicalsolvers
    topics/frontier-backends
    topics/message_bus
-   topics/own_crawling_strategy
+   topics/custom_crawling_strategy
    topics/scrapy-integration
    topics/frontera-settings
 
 :doc:`topics/installation`
     HOWTO and Dependencies options.
 
+:doc:`topics/strategies`
+    A list of built-in crawling strategies.
+
 :doc:`topics/frontier-objects`
     Understand the classes used to represent requests and responses.
 
@@ -75,13 +77,13 @@ Using Frontera
     Identify and make use of canonical url of document.
 
 :doc:`topics/frontier-backends`
-    Define your own crawling policy and custom storage.
+    Built-in backends, and tips on implementing your own.
 
 :doc:`topics/message_bus`
     Built-in message bus reference.
 
-:doc:`topics/own_crawling_strategy`
-    Implementing own crawling strategy for distributed backend.
+:doc:`topics/custom_crawling_strategy`
+    Implementing your own crawling strategy.
 
 :doc:`topics/scrapy-integration`
     Learn how to use Frontera with Scrapy.
@@ -130,13 +132,10 @@ Developer documentation
    topics/tests
    topics/loggers
    topics/frontier-tester
-   topics/faq
    topics/contributing
    topics/glossary
 
 
-
-
 :doc:`topics/architecture`
     See how Frontera works and its different components.
 
@@ -158,13 +157,9 @@ Developer documentation
 :doc:`topics/frontier-tester`
     Test your frontier in an easy way.
 
-:doc:`topics/faq`
-    Frequently asked questions.
-
 :doc:`topics/contributing`
     HOWTO contribute.
 
-
 :doc:`topics/glossary`
     Glossary of terms.
 

diff --git a/docs/source/topics/_images/high-level-arc.png b/docs/source/topics/_images/high-level-arc.png
diff --git a/docs/source/topics/architecture.rst b/docs/source/topics/architecture.rst
@@ -102,7 +102,7 @@ Where *sharded* means component consumes messages of assigned partition only, e.
 stream, and *replicated* is when components consume stream regardless of partitioning.
 
 Such design allows to operate online. Crawling strategy can be changed without having to stop the crawl. Also
-:doc:`crawling strategy <own_crawling_strategy>` can be implemented as a separate module; containing logic
+:doc:`crawling strategy <custom_crawling_strategy>` can be implemented as a separate module; containing logic
 for checking the crawling stopping condition, URL ordering, and scoring model.
 
 Frontera is polite to web hosts by design and each host is downloaded by no more than one spider process.

diff --git a/docs/source/topics/cluster-setup.rst b/docs/source/topics/cluster-setup.rst
@@ -20,7 +20,7 @@ Things to setup before you start
 
 Things to implement before you start
 ====================================
-* :doc:`Crawling strategy <own_crawling_strategy>`
+* :doc:`Crawling strategy <custom_crawling_strategy>` or :doc:`pick one from Frontera package <strategies>`
 * Spider code
 
 Configuring Kafka
@@ -43,7 +43,7 @@ Configuring HBase
 
 Configuring Frontera
 ====================
-Every Frontera component requires it's own configuration module, but some options are shared, so we recommend to create
+Every Frontera component requires its own configuration module, but some options are shared, so we recommend to create
 a common modules and import settings from it in component's modules.
 
 1. Create a common module and add there: ::
@@ -88,7 +88,7 @@ a common modules and import settings from it in component's modules.
     from __future__ import absolute_import
     from .worker import *
 
-    CRAWLING_STRATEGY = '' # path to the crawling strategy class
+    STRATEGY = '' # path to the crawling strategy class
     LOGGING_CONFIG='logging-sw.conf' # if needed
 
 The logging can be configured according to https://docs.python.org/2/library/logging.config.html see the
@@ -101,6 +101,7 @@ The logging can be configured according to https://docs.python.org/2/library/log
 
     BACKEND = 'frontera.contrib.backends.remote.messagebus.MessageBusBackend'
     KAFKA_GET_TIMEOUT = 0.5
+    LOCAL_MODE = False  # by default Frontera is prepared for single process mode
 
 
 6. Configure Scrapy settings module. It's located in Scrapy project folder and referenced in scrapy.cfg. Let's add
@@ -125,9 +126,11 @@ Starting the cluster
 First, let's start storage worker: ::
 
     # start DB worker only for batch generation
-    $ python -m frontera.worker.db --config [db worker config module] --no-incoming
-    ...
-    # Then start next one dedicated to spider log processing
+    # use single instance for every 10 partitions
+    $ python -m frontera.worker.db --config [db worker config module] --no-incoming --partitions 0 1
+
+
+    # Optionally, start next one dedicated to spider log processing.
     $ python -m frontera.worker.db --no-batches --config [db worker config module]
 
 
@@ -141,17 +144,18 @@ Next, let's start strategy workers, one process per spider log partition: ::
 You should notice that all processes are writing messages to the log. It's ok if nothing is written in streams,
 because of absence of seed URLs in the system.
 
-Let's put our seeds in text file, one URL per line and start spiders. A single spider per spider feed partition: ::
+Let's put our seeds in text file, one URL per line and run::
+
+    $ python -m frontera.utils.add_seeds --config [your_frontera_config] --seeds-file [path to your seeds file]
+
+Finally, a single spider per spider feed partition: ::
 
-    $ scrapy crawl [spider] -L INFO -s SEEDS_SOURCE = 'seeds.txt' -s SPIDER_PARTITION_ID=0
-    ...
     $ scrapy crawl [spider] -L INFO -s SPIDER_PARTITION_ID=1
     $ scrapy crawl [spider] -L INFO -s SPIDER_PARTITION_ID=2
     ...
     $ scrapy crawl [spider] -L INFO -s SPIDER_PARTITION_ID=N
 
-You should end up with N spider processes running. Usually it's enough for a single instance to read seeds from
-``SEEDS_SOURCE`` variable to pass seeds to Frontera cluster. Seeds are only read if spider queue is empty.
-::setting:`SPIDER_PARTITION_ID` can be read from config file also.
+You should end up with N spider processes running. Also :setting:`SPIDER_PARTITION_ID` can be read from config file.
 
-After some time seeds will pass the streams and will be scheduled for downloading by workers. Crawler is bootstrapped.
+You're done, crawler should start crawling. Any component can be restarted any time, without major data loss. However,
+for pausing its enough to stop batch gen only.