Skip to content

Latest commit

 

History

History
560 lines (274 loc) · 17.3 KB

urlfrontier.md

File metadata and controls

560 lines (274 loc) · 17.3 KB

Protocol Documentation

Table of Contents

Top

urlfrontier.proto

AckMessage

Field Type Label Description
ID string ID which had been specified by the client *
status AckMessage.Status

Active

Field Type Label Description
state bool
local bool

AnyCrawlID

BlockQueueParams

Parameter message for BlockQueueUntil *

Field Type Label Description
key string ID for the queue *
time uint64 Expressed in seconds of UTC time since Unix epoch 1970-01-01T00:00:00Z. The default value of 0 will unblock the queue.
crawlID string crawl ID
local bool only for this instance

Boolean

Field Type Label Description
state bool

CrawlLimitParams

Parameter message for SetCrawlLimit *

Field Type Label Description
key string ID for the queue *
limit uint32
crawlID string crawl ID

DeleteCrawlMessage

Field Type Label Description
value string
local bool

DiscoveredURLItem

URL discovered during the crawl, might already be known in the URL Frontier or not.

Field Type Label Description
info URLInfo

Empty

GetParams

Parameter message for GetURLs *

Field Type Label Description
max_urls_per_queue uint32 maximum number of URLs per queue, the default value of 0 means no limit
max_queues uint32 maximum number of queues to get URLs from, the default value of 0 means no limit
key string queue id if restricting to a specific queue
delay_requestable uint32 delay in seconds before a URL can be unlocked and sent again for fetching
anyCrawlID AnyCrawlID
crawlID string

KnownURLItem

URL which was already known in the frontier, was returned by GetURLs() and processed by the crawler. Used for updating the information about it in the frontier. If the date is not set, the URL will be considered done and won't be resubmitted for fetching, otherwise it will be elligible for fetching after the delay has elapsed.

Field Type Label Description
info URLInfo
refetchable_from_date uint64 Expressed in seconds of UTC time since Unix epoch 1970-01-01T00:00:00Z. Optional, the default value of 0 indicates that a URL should not be refetched.

ListUrlParams

Field Type Label Description
start uint32 position of the first result in the list; defaults to 0
size uint32 max number of values; defaults to 100
key string ID for the queue *
crawlID string crawl ID
local bool only for the current local instance

Local

Field Type Label Description
local bool

LogLevelParams

Configuration of the log level for a particular package, e.g. crawlercommons.urlfrontier.service.rocksdb DEBUG

Field Type Label Description
package string
level LogLevelParams.Level
local bool only for this instance

Long

Field Type Label Description
value uint64

Pagination

Field Type Label Description
start uint32 position of the first result in the list; defaults to 0
size uint32 max number of values; defaults to 100
include_inactive bool include inactive queues; defaults to false
crawlID string crawl ID
local bool only for the current local instance

QueueDelayParams

Parameter message for SetDelay *

Field Type Label Description
key string ID for the queue - an empty value sets the default for all the queues *
delay_requestable uint32 delay in seconds before a queue can provide new URLs
crawlID string crawl ID - empty string for default
local bool only for this instance

QueueList

Returned by ListQueues *

Field Type Label Description
values string repeated
total uint64 total number of queues
start uint32 position of the first result in the list
size uint32 number of values returned
crawlID string crawl ID - empty string for default

QueueWithinCrawlParams

Field Type Label Description
key string ID for the queue *
crawlID string crawl ID - empty string for default
local bool only for this instance

Stats

Message returned by the GetStats method

Field Type Label Description
size uint64 number of active URLs in queues
inProcess uint32 number of URLs currently in flight
counts Stats.CountsEntry repeated custom counts
numberOfQueues uint64 number of active queues in the frontier
crawlID string crawl ID

Stats.CountsEntry

Field Type Label Description
key string
value uint64

StringList

Field Type Label Description
values string repeated

URLInfo

Field Type Label Description
url string URL *
key string The key is used to put the URLs into queues, the value can be anything set by the client but would typically be the hostname, domain name or IP or the URL. If not set, the service will use a sensible default like hostname.
metadata URLInfo.MetadataEntry repeated Arbitrary key / values stored alongside the URL. Can be anything needed by the crawler like http status, source URL etc...
crawlID string crawl ID *

URLInfo.MetadataEntry

Field Type Label Description
key string
value StringList

URLItem

Wrapper for a KnownURLItem or DiscoveredURLItem *

Field Type Label Description
discovered DiscoveredURLItem
known KnownURLItem
ID string Identifier specified by the client, if missing, the URL is returned *

URLStatusRequest

Field Type Label Description
url string URL for which we request info
key string ID for the queue *
crawlID string crawl ID - empty string for default

AckMessage.Status

Name Number Description
OK 0
SKIPPED 1
FAIL 2

LogLevelParams.Level

Name Number Description
TRACE 0
DEBUG 1
INFO 2
WARN 3
ERROR 4

URLFrontier

Method Name Request Type Response Type Description
ListNodes Empty StringList Return the list of nodes forming the cluster the current node belongs to *
ListCrawls Local StringList Return the list of crawls handled by the frontier(s) *
DeleteCrawl DeleteCrawlMessage Long Delete an entire crawl, returns the number of URLs removed this way *
ListQueues Pagination QueueList Return a list of queues for a specific crawl. Can chose whether to include inactive queues (a queue is active if it has URLs due for fetching); by default the service will return up to 100 results from offset 0 and exclude inactive queues.*
GetURLs GetParams URLInfo stream Stream URLs due for fetching from M queues with up to N items per queue *
PutURLs URLItem stream AckMessage stream Push URL items to the server; they get created (if they don't already exist) in case of DiscoveredURLItems or updated if KnownURLItems *
GetStats QueueWithinCrawlParams Stats Return stats for a specific queue or an entire crawl. Does not aggregate the stats across different crawlids. *
DeleteQueue QueueWithinCrawlParams Long Delete the queue based on the key in parameter, returns the number of URLs removed this way *
BlockQueueUntil BlockQueueParams Empty Block a queue from sending URLs; the argument is the number of seconds of UTC time since Unix epoch 1970-01-01T00:00:00Z. The default value of 0 will unblock the queue. The block will get removed once the time indicated in argument is reached. This is useful for cases where a server returns a Retry-After for instance.
SetActive Active Empty De/activate the crawl. GetURLs will not return anything until SetActive is set to true. PutURLs will still take incoming data. *
GetActive Local Boolean Returns true if the crawl is active, false if it has been deactivated with SetActive(Boolean) *
SetDelay QueueDelayParams Empty Set a delay from a given queue. No URLs will be obtained via GetURLs for this queue until the number of seconds specified has elapsed since the last time URLs were retrieved. Usually informed by the delay setting of robots.txt.
SetLogLevel LogLevelParams Empty Overrides the log level for a given package *
SetCrawlLimit CrawlLimitParams Empty Sets crawl limit for domain *
GetURLStatus URLStatusRequest URLItem Get status of a particular URL This does not take into account URL scheduling. Used to check current status of an URL within the frontier
ListURLs ListUrlParams URLItem stream List all URLs currently in the frontier This does not take into account URL scheduling. Used to check current status of all URLs within the frontier

Scalar Value Types

.proto Type Notes C++ Java Python Go C# PHP Ruby
double double double float float64 double float Float
float float float float float32 float float Float
int32 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. int32 int int int32 int integer Bignum or Fixnum (as required)
int64 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. int64 long int/long int64 long integer/string Bignum
uint32 Uses variable-length encoding. uint32 int int/long uint32 uint integer Bignum or Fixnum (as required)
uint64 Uses variable-length encoding. uint64 long int/long uint64 ulong integer/string Bignum or Fixnum (as required)
sint32 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. int32 int int int32 int integer Bignum or Fixnum (as required)
sint64 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. int64 long int/long int64 long integer/string Bignum
fixed32 Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int int uint32 uint integer Bignum or Fixnum (as required)
fixed64 Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 long int/long uint64 ulong integer/string Bignum
sfixed32 Always four bytes. int32 int int int32 int integer Bignum or Fixnum (as required)
sfixed64 Always eight bytes. int64 long int/long int64 long integer/string Bignum
bool bool boolean boolean bool bool boolean TrueClass/FalseClass
string A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode string string string String (UTF-8)
bytes May contain any arbitrary sequence of bytes. string ByteString str []byte ByteString string String (ASCII-8BIT)