Top
Field |
Type |
Label |
Description |
state |
bool |
|
|
local |
bool |
|
|
Parameter message for BlockQueueUntil *
Field |
Type |
Label |
Description |
key |
string |
|
ID for the queue * |
time |
uint64 |
|
Expressed in seconds of UTC time since Unix epoch 1970-01-01T00:00:00Z. The default value of 0 will unblock the queue. |
crawlID |
string |
|
crawl ID |
local |
bool |
|
only for this instance |
Field |
Type |
Label |
Description |
state |
bool |
|
|
Parameter message for SetCrawlLimit *
Field |
Type |
Label |
Description |
key |
string |
|
ID for the queue * |
limit |
uint32 |
|
|
crawlID |
string |
|
crawl ID |
Field |
Type |
Label |
Description |
value |
string |
|
|
local |
bool |
|
|
URL discovered during the crawl, might already be known in the URL Frontier or not.
Field |
Type |
Label |
Description |
info |
URLInfo |
|
|
Parameter message for GetURLs *
Field |
Type |
Label |
Description |
max_urls_per_queue |
uint32 |
|
maximum number of URLs per queue, the default value of 0 means no limit |
max_queues |
uint32 |
|
maximum number of queues to get URLs from, the default value of 0 means no limit |
key |
string |
|
queue id if restricting to a specific queue |
delay_requestable |
uint32 |
|
delay in seconds before a URL can be unlocked and sent again for fetching |
anyCrawlID |
AnyCrawlID |
|
|
crawlID |
string |
|
|
URL which was already known in the frontier, was returned by GetURLs() and processed by the crawler. Used for updating the information
about it in the frontier. If the date is not set, the URL will be considered done and won't be resubmitted for fetching, otherwise
it will be elligible for fetching after the delay has elapsed.
Field |
Type |
Label |
Description |
info |
URLInfo |
|
|
refetchable_from_date |
uint64 |
|
Expressed in seconds of UTC time since Unix epoch 1970-01-01T00:00:00Z. Optional, the default value of 0 indicates that a URL should not be refetched. |
Field |
Type |
Label |
Description |
start |
uint32 |
|
position of the first result in the list; defaults to 0 |
size |
uint32 |
|
max number of values; defaults to 100 |
key |
string |
|
ID for the queue * |
crawlID |
string |
|
crawl ID |
local |
bool |
|
only for the current local instance |
Field |
Type |
Label |
Description |
local |
bool |
|
|
Configuration of the log level for a particular package, e.g.
crawlercommons.urlfrontier.service.rocksdb DEBUG
Field |
Type |
Label |
Description |
value |
uint64 |
|
|
Pagination
Field |
Type |
Label |
Description |
start |
uint32 |
|
position of the first result in the list; defaults to 0 |
size |
uint32 |
|
max number of values; defaults to 100 |
include_inactive |
bool |
|
include inactive queues; defaults to false |
crawlID |
string |
|
crawl ID |
local |
bool |
|
only for the current local instance |
Parameter message for SetDelay *
Field |
Type |
Label |
Description |
key |
string |
|
ID for the queue - an empty value sets the default for all the queues * |
delay_requestable |
uint32 |
|
delay in seconds before a queue can provide new URLs |
crawlID |
string |
|
crawl ID - empty string for default |
local |
bool |
|
only for this instance |
Returned by ListQueues *
Field |
Type |
Label |
Description |
values |
string |
repeated |
|
total |
uint64 |
|
total number of queues |
start |
uint32 |
|
position of the first result in the list |
size |
uint32 |
|
number of values returned |
crawlID |
string |
|
crawl ID - empty string for default |
Field |
Type |
Label |
Description |
key |
string |
|
ID for the queue * |
crawlID |
string |
|
crawl ID - empty string for default |
local |
bool |
|
only for this instance |
Message returned by the GetStats method
Field |
Type |
Label |
Description |
size |
uint64 |
|
number of active URLs in queues |
inProcess |
uint32 |
|
number of URLs currently in flight |
counts |
Stats.CountsEntry |
repeated |
custom counts |
numberOfQueues |
uint64 |
|
number of active queues in the frontier |
crawlID |
string |
|
crawl ID |
Field |
Type |
Label |
Description |
values |
string |
repeated |
|
Field |
Type |
Label |
Description |
url |
string |
|
URL * |
key |
string |
|
The key is used to put the URLs into queues, the value can be anything set by the client but would typically be the hostname, domain name or IP or the URL. If not set, the service will use a sensible default like hostname. |
metadata |
URLInfo.MetadataEntry |
repeated |
Arbitrary key / values stored alongside the URL. Can be anything needed by the crawler like http status, source URL etc... |
crawlID |
string |
|
crawl ID * |
Wrapper for a KnownURLItem or DiscoveredURLItem *
Field |
Type |
Label |
Description |
url |
string |
|
URL for which we request info |
key |
string |
|
ID for the queue * |
crawlID |
string |
|
crawl ID - empty string for default |
Name |
Number |
Description |
OK |
0 |
|
SKIPPED |
1 |
|
FAIL |
2 |
|
Name |
Number |
Description |
TRACE |
0 |
|
DEBUG |
1 |
|
INFO |
2 |
|
WARN |
3 |
|
ERROR |
4 |
|
Method Name |
Request Type |
Response Type |
Description |
ListNodes |
Empty |
StringList |
Return the list of nodes forming the cluster the current node belongs to * |
ListCrawls |
Local |
StringList |
Return the list of crawls handled by the frontier(s) * |
DeleteCrawl |
DeleteCrawlMessage |
Long |
Delete an entire crawl, returns the number of URLs removed this way * |
ListQueues |
Pagination |
QueueList |
Return a list of queues for a specific crawl. Can chose whether to include inactive queues (a queue is active if it has URLs due for fetching); by default the service will return up to 100 results from offset 0 and exclude inactive queues.* |
GetURLs |
GetParams |
URLInfo stream |
Stream URLs due for fetching from M queues with up to N items per queue * |
PutURLs |
URLItem stream |
AckMessage stream |
Push URL items to the server; they get created (if they don't already exist) in case of DiscoveredURLItems or updated if KnownURLItems * |
GetStats |
QueueWithinCrawlParams |
Stats |
Return stats for a specific queue or an entire crawl. Does not aggregate the stats across different crawlids. * |
DeleteQueue |
QueueWithinCrawlParams |
Long |
Delete the queue based on the key in parameter, returns the number of URLs removed this way * |
BlockQueueUntil |
BlockQueueParams |
Empty |
Block a queue from sending URLs; the argument is the number of seconds of UTC time since Unix epoch 1970-01-01T00:00:00Z. The default value of 0 will unblock the queue. The block will get removed once the time indicated in argument is reached. This is useful for cases where a server returns a Retry-After for instance. |
SetActive |
Active |
Empty |
De/activate the crawl. GetURLs will not return anything until SetActive is set to true. PutURLs will still take incoming data. * |
GetActive |
Local |
Boolean |
Returns true if the crawl is active, false if it has been deactivated with SetActive(Boolean) * |
SetDelay |
QueueDelayParams |
Empty |
Set a delay from a given queue. No URLs will be obtained via GetURLs for this queue until the number of seconds specified has elapsed since the last time URLs were retrieved. Usually informed by the delay setting of robots.txt. |
SetLogLevel |
LogLevelParams |
Empty |
Overrides the log level for a given package * |
SetCrawlLimit |
CrawlLimitParams |
Empty |
Sets crawl limit for domain * |
GetURLStatus |
URLStatusRequest |
URLItem |
Get status of a particular URL This does not take into account URL scheduling. Used to check current status of an URL within the frontier |
ListURLs |
ListUrlParams |
URLItem stream |
List all URLs currently in the frontier This does not take into account URL scheduling. Used to check current status of all URLs within the frontier |
.proto Type |
Notes |
C++ |
Java |
Python |
Go |
C# |
PHP |
Ruby |
double |
|
double |
double |
float |
float64 |
double |
float |
Float |
float |
|
float |
float |
float |
float32 |
float |
float |
Float |
int32 |
Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. |
int32 |
int |
int |
int32 |
int |
integer |
Bignum or Fixnum (as required) |
int64 |
Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. |
int64 |
long |
int/long |
int64 |
long |
integer/string |
Bignum |
uint32 |
Uses variable-length encoding. |
uint32 |
int |
int/long |
uint32 |
uint |
integer |
Bignum or Fixnum (as required) |
uint64 |
Uses variable-length encoding. |
uint64 |
long |
int/long |
uint64 |
ulong |
integer/string |
Bignum or Fixnum (as required) |
sint32 |
Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. |
int32 |
int |
int |
int32 |
int |
integer |
Bignum or Fixnum (as required) |
sint64 |
Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. |
int64 |
long |
int/long |
int64 |
long |
integer/string |
Bignum |
fixed32 |
Always four bytes. More efficient than uint32 if values are often greater than 2^28. |
uint32 |
int |
int |
uint32 |
uint |
integer |
Bignum or Fixnum (as required) |
fixed64 |
Always eight bytes. More efficient than uint64 if values are often greater than 2^56. |
uint64 |
long |
int/long |
uint64 |
ulong |
integer/string |
Bignum |
sfixed32 |
Always four bytes. |
int32 |
int |
int |
int32 |
int |
integer |
Bignum or Fixnum (as required) |
sfixed64 |
Always eight bytes. |
int64 |
long |
int/long |
int64 |
long |
integer/string |
Bignum |
bool |
|
bool |
boolean |
boolean |
bool |
bool |
boolean |
TrueClass/FalseClass |
string |
A string must always contain UTF-8 encoded or 7-bit ASCII text. |
string |
String |
str/unicode |
string |
string |
string |
String (UTF-8) |
bytes |
May contain any arbitrary sequence of bytes. |
string |
ByteString |
str |
[]byte |
ByteString |
string |
String (ASCII-8BIT) |