Skip to content

Specifying CrawlDB in Config

Kevin Yan edited this page Mar 22, 2021 · 6 revisions

General note on config file modifications:

The config file in question is sparkler-default.yaml. However, there are 3 sparkler-default.yaml config files in use:

Changes to the config file should be made across all 3 files for consistency.

Specifying which crawldb to use

The section of the config file pertaining to crawldb is set up as following (subject to change):

  crawldb.backend: solr

  solr.uri: http://localhost:8983/solr/crawldb
  elasticsearch.uri: http://localhost:9200

The 'crawldb.backend' field specifies which crawldb to use. Note, the value for 'crawldb.backend' must match one of the following '*.uri' fields. For example, the following specifies elasticsearch as the crawldb to use:

  crawldb.backend: elasticsearch

  solr.uri: http://localhost:8983/solr/crawldb
  elasticsearch.uri: http://localhost:9200

Adding a crawldb to the config file

To add a crawldb to this config file, add in the URI and specify the new crawldb. The following is an example done with an hypothetical crawldb called 'testdb'.

  crawldb.backend: testdb

  solr.uri: http://localhost:8983/solr/crawldb
  elasticsearch.uri: http://localhost:9200
  testdb.uri: http://localhost:9999  # replace http://localhost:9999 with the appropriate URI

Accessing the crawldb and crawldb.uri in code

Constants.java holds an interface through which the config file values can be accessed. In code, this will look like:

import edu.usc.irds.sparkler.Constants
Constants.key.CRAWLDB_BACKEND  # for example, this may equal 'solr' or 'elasticsearch'

To get the crawldb URI, use SparklerConfiguration.java's getDatabaseURI() method. This uses Constants.key.CRAWLDB_BACKEND to determine the appropriate backend URI to return. In code, this might look like:

import edu.usc.irds.sparkler.SparklerConfiguration
config.getDatabaseURI()  # where config is a SparklerConfiguration instance