-
Notifications
You must be signed in to change notification settings - Fork 2
Home
The dor_indexing_app
is the primary API for indexing DOR objects into the DOR Index in the Solr cloud. Its purpose is to keep data consistent between DOR objects and the Solr index via an automated pipeline. Its challenges are to ensure that all DOR objects can be indexed (i.e., handling unexpected data problems) and to keep the latency down and throughput up.
- All incoming requests are synchronous (i.e., blocks until finished)
- All DOR objects are indexed in parallel (i.e., not batched).
- All DOR objects are read from Fedora and the Workflow service
- All Solr documents are written to the Solr cloud
The /dor/reindex/:pid
route (see API documentation) basically does the following:
obj = Dor.load_instance pid # initializes an ActiveFedora, e.g., Dor::Item, Dor::Collection, Dor::AdminPolicyObject
solr_doc = obj.to_solr # loads all datastreams and related objects via ActiveFedora
Dor::SearchService.solr.add(solr_doc, options) # updates the Solr index via RSolr
We have a detailed description of the indexing itself. Note that that documentation addresses "bulk indexing" in great detail, which is outside the normal reindexing pipeline processes, and is used for building an index "from scratch."
We have logging for the time taken by each API call to /dor/reindex/:pid
. There is some instrumentation (i.e., benchmarking, metrics, etc.) at the sub-route level. The logic has these 3 main parts:
- (a) reading the object from Fedora and the Workflow Service,
- (b) converting the object into a Solr document, and
- (c) updating the Solr index.
Notably the distinction between (a) and (b) is blured due to the "lazy loading" approach that ActiveFedora uses to load all the datastreams for a given object. You can view a sample Solr document for an object by adding ".json
" to the end of the Argo object view page, e.g., https://argo.stanford.edu/catalog/druid:bb021tj7970.json
The dor_indexing_app
's incoming traffic is solely from an ActiveMQ consumer. Fedora sends ActiveMQ messages on every update or delete to an object to the fedora.apim.update
topic, as does the Workflow Service on every change in status.
There is a single consumer of those messages that translates them into GET requests on /dor/reindex/:pid
.
You can see in the ActiveMQ configuration that the fedora.apim.update
topic is consumed by doing a reindexing API call to dor-indexing-app. The messages are aggregated into batches by this ActiveMQ consumer, but its effectiveness and ability to de-duplicate messages is unknown (as of 11/4/16).
The fedora.apim.update
topic also receives delete object messages (called purgeObject
). These messages are consumed and then do a GET request on /dor/delete_from_index/:pid
API.
Note that we have two ActiveMQ brokers running (and "a" and a "b" node) and they are not actively load balanced but are configured for failover. That is, if sending a message to the "a" node fails, then the messsage is sent to the "b" node. This failover method is configured in the Fedora and Workflow Service ActiveMQ configuration files:
failover:(tcp://mqhost1,tcp://mqhost2)?timeout=5000
There are ActiveMQ dashboards available at /activemqweb/
and /hawtio/
, and dor-indexing-app has a /dor/queue_size
route that will return the current incoming queue size.
These are some of the notable gems in the stack (versions are in Gemfile.lock
):
-
dor-services
:- This holds all the application logic for converting a DOR object into a Solr document
-
dor-workflow-service
:- Used by
dor-services
to get information about the workflows datastream - This is the API to our Workflow Service HTTP API
- Used by
-
ActiveFedora
(on the 8.x branch, the latest for the Fedora 3 releases):- Used by
dor-services
to do CRUD operations on Fedora objects - This is the ActiveRecord-like API to objects stored in Fedora
- Used by
-
rubydora
:- Used by
ActiveFedora
- This is the API to Fedora's HTTP API -- Fedora v3 only
- Used by
-
RSolr
:- Used by
dor-services
,ActiveFedora
to query and index Solr documents - This is the API to Solr's HTTP API
- Used by
-
rest-client
:- Used by several gems for HTTP request/response processing
- This is an HTTP client gem
-
rails
:- The webapp platform
- Fedora (for reading)
- Workflow Service (for reading)
- Solr (for writing, although it apparently uses a read/query too)
- ActiveMQ (generates incoming traffic)