-
Notifications
You must be signed in to change notification settings - Fork 53
Datashare Server Mode
Datashare server mode is used by ICIJ to share projects (or document corpuses) between several users (journalists). Users are authenticated with OAuth2 or HTTP Basic authentication, and should have in their backend session a list of granted projects.
No external services nor cloud data exchanges are made, except for
- datashare docker image downloaded from docker hub
- NER models that are downloaded from ICIJ S3 service
Once your container and models are downloaded you can run datashare in an isolated local network.
Datashare is launched with --mode SERVER
and you have to provide:
- the external elasticsearch index address
elasticsearchAddress
- a Redis store address
redisAddress
- a Redis data bus address
messageBusAddress
- a database JDBC URL
dataSourceUrl
- the host of datashare (for batch search results URL generation)
rootHost
- an authentication mechanism and its parameters
docker run -ti ICIJ/datashare:version --mode SERVER \
--redisAddress redis://my.redis-server.org:6379 \
--elasticsearchAddress https://my.elastic-server.org:9200 \
--messageBusAddress my.redis-server.org \
--dataSourceUrl jdbc:postgresql://db-server/ds-database?user=ds-user&password=ds-password \
--rootHost https://my.datashare-server.org
# ... +auth parameters (see below)
Basic authentication is a simple protocol that uses the HTTP headers and the browser to authenticate users. User credentials are sent to the server in the header Authorization
with user:password
base64 encoded:
Authorization: Basic dXNlcjpwYXNzd29yZA==
It is secure as long as the communication to the server is encrypted (with SSL for example).
On the server side, you have to provide a user store for Datashare. For now we are using a Redis data store.
So you have to provision users. The passwords are sha256 hex encoded (for example with bash):
$ echo -n bar | sha256sum
fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9 -
Then insert the user like this in Redis:
$ redis-cli -h my.redis-server.org
redis-server.org:6379> set foo '{"uid":"foo", "password":"fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9", "groups_by_applications":{"datashare":["local-datashare"]}}'
If you use other indices, you'll have to include them in the group_by_applications
, but local-datashare
should remain. For exammple if you use myindex
:
$ redis-cli -h my.redis-server.org
redis-server.org:6379> set foo '{"uid":"foo", "password":"fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9", "groups_by_applications":{"datashare":["myindex","local-datashare"]}}'
Then you should see this popup:
docker run -ti ICIJ/datashare:version --mode SERVER \
--redisAddress redis://my.redis-server.org:6379 \
--elasticsearchAddress https://my.elastic-server.org:9200 \
--messageBusAddress my.redis-server.org \
--dataSourceUrl jdbc:postgresql://db-server/ds-database?user=ds-user&password=ds-password \
--rootHost https://my.datashare-server.org \
--authFilter org.icij.datashare.session.BasicAuthAdaptorFilter
With OAuth2 you will need an authorization service. The workflow is this:
docker run -ti ICIJ/datashare:version --mode SERVER \
--oauthClientId 30045255030c6740ce4c95c \
--oauthClientSecret 10af3d46399a8143179271e6b726aaf63f20604092106 \
--oauthAuthorizeUrl https://my.oauth-server.org/oauth/authorize \
--oauthTokenUrl https://my.oauth-server.org/oauth/token \
--oauthApiUrl https://my.oauth-server.org/api/v1/me.json \
--oauthCallbackPath /auth/callback
Now you have a datashare server, but no data in index/database. We will see here how to index files into Elasticsearch in CLI mode.
You'll have to provide the addresses of Redis (used for queuing files), the index, the index name and you can also play with other indexing parameters (queue name, threading pools size, etc.) for your use cases and performance optimization.
First you have to fill the queue that will be used for indexing on several threads/machines. The scanning process is not parallelized because the bottleneck is the filesystem read, and we've empirically saw that this stage is not that long to execute even for millions of documents.
docker run -ti -v host/data/path:datashare/container/path ICIJ/datashare:version --mode CLI
--stages SCAN
-d datashare/container/path
--redisAddress {{ ds_reddis_url }}
--queueName {{ ds_queue_name }}
Let's review some parameters:
- you have to provide to the container the access to the document folder (the
-v host/data/path:datashare/container/path
and tell datasahre to use this folder inside docker-d datashare/container/path
-
queueName
is the name of the queue used by datashare/extract in Redis - other parameters are the addresses of ES/Redis bus/Database
Once the data is in the Redis queue queueName
then we can launch the indexing on several threads and machines (we use ansible to run this task on up to 30 nodes with 32 threads each).
docker run -ti ICIJ/datashare:version --mode CLI
--stages INDEX
--ocr true
--parserParallelism {{ processor_count_cmd.stdout }}
--defaultProject {{ es_index_name }}
--redisAddress {{ ds_reddis_url }}
--queueName {{ ds_queue_name }}
--reportName {{ ds_report_name }}
--elasticsearchAddress {{ datashare_elasticsearch_url }}
--messageBusAddress {{ ds_bus_url }}
--dataSourceUrl {{ datashare_datasource_url }}
Additional parameters in the index stage are the following
- you can tell datashare/extract/Tika to do Optical Character Recognition (OCR). OCR will detect text in images but the process is dividing the performance by factor of 5 to 10
-
parserParallelism
is the number of threads that are going to be used for parsing documents -
defaultProject
is the project name, it will be used as index name for elasticsearch -
reportName
is the name of the map used by datashare/extract to store the results of text extraction. It is the way for this stage to be idempotent: if all files have been indexed with success then if you launch this stage a second time with reportName parameter, it won't index any file
Sometimes you will face the case where you have an existing index and you want to index additional documents without processing every document again. It can be done in two steps :
- Scan the index and gather the paths to store it inside a report queue
- Scan and index the documents in the directory, thanks to the previous report queue, it will skip the paths inside of it
docker run -ti ICIJ/datashare:version --mode CLI
--stages SCANIDX
--defaultProject {{ es_index_name }}
--redisAddress {{ ds_reddis_url }}
--reportName {{ ds_report_name }}
--elasticsearchAddress {{ datashare_elasticsearch_url }}
--messageBusAddress {{ ds_bus_url }}
docker run -ti ICIJ/datashare:version -v host/data/path:datashare/container/path ICIJ/datashare:version --mode CLI
--stages "SCAN,INDEX"
-d datashare/container/path
--ocr true
--parserParallelism {{ processor_count_cmd.stdout }}
--defaultProject {{ es_index_name }}
--redisAddress {{ ds_reddis_url }}
--queueName {{ ds_queue_name }}
--reportName {{ ds_report_name }}
--elasticsearchAddress {{ datashare_elasticsearch_url }}
--messageBusAddress {{ ds_bus_url }}
To find named entities, we will resume the documents that have not been processed for a given pipeline.
docker run -ti ICIJ/datashare:version --mode CLI
--stages NER
--nlpp {{ ds_nlpp_pipelines }}
--resume
--nlpParallelism {{ processor_count_cmd.stdout }}
--defaultProject {{ es_index_name }}
--redisAddress {{ ds_reddis_url }}
--elasticsearchAddress {{ datashare_elasticsearch_url }}
--messageBusAddress {{ ds_bus_url }}
--dataSourceUrl {{ datashare_datasource_url }}
The NER parameters are:
-
nlpp
the pipeline used that could be (CORENLP, OPENNLP, MITIE) -
nlpParallelism
number of threads used for Named Entity finding -
defaultProject
is the project name, it will be used as index name for elasticsearch -
resume
will also bring idem-potency by searching first the documents not processed by the pipeline
To run user batch searches, you can run this command:
docker run --rm icij/datashare:version
-m BATCH_SEARCH
--dataSourceUrl '{{ datashare_datasource_url }}'
--elasticsearchAddress '{{ datashare_elasticsearch_url }}'
--batchQueueType org.icij.datashare.extract.RedisBlockingQueue
The batch search parameters are:
-
batchQueueType
the queue class for batch search queue -
batchSearchMaxTimeSeconds
max time for batch search in seconds -
batchThrottleMilliseconds
the throttle for batch search in milliseconds
You can use it in a crontab job for example.
To run user batch download, you can run this command:
docker run --rm icij/datashare:version
-m BATCH_DOWNLOAD
-v host/downloads:/home/datashare/app/tmp
--dataSourceUrl '{{ datashare_datasource_url }}'
--elasticsearchAddress '{{ datashare_elasticsearch_url }}'
--batchQueueType org.icij.datashare.extract.RedisBlockingQueue
The batch download parameters are:
-
batchQueueType
the queue class for batch download queue -
batchDownloadMaxNbFiles
the maximum file number that can be archived in a zip, the number by default is 10,000 -
batchDownloadMaxSize
the maximum total files size that can be zipped. Human readable suffix K/M/G for KB/MB/GB, the default size is 100M -
batchDownloadTimeToLive
the time to live in hour for batch download zip files, the time is 24h by default -
smtpUrl
smtp url to allow datashare to send emails (ex: smtp://localhost:25)
You can use it in a crontab job for example.
A zip encryption for the zip archives can be enabled, Datashare sends passwords to users though smtp. To do so, datashare backend instance needs to have the batchDownloadEncrypt
parameter set to true
and for the batch download instance, an smtpUrl
has to be defined.
Datashare is opensource and easily extensible. You can implement your own component for your architecture.
For now, the database in datashare is used to store:
- starred documents
- tags
- batch searches
- batch results
- display banners according to documents path
- access rights for downloading sources of documents
It is implemented for PostgreSQL and Sqlite with Jooq (a quite low-level SQL like Java API). Normally it should work "as is" for other databases (MySQL...) supported by Jooq. But the Repositories integration tests are only run on CI for PostgreSQL/Sqlite.
You can try changing the DB URL parameter: dataSourceUrl
, you can find useful documentation about the options you can use in the PostgreSQL JDBC driver documentation. Some handy examples:
- Connect to the server named
postgresserver
, to the database nameddatashare
usingmyuser
as username andmypassword
as password you may pass this connection string to thedataSourceUrl
option:jdbc:postgresql://postgresserver/datashare?user=myuser&password=mypassword
- Connect to the same example but to a PostgreSQL server using SSL, a selfsigned certificate with its CA certificate stored in
/etc/ssl/certs/ca-cert.crt
and listening in the port 25061:jdbc:postgresql://postgresserver:25061/datashare?sslmode=require&user=myuser&password=mypassword&sslrootcert=/etc/ssl/certs/ca-cert.crt
Datashare is based on fluent-http. It needs two classes to handle sessions:
- Users is the list of referenced users
- SessionIdStore that is the list of session ids
We have implemented this with RedisUsers and RedisSessionIdStore
So sessions/users are stored in Redis, but they could be implemented with another persistence backend component.
Here also we used Redis for our needs but in extract there is a MySQL implementation for Queue/Report components. If it is more convenient for you can try to wire this in Datashare and add options for them.
We also implemented memory queues/maps for datashare to be run without dependencies on Redis (only works if run in one machine).
We are using a small Redis databus for:
- the progress of indexing
- launching NER finding after indexing
But we already implemented a memory databus for the same reason as above.
That could also be implemented with RabbitMQ or other data buses, take a look at DataBus.