A Flask RESTful API for querying the Oracc database using Elasticsearch.
The accompanying frontend project for accessing this backend can be found on this repo.
The guide below is sufficient for setting up the entire project. For additional technical and supplementary information please refer to the ORACC Server wiki.
You can also see these useful snippets.
This codebase has been written and tested in Python3.
This is the directory structure in the project root:
.
├── api # flask API code
├── ingest # scripts for processing and uploading data into elasticsearch
├── tests # custom tests
├── app.py # entrypoint for running the flask api
├── oracc-rest.wsgi # config file for serving the flask api via apache
└── requirements.txt # list of python modules used by the project
The Oracc project has been dockerised to run Flask and ElasticSearch. This means you only need Docker installed to run the backend aspect of the search site (i.e. this repo) locally
It is best practice to work within a python virtual environment for both development and production. This keeps any packages you install isolated from any system-wide installations. You can use any virtual environment manager of your choice, but make sure not to commit your virtual environment folder to this repo. Here is an example using the built-in python virtual environment manager:
# run the following from the top-level directory of your python project
python3 -m venv venv # creates the environment
source venv/bin/activate # activates the environment
deactivate # deactivates the environment
Once you have created and activated your virtual environment, you can install pip packages and do all other tasks as normal.
Ensure you have Docker installed on your local machine first.
Then you can simply get elasticsearch and the api server up and running with the following command from the top-level directory of this repo:
Please note if you are on a Mac, you will may to export ORACC_INGEST_DIRECTORY
as the absolute path to the sample glossaries folder within the ingest directory.
A good indication that you will need to do this is if you receive an error saying the ingest directory could not be mounted
when trying to build and up the docker containers.
This is because one of the docker containers requires this environment variable to complete the ingest but on Mac, the relative path described in docker-compose.yml
isn't recognised as somewhere Docker is allowed to read.
As this would need to be done for every terminal session, it is recommended you add it to your ~/.zshrc
or ~/.bashrc
.
export ORACC_INGEST_DIRECTORY="<absolute path to>/oracc-rest/ingest/assets/dev/sample-glossaries"
You can then proceed to run:
docker compose up --build -d
(If you are running on an older OS you might need docker-compose
not docker compose
, so a dash not a space).
This will expose the api server on localhost:8000
. The elasticsearch
server will be populated with the glossaries, and the api server will
be connected with elasticsearch. Elasticsearch will not be available
from outside the docker network.
To stop the Docker container run docker compose down
Just as before, the production environment is easiest to deploy with a copy of the source code:
git clone https://github.com/oracc/oracc-rest.git
cd oracc-rest
git switch development
The docker compose deployment uses gunicorn and is production-ready.
However, you need to change the port to the port we want the backend
to listen on and the ingest directory to the path that the search backend
should ingest glossaries from (this directory must be readable by the
user oracc
). You can do this by creating a file /etc/profile.d/oracc.sh
with the following contents:
export ORACC_INGEST_DIRECTORY=/path/to/ingest
export ORACC_PORT=5000
Then run it (just this once, it will run whenever you log in from now on):
source /etc/profile.d/oracc.sh
Then we can run docker compose up --build -d
as before. This will
ingest data from the ORACC_INGEST_DIRECTORY
on startup.
To ingest from the same directory again:
docker restart oracc-ingest
To get logs from the ingest process:
docker logs --tail=30 -t oracc-ingest
These can be run by any user with Docker privileges from any directory.
If you change the ingest directory (by editing /etc/profile.d/oracc.sh
),
you must remove the old volume and restart. The easiest way would be:
source /etc/profile.d/oracc.sh
docker compose down -v
docker compose up --build -d
This will destroy all the existing elasticsearch data and recreate it. If you would rather not recreate all the data, do this:
source /etc/profile.d/oracc.sh
docker compose down
docker rm oracc-ingest
docker volume rm oracc-rest_ingest
docker compose up --build -d
The elastic search data is persistent across restarts. To remove it and ingest again from scratch (again from the source directory):
docker compose down -v
docker compose up --build -d
To upgrade to new code without re-ingesting:
cd oracc-rest
git pull
docker compose down
docker compose up --build -d
The search can be accessed at the /search
endpoint of a server running Elasticsearch and the Oracc web server in this repo, e.g.:
# during production
curl -k https://localhost:8000/search/water-skin
# during development
curl http://localhost:8000/search/water-skin
This searches multiple fields for the given query word and returns all results. The list of fields currently searched is: gw
(guideword), cf
(cuneiform), senses.mng
(meaning), forms.n
and norms.n
(lemmatisations).
The matching is not exact: an entry is considered to match a query word if it contains any term starting with the query in the relevant fields. For example, searching for "cat" would return words with either "cat" or "catch" in their meanings (among others).
The query can also be a phrase of words separated by spaces. In this case, it will return results matching all of the words in the phrase.
A second endpoint at /search_all
can be used to retrieve all indexed entries.
In both cases, the result is a JSON array with the full contents of each hit. If no matches are found, a 204 (No Content) status code is returned.
An older, simpler search mode can also be accessed at the /search
endpoint:
curl -XGET localhost:8000/search -d 'gw=water'
This mode supports searching a single field (e.g. guideword) for the given value. If more than one fields are specified (or if none are), an error will be returned. This does not accept the extra parameters described below, and should be considered deprecated.
You can customise the search by optionally specifying additional parameters.
These are:
sort_by
: the field on which to sort (gw
,cf
oricount
)direction
: the sorting order, ascending (asc
) or descending (desc
)count
: the maximum number of results
For example, if you want to retrieve the 20 entries that appear most frequently in the indexed corpus, you can request this at:
localhost:8000/search_all?sort_by=icount&dir=desc&count=20
If you don't want to retrieve all results at once, you can use a combination of the count
parameter described above and the after
parameter. The latter takes a "sorting threshold" and only returns entries whose sorting score is greater or lesser (for ascending or descending search, respectively) than this threshold.
Two other end points can be accessed at the /suggest
and /completion
endpoints of a server running ElasticSearch and the Oracc web server in this repo.
In the case of /suggest
e.g.:
curl -XGET localhost:8000/suggest/yam
This searches both gw
(guideword) and cf
(cuneiform) fields for words which are within a distance of 2 changes from the query word, e.g.: yam
returns ym
and ya
(cf
).
In the case of /completion
e.g.:
curl -XGET localhost:8000/completion/go
This searches both gw
(guideword) and cf
(cuneiform) fields for words which begin with the query. This works for single letters or fragments of words. e.g.: go
returns god
and goddess
Important note: The sorting score depends on the field being sorted on, but it is not equal to the value of that field! Instead, you can retrieve an entry's score by looking at the sort
field returned with each hit. You can then use this value as the threshold when requesting the next batch of results.
The code is accompanied by tests written for the pytest library (installed with the requirements), which can help ensure that important functionality is not broken.
To run the tests after making changes, restart the docker compose:
docker compose down
docker compose -f docker-compose.yml -f docker-compose.test.yml up -d --build
then wait for the elastic search container to come up (use
docker compose logs -f
to see it if you like), then enter the virtual
environment and run pytest
.
. .venv/bin/activate
pytest
$ docker-compose exec elasticsearch bash
elasticsearch@7b6eb4ff0455:~$ curl "localhost:9200/_nodes/stats?filter_path=nodes.*.jvm.mem.pools.old"
{"nodes":{"w9Hm2YGAT7iorZM3JTYlUA":{"jvm":{"mem":{"pools":{"old":{"used_in_bytes":29128704,"max_in_bytes":8589934592,"peak_used_in_bytes":41711616,"peak_max_in_bytes":8589934592}}}}}}}
elasticsearch@7b6eb4ff0455:~$ exit
This shows us 291M used, 8.5G max.