Skip to content

oracc/oracc-rest

Folders and files

NameName
Last commit message
Last commit date
Feb 27, 2023
Jan 8, 2025
Feb 8, 2024
Feb 1, 2025
Feb 1, 2025
Jan 8, 2025
Feb 1, 2025
Nov 27, 2023
Feb 1, 2025
Feb 1, 2025
Feb 20, 2024
May 2, 2023
Feb 1, 2025
Nov 27, 2023
Jun 20, 2018

Repository files navigation

Oracc REST

A Flask RESTful API for querying the Oracc database using Elasticsearch.

The accompanying frontend project for accessing this backend can be found on this repo.

The guide below is sufficient for setting up the entire project. For additional technical and supplementary information please refer to the ORACC Server wiki.

You can also see these useful snippets.

This codebase has been written and tested in Python3.


Project structure

This is the directory structure in the project root:

.
├── api # flask API code
├── ingest # scripts for processing and uploading data into elasticsearch
├── tests # custom tests
├── app.py # entrypoint for running the flask api
├── oracc-rest.wsgi # config file for serving the flask api via apache
└── requirements.txt # list of python modules used by the project

Setting up the project

The Oracc project has been dockerised to run Flask and ElasticSearch. This means you only need Docker installed to run the backend aspect of the search site (i.e. this repo) locally

Using a python virtual environment

It is best practice to work within a python virtual environment for both development and production. This keeps any packages you install isolated from any system-wide installations. You can use any virtual environment manager of your choice, but make sure not to commit your virtual environment folder to this repo. Here is an example using the built-in python virtual environment manager:

# run the following from the top-level directory of your python project
python3 -m venv venv # creates the environment
source venv/bin/activate # activates the environment
deactivate # deactivates the environment

Once you have created and activated your virtual environment, you can install pip packages and do all other tasks as normal.


Running the API locally

Ensure you have Docker installed on your local machine first.

Then you can simply get elasticsearch and the api server up and running with the following command from the top-level directory of this repo:

Please note if you are on a Mac, you will may to export ORACC_INGEST_DIRECTORY as the absolute path to the sample glossaries folder within the ingest directory. A good indication that you will need to do this is if you receive an error saying the ingest directory could not be mounted when trying to build and up the docker containers.

This is because one of the docker containers requires this environment variable to complete the ingest but on Mac, the relative path described in docker-compose.yml isn't recognised as somewhere Docker is allowed to read.

As this would need to be done for every terminal session, it is recommended you add it to your ~/.zshrc or ~/.bashrc.

export ORACC_INGEST_DIRECTORY="<absolute path to>/oracc-rest/ingest/assets/dev/sample-glossaries"

You can then proceed to run:

docker compose up --build -d

(If you are running on an older OS you might need docker-compose not docker compose, so a dash not a space).

This will expose the api server on localhost:8000. The elasticsearch server will be populated with the glossaries, and the api server will be connected with elasticsearch. Elasticsearch will not be available from outside the docker network.

To stop the Docker container run docker compose down

Docker in the production environment

Just as before, the production environment is easiest to deploy with a copy of the source code:

git clone https://github.com/oracc/oracc-rest.git
cd oracc-rest
git switch development

The docker compose deployment uses gunicorn and is production-ready. However, you need to change the port to the port we want the backend to listen on and the ingest directory to the path that the search backend should ingest glossaries from (this directory must be readable by the user oracc). You can do this by creating a file /etc/profile.d/oracc.sh with the following contents:

export ORACC_INGEST_DIRECTORY=/path/to/ingest
export ORACC_PORT=5000

Then run it (just this once, it will run whenever you log in from now on):

source /etc/profile.d/oracc.sh

Then we can run docker compose up --build -d as before. This will ingest data from the ORACC_INGEST_DIRECTORY on startup.

To ingest from the same directory again:

docker restart oracc-ingest

To get logs from the ingest process:

docker logs --tail=30 -t oracc-ingest

These can be run by any user with Docker privileges from any directory.

If you change the ingest directory (by editing /etc/profile.d/oracc.sh), you must remove the old volume and restart. The easiest way would be:

source /etc/profile.d/oracc.sh
docker compose down -v
docker compose up --build -d

This will destroy all the existing elasticsearch data and recreate it. If you would rather not recreate all the data, do this:

source /etc/profile.d/oracc.sh
docker compose down
docker rm oracc-ingest
docker volume rm oracc-rest_ingest
docker compose up --build -d

The elastic search data is persistent across restarts. To remove it and ingest again from scratch (again from the source directory):

docker compose down -v
docker compose up --build -d

To upgrade to new code without re-ingesting:

cd oracc-rest
git pull
docker compose down
docker compose up --build -d

Additional info for querying the Flask API

Calling the Flask API endpoints to retrieve data from Elasticsearch

The search can be accessed at the /search endpoint of a server running Elasticsearch and the Oracc web server in this repo, e.g.:

# during production
curl -k https://localhost:8000/search/water-skin

# during development
curl http://localhost:8000/search/water-skin

This searches multiple fields for the given query word and returns all results. The list of fields currently searched is: gw (guideword), cf (cuneiform), senses.mng (meaning), forms.n and norms.n (lemmatisations).

The matching is not exact: an entry is considered to match a query word if it contains any term starting with the query in the relevant fields. For example, searching for "cat" would return words with either "cat" or "catch" in their meanings (among others).

The query can also be a phrase of words separated by spaces. In this case, it will return results matching all of the words in the phrase.

A second endpoint at /search_all can be used to retrieve all indexed entries.

In both cases, the result is a JSON array with the full contents of each hit. If no matches are found, a 204 (No Content) status code is returned.

An older, simpler search mode can also be accessed at the /search endpoint:

curl -XGET localhost:8000/search -d 'gw=water'

This mode supports searching a single field (e.g. guideword) for the given value. If more than one fields are specified (or if none are), an error will be returned. This does not accept the extra parameters described below, and should be considered deprecated.

Customising the search

You can customise the search by optionally specifying additional parameters.

These are:

  • sort_by: the field on which to sort (gw, cf or icount)
  • direction: the sorting order, ascending (asc) or descending (desc)
  • count: the maximum number of results

For example, if you want to retrieve the 20 entries that appear most frequently in the indexed corpus, you can request this at:

localhost:8000/search_all?sort_by=icount&dir=desc&count=20

Paginating the results

If you don't want to retrieve all results at once, you can use a combination of the count parameter described above and the after parameter. The latter takes a "sorting threshold" and only returns entries whose sorting score is greater or lesser (for ascending or descending search, respectively) than this threshold.

Suggesters

Two other end points can be accessed at the /suggest and /completion endpoints of a server running ElasticSearch and the Oracc web server in this repo.

In the case of /suggest e.g.:

curl -XGET localhost:8000/suggest/yam

This searches both gw (guideword) and cf (cuneiform) fields for words which are within a distance of 2 changes from the query word, e.g.: yam returns ym and ya (cf).

In the case of /completion e.g.:

curl -XGET localhost:8000/completion/go

This searches both gw (guideword) and cf (cuneiform) fields for words which begin with the query. This works for single letters or fragments of words. e.g.: go returns god and goddess

Important note: The sorting score depends on the field being sorted on, but it is not equal to the value of that field! Instead, you can retrieve an entry's score by looking at the sort field returned with each hit. You can then use this value as the threshold when requesting the next batch of results.


Running the tests

The code is accompanied by tests written for the pytest library (installed with the requirements), which can help ensure that important functionality is not broken.

To run the tests after making changes, restart the docker compose:

docker compose down
docker compose -f docker-compose.yml -f docker-compose.test.yml up -d --build

then wait for the elastic search container to come up (use docker compose logs -f to see it if you like), then enter the virtual environment and run pytest.

. .venv/bin/activate
pytest

Examining memory usage in elasticsearch

$ docker-compose exec elasticsearch bash
elasticsearch@7b6eb4ff0455:~$ curl "localhost:9200/_nodes/stats?filter_path=nodes.*.jvm.mem.pools.old"
{"nodes":{"w9Hm2YGAT7iorZM3JTYlUA":{"jvm":{"mem":{"pools":{"old":{"used_in_bytes":29128704,"max_in_bytes":8589934592,"peak_used_in_bytes":41711616,"peak_max_in_bytes":8589934592}}}}}}}
elasticsearch@7b6eb4ff0455:~$ exit

This shows us 291M used, 8.5G max.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published