This repository contains code and configuration for processing and analysing images of plankton samples. It's experimental, serving as much as a proposed template for new projects than as a project in itself.
It's a companion project to an R-shiny based image annotation app that is not yet released, written by researchers and data scientists at the UK Centre for Ecology and Hydrology in the early stages of a collaboration that was placed on hold.
Create a fresh virtual environment in the repository root using Python >=3.12 and (e.g.) venv
:
python -m venv venv
Next, install the package using pip
:
python -m pip install .
Most likely you are interested in developing and/or experimenting, so you will probably want to install the package in 'editable' mode (-e
), along with dev tools and jupyter notebook functionality
python -m pip install -e .[all]
Use anaconda or miniconda to create a python environment using the included environment.yml
conda env create -f environment.yml
conda activate cyto_ml
Next install this package without dependencies:
python -m pip install --no-deps -e .
We use exiftool
to write basic metadata (latitude/longitude of observation, plus timestamp) into individual plankton images extracted from the larger "collage" format that the FlowCam microscope exports them in.
Guidance for installing exiftool
Ubuntu: sudo apt install libimage-exiftool-perl
Centos: sudo yum install libimage-exiftool-perl
Or in an environment without root access:
git clone https://github.com/exiftool/exiftool.git
export PATH=$PATH:exiftool
.env
contains environment variable names for S3 connection details for the JASMIN object store. Fill these in with your own credentials. If you're not sure what the AWS_URL_ENDPOINT
should be, please reach out to one of the project contributors listed below.
pytest
or py.test
scripts/intake_metadata.py
is a proof of concept that creates a configuration file for an intake catalogue - a utility to make reading analytical datasets into analysis workflows more reproducible and less effortful.
Experiment testing workflows by using this plankton model from SciVision to extract features from images for use in similarity search, clustering, etc.
The notebooks/
directory contains Markdown (.md
) representations of the notebooks.
To create Jupyter notebooks (.ipynb
), run the following command with the conda environment activated:
jupytext --sync notebooks/*
If you modify the contents of a notebook, run the command after closing the notebook to re-sync the .ipynb
and .md
representations before committing.
For more information see the Jupytext docs.
Streamlit app based off the text embeddings for EIDC catalogue metadata one
streamlit run src/cyto_ml/visualisation/app.py
The demo should automatically open in your browser when you run streamlit. If it does not, connect using: http://localhost:8501.
See the Object Store API project - RESTful interface to manage a data collection held in s3 object storage.
-
DVC with s3 condensed walkthrough as part of the LLM evaluation project - complete this up to
dvc remote modify...
to set up the s3 connection. -
Importing external data: Avoiding duplication - is it this pattern?
DAG / pipeline elements
Jo Walsh Alba Gomez Segura Ezra Kitson
Please see CONTRIBUTING.md