Skip to content

OpenMined/syft-heart-disease-tutorial

Repository files navigation

Syft Logo

Study Heart Disease using PySyft

Welcome!

In this tutorial we will use PySyft to study heart disease, and by doing so we will try to answer the following question:

Can we run Machine Learning experiments on multiple and distributed medical datasets, without seeing the data?

We are going to to learn how! All you need to get started is PySyft, and a Jupyter notebook! 🚀

Read more on this post on OpenMined blog.

Table of Content

Getting Started

1. Download the code locally

Using the git command from the terminal:

$ git clone https://github.com/openmined/syft-heart-disease-tutorial

or by clicking on Code >> Local >> Download ZIP on the repository main page.

2. Install PySyft and ML Packages

The repository includes a requirements.txt file with the list of all the Python packages required to work with the notebooks. You can install all these dependencies using pip:

$ pip install -r requirements.txt

Please refer to the Quick Install guide to learn how to install PySyft.

Note: It is recommended to install PySyft and all the dependencies within a dedicated Python virtual environment (using the virtual-env manager of choice, e.g. Miniconda, pyenv)

3. Launch the Datasites

Setup and launch the PySyft Datasites using the launch_datasites.py script included in the repository. From the command line:

$ python launch_datasites.py

Note: Please, keep the terminal open, as this will keep all the servers running in background. You can stop all the servers, and terminate the program by typing Ctrl+C.

Table of Content

  • 🧭 (Intro) Setup Datasites: Familiarise with the data, and the Datasites.
  • 📊 1. Compare Demographics: Study the distribution of the demographics in the data, using PySyft.
  • 🤖 2. ML Model Training Experiment: Use PySyft to train a Machine learning classifier, using data across the four distributed datasites, and without seeing the data! (🌟)
  • 📝 3. ML Model Evaluation Experiment: Assess the performance of the trained classifiers on each remote datasite. (🌟🌟)
  • 🗳️ 4. Ensemble Learning Experiment: Create an Ensemble using all the models trained remotely and independently on each dataset. We will test this strategy to obtain a ML predictive model that has seen 4x more medical data in training. (🌟🌟🌟)

Data Description

We will use the full version of the Heart Disease dataset, as available on UCI ML.

This database is the result of a study for the diagnosis of coronary artery disease, as presented in this paper.

The full dataset contains the data as collected by patients in four different hospitals, in 1988:

  • Cleveland Clinic in Cleveland, Ohio (303 patients);
  • Hungarian Institute of Cardiology in Budapest, Hungary (425 patients);
  • Veterans Administration Medical Center in Long Beach, California (200 patients)
  • University Hospitals in Zurich and Basel (143 patients).

Each Hospital will correspond to a single PySyft Datasite, hosting their corresponding version of the Heart Study Data.

Notes

This dataset is quite popular, and well-known in the data science/machine learning community. However, only the Cleveland database is the one that has been effectively used by ML researchers to date 1. The "target" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. In our Machine learning experiments we will treat this problem as a binary (presence vs absence) classification problem.

Acknowledgments

The authors of the dataset have requested that any use of the data include the names of the principal investigator responsible for the data collection at each institution. They would be:

  • Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
  • University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
  • University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
  • V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:Robert Detrano, M.D., Ph.D.

Feedback and Support

If you spot any error or mistake, please feel free to reach out directly to me via email, or to open an Issue on the repository.

Any feedback will be very much appreciated! Thank you! 🙏

Any question about PySyft?

For any technical question, or clarification, or any request for assistance with PySyft, please consider joining the OpenMined slack, and pop your question in the #support channel.

Colophon

Author: Valerio Maggio (@leriomaggio), Researcher, SSI Fellow, and Education Team @ Open Mined.

All the Code material is distributed under the terms of the Apache License. See LICENSE file for additional details.

All the instructional materials in this repository are free to use, and made available under the Creative Commons Attribution license. The following is a human-readable summary of (and not a substitute for) the full legal text of the CC BY 4.0 license.

You are free:

  • to Share---copy and redistribute the material in any medium or format
  • to Adapt---remix, transform, and build upon the material

for any purpose, even commercially.

The licensor cannot revoke these freedoms as long as you follow the license terms.

Under the following terms:

  • Attribution --- You must give appropriate credit, and provide a link to the LICENSE cc-by-human, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

  • No additional restrictions --- You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published