Merge pull request #46 from yangchen2/master

remove output of logistic regression plots, move UpSet plots, update README.md, merge tutorial.md
biocore · Aug 1, 2023 · 7c596fc · 7c596fc
2 parents 3a65990 + 7a88dbf
commit 7c596fc
Show file tree

Hide file tree

Showing 29 changed files with 2,306 additions and 191 deletions.
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -31,7 +31,7 @@ jobs:
           mamba-version: "*"
           channels: conda-forge,defaults,bioconda
           channel-priority: true
-          python-version: "3.8"
+          python-version: "3.9"
 
       - name: Install conda packages
         shell: bash -l {0}

diff --git a/FAQs.md b/FAQs.md
diff --git a/README.md b/README.md
@@ -1,20 +1,21 @@
 ![Main CI](https://github.com/gibsramen/qadabra/actions/workflows/main.yml/badge.svg)
 
-# Qadabra
+# Qadabra: **Q**uantitative **A**nalysis of **D**ifferential **Ab**undance **Ra**nks
 
-**Q**uantitative **A**nalysis of **D**ifferential **Ab**undance **Ra**nks
+##### (Pronounced *ka-da-bra*)
 
-Qadabra is a Snakemake workflow for comparing the results of differential abundance tools.
-Importantly, Qadabra focuses on feature *ranks* rather than FDR corrected p-values.
+Qadabra is a Snakemake workflow for running and comparing several differential abundance (DA) methods on the same microbiome dataset.
 
-## Installation
+Importantly, Qadabra focuses on both FDR corrected p-values *and* [feature ranks](https://www.nature.com/articles/s41467-019-10656-5) and generates visualizations of differential abundance results.
+
+![Schematic](images/Qadabra_schematic.svg)
 
+## Installation
 ```
 pip install qadabra
 ```
 
 Qadabra requires the following dependencies:
-
 * snakemake
 * click
 * biom-format
@@ -25,122 +26,81 @@ Qadabra requires the following dependencies:
 
 ## Usage
 
-### Creating the workflow structure
+### 1. Creating the workflow directory
 
 Qadabra can be used on multiple datasets at once.
-First, we want to create the worfklow structure to perfrom differential abundance with all tools.
+First, we want to create the workflow directory to perfrom differential abundance with all methods:
 
 ```
-qadabra create-workflow --workflow-dest my_qadabra
+qadabra create-workflow --workflow-dest <directory_name>
 ```
 
-This command will initialize the workflow but we still need to point to our dataset(s) of interest.
+This command will initialize the workflow, but we still need to point to our dataset(s) of interest.
 
-### Adding a dataset
+### 2. Adding a dataset
 
-We can add datasets one-by-one with the `add-dataset` command.
+We can add datasets one-by-one with the `add-dataset` command:
 
 ```
 qadabra add-dataset \
-    --workflow-dest my_qadabra \
-    --table data/table.biom \
-    --metadata data/metadata.tsv \
-    --name my_dataset_1 \
+    --workflow-dest <directory_name> \
+    --table <directory_name>/data/table.biom \
+    --metadata <directory_name>/data/metadata.tsv \
+    --tree <directory_name>/data/my_tree.nwk \
+    --name my_dataset \
     --factor-name case_control \
     --target-level case \
     --reference-level control \
+    --confounder confounding_variable(s) <confounding_var> \
     --verbose
 ```
 
-Let's walkthrough the arguments provided here:
+Let's walkthrough the arguments provided here, which represent the inputs to Qadabra:
 
 * `workflow-dest`: The location of the workflow that we created earlier
-* `table`: Feature table (features by samples) in BIOM format
+* `table`: Feature table (features by samples) in [BIOM](https://biom-format.org/) format
 * `metadata`: Sample metadata in TSV format
+* `tree`: Phylogenetic tree in .nwk or other tree format (optional)
 * `name`: Name to give this dataset
 * `factor-name`: Metadata column to use for differential abundance
 * `target-level`: The value in the chosen factor to use as the target
 * `reference-level`: The reference level to which we want to compare our target
+* `confounder`: Any confounding variable metadata columns (optional)
 * `verbose`: Flag to show all preprocessing performed by Qadabra
 
-You can use `qadabra add-dataset --help` for more details.
+Your dataset should now be added as a line in `my_qadabra/config/datasets.tsv`. 
+
+You can use `qadabra add-dataset --help` for more details. 
 To add another dataset, just run this command again with the new dataset information.
 
-### Running the workflow
+### 3. Running the workflow
 
 The previous commands will create a subdirectory, `my_qadabra` in which the workflow structure is contained.
-Navigate into this directory; you should see two folders: `config` and `workflow`.
-If you open the `config/config.yaml` file, you can see a number of options with which to run Qadabra.
-You can modify these as you like.
-For example, if you want to only run DESeq2, ANCOM-BC, and Songbird, you can delete the other entries in the `tools` heading.
-
-From the command line, execute `snakemake --use-conda <other options>` to start the workflow.
+From the command line, execute the following to start the workflow:
+```
+snakemake --use-conda --cores <number of cores preferred> <other options>
+```
 Please read the [Snakemake documentation](https://snakemake.readthedocs.io/en/stable/executing/cli.html) for how to run Snakemake best on your system.
 
 When this process is completed, you should have directories `figures`, `results`, and `log`.
 Each of these directories will have a separate folder for each dataset you added.
 
-### Generating a report
+### 4. Generating a report
 
-You can also generate a report of the workflow with the following command:
+After Qadabra has finished running, you can generate a Snakemake report of the workflow with the following command:
 
 ```
 snakemake --report report.zip
 ```
 
 This will create a zipped directory containing the report.
-Unzip this file and open the `report.html` file to view the report in your browser.
-
-## Additional workflow options
-
-### Worfklow subset
-
-In some cases you may not want to run the full workflow and may only be interested in just running the different tools.
-You can use `snakemake all_differentials --use-conda <other options>` to eschew the machine learning and visualization parts of the workflow.
-
-### Phylogenetic visualization
-
-Qadabra allows users to visualize the differentials on a phylogenetic tree using [EMPress](https://journals.asm.org/doi/10.1128/mSystems.01216-20).
-With EMPress, you can annotate the tree with the differentials as barplots.
-This can be useful for determining phylogenetic signal in differential abundance.
-
-### Incorporating confounders
-
-You can also specify additional confounders to incorporate into your DA model.
-When adding a dataset, use `--confounder <column name>` to add a confounder into your model.
-You can add multiple confounders by adding more `--confounder <column name>` arguments to `add-dataset`.
-
-## Workflow Overview
-
-Qadabra runs several differential abundance tools on the same dataset.
-The features are ranked according to their association with the given metadata covariate.
-The top and bottom features are then used to create log-ratios according to [Morton 2019](https://doi.org/10.1038/s41467-019-10656-5) and [Fedarko 2020](https://github.com/biocore/qurro).
-These log-ratios are used as predictors in logistic regression models to predict the class given the log-ratio.
-
-### Output
-
-Qadabra generates many results files including many intermediate files that can be explored further.
-
-#### Results
-
-Each tool's output is stored in a separate subdirectory.
-For the R tools, an RDS object with the tool's R data is saved.
-The raw outputs are processed and concatenated into a file called `concatenated_differentials.tsv`.
-A Qurro visualization of all the tool ranks is generated at `results/<dataset>/qurro/index.html`.
-An interactive table with all the tool outputs is at `results/<dataset>/differentials_table.html`.
+Unzip this file and open the `report.html` file to view the report containing results and visualizations in your browser.
 
-For each tool, the ranked features are used for machine learning models.
-The `config.yaml` file enumerates the percentile of feats to use for log-ratios.
-For example, at the 5% percentile, the top 5% of features and the bottom 5% of features associated with `covariate` are used to compute a log-ratio for each sample.
-This log-ratio is used in repeated K-fold cross-validation to determine how well this log-ratio can predict class membership using logistic regression.
-The `ml` subdirectory of each tool contains the features used, sample log-ratios, and compressed model objects.
+## Tutorial
+See the [tutorial](tutorial.md) page for a walkthroughon using Qadabra workflow with a microbiome dataset.
 
-#### Figures
+## FAQs
+Coming soon: An [FAQs](FAQs.md) page of commonly asked question on the statistics and code pertaining to Qadabra.
 
-The differential rank plots of each tool are plotted as `<tool_name>_differentials.svg`.
-A heatmap of the pairwise Kendall rank correlation among all pairs of tools is available as well.
-We also generated interactive plots to help compare the ranks of different features from the tools.
-`figures/pca.svg` generates a PCA plot of all the features, showing the concordance and discordance of results as well as the contribution of the tools.
-You can use the `figures/rank_comparisons.html` webpage to dynamically explore the relationship between pairs of tools.
-The `upset` subdirectory contains [UpSet](https://doi.org/10.1109%2FTVCG.2014.2346248) plots comparing the features from each tool.
-Finally, the `roc` and `pr` subdirectories contain ROC and PR (respectively) plots of all tools at each percentile of features.
+## Citation
+The manuscript for Qadabra is currently in progress. Please cite this GitHub page if Qadabra is used for your analysis. This project is licensed under the MIT License. See the [license](LICENSE) file for details.
diff --git a/images/PCA.jpg b/images/PCA.jpg
diff --git a/images/Qadabra_schematic.svg b/images/Qadabra_schematic.svg
diff --git a/images/differential_rank_comparison.jpg b/images/differential_rank_comparison.jpg
diff --git a/images/differentials_table.jpg b/images/differentials_table.jpg
diff --git a/images/gurobi.log b/images/gurobi.log
@@ -0,0 +1,6 @@
+
+Gurobi 8.1.0 (mac64) logging started Tue Jul 11 13:22:37 2023
+
+
+ERROR 10009: No Gurobi license found (user yac027, host knightlab.local, hostid 9fe21529, cores 4)
+