Review add more covariates #32

hansvancalster · 2024-09-11T15:23:47Z

Some things need further consideration:

combining metadata with diversity data creates zeroes (zero observed species), but it is unclear from metadata whether are eDNA pipeline succeeded for all primersets, group combinations. If not, than some zeroes will need to be changed to NA
currently physicochemical data are added to models containing all land use categories, but maybe it makes more sense to look at these covariates within a single land use category

slambrechts · 2024-11-26T17:19:41Z

Some things need further consideration:

combining metadata with diversity data creates zeroes (zero observed species), but it is unclear from metadata whether are eDNA pipeline succeeded for all primersets, group combinations. If not, than some zeroes will need to be changed to NA

The eDNA pipeline succeeded for all primersets x group combinations that are in mbag_combined_dataframe. I don't see any samples with observed is 0 in that dataframe, but that makes me realize that for the data from the universal 18S primers, these are automatically removed when subsetting for the different groups that this primerset recovers. For example physeq <- subset_taxa(physeq, phylum == "annelida" removes all samples that do not contain annelida, and the resulting subset was converted to a tidytacos object and used as input for the 18S annelida rows in this combined dataframe. So indeed, I think these should be zeros and not NA, unless you mean something different?

currently physicochemical data are added to models containing all land use categories, but maybe it makes more sense to look at these covariates within a single land use category

@hansvancalster is this because potential collinearity between land-use category and the physicochemical variables could mask the effects of these variables in the model? I assume we can test this in #28? Or should we not model this across all land-use categories, regardless of potential collinearity, because relationships between physicochemical variables and biodiversity may vary across land-use categories, making stratified analysis more ecologically meaningful?

Updated combined dataframe: v2

Silke updated the combined dataframe with data from the new primers (inseKP) we tested, which includes Arthropoda, and also the Bacteria, Fungi and Nematoda data from ILVO:

file.path(
    mbag_bodem_folder,
    "data", "statistiek", "dataframe_overkoepelend",
    "mbag_combined_dataframe_v2.csv")

Updated metadata: v2_cleaned_13

Also, we updated the metadata to MBAG_stratfile_v2_cleaned_13.csv, since we discovered some Heide and Moeras plots which should be removed from the MBAG analysis:

file.path(
    mbag_bodem_folder,
    "data", 
    "Stratificatie_MBAG_plots",
    "MBAG_stratfile_v2_cleaned_13.csv")

- exclude Moeras

hansvancalster · 2024-11-27T10:15:30Z

@hansvancalster is this because potential collinearity between land-use category and the physicochemical variables could mask the effects of these variables in the model?

I see the role of these variables more as "controlling for their effects". There is in most cases substantial overlap of ranges of observed values across land-use types, but the mean / bulk of the distribution may differ. For instance, the "natuurgraslanden" are mainly at low pH whereas other types are at somewhat higher pH. When adding pH, we can make predictions for land-use type and depth conditional on a specific value of pH, making the comparison between land use types and depth categories more reliable in the sense of being closer to the analogue of a designed experiment where we could have excluded such not-of-interest factors by design.

I assume we can test this in #28?

It certainly should be explored whether there could be important collinearity issues, but also the output of the models can signal this through diagnostic checks.

Or should we not model this across all land-use categories, regardless of potential collinearity, because relationships between physicochemical variables and biodiversity may vary across land-use categories, making stratified analysis more ecologically meaningful?

That was indeed the line of thought I had when proposing this. But on second thought, I think this is of low priority and I prefer sticking to the current model formulation for which I wrote the rationale in my first answer in this comment.

hansvancalster · 2024-11-27T10:20:36Z

I will knit the Rmd file first to check if everything works as expected before I merge.

slambrechts · 2024-11-27T14:24:19Z

I will knit the Rmd file first to check if everything works as expected before I merge.

Ok merci, laat gerust weten indien te zwaar voor op een laptop, dan kunnen we het op de HPC lopen

slambrechts · 2024-12-05T10:59:27Z

The eDNA pipeline succeeded for all primersets x group combinations that are in mbag_combined_dataframe. I don't see any samples with observed is 0 in that dataframe, but that makes me realize that for the data from the universal 18S primers, these are automatically removed when subsetting for the different groups that this primerset recovers. For example physeq <- subset_taxa(physeq, phylum == "annelida" removes all samples that do not contain annelida, and the resulting subset was converted to a tidytacos object and used as input for the 18S annelida rows in this combined dataframe. So indeed, I think these should be zeros and not NA

This is not only the case for the 18S primers, but also for inseKP (that cover mainly annelida, collembola, arthropoda), where e.g. zero values for e.g. the inseKP arthropoda subset indicate that arthropoda were proportionally less represented for that sample compared to other samples

slambrechts · 2024-12-05T11:55:10Z

The eDNA pipeline succeeded for all primersets x group combinations that are in mbag_combined_dataframe. I don't see any samples with observed is 0 in that dataframe, but that makes me realize that for the data from the universal 18S primers, these are automatically removed when subsetting for the different groups that this primerset recovers. For example physeq <- subset_taxa(physeq, phylum == "annelida" removes all samples that do not contain annelida, and the resulting subset was converted to a tidytacos object and used as input for the 18S annelida rows in this combined dataframe. So indeed, I think these should be zeros and not NA

This is not only the case for the 18S primers, but also for inseKP (that cover mainly annelida, collembola, arthropoda), where e.g. zero values for e.g. the inseKP arthropoda subset indicate that arthropoda were proportionally less represented for that sample compared to other samples

But this is not the case for Nematoda - 18s - asv. For this data, ILVO selected 80 samples and compared two different methods (eDNA and nema-extract). So here the metabarcoding pipeline was not run for all samples, and zeros should be NA. This we need to edit in the script. In addition, the eDNA and nema-extract datasets should be analysed separately. This we need to edit in the overkoepelend dataframe

slambrechts · 2024-12-05T16:40:56Z

source/rmarkdown/analyses_diversity/analyses_diversity.Rmd

+sum(diversiteit$observed == 0)
+```
+
+Deze nulwaarnemingen moeten we terug toevoegen (observed wordt dan 0, maar Shannon en Simpson zijn dan niet gedefinieerd).


@hansvancalster this is not the case for the Nematoda - 18s - asv data from ILVO, since that is an unusual case see #32 (comment)

and in general, I realize this can differ per primerset (primers that target more than one group vs group-specific), and sample (for some samples total read count across everything a primer captures is zero, in which case I think we should assume the eDNA pipeline in the lab failed). I will open an issue for this and investigate per primerset

Thanks for documenting this in an issue. One quick question: could a total read count for a sample be zero because it is completely denuded of everything the primer targets? I'm thinking of pesticide misuse and other forms of pollutions. If that could be the case, we should be careful in attributing it to a failed eDNA pipeline. Are there ways to know this?

Good question. I know that many people (e.g. ILVO and the LUCAS project) resequence a sample if it has less than 50K total reads across everything a primer captures, but I have often wondered whether total read count is really random in general? I will continue the discussion in https://github.com/slambrechts/INBO_eDNA_metabarcoding_BODEM/issues/238

hansvancalster added 4 commits September 3, 2024 09:59

rename variables

a9e7c78

add missing library

9bcafe2

add samples with total_count zero (no observed taxa)!

f4fd889

various improvements

a3467ce

slambrechts approved these changes Nov 26, 2024

View reviewed changes

hansvancalster added 2 commits November 27, 2024 10:48

update version combined dataframe and metadata

e14963f

- exclude Moeras

split plot by diepte

4352f60

hansvancalster added 2 commits November 27, 2024 18:28

improve selection of family

3a16ab4

avoid NA propagation

fed38c3

hansvancalster merged commit fd07eb5 into add_SWCvol_Cdensity_etc_to_model_observed_richness Nov 28, 2024
1 check failed

hansvancalster deleted the review_add_more_covariates branch November 28, 2024 09:46

slambrechts mentioned this pull request Dec 5, 2024

Update fyschem #41

Merged

slambrechts reviewed Dec 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review add more covariates #32

Review add more covariates #32

hansvancalster commented Sep 11, 2024

slambrechts commented Nov 26, 2024 •

edited

Loading

hansvancalster commented Nov 27, 2024

hansvancalster commented Nov 27, 2024

slambrechts commented Nov 27, 2024

slambrechts commented Dec 5, 2024 •

edited

Loading

slambrechts commented Dec 5, 2024

slambrechts Dec 5, 2024 •

edited

Loading

hansvancalster Dec 6, 2024

slambrechts Dec 6, 2024

Review add more covariates #32

Review add more covariates #32

Conversation

hansvancalster commented Sep 11, 2024

slambrechts commented Nov 26, 2024 • edited Loading

hansvancalster commented Nov 27, 2024

hansvancalster commented Nov 27, 2024

slambrechts commented Nov 27, 2024

slambrechts commented Dec 5, 2024 • edited Loading

slambrechts commented Dec 5, 2024

slambrechts Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

hansvancalster Dec 6, 2024

Choose a reason for hiding this comment

slambrechts Dec 6, 2024

Choose a reason for hiding this comment

slambrechts commented Nov 26, 2024 •

edited

Loading

slambrechts commented Dec 5, 2024 •

edited

Loading

slambrechts Dec 5, 2024 •

edited

Loading