Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review add more covariates #32

Conversation

hansvancalster
Copy link
Collaborator

Some things need further consideration:

  • combining metadata with diversity data creates zeroes (zero observed species), but it is unclear from metadata whether are eDNA pipeline succeeded for all primersets, group combinations. If not, than some zeroes will need to be changed to NA
  • currently physicochemical data are added to models containing all land use categories, but maybe it makes more sense to look at these covariates within a single land use category

@slambrechts
Copy link
Collaborator

slambrechts commented Nov 26, 2024

Some things need further consideration:

  • combining metadata with diversity data creates zeroes (zero observed species), but it is unclear from metadata whether are eDNA pipeline succeeded for all primersets, group combinations. If not, than some zeroes will need to be changed to NA

The eDNA pipeline succeeded for all primersets x group combinations that are in mbag_combined_dataframe. I don't see any samples with observed is 0 in that dataframe, but that makes me realize that for the data from the universal 18S primers, these are automatically removed when subsetting for the different groups that this primerset recovers. For example physeq <- subset_taxa(physeq, phylum == "annelida" removes all samples that do not contain annelida, and the resulting subset was converted to a tidytacos object and used as input for the 18S annelida rows in this combined dataframe. So indeed, I think these should be zeros and not NA, unless you mean something different?

  • currently physicochemical data are added to models containing all land use categories, but maybe it makes more sense to look at these covariates within a single land use category

@hansvancalster is this because potential collinearity between land-use category and the physicochemical variables could mask the effects of these variables in the model? I assume we can test this in #28? Or should we not model this across all land-use categories, regardless of potential collinearity, because relationships between physicochemical variables and biodiversity may vary across land-use categories, making stratified analysis more ecologically meaningful?

Updated combined dataframe: v2

Silke updated the combined dataframe with data from the new primers (inseKP) we tested, which includes Arthropoda, and also the Bacteria, Fungi and Nematoda data from ILVO:

file.path(
    mbag_bodem_folder,
    "data", "statistiek", "dataframe_overkoepelend",
    "mbag_combined_dataframe_v2.csv")

Updated metadata: v2_cleaned_13

Also, we updated the metadata to MBAG_stratfile_v2_cleaned_13.csv, since we discovered some Heide and Moeras plots which should be removed from the MBAG analysis:

file.path(
    mbag_bodem_folder,
    "data", 
    "Stratificatie_MBAG_plots",
    "MBAG_stratfile_v2_cleaned_13.csv")

@hansvancalster
Copy link
Collaborator Author

@hansvancalster is this because potential collinearity between land-use category and the physicochemical variables could mask the effects of these variables in the model?

I see the role of these variables more as "controlling for their effects". There is in most cases substantial overlap of ranges of observed values across land-use types, but the mean / bulk of the distribution may differ. For instance, the "natuurgraslanden" are mainly at low pH whereas other types are at somewhat higher pH. When adding pH, we can make predictions for land-use type and depth conditional on a specific value of pH, making the comparison between land use types and depth categories more reliable in the sense of being closer to the analogue of a designed experiment where we could have excluded such not-of-interest factors by design.

I assume we can test this in #28?

It certainly should be explored whether there could be important collinearity issues, but also the output of the models can signal this through diagnostic checks.

Or should we not model this across all land-use categories, regardless of potential collinearity, because relationships between physicochemical variables and biodiversity may vary across land-use categories, making stratified analysis more ecologically meaningful?

That was indeed the line of thought I had when proposing this. But on second thought, I think this is of low priority and I prefer sticking to the current model formulation for which I wrote the rationale in my first answer in this comment.

@hansvancalster
Copy link
Collaborator Author

I will knit the Rmd file first to check if everything works as expected before I merge.

@slambrechts
Copy link
Collaborator

I will knit the Rmd file first to check if everything works as expected before I merge.

Ok merci, laat gerust weten indien te zwaar voor op een laptop, dan kunnen we het op de HPC lopen

@hansvancalster hansvancalster merged commit fd07eb5 into add_SWCvol_Cdensity_etc_to_model_observed_richness Nov 28, 2024
1 check failed
@hansvancalster hansvancalster deleted the review_add_more_covariates branch November 28, 2024 09:46
@slambrechts
Copy link
Collaborator

slambrechts commented Dec 5, 2024

The eDNA pipeline succeeded for all primersets x group combinations that are in mbag_combined_dataframe. I don't see any samples with observed is 0 in that dataframe, but that makes me realize that for the data from the universal 18S primers, these are automatically removed when subsetting for the different groups that this primerset recovers. For example physeq <- subset_taxa(physeq, phylum == "annelida" removes all samples that do not contain annelida, and the resulting subset was converted to a tidytacos object and used as input for the 18S annelida rows in this combined dataframe. So indeed, I think these should be zeros and not NA

This is not only the case for the 18S primers, but also for inseKP (that cover mainly annelida, collembola, arthropoda), where e.g. zero values for e.g. the inseKP arthropoda subset indicate that arthropoda were proportionally less represented for that sample compared to other samples

@slambrechts
Copy link
Collaborator

The eDNA pipeline succeeded for all primersets x group combinations that are in mbag_combined_dataframe. I don't see any samples with observed is 0 in that dataframe, but that makes me realize that for the data from the universal 18S primers, these are automatically removed when subsetting for the different groups that this primerset recovers. For example physeq <- subset_taxa(physeq, phylum == "annelida" removes all samples that do not contain annelida, and the resulting subset was converted to a tidytacos object and used as input for the 18S annelida rows in this combined dataframe. So indeed, I think these should be zeros and not NA

This is not only the case for the 18S primers, but also for inseKP (that cover mainly annelida, collembola, arthropoda), where e.g. zero values for e.g. the inseKP arthropoda subset indicate that arthropoda were proportionally less represented for that sample compared to other samples

But this is not the case for Nematoda - 18s - asv. For this data, ILVO selected 80 samples and compared two different methods (eDNA and nema-extract). So here the metabarcoding pipeline was not run for all samples, and zeros should be NA. This we need to edit in the script. In addition, the eDNA and nema-extract datasets should be analysed separately. This we need to edit in the overkoepelend dataframe

@slambrechts slambrechts mentioned this pull request Dec 5, 2024
sum(diversiteit$observed == 0)
```

Deze nulwaarnemingen moeten we terug toevoegen (observed wordt dan 0, maar Shannon en Simpson zijn dan niet gedefinieerd).
Copy link
Collaborator

@slambrechts slambrechts Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hansvancalster this is not the case for the Nematoda - 18s - asv data from ILVO, since that is an unusual case see #32 (comment)

and in general, I realize this can differ per primerset (primers that target more than one group vs group-specific), and sample (for some samples total read count across everything a primer captures is zero, in which case I think we should assume the eDNA pipeline in the lab failed). I will open an issue for this and investigate per primerset

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for documenting this in an issue. One quick question: could a total read count for a sample be zero because it is completely denuded of everything the primer targets? I'm thinking of pesticide misuse and other forms of pollutions. If that could be the case, we should be careful in attributing it to a failed eDNA pipeline. Are there ways to know this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I know that many people (e.g. ILVO and the LUCAS project) resequence a sample if it has less than 50K total reads across everything a primer captures, but I have often wondered whether total read count is really random in general? I will continue the discussion in https://github.com/slambrechts/INBO_eDNA_metabarcoding_BODEM/issues/238

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants