microscopy_IdRemark_2021-03-23.Rmd

---
title: "QU39 Phytoplankton Microscopy OBIS submission"
output: html_notebook
---

This notebook uploads phytoplankton microscopy data from excel based analyst results, which include taxonomic names and abundance counts for each data collection event. In this notebook, only data collected at QU39 are considered: At this station, phytoplankton microscopy samples are collected at a single depth on a weekly to bi-weekly basis. Data collection is ongoing, but here, only data between 2016 and the end of 2018 have been uploaded.

After initial data transformation, the data are converted to tidy (long) format and taxonomic information is matched to taxonomic names within the WoRMs database (using both the obisTools and Worrms R packages). This matching corrects for spelling errors, allows for selection of the appropriate taxonomic information where multiple matches are returned, provides a WoRMs aphiaID and outputs taxonomic hierarchies. Considerable consultation was done with the analyst to ensure that the taxonomic matches corresponded to the original analyst designations. As a result, the final scientificNames have been updated to the most recent accepted names within the WoRMs database.

Care was taken to ensure that matches were done to the lowest taxonomic level possible. At times, taxonomic designations were downgraded to higher taxonomic levels due to analyst uncertainties. These downgrades are documented within the code below. Additionally, identificationRemarks were used to explain decisions and analyst notes.


```{r}
#Load packages - Ran script with some packages turned off and it work, but need to try with more
library(tidyverse)
library(janitor) #Where do I use this? Test running without
library(hutils) #Used to drop empty columns 
library(obistools) #Used for initial species matches to WoRMs - good for cleaning taxonomic names
library(worrms) #Used to derive accepted names and taxonomic hierarchies
library(curl) #??? Test running without.
library(stringr) #Used to remove white-space after 
library(here)

#library('taxizesoap') #not working. 
#library(fuzzyjoin) #I don't know if I use this anymore
#library(devtools) #This was to try to get taxizesoap to work, but think it is unused.
#library(remotes) #I think this was to try to get taxizesoap to work, but think it is unused.
#library(XML) #I think this was to try to get taxizesoap to work, but think it is unused.


```

```{r}

#Uploading and wrangling

#This step needs to be updated to more efficiently upload the raw data from analyst when new data is received. I am still doing quite a bit of manual wrangling in Excel to standardize formatting, but some of this could likely be automated. It is difficult as there are always slight chances in every new sheet sent by the analyst.

#I need to update the upload to use "here"

#Uploads data - the way I have it formatted ensures the data columns are kept as numeric.
micro_qu39 <- read.csv(here("QU39_v1.2.csv"), skip = 2)


#Removes phytoplankton group designations by the analyst. Removed because when kept in, it makes the columns character rather then numeric after the transpose performed in the next step.
micro_qu39 <- micro_qu39[,-1]

#Saving the analyst groupings as they may be useful for analysis.
analyst_groups <- micro_qu39[,1]

#Transposes and makes species names column headers and Hakai IDs row names - allows data columns to remain as numeric
micro_qu39 <- setNames(data.frame(t(micro_qu39[ , -1])), micro_qu39 [ , 1])

#Pushes the hakaiID rownames into a column with a header.
micro_qu39 <- rownames_to_column(micro_qu39, "hakai_id")

#Converts data table into a tibble.
micro_qu39 <- as_tibble(micro_qu39)

#Saving columns with no counts/observations. These exist because I manually standardized all the data in the uploaded excel data-sheet so that I could easily combine them. As a result, taxonomic designations exist where occurrences existed in other data-sets not used here.  
dropped_col_list <- micro_qu39 %>%
  select_if(~(all(is.na(.)) | all(. == ""))) 
  
#dropping columns with no counts/observations. 
micro_qu39 <- drop_empty_cols(micro_qu39)
```


```{r}
#Here, I work with/manipulate some of the taxonomic designations while they are still in column format (each taxonomic designation is a name of a column). It's easier to rename columns and sum abundance counts in this format when compared to tidy format

#The below step sums taxonomic columns with size designations. As per the analyst, these size separations were qualitative/descriptive and are not useful/descriptive for OBIS upload.

#What about chaetoceros cinctus/radicans restings stages? Sum with chaetoceros spp. (resting). I think can keep and place in identificationRemarks.

micro_qu39 <- micro_qu39 %>% 
  rowwise() %>% 
  mutate(Biddulphiales = sum(`Unidentified centric forms`, 
                   `Unidentified very small centric forms`, na.rm = TRUE),
         Pennales = sum(`Unidentified pennate forms`,
                        `Unidentified very small pennate forms`, na.rm = TRUE),
         `Skeletonema marinoi` = sum(`S. marinoi (small cells)`,
                                    `S. marinoi (large cells)`, na.rm = TRUE),
         `Chaetoceros tenuissimus` = sum(`Chaetoceros tenuissimus (large)`,
                                         `Chaetoceros tenuissimus (small)`, 
                                         na.rm = TRUE),
         `Chaetoceros spp_sum.` = sum(`Chaetoceros spp.`,
                                  `Chaetoeros spp. very small`, na.rm = TRUE ),
         Oligotrichea = sum(`Elongate form`, `Oligotricious forms (large)`,
                            `Oligotricious forms (medium)`, 
                            `Oligotricious forms (small)`, na.rm = TRUE))
         
#Checking that the sums worked.
sum_check <- micro_qu39 %>% 
  select(Biddulphiales, `Unidentified centric forms`, 
         `Unidentified very small centric forms`, Pennales, 
         `Unidentified pennate forms`, `Unidentified very small pennate forms`,
         `Skeletonema marinoi`, `S. marinoi (small cells)`,
         `S. marinoi (large cells)`, `Chaetoceros tenuissimus`, 
         `Chaetoceros tenuissimus (large)`, `Chaetoceros tenuissimus (small)`,
         `Chaetoceros spp_sum.`, `Chaetoceros spp.`, `Chaetoeros spp. very small`,
         Oligotrichea, `Elongate form`, `Oligotricious forms (large)`,
         `Oligotricious forms (medium)`, `Oligotricious forms (small)`)  
  

#Removing size separated columns that were summed into a single column.
micro_qu39 <- micro_qu39 %>% 
  select(-`Unidentified centric forms`, -`Unidentified very small centric forms`,
         -`Unidentified pennate forms`, -`Unidentified very small pennate forms`,
         -`S. marinoi (small cells)`, -`S. marinoi (large cells)`,
         -`Chaetoceros tenuissimus (large)`, -`Chaetoceros tenuissimus (small)`,
         -`Chaetoceros spp.`, -`Chaetoeros spp. very small`, 
         -`Elongate form`, -`Oligotricious forms (large)`, 
         -`Oligotricious forms (medium)`, -`Oligotricious forms (small)`)

#Sum step turned NA's into zeros. Once tidy, I remove NA's as it indicates no observation. This step converts the zeros in the summed columns back into NA's.
micro_qu39 <- micro_qu39 %>% 
  mutate(Biddulphiales = na_if(Biddulphiales, 0),
         Pennales = na_if(Pennales, 0),
         `Skeletonema marinoi` = na_if(`Skeletonema marinoi`, 0),
         `Chaetoceros tenuissimus` = na_if(`Chaetoceros tenuissimus`, 0),
         `Chaetoceros spp_sum.` = na_if(`Chaetoceros spp_sum.`, 0),
         Oligotrichea = na_if(Oligotrichea, 0)) #Had the wrong name here - `Chaetoceros spp_sum.` C/P error

#Below, I rename "unknown" taxonomic designations to the lowest taxonomic level possible. In the original excel sheet, these had a group level classification in the neighboring column, but I removed this as it was interfering with formatting. The work below relies on column position to separate which unknown species is which (i.e. Unknown species vs. Unknown species.1 etc.). It works here, but it could cause errors when more data is included. I wonder if there is anyway to keep the analyst group that they were associated with? Really though, these should stay in the same order unless the analyst adds a new "Unknown species".

#Do I just combine all of the protozoa or do I keep qualifiers?

micro_qu39 <- micro_qu39 %>% 
  rename(Bacillariophyceae  = `Unknown forms`, #Class
         `Protozoa (alpha-starch, Chlorophyta?)` = `Unknown species`, #Biota 
         `Chaetoceros spp.` = `Chaetoceros spp_sum.`, #Genus
         Choanozoa = `Unknown species.1`, #Phylum
         Chrysophyceae = `Unknown species.2`, #Class
         `Dinophyceae (resting stages)` = `Resting stages`, #Class, NVS S1134 resting "spores"
         Dinophyceae = `Unknown species.3`, #Class 
         Euglenozoa = `Unknown species.4`,  #Phylum
         Kinetoplastea = `Unknown species.5`, #Class.
         Protozoa = `Unknown flagellae-bearing cells`, #Biota
         `Protozoa (equatorial groove, Dinophyceae?)` = `Unknown very small species (dino??)`,
         `copepoda (copepodites)` = Copepodids, #Subclass, NVS lifestage S115
         `copepoda (nauplii/ekdysis)` = `Copepod nauplii/ekdysis`, #Subclass, NVS S1130
         `Thalassiosira nordenskioeldii` = `Thalassiosira nordenskioldii`,
         Tintinnina = `Tintinnid forms`) #suborder

#Finding remaining unknown species classifications that I couldn't resolve. Unknown cells cannot be included. Total cells is just a summation of all diatoms and should be excluded.
unknown_Col_list <- micro_qu39 %>% 
  select(starts_with("v") |
           starts_with("unk") |
           starts_with("uni") |
           starts_with("tot"))  
  
#Removes all of the above columns that don't have taxonomic information. 
micro_qu39 <- micro_qu39 %>% 
  select(!(starts_with("v") |
             starts_with("unk") |
             starts_with("uni") |
             starts_with("tot")))
```

```{r}
#Merging data with database, cross checking merge. Making data tidy.

#Upload portal metadata to get location and times. Changed to the newer read_csv. Not sure why I was using the base R version, but kept the line here as it is what I developed everything from. Checked and doesn't seem to be any differences, but need to cross check final product. Also, want to move the file into a files folder OR upload directly from portal using the API, but haven't been able to make this work yet.
#metadata <- read.csv("metadata_phyto.csv")
metadata <- read_csv(here("metadata_phyto.csv"))


#Join with portal metadata based on hakai_id - Caution required as the analyst doesn't provide Hakai ID in data sheets (only station and date), so I need to investigate data sheet errors and ensure I assign them the correct Hakai ID.  
micro_qu39 <- micro_qu39 %>% 
  left_join(metadata)

#reorganizing and eliminating unused columns
micro_qu39 <- micro_qu39 %>% 
  select(date, collected, hakai_id, depth = Depth, line_out_depth, work_area, site_id,
         lat, long, volume, lugols, collected,
         `Achnanthes spp.`:Oligotrichea)

#Find rows where there was no metadata match - This step is a QC check to make sure that all analyst results were matched to metadata.
meta_missing <- micro_qu39 %>%
  filter(is.na(collected))

#Test to see if meta line-out-depth matches data sheet depth.
depth_mismatch <- micro_qu39 %>% 
  mutate(depth_check = if_else(line_out_depth == depth, 1, 0)) %>% 
  select(depth_check) %>% 
  filter(depth_check == 0)

#Making data Tidy 
micro_qu39_tidy <- micro_qu39 %>% 
  pivot_longer(c(`Achnanthes spp.`: Oligotrichea),
               names_to = "species", values_to = "counts")

#With this data, NA's represent zeros, which means species were not observed. Removing as only want observed species.
micro_qu39_tidy <- micro_qu39_tidy %>%
  drop_na(counts)
```

```{r}
#Cleaning up the data

#In this chunk I start working with the taxonomic designations to allow for matches within the WoRMs database. Specifically, non-taxonomic data are removed and qualifiers are moved into separate columns. The latter step involves considerable regular expression work.

#Create a list of designations/names do not have taxonomic information and that I remove in the next step. 
removed_row_list <- micro_qu39_tidy %>%
  filter(str_detect(species, "Eggs") |
         str_detect(species, "pieces") |
         str_detect(species, "Fecal") |
         str_detect(species, "ejecta"))


#Removes designations/names with no taxonomic information from datasheet.
micro_qu39_tidy <- micro_qu39_tidy %>%
  filter(!(str_detect(species, "Eggs") |
           str_detect(species, "pieces") |
           str_detect(species, "Fecal") |
           str_detect(species, "ejecta")))

#This creates regex list that will be used to add these terms to lifeStage column.
lifeStage_regex <- c("resting stages", "resting stage", "nauplii/ekdysis",
             "\\bspore", "Auxospore", "copepodites") %>% 
  str_c(collapse = "|")

#This creates regex list that will be used to add these terms to the identificationRemark column. 
identificationRemark_regex <- c("alpha-starch, Chlorophyta\\?",
                                "equatorial groove, Dinophyceae\\?",
                                "cinctus/radicans") %>% 
  str_c(collapse = "|")

#This creates regex list that will be used to add these terms to the identificationQualifier column. I could likely make the search easier and eliminate [:blank:]spp$ by using stringr::str_trim to remove whitespace before or after.
identificationQ_regex <- c("[:blank:]spp$", "spp\\.", "sp\\.") %>% 
  str_c(collapse = "|")

#This step uses the regular expression lists defined above to find matches within the taxonomic column and for these matches, writes the specified terms to either lifeStage, indentificationQualifier or identificationRemark columns.
micro_qu39_tidy <- micro_qu39_tidy %>% 
  mutate(lifeStage = str_match(species, lifeStage_regex)) %>% 
  mutate(identificationQualifier = str_match(species, identificationQ_regex)) %>% 
  mutate(identificationRemark = str_match(species, identificationRemark_regex))

#Now that I have taken these terms that were included with the taxonomic information and put them in the correct columns, I now fix rename the lifeStages to match accepted vocabulary and fix a weird identificationQualifier spp that has white space making regex work difficult. I could have fixed these earlier when I was working with the data in the column format. 
micro_qu39_tidy <- micro_qu39_tidy %>% 
  mutate(lifeStage = str_replace(lifeStage, "spore", "spores")) %>%
  mutate(lifeStage = str_replace(lifeStage, "resting stages", "resting spores")) %>%
  mutate(lifeStage = str_replace(lifeStage, "resting stage", "resting spores")) %>%
  mutate(lifeStage = str_replace(lifeStage, "nauplii/ekdysis", "nauplii")) %>% 
  mutate(identificationQualifier = str_replace(identificationQualifier,
                                               "[:blank:]spp$", "spp."))
  
#This step/sheet is a quality check to ensure that all of the lifeStage-taxonomic combinations are present 
lifeStage_check <- micro_qu39_tidy %>% 
  distinct(across(c(species, lifeStage))) %>% 
  filter(!is.na(lifeStage)) 

#Check that all the identificationQualifier combinations are present
identificationQ_check <- micro_qu39_tidy %>% 
  distinct(across(c(species, identificationQualifier))) %>% 
  filter(!is.na(identificationQualifier)) 

#Check that all of the identificationRemark combinations are present
identificationRemark_check <- micro_qu39_tidy %>% 
  distinct(across(c(species, identificationRemark))) %>% 
  filter(!is.na(identificationRemark)) 

#Adding spp. where I moved cinctus/radicans to identificationRemark. The last term in the case_when function makes sure that the other identificationQualifiers are not overwritten.
micro_qu39_tidy <- micro_qu39_tidy %>% 
  mutate(identificationQualifier = case_when(identificationRemark == 
                                               "cinctus/radicans" ~ "spp.",
                                             TRUE ~ as.character
                                             (identificationQualifier)))

#Remove all qualifier information from taxonomic/species name column so that only taxonomic information remains.
micro_qu39_tidy <- micro_qu39_tidy %>% 
  mutate(species = str_replace(species, "\\(resting stages\\)", "")) %>% 
  mutate(species = str_replace(species, "resting stage", "")) %>%
  mutate(species = str_replace(species, "resting stages", "")) %>%
  mutate(species = str_replace(species, "Auxospore", "")) %>%
  mutate(species = str_replace(species, "\\(copepodites\\)", "")) %>%
  mutate(species = str_replace(species, "\\(nauplii/ekdysis\\)", "")) %>%
  mutate(species = str_replace(species, "spp\\.", "")) %>%
  mutate(species = str_replace(species, "\\bspp", "")) %>%
  mutate(species = str_replace(species, "sp\\.", "")) %>%
  mutate(species = str_replace(species, "spore forms", "")) %>% 
  mutate(species = str_replace(species, "cinctus/radicans", "")) %>%
  mutate(species = str_replace(species, "\\(elongate form\\)", "")) %>% 
  mutate(species = str_replace(species, "Pterosperma  s", "Pterosperma")) %>% 
  mutate(species = str_replace(species, "\\(equatorial groove, Dinophyceae\\?\\)","")) %>%
  mutate(species = str_replace(species, "\\(alpha-starch, Chlorophyta\\?\\)", "")) %>% 
  mutate(species = str_replace(species, "[:blank:]spp$", ""))

#Adding more information to my identificationRemarks. The current remarks were easy to manipulte in earlier steps, but here at the end of the process, I can make them more encompassing.

#Unknown protozoa that could be Chlorophytes
micro_qu39_tidy <- micro_qu39_tidy %>% 
  mutate(identificationRemark = str_replace(identificationRemark,
                                            "alpha-starch, Chlorophyta\\?", 
"Cells too small to identify. Stained blue indicating alpha starch and could be Chlorophyta"))

#Unknown protozoa that could be Dinophyceae
micro_qu39_tidy <- micro_qu39_tidy %>% 
  mutate(identificationRemark = str_replace(identificationRemark,
                                            "equatorial groove, Dinophyceae\\?", 
"Cells too small to identify. Showed equatorial groove and could be Dinophyceae"))

#Couldn't tell between resting stage cinctus and radicans
micro_qu39_tidy <- micro_qu39_tidy %>% 
  mutate(identificationRemark = str_replace(identificationRemark,
                                            "cinctus/radicans", 
"Unable to distinguish between Chaetoceros cinctus and radicans"))

#Adding reason why classification was stopped at sp., spp according to Horten et al. In theory, shouldn't this Qualifier be put on any occurrence that stops before species level? Based on figure 1 in Horten et al., I think that the qualifier should be added even when there is no sp. or spp. I will add this after the taxonomic hierarchy has been applied so that I can filter for it.
micro_qu39_tidy <- micro_qu39_tidy %>% 
  mutate(identificationQualifier = case_when(identificationQualifier == 
                                               "spp." ~ "spp. Indet.",
                                             identificationQualifier ==
                                               "sp." ~ "sp. Indet.",
                                             TRUE ~ as.character
                                             (identificationQualifier)))

#This step is time consuming to run, but removes any white-space before/after strings created through regex work.
micro_qu39_tidy <- micro_qu39_tidy %>% 
  mutate(species = stringr::str_trim(species, side = "right"))

```

```{r}
#This chunk of code runs the obistools and worrms tools on a truncated distinct species list rather than the entire data-sheet. This step is important because it helps to isolate species match issues on a small data-set. Furthermore, the worrms::wm_record_names provides the accepted scientificName, aphiaID and taxonomic hierarchy - these are not provided by the obistools:match_taxa. 

#For error checking, limit species to unique species names. For whatever reason, this is throwing duplicates - must be something from my above filtering/renaming (i.e. a space after the name?). Not certain yet if it is necessary to fix this as it will depend on if it affects the data when I apply functions to the full dataset, which is not a unique subset. For unique subset, I deal with this later after the match_taxa has been performed - apply distinct to the scientificName output.

unique_list <- micro_qu39_tidy %>%
  distinct(species) %>%
  arrange(species) %>%
  mutate(id = row_number())

#Take unique list of species and search for them within the WoRMs database. Add a unique row number for matching with the original species name list, so that species with poor or no matches can be identified

worm_match <- unique_list$species %>%
  match_taxa() %>%
  mutate(id = row_number())

#Testing to see if only using  worrms:wm_match_taxa provides the same results as first using obistools::match_taxa. It didn't pick up names where there was spelling errors or small differences and I think this was because the fuzzy matching default is FALSE. When I tried turning it on, it threw: Error: (500) Internal Server Error - AphiaRecordsByNames. Conflicted here because what I have works (even if it is more convoluted) and adds additional failsafe in that species names are more thoroughly checked.

# test <- unique_list$species %>%
#   wm_records_names(fuzzy = TRUE) %>%
#   bind_rows() %>%
#   rename(scientificName = scientificname)
# 
# test2 <- worm_match %>%
#   left_join(test, by = "scientificName")


#Join WoRMS species scientificName and scientificID with original species names from datasheet

worm_match <- worm_match %>%
  full_join(unique_list) %>%
  rename(orig_name = species)

#Filter for species where there was a poor match. Cross check matches to see if correct. After cross checking, the near_3 match is incorrect - Pleurotoma gracile, which was matched to Pleurostomum gracile. The near_2 matches are correct.

worm_poor_match <- worm_match %>%
  filter(match_type %in% c("near_2", "near_3", na.rm = "FALSE"))

#Filter for species with no WoRMs matches. See next note.

worm_na_match <- worm_match %>%
  filter(is.na(match_type))

# This step uses the WoRMs scientificName to create a list of distinct species and then it removes species with incorrect or no matches as these complicate the following database search using the worrms::wm_records_name tool. I can't find a match for Meringosphaerae tenerrima, but I could go to genus level for this species. For the other two species, I can't even find a genus level match. The analyst said that these species occur so infrequently and at such low numbers that he suggests just removing them instead of going through the work of adding them. Pleurostomum gracile, Phyllomitus yorkeensis, Meringosphaerae tenerrima
worm_match_clean <- worm_match %>% 
  filter(!(str_detect(orig_name, "Pleurostomum gracile") |  
           str_detect(orig_name, "Phyllomitus yorkeensis") |  
           str_detect(orig_name, "Meringosphaerae tenerrima"))) %>%   
  distinct(scientificName, .keep_all = TRUE) 

#This tool uses the worrms::wm_records_names tool to attach taxonomic hierarchies and accepted scientificNames to the scientificNames derived above from the obistools::match_taxa (provides direct matches, not accepted). Unfortunately, it re-adds duplicates (Appendicularia and Euglenozoa) where I was able to make selections in obistools::match_taxa and I need to manually remove these later. The worrms::wm_record_names is described here: https://cran.r-project.org/web/packages/worrms/worrms.pdf
worm_hierarchy <- worm_match_clean$scientificName %>% 
  wm_records_names() %>% 
  bind_rows() %>% 
  rename(scientificName = scientificname)

#Joins the worrms::wm_records_names output with the obistools::taxon_match output so that unaccepted names can be found and traced back to the analysts original name and other issues can be identified. 
worm_join <- worm_hierarchy %>% 
  full_join(worm_match_clean, by = "scientificName")

#finds duplicate species names created using wormms::wm_name_records. The worrms::wm_name_records tool reintroduces potential duplicate taxonomic designations that were selected and removed in obistools::taxon_match. This list is used to remove/rename duplicated entries when working with the full data-set in the next chunk.
worm_join_dup <- worm_join %>% 
     group_by(scientificName) %>% 
     mutate(num_dups = n(), 
            dup_id = row_number()) %>% 
     ungroup() %>% 
     mutate(is_duplicated = dup_id > 1) %>% 
     filter(num_dups > 1)

#this finds all of the taxonomic designations that are not accepted. After consultation with the analyst, I decided to go forward with the accepted scientificName from the worrms::wm_records_names output. It is suggested that the analyst original name, even if it is a synonym, is retained and used, but I since I cross checked with the analyst,  There is nothing I can do to corrected for uncertain names. 
worm_join_unaccept <- worm_join %>% 
  filter(!status == "accepted")

```

```{r}
#This chunk adds taxonomic information to the full dataset, rather than just a distinct list. The distinct list is still important because the worrms::wm_records_names tool will only run on a small number of records. Therefore, I needed to use the distinct list from above to join to the larger dataset here. I don't really like my naming conventions here.

#Performs obistools::match_taxa on full dataset. When prompted in console, select y, 2, 2. As the output is a single scientificName column, this step also gives an id so that this can be merged back with the metadata in the next step.
complete <- micro_qu39_tidy$species %>% 
  match_taxa() %>% 
  mutate(id = row_number())

#Gives the full data sheet an id for merging with 
micro_qu39_tidy <- micro_qu39_tidy %>% 
  mutate(id = row_number())


#Removes species names that I decided to remove from the data based on the analysts suggestion. I could also do this earlier before the matching is done.
complete <- complete %>% 
  full_join(micro_qu39_tidy) %>% 
  rename(orig_name = species) %>% 
  filter(!(str_detect(orig_name, "Pleurostomum gracile") | 
           str_detect(orig_name, "Phyllomitus yorkeensis") | 
           str_detect(orig_name, "Meringosphaerae tenerrima"))) 

#This removes the double hierarchy matches (as seen in worm_join_dup data-sheet). Consultation was done to ensure the removal of the incorrect match and to leave the correct match. It might be better to do this in the last chunk where I develop this sheet - would probably flow better, rather than jumping back here.
worm_hierarchy_clean <- worm_hierarchy %>% 
  filter(!((scientificName == "Appendicularia" & is.na(authority)) |
          (scientificName == "Dinophysis rotundata" & authority == "Levander, 1894") |
          (scientificName == "Euglenozoa" & is.na(authority))))

#Joins the worrms:wm_records_names derived taxonomic hierarchies and accepted ScientificName (and other information) to the full datasheet with all occurrences. 
complete_join <- complete %>% 
  left_join(worm_hierarchy_clean, by = "scientificName")

#Writing output for cross-check. Do I still needs this?
#write_csv(complete_join, "out_QU39_V1.2_NA_remove.csv")

```

```{r}
#Adding Indet. to all classification that did not go to species level and do not have sp. or spp.

complete_join <- complete_join %>% 
  mutate(identificationQualifier = case_when(!rank == "Species" & 
                                               is.na(identificationQualifier) ~
                                               "Indet.",
                                             TRUE ~ as.character
                                             (identificationQualifier)))
```


```{r}
#In this chunk I work to get the data into the OBIS-DWC format and split it into the event-core and the occurrence and extended measurement of fact (EMoF) extensions. 

#Setting up a large flat table that I will split the Event, Occurrence and EMoF extensions from afterwards.

#Switched from portal date to collected, which includes hours/minutes/seconds in UTC. The portal format looks to be in the correct format for OBIS-DwC. I kept the date (format Year-month-day) for the creation of the eventID, because I thought it was too complex to add the hours/minutes/seconds to the ID
format_obis <- complete_join %>% 
  select(date, eventDate = collected, minimumDepthInMeters = line_out_depth,
         maximumDepthInMeters = line_out_depth, lat, long,
         scientificName = valid_name, scientificNameID, valid_AphiaID, 
         taxonRank = rank, lifeStage, identificationRemark, 
         identificationQualifier, measurementValue = counts)

#The worrms::wm_records_names tool does not output the correct scientificNameID format (not a url), but it does give the correct aphiaID number. The obistools::match_taxa format is correct, but with the unaccepted species aphiaIDs at the end. In this step, I remove the ID numbers at the end of the obistools scientificNameID url and then concatenate the correct ID number from the worrms tool at the end of the url. 
format_obis <- format_obis %>% 
  mutate(scientificNameID_2 = str_replace_all(scientificNameID, "[:digit:]", "")) %>% 
  unite("ScientificNameID_corr", scientificNameID_2, valid_AphiaID, 
        sep = "", remove = FALSE) %>% 
  select(-scientificNameID_2, - scientificNameID, -valid_AphiaID,
         scientificnameID = ScientificNameID_corr)

#Adding required columns (these pertain to the Occurrence table)
format_obis <- format_obis %>% 
  mutate(occuranceStatus = "Present",
         basisOfRecord = "HumanObservation",
         identifiedBy = "Louis Hobson",
         method = "Morphology")

##https://tools.gbif.org/dwca-validator/extension.do?id=dwc:Event
#Creating eventID - An identifier for the set of information associated with an Event (something that occurs at a place and time). May be a global unique identifier or an identifier specific to the data set.
format_obis <- format_obis %>% 
  mutate(eventID = "Hakai_phyto_QU39") %>% 
  mutate(depth_unit = "m") %>% # creating depth unit to added to depth measurement 
  unite("depth_unit", minimumDepthInMeters, depth_unit, #merging depth with depth unit
        sep = "", remove = FALSE) %>%
  unite("eventID", eventID, date, depth_unit, #merging event event ID prefix, date and depth
        sep = "_", remove = FALSE) %>% 
  select(!depth_unit) #removing depth unit column created for ID

#Creating and adding occurrence ID
format_obis <- format_obis %>% 
  group_by(eventID) %>% 
  mutate(occurrence_num = row_number()) %>% #creating sequential # that restarts each new date
  ungroup() %>% 
  unite("occurrenceID", eventID, occurrence_num, #merging event ID with new number 
        sep = "_", remove = FALSE) %>% 
  relocate(occurrenceID, .after = eventID) %>% 
  select(!occurrence_num) #removing sequential # used to build occurrenceID

#Removing cyanobacteria because observation/count highly speculative - could do earlier in process, but nice to have for other work.
format_obis <- format_obis %>% 
  filter(!scientificName == "Cyanobacteria")

#Creating eventCore - pulling columns from flat format_obis table
event <- format_obis %>% 
  select(eventID, eventDate, minimumDepthInMeters,maximumDepthInMeters,
         decimalLongitude = long, decimalLatitude = lat) %>% 
  distinct(eventID, .keep_all = TRUE)
  
#Creating occurrence table pulling columns from flat format_obis table 
occurrence <- format_obis %>%
  select(eventID, occurrenceID, scientificName, 
         scientificnameID, taxonRank, lifeStage, identificationRemark,
         identificationQualifier, occuranceStatus, basisOfRecord,
         identifiedBy, 
         method)

#Creating extended measurement of fact sheet without lifeStage.
emof_no_lifeStage <- format_obis %>%
  select(occurrenceID, measurementValue, lifeStage) %>%
  mutate(measurementType = "Abundance of phytoplankton",
         measurementTypeID = "http://vocab.nerc.ac.uk/collection/P01/current/PU00M00Z/",
         measurementUnit = "cells per litre",
         measurementUnitID = "http://vocab.nerc.ac.uk/collection/P06/current/UCPL/")

#Creating EMoF with lifestage included in long (tidy) format.
emof_lifeStage <- format_obis %>% 
  select(occurrenceID, measurementValue, lifeStage) %>% 
  mutate(measurementValue = as.character(measurementValue)) %>% 
  pivot_longer(c(measurementValue:lifeStage),
               names_to = "measurementType", values_to = "measurementValue")

#Removing rows where there is no lifeStage - Don't need NAs
emof_lifeStage <- emof_lifeStage %>% 
  filter(!is.na(measurementValue))

#Creating measurement ID since lifeStage addition creates duplicate occurrenceIDs
emof_lifeStage <- emof_lifeStage %>% 
  group_by(occurrenceID) %>% 
  mutate(meas_num = row_number()) %>% #creating sequential # that restarts each new occID
  ungroup() %>% 
  unite("measurementID", occurrenceID, meas_num, #merging  occID with new numbers 
        sep = "_", remove = FALSE) %>% 
  relocate(measurementID, .after = occurrenceID) %>% 
  select(!meas_num) #removing sequential # used to build measID

#Adding measurementType and MeasurementTypeID to emof_lifestage. 
emof_lifeStage <- emof_lifeStage %>% 
  mutate(measurementType = str_replace(measurementType,
                                 "measurementValue", 
                                 "Abundance of phytoplankton")) %>%
  mutate(measurementTypeID = 
           case_when(measurementType == "Abundance of phytoplankton" ~ 
                       "http://vocab.nerc.ac.uk/collection/P01/current/PU00M00Z/",
                     measurementType == "lifeStage" ~
                       "http://vocab.nerc.ac.uk/collection/P01/current/LSTAGE01/")) %>% 
  mutate(measurementValueID = 
           case_when(measurementValue == "Auxospores" ~
                       "http://vocab.nerc.ac.uk/collection/S11/current/S1117/",
                     measurementValue == "copepodites" ~
                       "http://vocab.nerc.ac.uk/collection/S11/current/S115/",
                     measurementValue == "nauplii" ~
                       "http://vocab.nerc.ac.uk/collection/S11/current/S1130/",
                     measurementValue == "resting spores" ~ #update to "spores"
                       "http://vocab.nerc.ac.uk/collection/S11/current/S1134/",
                     measurementValue == "spores" ~
                       "http://vocab.nerc.ac.uk/collection/S11/current/S1135/")) %>% 
  select(occurrenceID, measurementID, measurementType, measurementTypeID,
         measurementValue, measurementValueID)

#Adding measurement unit for abundance counts and measurementUnitID for measurementUnit
emof_lifeStage <- emof_lifeStage %>% 
  mutate(measurementUnit = 
           case_when(measurementType == "Abundance of phytoplankton" ~
                       "Number per litre")) %>% 
  mutate(measurementUnitID = 
           case_when(measurementType == "Abundance of phytoplankton" ~
                       "http://vocab.nerc.ac.uk/collection/P06/current/UCPL/"))

```

```{r}
#Assigning Louis' original classifications to species list.

#Creating final distinct taxonomic list including all original analyst classifications and worms additions to ensure that all classifications/qualifiers are correct and to send to analyst for cross-check.

occurrence_unique <- complete_join %>%
  select(orig_name, status, unacceptreason, 
         scientificName_accepted = valid_name, rank, kingdom:genus, 
         lifeStage, identificationRemark, identificationQualifier) %>% 
  distinct(orig_name, across(kingdom:identificationQualifier), 
           .keep_all = TRUE) 

occurrence_unique <- occurrence_unique %>% 
  mutate(Louis_class = case_when(class == "Bacillariophyceae" ~ "Bacillariophyta",
                                 phylum == "Chlorophyta" ~ "Chlorophyta",
                                 phylum == "Choanozoa" ~ "Choanoflagellata",
                                 class == "Chrysophyceae" |
                                   class == "Xanthophyceae" ~ "Chrysophyta",
                                 phylum == "Ciliophora" ~ "Ciliophora",
                                 class == "Cryptophyceae" ~ "Cryptophyta",
                                 phylum == "Cyanobacteria" ~ "Cyanobacteria",
                                 class == "Dictyochophyceae" ~ "Dictyophyta",
                                 class == "Dinophyceae" ~ "Dinoflagellata",
                                 class == "Ebriophyceae" ~ "Ebriidea",
                                 phylum == "Euglenozoa" & 
                                   (class == "Euglenoidea" | is.na(class)) ~
                                   "Euglenophyta",
                                 phylum == "Haptophyta" ~ "Haptophyta",
                                 class == "Kinetoplastea" | 
                                   orig_name == "Metromonas simplex" | 
                                   orig_name == "Pseudobodo tremulans" | 
                                   orig_name == "Telonema subtilis" ~ 
                                   "Kinetoplastidea",
                                 phylum == "Arthropoda" | 
                                   phylum == "Chordata" ~ "Metazoa",
                                 class == "Raphidophyceae" ~ "Raphidiophyta",
                                 kingdom == "Protozoa" & is.na(phylum) &
                                   is.na(class) ~ "Unknown")) %>% 
  select(Louis_class, Louis_name = orig_name, status:identificationQualifier) %>% 
  arrange(Louis_class, Louis_name)

#Testing stringr::stg_trim to remove whitespace following species name. Worked - should probably do this somewhere earlier (maybe after regex work)
occurrence_unique <- occurrence_unique %>% 
  mutate(Louis_name = stringr::str_trim(Louis_name, side = "right"))
```

```{r}
#Comparison of portal samples with data with current OBIS data - Trying to see what samples we have missed or are awaiting results for.
data_incorp <- complete_join %>% 
  select(hakai_id) %>% 
  distinct() %>% 
  mutate(obis_status = "uploaded")


#This didn't work very well - too many samples that haven't been analyzed and tough to filter them out - would be better to look at my sent sample inventory and cross compare with that.
# data_assess <- metadata %>% 
#   left_join(data_incorp) %>% 
#   distinct(hakai_id, .keep_all = TRUE) %>% 
#   filter(!is.na(hakai_id) & date > "2016-01-01" & 
#            (line_out_depth == 0 | line_out_depth == 5) &
#            is.na(obis_status))

```

```{r}
#Writing sheets to csv.

#out_path <- "C:\\Users\\justin.belluz\\Documents\\R\\microscopy_OBIS\\output_files"

#Write sheets to file.
write_csv(format_obis, here("output_files", "OBIS_format_flat.csv"))
write_csv(event, here("output_files", "Hakai_phytoplankton_event.csv"))
write_csv(occurrence, here("output_files", "Hakai_phytoplankton_occurrence_idRemark.csv"))
write_csv(emof_no_lifeStage, here("output_files", "Hakai_phytoplankton_emof_no_lifeStage.csv"))
write_csv(emof_lifeStage, here("output_files", "Hakai_phytoplankton_emof_lifeStage.csv"))
write_csv(occurrence_unique, here("output_files", "occurrence_distinct_check_4.csv"))
write_csv(data_incorp, here("output_files", "incorp_hakID.csv"))
```

```{r}
#things to do
#fix time (done)
#add qualifier information (done, but needs cross checks)
```