-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathqu39_orig_full_process.Rmd
679 lines (536 loc) · 38.7 KB
/
qu39_orig_full_process.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
---
title: "QU39 Phytoplankton Microscopy OBIS submission"
output: html_notebook
---
This notebook uploads phytoplankton microscopy data from excel based analyst results, which include taxonomic names and abundance counts for each data collection event. In this notebook, only data collected at QU39 are considered: At this station, phytoplankton microscopy samples are collected at a single depth on a weekly to bi-weekly basis. Data collection is ongoing, but here, only data between 2016 and the end of 2018 have been uploaded.
After initial data transformation, the data are converted to tidy (long) format and taxonomic information is matched to taxonomic names within the WoRMs database (using both the obisTools and Worrms R packages). This matching corrects for spelling errors, allows for selection of the appropriate taxonomic information where multiple matches are returned, provides a WoRMs aphiaID and outputs taxonomic hierarchies. Considerable consultation was done with the analyst to ensure that the taxonomic matches corresponded to the original analyst designations. As a result, the final scientificNames have been updated to the most recent accepted names within the WoRMs database.
Care was taken to ensure that matches were done to the lowest taxonomic level possible. At times, taxonomic designations were downgraded to higher taxonomic levels due to analyst uncertainties. These downgrades are documented within the code below. Additionally, identificationRemarks were used to explain decisions and analyst notes.
```{r}
#Load packages - Ran script with some packages turned off and it work, but need to try with more
library(tidyverse)
library(janitor) #Where do I use this? Test running without
library(hutils) #Used to drop empty columns
library(obistools) #Used for initial species matches to WoRMs - good for cleaning taxonomic names
library(worrms) #Used to derive accepted names and taxonomic hierarchies
library(curl) #??? Test running without.
library(stringr) #Used to remove white-space after
library(here)
#library('taxizesoap') #not working.
#library(fuzzyjoin) #I don't know if I use this anymore
#library(devtools) #This was to try to get taxizesoap to work, but think it is unused.
#library(remotes) #I think this was to try to get taxizesoap to work, but think it is unused.
#library(XML) #I think this was to try to get taxizesoap to work, but think it is unused.
```
```{r}
#Uploading and wrangling
#This step needs to be updated to more efficiently upload the raw data from analyst when new data is received. I am still doing quite a bit of manual wrangling in Excel to standardize formatting, but some of this could likely be automated. It is difficult as there are always slight chances in every new sheet sent by the analyst.
#I need to update the upload to use "here"
#Uploads data - the way I have it formatted ensures the data columns are kept as numeric.
micro_qu39 <- read.csv(here("files", "upload_qu39_2016_half2019.csv"), skip = 2)
#Removes phytoplankton group designations by the analyst. Removed because when kept in, it makes the columns character rather then numeric after the transpose performed in the next step.
micro_qu39 <- micro_qu39[,-1]
#Saving the analyst groupings as they may be useful for analysis.
analyst_groups <- micro_qu39[,1]
#Transposes and makes species names column headers and Hakai IDs row names - allows data columns to remain as numeric
micro_qu39 <- setNames(data.frame(t(micro_qu39[ , -1])), micro_qu39 [ , 1])
#Pushes the hakaiID rownames into a column with a header.
micro_qu39 <- rownames_to_column(micro_qu39, "hakai_id")
#Converts data table into a tibble.
micro_qu39 <- as_tibble(micro_qu39)
#Saving columns with no counts/observations. These exist because I manually standardized all the data in the uploaded excel data-sheet so that I could easily combine them. As a result, taxonomic designations exist where occurrences existed in other data-sets not used here.
dropped_col_list <- micro_qu39 %>%
select_if(~(all(is.na(.)) | all(. == "")))
#dropping columns with no counts/observations.
micro_qu39 <- drop_empty_cols(micro_qu39)
```
```{r}
#Here, I work with/manipulate some of the taxonomic designations while they are still in column format (each taxonomic designation is a name of a column). It's easier to rename columns and sum abundance counts in this format when compared to tidy format
#The below step sums taxonomic columns with size designations. As per the analyst, these size separations were qualitative/descriptive and are not useful/descriptive for OBIS upload.
#What about chaetoceros cinctus/radicans restings stages? Sum with chaetoceros spp. (resting). I think can keep and place in identificationRemarks.
micro_qu39 <- micro_qu39 %>%
rowwise() %>%
mutate(Biddulphiales = sum(`Unidentified centric forms`,
`Unidentified very small centric forms`, na.rm = TRUE),
Pennales = sum(`Unidentified pennate forms`,
`Unidentified very small pennate forms`, na.rm = TRUE),
`Skeletonema marinoi` = sum(`S. marinoi (small cells)`,
`S. marinoi (large cells)`, na.rm = TRUE),
`Chaetoceros tenuissimus` = sum(`Chaetoceros tenuissimus (large)`,
`Chaetoceros tenuissimus (small)`,
na.rm = TRUE),
`Chaetoceros spp_sum.` = sum(`Chaetoceros spp.`,
`Chaetoeros spp. very small`, na.rm = TRUE ),
Oligotrichea = sum(`Elongate form`, `Oligotricious forms (large)`,
`Oligotricious forms (medium)`,
`Oligotricious forms (small)`, na.rm = TRUE))
#Checking that the sums worked.
sum_check <- micro_qu39 %>%
select(Biddulphiales, `Unidentified centric forms`,
`Unidentified very small centric forms`, Pennales,
`Unidentified pennate forms`, `Unidentified very small pennate forms`,
`Skeletonema marinoi`, `S. marinoi (small cells)`,
`S. marinoi (large cells)`, `Chaetoceros tenuissimus`,
`Chaetoceros tenuissimus (large)`, `Chaetoceros tenuissimus (small)`,
`Chaetoceros spp_sum.`, `Chaetoceros spp.`, `Chaetoeros spp. very small`,
Oligotrichea, `Elongate form`, `Oligotricious forms (large)`,
`Oligotricious forms (medium)`, `Oligotricious forms (small)`)
#Removing size separated columns that were summed into a single column.
micro_qu39 <- micro_qu39 %>%
select(-`Unidentified centric forms`, -`Unidentified very small centric forms`,
-`Unidentified pennate forms`, -`Unidentified very small pennate forms`,
-`S. marinoi (small cells)`, -`S. marinoi (large cells)`,
-`Chaetoceros tenuissimus (large)`, -`Chaetoceros tenuissimus (small)`,
-`Chaetoceros spp.`, -`Chaetoeros spp. very small`,
-`Elongate form`, -`Oligotricious forms (large)`,
-`Oligotricious forms (medium)`, -`Oligotricious forms (small)`)
#Sum step turned NA's into zeros. Once tidy, I remove NA's as it indicates no observation. This step converts the zeros in the summed columns back into NA's.
micro_qu39 <- micro_qu39 %>%
mutate(Biddulphiales = na_if(Biddulphiales, 0),
Pennales = na_if(Pennales, 0),
`Skeletonema marinoi` = na_if(`Skeletonema marinoi`, 0),
`Chaetoceros tenuissimus` = na_if(`Chaetoceros tenuissimus`, 0),
`Chaetoceros spp_sum.` = na_if(`Chaetoceros spp_sum.`, 0),
Oligotrichea = na_if(Oligotrichea, 0)) #Had the wrong name here - `Chaetoceros spp_sum.` C/P error
#Below, I rename "unknown" taxonomic designations to the lowest taxonomic level possible. In the original excel sheet, these had a group level classification in the neighboring column, but I removed this as it was interfering with formatting. The work below relies on column position to separate which unknown species is which (i.e. Unknown species vs. Unknown species.1 etc.). It works here, but it could cause errors when more data is included. I wonder if there is anyway to keep the analyst group that they were associated with? Really though, these should stay in the same order unless the analyst adds a new "Unknown species".
#Do I just combine all of the protozoa or do I keep qualifiers?
micro_qu39 <- micro_qu39 %>%
rename(Bacillariophyceae = `Unknown forms`, #Class
`Protozoa (alpha-starch, Chlorophyta?)` = `Unknown species`, #Biota
`Chaetoceros spp.` = `Chaetoceros spp_sum.`, #Genus
Choanozoa = `Unknown species.1`, #Phylum
Chrysophyceae = `Unknown species.2`, #Class
`Dinophyceae (resting stages)` = `Resting stages`, #Class, NVS S1134 resting "spores"
Dinophyceae = `Unknown species.3`, #Class
Euglenozoa = `Unknown species.4`, #Phylum
Kinetoplastea = `Unknown species.5`, #Class.
Protozoa = `Unknown flagellae-bearing cells`, #Biota
`Protozoa (equatorial groove, Dinophyceae?)` = `Unknown very small species (dino??)`,
`copepoda (copepodites)` = Copepodids, #Subclass, NVS lifestage S115
`copepoda (nauplii/ekdysis)` = `Copepod nauplii/ekdysis`, #Subclass, NVS S1130
`Thalassiosira nordenskioeldii` = `Thalassiosira nordenskioldii`,
Tintinnina = `Tintinnid forms`) #suborder
#Finding remaining unknown species classifications that I couldn't resolve. Unknown cells cannot be included. Total cells is just a summation of all diatoms and should be excluded.
unknown_Col_list <- micro_qu39 %>%
select(starts_with("v") |
starts_with("unk") |
starts_with("uni") |
starts_with("tot"))
#Removes all of the above columns that don't have taxonomic information.
micro_qu39 <- micro_qu39 %>%
select(!(starts_with("v") |
starts_with("unk") |
starts_with("uni") |
starts_with("tot")))
```
```{r}
#Merging data with database, cross checking merge. Making data tidy.
#Upload portal metadata to get location and times. Changed to the newer read_csv. Not sure why I was using the base R version, but kept the line here as it is what I developed everything from. Checked and doesn't seem to be any differences, but need to cross check final product. Also, want to move the file into a files folder OR upload directly from portal using the API, but haven't been able to make this work yet.
#metadata <- read.csv("metadata_phyto.csv")
metadata <- read_csv(here("metadata_phyto.csv"))
#Join with portal metadata based on hakai_id - Caution required as the analyst doesn't provide Hakai ID in data sheets (only station and date), so I need to investigate data sheet errors and ensure I assign them the correct Hakai ID.
micro_qu39 <- micro_qu39 %>%
left_join(metadata)
#reorganizing and eliminating unused columns
micro_qu39 <- micro_qu39 %>%
select(date, collected, hakai_id, depth = Depth, line_out_depth, work_area, site_id,
lat, long, volume, lugols, collected,
`Achnanthes spp.`:Oligotrichea)
#Find rows where there was no metadata match - This step is a QC check to make sure that all analyst results were matched to metadata.
meta_missing <- micro_qu39 %>%
filter(is.na(collected))
#Test to see if meta line-out-depth matches data sheet depth.
depth_mismatch <- micro_qu39 %>%
mutate(depth_check = if_else(line_out_depth == depth, 1, 0)) %>%
select(depth_check) %>%
filter(depth_check == 0)
#Making data Tidy
micro_qu39_tidy <- micro_qu39 %>%
pivot_longer(c(`Achnanthes spp.`: Oligotrichea),
names_to = "species", values_to = "counts")
#With this data, NA's represent zeros, which means species were not observed. Removing as only want observed species.
micro_qu39_tidy <- micro_qu39_tidy %>%
drop_na(counts)
```
```{r}
#Cleaning up the data
#In this chunk I start working with the taxonomic designations to allow for matches within the WoRMs database. Specifically, non-taxonomic data are removed and qualifiers are moved into separate columns. The latter step involves considerable regular expression work.
#Create a list of designations/names do not have taxonomic information and that I remove in the next step.
removed_row_list <- micro_qu39_tidy %>%
filter(str_detect(species, "Eggs") |
str_detect(species, "pieces") |
str_detect(species, "Fecal") |
str_detect(species, "ejecta"))
#Removes designations/names with no taxonomic information from datasheet.
micro_qu39_tidy <- micro_qu39_tidy %>%
filter(!(str_detect(species, "Eggs") |
str_detect(species, "pieces") |
str_detect(species, "Fecal") |
str_detect(species, "ejecta")))
#This creates regex list that will be used to add these terms to lifeStage column.
lifeStage_regex <- c("resting stages", "resting stage", "nauplii/ekdysis",
"\\bspore", "Auxospore", "copepodites") %>%
str_c(collapse = "|")
#This creates regex list that will be used to add these terms to the identificationRemark column.
identificationRemark_regex <- c("alpha-starch, Chlorophyta\\?",
"equatorial groove, Dinophyceae\\?",
"cinctus/radicans") %>%
str_c(collapse = "|")
#This creates regex list that will be used to add these terms to the identificationQualifier column. I could likely make the search easier and eliminate [:blank:]spp$ by using stringr::str_trim to remove whitespace before or after.
identificationQ_regex <- c("[:blank:]spp$", "spp\\.", "sp\\.") %>%
str_c(collapse = "|")
#This step uses the regular expression lists defined above to find matches within the taxonomic column and for these matches, writes the specified terms to either lifeStage, indentificationQualifier or identificationRemark columns.
micro_qu39_tidy <- micro_qu39_tidy %>%
mutate(lifeStage = str_match(species, lifeStage_regex)) %>%
mutate(identificationQualifier = str_match(species, identificationQ_regex)) %>%
mutate(identificationRemark = str_match(species, identificationRemark_regex))
#Now that I have taken these terms that were included with the taxonomic information and put them in the correct columns, I now fix rename the lifeStages to match accepted vocabulary and fix a weird identificationQualifier spp that has white space making regex work difficult. I could have fixed these earlier when I was working with the data in the column format.
micro_qu39_tidy <- micro_qu39_tidy %>%
mutate(lifeStage = str_replace(lifeStage, "spore", "spores")) %>%
mutate(lifeStage = str_replace(lifeStage, "resting stages", "resting spores")) %>%
mutate(lifeStage = str_replace(lifeStage, "resting stage", "resting spores")) %>%
mutate(lifeStage = str_replace(lifeStage, "nauplii/ekdysis", "nauplii")) %>%
mutate(identificationQualifier = str_replace(identificationQualifier,
"[:blank:]spp$", "spp."))
#This step/sheet is a quality check to ensure that all of the lifeStage-taxonomic combinations are present
lifeStage_check <- micro_qu39_tidy %>%
distinct(across(c(species, lifeStage))) %>%
filter(!is.na(lifeStage))
#Check that all the identificationQualifier combinations are present
identificationQ_check <- micro_qu39_tidy %>%
distinct(across(c(species, identificationQualifier))) %>%
filter(!is.na(identificationQualifier))
#Check that all of the identificationRemark combinations are present
identificationRemark_check <- micro_qu39_tidy %>%
distinct(across(c(species, identificationRemark))) %>%
filter(!is.na(identificationRemark))
#Adding spp. where I moved cinctus/radicans to identificationRemark. The last term in the case_when function makes sure that the other identificationQualifiers are not overwritten.
micro_qu39_tidy <- micro_qu39_tidy %>%
mutate(identificationQualifier = case_when(identificationRemark ==
"cinctus/radicans" ~ "spp.",
TRUE ~ as.character
(identificationQualifier)))
#Remove all qualifier information from taxonomic/species name column so that only taxonomic information remains.
micro_qu39_tidy <- micro_qu39_tidy %>%
mutate(species = str_replace(species, "\\(resting stages\\)", "")) %>%
mutate(species = str_replace(species, "resting stage", "")) %>%
mutate(species = str_replace(species, "resting stages", "")) %>%
mutate(species = str_replace(species, "Auxospore", "")) %>%
mutate(species = str_replace(species, "\\(copepodites\\)", "")) %>%
mutate(species = str_replace(species, "\\(nauplii/ekdysis\\)", "")) %>%
mutate(species = str_replace(species, "spp\\.", "")) %>%
mutate(species = str_replace(species, "\\bspp", "")) %>%
mutate(species = str_replace(species, "sp\\.", "")) %>%
mutate(species = str_replace(species, "spore forms", "")) %>%
mutate(species = str_replace(species, "cinctus/radicans", "")) %>%
mutate(species = str_replace(species, "\\(elongate form\\)", "")) %>%
mutate(species = str_replace(species, "Pterosperma s", "Pterosperma")) %>%
mutate(species = str_replace(species, "\\(equatorial groove, Dinophyceae\\?\\)","")) %>%
mutate(species = str_replace(species, "\\(alpha-starch, Chlorophyta\\?\\)", "")) %>%
mutate(species = str_replace(species, "[:blank:]spp$", ""))
#Adding more information to my identificationRemarks. The current remarks were easy to manipulte in earlier steps, but here at the end of the process, I can make them more encompassing.
#Unknown protozoa that could be Chlorophytes
micro_qu39_tidy <- micro_qu39_tidy %>%
mutate(identificationRemark = str_replace(identificationRemark,
"alpha-starch, Chlorophyta\\?",
"Cells too small to identify. Stained blue indicating alpha starch and could be Chlorophyta"))
#Unknown protozoa that could be Dinophyceae
micro_qu39_tidy <- micro_qu39_tidy %>%
mutate(identificationRemark = str_replace(identificationRemark,
"equatorial groove, Dinophyceae\\?",
"Cells too small to identify. Showed equatorial groove and could be Dinophyceae"))
#Couldn't tell between resting stage cinctus and radicans
micro_qu39_tidy <- micro_qu39_tidy %>%
mutate(identificationRemark = str_replace(identificationRemark,
"cinctus/radicans",
"Unable to distinguish between Chaetoceros cinctus and radicans"))
#Adding reason why classification was stopped at sp., spp according to Horten et al. In theory, shouldn't this Qualifier be put on any occurrence that stops before species level? Based on figure 1 in Horten et al., I think that the qualifier should be added even when there is no sp. or spp. I will add this after the taxonomic hierarchy has been applied so that I can filter for it.
micro_qu39_tidy <- micro_qu39_tidy %>%
mutate(identificationQualifier = case_when(identificationQualifier ==
"spp." ~ "spp. Indet.",
identificationQualifier ==
"sp." ~ "sp. Indet.",
TRUE ~ as.character
(identificationQualifier)))
#This step is time consuming to run, but removes any white-space before/after strings created through regex work.
micro_qu39_tidy <- micro_qu39_tidy %>%
mutate(species = stringr::str_trim(species, side = "right"))
```
```{r}
#This chunk of code runs the obistools and worrms tools on a truncated distinct species list rather than the entire data-sheet. This step is important because it helps to isolate species match issues on a small data-set. Furthermore, the worrms::wm_record_names provides the accepted scientificName, aphiaID and taxonomic hierarchy - these are not provided by the obistools:match_taxa.
#For error checking, limit species to unique species names. For whatever reason, this is throwing duplicates - must be something from my above filtering/renaming (i.e. a space after the name?). Not certain yet if it is necessary to fix this as it will depend on if it affects the data when I apply functions to the full dataset, which is not a unique subset. For unique subset, I deal with this later after the match_taxa has been performed - apply distinct to the scientificName output.
unique_list <- micro_qu39_tidy %>%
distinct(species) %>%
arrange(species) %>%
mutate(id = row_number())
#Take unique list of species and search for them within the WoRMs database. Add a unique row number for matching with the original species name list, so that species with poor or no matches can be identified
worm_match <- unique_list$species %>%
match_taxa() %>%
mutate(id = row_number())
#Testing to see if only using worrms:wm_match_taxa provides the same results as first using obistools::match_taxa. It didn't pick up names where there was spelling errors or small differences and I think this was because the fuzzy matching default is FALSE. When I tried turning it on, it threw: Error: (500) Internal Server Error - AphiaRecordsByNames. Conflicted here because what I have works (even if it is more convoluted) and adds additional failsafe in that species names are more thoroughly checked.
# test <- unique_list$species %>%
# wm_records_names(fuzzy = TRUE) %>%
# bind_rows() %>%
# rename(scientificName = scientificname)
#
# test2 <- worm_match %>%
# left_join(test, by = "scientificName")
#Join WoRMS species scientificName and scientificID with original species names from datasheet
worm_match <- worm_match %>%
full_join(unique_list) %>%
rename(orig_name = species)
#Filter for species where there was a poor match. Cross check matches to see if correct. After cross checking, the near_3 match is incorrect - Pleurotoma gracile, which was matched to Pleurostomum gracile. The near_2 matches are correct.
worm_poor_match <- worm_match %>%
filter(match_type %in% c("near_2", "near_3", na.rm = "FALSE"))
#Filter for species with no WoRMs matches. See next note.
worm_na_match <- worm_match %>%
filter(is.na(match_type))
# This step uses the WoRMs scientificName to create a list of distinct species and then it removes species with incorrect or no matches as these complicate the following database search using the worrms::wm_records_name tool. I can't find a match for Meringosphaerae tenerrima, but I could go to genus level for this species. For the other two species, I can't even find a genus level match. The analyst said that these species occur so infrequently and at such low numbers that he suggests just removing them instead of going through the work of adding them. Pleurostomum gracile, Phyllomitus yorkeensis, Meringosphaerae tenerrima
worm_match_clean <- worm_match %>%
filter(!(str_detect(orig_name, "Pleurostomum gracile") |
str_detect(orig_name, "Phyllomitus yorkeensis") |
str_detect(orig_name, "Meringosphaerae tenerrima"))) %>%
distinct(scientificName, .keep_all = TRUE)
#This tool uses the worrms::wm_records_names tool to attach taxonomic hierarchies and accepted scientificNames to the scientificNames derived above from the obistools::match_taxa (provides direct matches, not accepted). Unfortunately, it re-adds duplicates (Appendicularia and Euglenozoa) where I was able to make selections in obistools::match_taxa and I need to manually remove these later. The worrms::wm_record_names is described here: https://cran.r-project.org/web/packages/worrms/worrms.pdf
worm_hierarchy <- worm_match_clean$scientificName %>%
wm_records_names() %>%
bind_rows() %>%
rename(scientificName = scientificname)
#Joins the worrms::wm_records_names output with the obistools::taxon_match output so that unaccepted names can be found and traced back to the analysts original name and other issues can be identified.
worm_join <- worm_hierarchy %>%
full_join(worm_match_clean, by = "scientificName")
#finds duplicate species names created using wormms::wm_name_records. The worrms::wm_name_records tool reintroduces potential duplicate taxonomic designations that were selected and removed in obistools::taxon_match. This list is used to remove/rename duplicated entries when working with the full data-set in the next chunk.
worm_join_dup <- worm_join %>%
group_by(scientificName) %>%
mutate(num_dups = n(),
dup_id = row_number()) %>%
ungroup() %>%
mutate(is_duplicated = dup_id > 1) %>%
filter(num_dups > 1)
#this finds all of the taxonomic designations that are not accepted. After consultation with the analyst, I decided to go forward with the accepted scientificName from the worrms::wm_records_names output. It is suggested that the analyst original name, even if it is a synonym, is retained and used, but I since I cross checked with the analyst, There is nothing I can do to corrected for uncertain names.
worm_join_unaccept <- worm_join %>%
filter(!status == "accepted")
```
```{r}
#This chunk adds taxonomic information to the full dataset, rather than just a distinct list. The distinct list is still important because the worrms::wm_records_names tool will only run on a small number of records. Therefore, I needed to use the distinct list from above to join to the larger dataset here. I don't really like my naming conventions here.
#Performs obistools::match_taxa on full dataset. When prompted in console, select y, 2, 2. As the output is a single scientificName column, this step also gives an id so that this can be merged back with the metadata in the next step.
complete <- micro_qu39_tidy$species %>%
match_taxa() %>%
mutate(id = row_number())
#Gives the full data sheet an id for merging with
micro_qu39_tidy <- micro_qu39_tidy %>%
mutate(id = row_number())
#Removes species names that I decided to remove from the data based on the analysts suggestion. I could also do this earlier before the matching is done.
complete <- complete %>%
full_join(micro_qu39_tidy) %>%
rename(orig_name = species) %>%
filter(!(str_detect(orig_name, "Pleurostomum gracile") |
str_detect(orig_name, "Phyllomitus yorkeensis") |
str_detect(orig_name, "Meringosphaerae tenerrima")))
#This removes the double hierarchy matches (as seen in worm_join_dup data-sheet). Consultation was done to ensure the removal of the incorrect match and to leave the correct match. It might be better to do this in the last chunk where I develop this sheet - would probably flow better, rather than jumping back here.
worm_hierarchy_clean <- worm_hierarchy %>%
filter(!((scientificName == "Appendicularia" & is.na(authority)) |
(scientificName == "Dinophysis rotundata" & authority == "Levander, 1894") |
(scientificName == "Euglenozoa" & is.na(authority))))
#Joins the worrms:wm_records_names derived taxonomic hierarchies and accepted ScientificName (and other information) to the full datasheet with all occurrences.
complete_join <- complete %>%
left_join(worm_hierarchy_clean, by = "scientificName")
```
```{r}
#Adding Indet. to all classification that did not go to species level and do not have sp. or spp.
complete_join <- complete_join %>%
mutate(identificationQualifier = case_when(!rank == "Species" &
is.na(identificationQualifier) ~
"Indet.",
TRUE ~ as.character
(identificationQualifier)))
```
```{r}
#Writing flat file with all information
write_csv(complete_join, here("output_worms_matching", "qu39_2016_half2019.csv"))
```
```{r}
#In this chunk I work to get the data into the OBIS-DWC format and split it into the event-core and the occurrence and extended measurement of fact (EMoF) extensions.
#Setting up a large flat table that I will split the Event, Occurrence and EMoF extensions from afterwards.
#Switched from portal date to collected, which includes hours/minutes/seconds in UTC. The portal format looks to be in the correct format for OBIS-DwC. I kept the date (format Year-month-day) for the creation of the eventID, because I thought it was too complex to add the hours/minutes/seconds to the ID
format_obis <- complete_join %>%
select(date, eventDate = collected, minimumDepthInMeters = line_out_depth,
maximumDepthInMeters = line_out_depth, lat, long,
scientificName = valid_name, scientificNameID, valid_AphiaID,
taxonRank = rank, lifeStage, identificationRemark,
identificationQualifier, measurementValue = counts)
#The worrms::wm_records_names tool does not output the correct scientificNameID format (not a url), but it does give the correct aphiaID number. The obistools::match_taxa format is correct, but with the unaccepted species aphiaIDs at the end. In this step, I remove the ID numbers at the end of the obistools scientificNameID url and then concatenate the correct ID number from the worrms tool at the end of the url.
format_obis <- format_obis %>%
mutate(scientificNameID_2 = str_replace_all(scientificNameID, "[:digit:]", "")) %>%
unite("ScientificNameID_corr", scientificNameID_2, valid_AphiaID,
sep = "", remove = FALSE) %>%
select(-scientificNameID_2, - scientificNameID, -valid_AphiaID,
scientificnameID = ScientificNameID_corr)
#Adding required columns (these pertain to the Occurrence table)
format_obis <- format_obis %>%
mutate(occuranceStatus = "Present",
basisOfRecord = "HumanObservation",
identifiedBy = "Louis Hobson",
method = "Morphology")
##https://tools.gbif.org/dwca-validator/extension.do?id=dwc:Event
#Creating eventID - An identifier for the set of information associated with an Event (something that occurs at a place and time). May be a global unique identifier or an identifier specific to the data set.
format_obis <- format_obis %>%
mutate(eventID = "Hakai_phyto_QU39") %>%
mutate(depth_unit = "m") %>% # creating depth unit to added to depth measurement
unite("depth_unit", minimumDepthInMeters, depth_unit, #merging depth with depth unit
sep = "", remove = FALSE) %>%
unite("eventID", eventID, date, depth_unit, #merging event event ID prefix, date and depth
sep = "_", remove = FALSE) %>%
select(!depth_unit) #removing depth unit column created for ID
#Creating and adding occurrence ID
format_obis <- format_obis %>%
group_by(eventID) %>%
mutate(occurrence_num = row_number()) %>% #creating sequential # that restarts each new date
ungroup() %>%
unite("occurrenceID", eventID, occurrence_num, #merging event ID with new number
sep = "_", remove = FALSE) %>%
relocate(occurrenceID, .after = eventID) %>%
select(!occurrence_num) #removing sequential # used to build occurrenceID
#Removing cyanobacteria because observation/count highly speculative - could do earlier in process, but nice to have for other work.
format_obis <- format_obis %>%
filter(!scientificName == "Cyanobacteria")
#Creating eventCore - pulling columns from flat format_obis table
event <- format_obis %>%
select(eventID, eventDate, minimumDepthInMeters,maximumDepthInMeters,
decimalLongitude = long, decimalLatitude = lat) %>%
distinct(eventID, .keep_all = TRUE)
#Creating occurrence table pulling columns from flat format_obis table
occurrence <- format_obis %>%
select(eventID, occurrenceID, scientificName,
scientificnameID, taxonRank, lifeStage, identificationRemark,
identificationQualifier, occuranceStatus, basisOfRecord,
identifiedBy,
method)
#Creating extended measurement of fact sheet without lifeStage.
emof_no_lifeStage <- format_obis %>%
select(occurrenceID, measurementValue, lifeStage) %>%
mutate(measurementType = "Abundance of phytoplankton",
measurementTypeID = "http://vocab.nerc.ac.uk/collection/P01/current/PU00M00Z/",
measurementUnit = "cells per litre",
measurementUnitID = "http://vocab.nerc.ac.uk/collection/P06/current/UCPL/")
#Creating EMoF with lifestage included in long (tidy) format.
emof_lifeStage <- format_obis %>%
select(occurrenceID, measurementValue, lifeStage) %>%
mutate(measurementValue = as.character(measurementValue)) %>%
pivot_longer(c(measurementValue:lifeStage),
names_to = "measurementType", values_to = "measurementValue")
#Removing rows where there is no lifeStage - Don't need NAs
emof_lifeStage <- emof_lifeStage %>%
filter(!is.na(measurementValue))
#Creating measurement ID since lifeStage addition creates duplicate occurrenceIDs
emof_lifeStage <- emof_lifeStage %>%
group_by(occurrenceID) %>%
mutate(meas_num = row_number()) %>% #creating sequential # that restarts each new occID
ungroup() %>%
unite("measurementID", occurrenceID, meas_num, #merging occID with new numbers
sep = "_", remove = FALSE) %>%
relocate(measurementID, .after = occurrenceID) %>%
select(!meas_num) #removing sequential # used to build measID
#Adding measurementType and MeasurementTypeID to emof_lifestage.
emof_lifeStage <- emof_lifeStage %>%
mutate(measurementType = str_replace(measurementType,
"measurementValue",
"Abundance of phytoplankton")) %>%
mutate(measurementTypeID =
case_when(measurementType == "Abundance of phytoplankton" ~
"http://vocab.nerc.ac.uk/collection/P01/current/PU00M00Z/",
measurementType == "lifeStage" ~
"http://vocab.nerc.ac.uk/collection/P01/current/LSTAGE01/")) %>%
mutate(measurementValueID =
case_when(measurementValue == "Auxospores" ~
"http://vocab.nerc.ac.uk/collection/S11/current/S1117/",
measurementValue == "copepodites" ~
"http://vocab.nerc.ac.uk/collection/S11/current/S115/",
measurementValue == "nauplii" ~
"http://vocab.nerc.ac.uk/collection/S11/current/S1130/",
measurementValue == "resting spores" ~ #update to "spores"
"http://vocab.nerc.ac.uk/collection/S11/current/S1134/",
measurementValue == "spores" ~
"http://vocab.nerc.ac.uk/collection/S11/current/S1135/")) %>%
select(occurrenceID, measurementID, measurementType, measurementTypeID,
measurementValue, measurementValueID)
#Adding measurement unit for abundance counts and measurementUnitID for measurementUnit
emof_lifeStage <- emof_lifeStage %>%
mutate(measurementUnit =
case_when(measurementType == "Abundance of phytoplankton" ~
"Number per litre")) %>%
mutate(measurementUnitID =
case_when(measurementType == "Abundance of phytoplankton" ~
"http://vocab.nerc.ac.uk/collection/P06/current/UCPL/"))
```
```{r}
#Writing sheets to csv. Blanked off because I think it's worth merging with other cleaned and matched data and then formatting to obis together.
#out_path <- "C:\\Users\\justin.belluz\\Documents\\R\\microscopy_OBIS\\output_files"
#Write sheets to file.
# write_csv(format_obis, here(output_files, "OBIS_format_flat.csv"))
# write_csv(event, here(output_files, "Hakai_phytoplankton_event.csv"))
# write_csv(occurrence, here(output_files, "Hakai_phytoplankton_occurrence_idRemark.csv"))
# write_csv(emof_no_lifeStage, here(output_files, "Hakai_phytoplankton_emof_no_lifeStage.csv"))
# write_csv(emof_lifeStage, here(output_files, "Hakai_phytoplankton_emof_lifeStage.csv"))
# write_csv(occurrence_unique, here(output_files, "occurrence_distinct_check_4.csv"))
# write_csv(data_incorp, here(output_files, "incorp_hakID.csv"))
```
```{r}
#Assigning Louis' original classifications to species list.
#Creating final distinct taxonomic list including all original analyst classifications and worms additions to ensure that all classifications/qualifiers are correct and to send to analyst for cross-check.
occurrence_unique <- complete_join %>%
select(orig_name, status, unacceptreason,
scientificName_accepted = valid_name, rank, kingdom:genus,
lifeStage, identificationRemark, identificationQualifier) %>%
distinct(orig_name, across(kingdom:identificationQualifier),
.keep_all = TRUE)
occurrence_unique <- occurrence_unique %>%
mutate(Louis_class = case_when(class == "Bacillariophyceae" ~ "Bacillariophyta",
phylum == "Chlorophyta" ~ "Chlorophyta",
phylum == "Choanozoa" ~ "Choanoflagellata",
class == "Chrysophyceae" |
class == "Xanthophyceae" ~ "Chrysophyta",
phylum == "Ciliophora" ~ "Ciliophora",
class == "Cryptophyceae" ~ "Cryptophyta",
phylum == "Cyanobacteria" ~ "Cyanobacteria",
class == "Dictyochophyceae" ~ "Dictyophyta",
class == "Dinophyceae" ~ "Dinoflagellata",
class == "Ebriophyceae" ~ "Ebriidea",
phylum == "Euglenozoa" &
(class == "Euglenoidea" | is.na(class)) ~
"Euglenophyta",
phylum == "Haptophyta" ~ "Haptophyta",
class == "Kinetoplastea" |
orig_name == "Metromonas simplex" |
orig_name == "Pseudobodo tremulans" |
orig_name == "Telonema subtilis" ~
"Kinetoplastidea",
phylum == "Arthropoda" |
phylum == "Chordata" ~ "Metazoa",
class == "Raphidophyceae" ~ "Raphidiophyta",
kingdom == "Protozoa" & is.na(phylum) &
is.na(class) ~ "Unknown")) %>%
select(Louis_class, Louis_name = orig_name, status:identificationQualifier) %>%
arrange(Louis_class, Louis_name)
#Testing stringr::stg_trim to remove whitespace following species name. Worked - should probably do this somewhere earlier (maybe after regex work)
occurrence_unique <- occurrence_unique %>%
mutate(Louis_name = stringr::str_trim(Louis_name, side = "right"))
```
```{r}
#Comparison of portal samples with data with current OBIS data - Trying to see what samples we have missed or are awaiting results for.
data_incorp <- complete_join %>%
select(hakai_id) %>%
distinct() %>%
mutate(obis_status = "uploaded")
#This didn't work very well - too many samples that haven't been analyzed and tough to filter them out - would be better to look at my sent sample inventory and cross compare with that.
# data_assess <- metadata %>%
# left_join(data_incorp) %>%
# distinct(hakai_id, .keep_all = TRUE) %>%
# filter(!is.na(hakai_id) & date > "2016-01-01" &
# (line_out_depth == 0 | line_out_depth == 5) &
# is.na(obis_status))
```
```{r}
#things to do
#fix time (done)
#add qualifier information (done, but needs cross checks)
```