Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parenthesis inside the string: plantR::prepName #132

Open
ggrittz opened this issue Feb 18, 2025 · 0 comments
Open

Parenthesis inside the string: plantR::prepName #132

ggrittz opened this issue Feb 18, 2025 · 0 comments

Comments

@ggrittz
Copy link

ggrittz commented Feb 18, 2025

plantR::prepName cannot deal with cases such as "Sobrinho, J. de P.L. (no. 1441)"

> plantR::prepName('Sobrinho, J. de P.L. (no. 1441)')
Error in gsub(x, "", y, perl = TRUE) : 
  expressão regular inválida ')|Sobrinho'
Além disso: Warning message:
In gsub(x, "", y, perl = TRUE) : erro de compilação de padrão PCRE
	'unmatched closing parenthesis'
	at ')|Sobrinho'

This is because when a parenthesis (or bracket) is found, the function only tracks them if they are at the beginning and the end, i.e., "(João Silva)":

Below are lines 11 to 18 of prepName

if (any(bracks)) 
    x[bracks] <- gsub("^\\[|\\]$|^\\(|\\)$", "", x[bracks], 
                      perl = TRUE)
  parent <- grepl("^\\(", x, perl = TRUE) & grepl("\\)$", x, 
                                                  perl = TRUE)
  if (any(parent)) 
    x[parent] <- gsub("^\\[|\\]$|^\\(|\\)$", "", x[parent], 
                      perl = TRUE)

Cases such as "Sobrinho, J. de P.L. (no. 1441)" are not accounted for and an error is returned. I've been thinking about how to solve this, since at the end of the function those brackets and parenthesis are returned, but for cases like the one I mentioned, this exercise becomes too complicated. So I looked at thousands of cases like this and pretty much all of them are one of

  1. some location, ("Parc National de Port-Cros)"
  2. some institute name, "(INFLOVAR (Association))"
  3. another name, "Franklin, M.A. (Ben)" or
  4. a (potential) collector number, "Luetzelburg, P. von (no. 23045)"

But since collector numbers are not extracted from these columns (prepName is used on $recordedBy and $identifiedBy), I think in cases such as "Sobrinho, J. de P.L. (no. 1441)", everything inside the parentheses (including them) could be removed. The function preps a name only, and what's inside the within-string parenthesis is not used for anything else. Also, if only the parentheses are removed, i.e., "Sobrinho, J. de P.L. no. 1441", then the output gets messy and considers "No." the surname.

If wanted, to remove the within-string parentheses and what's inside, it's possible to use

x <- trimws(ifelse(grepl("(?<!^)\\(", x, perl = TRUE) | grepl("\\)(?!$)", x, perl = TRUE), gsub("\\([^)]*\\)", "", x), x))
It will still keep cases such as "(João Silva)" as is.

The lines above could be added just before this step (line 14) in the prepName function:
parent <- grepl("^\\(", x, perl = TRUE) & grepl("\\)$", x, perl = TRUE)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant