-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tools::toTitleCase() (specific case from #24) #32
Comments
From the bugzilla report:
Documenting existing behavior, as existing unit tests are minimal.
|
Suggested tests to add with their current behavior
|
From @shannonpileggi's test cases, we noticed the unexpected behaviour of single letter words (including 'a') not being capitalized, even at the beginning of a sentence. This is at odds with the comments in the function source code: ## These should be lower case except at the beginning (and after :)
lpat <- "^(a|an|and|are|as|at|be|but|by|en|for|if|in|is|nor|not|of|on|or|per|so|the|to|v[.]?|via|vs[.]?|from|into|than|that|with)$" There is indeed an instruction to capitalize the first word of the sentence: l[1L] <- FALSE But then a later instruction explicitly excludes single-letter words from the transformation: keep <- havecaps | l | (nchar(xx) == 1L) | alone A potential solution is to special-case the first word of the sentence later in the logic, so it's not overwritten by following exclusions. We still want it to happen before This leads to the following patch: Index: src/library/tools/R/utils.R
===================================================================
--- src/library/tools/R/utils.R (revision 86796)
+++ src/library/tools/R/utils.R (working copy)
@@ -2688,7 +2688,6 @@
alone <- alone | grepl("^'.*'$", xx)
havecaps <- grepl("^[[:alpha:]].*[[:upper:]]+", xx)
l <- grepl(lpat, xx, ignore.case = TRUE)
- l[1L] <- FALSE
## do not remove capitalization immediately after ": " or "- "
ind <- grep("[-:]$", xx); ind <- ind[ind + 2L <= length(l)]
ind <- ind[(xx[ind + 1L] == " ") & grepl("^['[:alnum:]]", xx[ind + 2L])]
@@ -2697,7 +2696,9 @@
ind <- which(xx == '"'); ind <- ind[ind + 1L <= length(l)]
l[ind + 1L] <- FALSE
xx[l] <- tolower(xx[l])
- keep <- havecaps | l | (nchar(xx) == 1L) | alone
+ keep <- havecaps | l | (nchar(xx) == 1L)
+ keep[1L] <- FALSE
+ keep <- keep | alone
xx[!keep] <- sapply(xx[!keep], do1)
paste(xx, collapse = "") |
The patch mentioned above doesn't completely harmonize the function behaviour with the comment. 'a' after ':' is still not capitalized: tools::toTitleCase("Salzburg: a city of music")
#> [1] "Salzburg: a City of Music" Created on 2024-07-12 with reprex v2.1.1 |
Patch from @sarahzeller for the original issue: Index: src/library/tools/R/utils.R
===================================================================
--- src/library/tools/R/utils.R (revision 86796)
+++ src/library/tools/R/utils.R (working copy)
@@ -2692,6 +2692,8 @@
## do not remove capitalization immediately after ": " or "- "
ind <- grep("[-:]$", xx); ind <- ind[ind + 2L <= length(l)]
ind <- ind[(xx[ind + 1L] == " ") & grepl("^['[:alnum:]]", xx[ind + 2L])]
+ # don't capitalize lpat words after hyphenation
+ ind <- ind[!(xx[ind] == "-" & grepl(lpat, xx[ind + 2L]))]
l[ind + 2L] <- FALSE
## Also after " (e.g. "A Book Title")
ind <- which(xx == '"'); ind <- ind[ind + 1L <= length(l)] |
What is R CMD check using to calculate the correct title case. Perhaps we can simply use the same code there in toTitleCase... |
R CMD check is using ## Check Title field.
title <- trimws(as.vector(meta["Title"]))
title <- gsub("[\n\t]", " ", title)
package <- meta["Package"]
if (tolower(title) == tolower(package)) {
out$title_is_name <- TRUE
} else {
if(grepl(paste0("^",
gsub(".", "[.]", package, fixed = TRUE),
"[ :]"), title, ignore.case = TRUE))
out$title_includes_name <- TRUE
language <- meta["Language"]
if(is.na(language) ||
(language == "en") ||
startsWith(language, "en-")) {
title2 <- toTitleCase(title)
## Keep single quoted elements unchanged.
p <- "(^|(?<=[ \t[:punct:]]))'[^']*'($|(?=[ \t[:punct:]]))"
m <- gregexpr(p, title, perl = TRUE)
regmatches(title2, m) <- regmatches(title, m)
if(title != title2)
out$title_case <- c(title, title2)
}
} |
Didn't the bug report indicate that after using title case R CMD check complained that it wasn't in title case? |
The issue we're aiming to solve the following issue: After a hyphen, words that should not be capitalized are being capitalized. Examples include and, or, to, as specified in In the code, there are already lines checking for words following hyphens, to make sure they are capitalized. We add a check in this part, where we exclude those specific cases of “forbidden” words following a hyphen. This prevents such words from being incorrectly capitalized. Index: src/library/tools/R/utils.R
===================================================================
--- src/library/tools/R/utils.R (revision 86796)
+++ src/library/tools/R/utils.R (working copy)
@@ -2692,6 +2692,8 @@
## do not remove capitalization immediately after ": " or "- "
ind <- grep("[-:]$", xx); ind <- ind[ind + 2L <= length(l)]
ind <- ind[(xx[ind + 1L] == " ") & grepl("^['[:alnum:]]", xx[ind + 2L])]
+ # don't capitalize lpat words after hyphenation
+ ind <- ind[!(xx[ind] == "-" & grepl(lpat, xx[ind + 2L]))]
l[ind + 2L] <- FALSE
## Also after " (e.g. "A Book Title")
ind <- which(xx == '"'); ind <- ind[ind + 1L <= length(l)] |
Proposal for a new bug report / discussion on bugzilla: We noticed some edge cases in Mismatch between code comment and actual behaviour'A' is not capitalized at the beginning of a sentence of after a colon. This is at odds with the comments in the function source code: ## These should be lower case except at the beginning (and after :)
lpat <- "^(a|an|and|are|as|at|be|but|by|en|for|if|in|is|nor|not|of|on|or|per|so|the|to|v[.]?|via|vs[.]?|from|into|than|that|with)$" 'A' is not capitalized at the beginning of a sentencetools::toTitleCase("a new bug")
#> [1] "a New Bug" There is indeed an instruction to capitalize the first word of the sentence: l[1L] <- FALSE But then a later instruction explicitly excludes single-letter words from the transformation: keep <- havecaps | l | (nchar(xx) == 1L) | alone A potential solution is to handle the first word of the sentence later in the logic, so it's not overwritten by following exclusions. We still want it to happen before This leads to the following patch: Index: src/library/tools/R/utils.R
===================================================================
--- src/library/tools/R/utils.R (revision 86796)
+++ src/library/tools/R/utils.R (working copy)
@@ -2688,7 +2688,6 @@
alone <- alone | grepl("^'.*'$", xx)
havecaps <- grepl("^[[:alpha:]].*[[:upper:]]+", xx)
l <- grepl(lpat, xx, ignore.case = TRUE)
- l[1L] <- FALSE
## do not remove capitalization immediately after ": " or "- "
ind <- grep("[-:]$", xx); ind <- ind[ind + 2L <= length(l)]
ind <- ind[(xx[ind + 1L] == " ") & grepl("^['[:alnum:]]", xx[ind + 2L])]
@@ -2697,7 +2696,9 @@
ind <- which(xx == '"'); ind <- ind[ind + 1L <= length(l)]
l[ind + 1L] <- FALSE
xx[l] <- tolower(xx[l])
- keep <- havecaps | l | (nchar(xx) == 1L) | alone
+ keep <- havecaps | l | (nchar(xx) == 1L)
+ keep[1L] <- FALSE
+ keep <- keep | alone
xx[!keep] <- sapply(xx[!keep], do1)
paste(xx, collapse = "") Note that this patch would capitalize all single-letter words at the beginning of a sentence, not just 'A'. We are not completely sure if this is the desired behaviour or not. 'A' is not capitalized after a colontools::toTitleCase("Salzburg: a city of music")
#> [1] "Salzburg: a City of Music" This happens for the same reason, where single-letter words get a special treatment late in the logic. The only difference with the previous section is where the custom logic to capitalize the first word after a colon is applied: ## do not remove capitalization immediately after ": " or "- "
ind <- grep("[-:]$", xx); ind <- ind[ind + 2L <= length(l)]
ind <- ind[(xx[ind + 1L] == " ") & grepl("^['[:alnum:]]", xx[ind + 2L])]
l[ind + 2L] <- FALSE Special
|
As mentioned in previous posts, the code has a vector named |
Quoted words are not protected from capitalization change when followed by a comma: tools::toTitleCase("Import and Export 'SPSS', 'Stata' and 'SAS' Files")
#> [1] "Import and Export 'Spss', 'Stata' and 'SAS' Files"
tools::toTitleCase("Import and Export 'SPSS' 'Stata' and 'SAS' Files")
#> [1] "Import and Export 'SPSS' 'Stata' and 'SAS' Files" Created on 2024-07-12 with reprex v2.1.1 |
A PR has been made to r-svn that
|
Hi! Martin has committed changes here: r-devel/r-svn@cd556f5 Does anyone want take lead on summarizing other potential bugs and adding that to the bugzilla report for consideration? |
We still have r-devel/r-svn#174 on hold on bugzilla and I'm happy to post this comment about capitalization at the start of a string as a new report: #32 (comment) @shannonpileggi, please let me know if the phrasing in #32 (comment) seems good to you or if you'd tweak anything. If it sounds good, a 👍 reaction on the message to confirm your agreement would be great! |
Yes, it is very through and looks great!! Please post @Bisaloo 🙏 |
Thanks, it is now submitted as https://bugs.r-project.org/show_bug.cgi?id=18767. |
Breaking this out from issue #24
tools::toTitleCase() incorrectly capitalizes conjunctions (e.g. 'and') when using suspensive hyphenation
https://bugs.r-project.org/show_bug.cgi?id=18674
The text was updated successfully, but these errors were encountered: