Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tools::toTitleCase() (specific case from #24) #32

Open
shannonpileggi opened this issue Jul 12, 2024 · 17 comments
Open

tools::toTitleCase() (specific case from #24) #32

shannonpileggi opened this issue Jul 12, 2024 · 17 comments

Comments

@shannonpileggi
Copy link

Breaking this out from issue #24

tools::toTitleCase() incorrectly capitalizes conjunctions (e.g. 'and') when using suspensive hyphenation
https://bugs.r-project.org/show_bug.cgi?id=18674

@shannonpileggi
Copy link
Author

shannonpileggi commented Jul 12, 2024

From the bugzilla report:

# this behavior is incorrect
tools::toTitleCase("small- and large-scale analysis")
"Small- And Large-Scale Analysis"

# this behavior is correct
tools::toTitleCase("small-scale and large-scale analysis")
"Small-Scale and Large-Scale Analysis"

# this behavior is correct
tools::toTitleCase("small and large-scale analysis")
"Small and Large-Scale Analysis"

Documenting existing behavior, as existing unit tests are minimal.

tools::toTitleCase("bayesian network modeling and analysis")
"Bayesian Network Modeling and Analysis"

tools::toTitleCase("ensemble tool for predictions from species distribution models")
"Ensemble Tool for Predictions from Species Distribution Models"

tools::toTitleCase("a 'shiny' app for weibull analysis from 'weibullr'")
"a 'shiny' App for Weibull Analysis from 'weibullr'"

tools::toTitleCase("post-estimation functions for generalized linear mixed models")
"Post-Estimation Functions for Generalized Linear Mixed Models"

tools::toTitleCase("post- and post-estimation functions for generalized linear mixed models")
"Post- And Post-Estimation Functions for Generalized Linear Mixed Models"

tools::toTitleCase("general network (http/ftp/...) client interface for r")
"General Network (Http/Ftp/...) Client Interface for r"

tools::toTitleCase("i love r")
"i Love r"

tools::toTitleCase("r and i")
"r and i"

tools::toTitleCase("above the bar")
"above the Bar"

tools::toTitleCase("the bar is above")
"The Bar is above"

@shannonpileggi
Copy link
Author

shannonpileggi commented Jul 12, 2024

Suggested tests to add with their current behavior

# incorrect
tools::toTitleCase("i love r")
"i Love r"

# incorrect
tools::toTitleCase("r and i")
"r and i"

# incorrect
tools::toTitleCase("a new bug")
"a New Bug"

# correct
tools::toTitleCase("in a minute")
"In a Minute"

# correct
tools::toTitleCase("pre and post estimation")
"Pre and Post Estimation"

# incorrect
tools::toTitleCase("pre- and post estimation")
"Pre- And Post Estimation"

@Bisaloo
Copy link

Bisaloo commented Jul 12, 2024

From @shannonpileggi's test cases, we noticed the unexpected behaviour of single letter words (including 'a') not being capitalized, even at the beginning of a sentence.

This is at odds with the comments in the function source code:

https://github.com/r-devel/r-svn/blob/09f3acb92dcc180636a7c4e257491302e9a22ae8/src/library/tools/R/utils.R#L2662-L2663

    ## These should be lower case except at the beginning (and after :)
    lpat <- "^(a|an|and|are|as|at|be|but|by|en|for|if|in|is|nor|not|of|on|or|per|so|the|to|v[.]?|via|vs[.]?|from|into|than|that|with)$"

There is indeed an instruction to capitalize the first word of the sentence:

https://github.com/r-devel/r-svn/blob/09f3acb92dcc180636a7c4e257491302e9a22ae8/src/library/tools/R/utils.R#L2687

        l[1L] <- FALSE

But then a later instruction explicitly excludes single-letter words from the transformation:

https://github.com/r-devel/r-svn/blob/09f3acb92dcc180636a7c4e257491302e9a22ae8/src/library/tools/R/utils.R#L2696

        keep <- havecaps | l | (nchar(xx) == 1L) | alone

A potential solution is to special-case the first word of the sentence later in the logic, so it's not overwritten by following exclusions. We still want it to happen before alone exclusions are applied, to ensure alone words are always left alone, even if they are in first position.

This leads to the following patch:

Index: src/library/tools/R/utils.R
===================================================================
--- src/library/tools/R/utils.R (revision 86796)
+++ src/library/tools/R/utils.R (working copy)
@@ -2688,7 +2688,6 @@
         alone <- alone | grepl("^'.*'$", xx)
         havecaps <- grepl("^[[:alpha:]].*[[:upper:]]+", xx)
         l <- grepl(lpat, xx, ignore.case = TRUE)
-        l[1L] <- FALSE
         ## do not remove capitalization immediately after ": " or "- "
         ind <- grep("[-:]$", xx); ind <- ind[ind + 2L <= length(l)]
         ind <- ind[(xx[ind + 1L] == " ") & grepl("^['[:alnum:]]", xx[ind + 2L])]
@@ -2697,7 +2696,9 @@
         ind <- which(xx == '"'); ind <- ind[ind + 1L <= length(l)]
         l[ind + 1L] <- FALSE
         xx[l] <- tolower(xx[l])
-        keep <- havecaps | l | (nchar(xx) == 1L) | alone
+        keep <- havecaps | l | (nchar(xx) == 1L)
+        keep[1L] <- FALSE
+        keep <- keep | alone
         xx[!keep] <- sapply(xx[!keep], do1)
         paste(xx, collapse = "")

@Bisaloo
Copy link

Bisaloo commented Jul 12, 2024

The patch mentioned above doesn't completely harmonize the function behaviour with the comment. 'a' after ':' is still not capitalized:

tools::toTitleCase("Salzburg: a city of music")
#> [1] "Salzburg: a City of Music"

Created on 2024-07-12 with reprex v2.1.1

@Bisaloo
Copy link

Bisaloo commented Jul 12, 2024

Patch from @sarahzeller for the original issue:

Index: src/library/tools/R/utils.R
===================================================================
--- src/library/tools/R/utils.R (revision 86796)
+++ src/library/tools/R/utils.R (working copy)
@@ -2692,6 +2692,8 @@
         ## do not remove capitalization immediately after ": " or "- "
         ind <- grep("[-:]$", xx); ind <- ind[ind + 2L <= length(l)]
         ind <- ind[(xx[ind + 1L] == " ") & grepl("^['[:alnum:]]", xx[ind + 2L])]
+        # don't capitalize lpat words after hyphenation
+        ind <- ind[!(xx[ind] == "-" & grepl(lpat, xx[ind + 2L]))]
         l[ind + 2L] <- FALSE
         ## Also after " (e.g. "A Book Title")
         ind <- which(xx == '"'); ind <- ind[ind + 1L <= length(l)]

@gmbecker
Copy link

What is R CMD check using to calculate the correct title case. Perhaps we can simply use the same code there in toTitleCase...

@Bisaloo
Copy link

Bisaloo commented Jul 12, 2024

R CMD check is using toTitleCase():

https://github.com/r-devel/r-svn/blob/09f3acb92dcc180636a7c4e257491302e9a22ae8/src/library/tools/R/QC.R#L7848-L7871

    ## Check Title field.
    title <- trimws(as.vector(meta["Title"]))
    title <- gsub("[\n\t]", " ", title)
    package <- meta["Package"]
    if (tolower(title) == tolower(package)) {
        out$title_is_name <- TRUE
    } else {
        if(grepl(paste0("^",
                        gsub(".", "[.]", package, fixed = TRUE),
                        "[ :]"), title, ignore.case = TRUE))
            out$title_includes_name <- TRUE
        language <- meta["Language"]
        if(is.na(language) ||
           (language == "en") ||
           startsWith(language, "en-")) {
            title2 <- toTitleCase(title)
            ## Keep single quoted elements unchanged.
            p <- "(^|(?<=[ \t[:punct:]]))'[^']*'($|(?=[ \t[:punct:]]))"
            m <- gregexpr(p, title, perl = TRUE)
            regmatches(title2, m) <- regmatches(title, m)
            if(title != title2)
                out$title_case <- c(title, title2)
        }
    }

@gmbecker
Copy link

Didn't the bug report indicate that after using title case R CMD check complained that it wasn't in title case?

@sarahzeller
Copy link

sarahzeller commented Jul 12, 2024

The issue we're aiming to solve the following issue: After a hyphen, words that should not be capitalized are being capitalized. Examples include and, or, to, as specified in lpat earlier in the toTitleCase().

In the code, there are already lines checking for words following hyphens, to make sure they are capitalized. We add a check in this part, where we exclude those specific cases of “forbidden” words following a hyphen. This prevents such words from being incorrectly capitalized.

Index: src/library/tools/R/utils.R
===================================================================
--- src/library/tools/R/utils.R (revision 86796)
+++ src/library/tools/R/utils.R (working copy)
@@ -2692,6 +2692,8 @@
         ## do not remove capitalization immediately after ": " or "- "
         ind <- grep("[-:]$", xx); ind <- ind[ind + 2L <= length(l)]
         ind <- ind[(xx[ind + 1L] == " ") & grepl("^['[:alnum:]]", xx[ind + 2L])]
+        # don't capitalize lpat words after hyphenation
+        ind <- ind[!(xx[ind] == "-" & grepl(lpat, xx[ind + 2L]))]
         l[ind + 2L] <- FALSE
         ## Also after " (e.g. "A Book Title")
         ind <- which(xx == '"'); ind <- ind[ind + 1L <= length(l)]

@Bisaloo
Copy link

Bisaloo commented Jul 12, 2024

Proposal for a new bug report / discussion on bugzilla:


We noticed some edge cases in toTitleCase() that produced unexpected (to us!) results.

Mismatch between code comment and actual behaviour

'A' is not capitalized at the beginning of a sentence of after a colon.

This is at odds with the comments in the function source code:

https://github.com/r-devel/r-svn/blob/09f3acb92dcc180636a7c4e257491302e9a22ae8/src/library/tools/R/utils.R#L2662-L2663

    ## These should be lower case except at the beginning (and after :)
    lpat <- "^(a|an|and|are|as|at|be|but|by|en|for|if|in|is|nor|not|of|on|or|per|so|the|to|v[.]?|via|vs[.]?|from|into|than|that|with)$"

'A' is not capitalized at the beginning of a sentence

tools::toTitleCase("a new bug")
#> [1] "a New Bug"

There is indeed an instruction to capitalize the first word of the sentence:

https://github.com/r-devel/r-svn/blob/09f3acb92dcc180636a7c4e257491302e9a22ae8/src/library/tools/R/utils.R#L2687

        l[1L] <- FALSE

But then a later instruction explicitly excludes single-letter words from the transformation:

https://github.com/r-devel/r-svn/blob/09f3acb92dcc180636a7c4e257491302e9a22ae8/src/library/tools/R/utils.R#L2696

        keep <- havecaps | l | (nchar(xx) == 1L) | alone

A potential solution is to handle the first word of the sentence later in the logic, so it's not overwritten by following exclusions. We still want it to happen before alone exclusions are applied, to ensure alone words are always left alone, even if they are in first position.

This leads to the following patch:

Index: src/library/tools/R/utils.R
===================================================================
--- src/library/tools/R/utils.R (revision 86796)
+++ src/library/tools/R/utils.R (working copy)
@@ -2688,7 +2688,6 @@
         alone <- alone | grepl("^'.*'$", xx)
         havecaps <- grepl("^[[:alpha:]].*[[:upper:]]+", xx)
         l <- grepl(lpat, xx, ignore.case = TRUE)
-        l[1L] <- FALSE
         ## do not remove capitalization immediately after ": " or "- "
         ind <- grep("[-:]$", xx); ind <- ind[ind + 2L <= length(l)]
         ind <- ind[(xx[ind + 1L] == " ") & grepl("^['[:alnum:]]", xx[ind + 2L])]
@@ -2697,7 +2696,9 @@
         ind <- which(xx == '"'); ind <- ind[ind + 1L <= length(l)]
         l[ind + 1L] <- FALSE
         xx[l] <- tolower(xx[l])
-        keep <- havecaps | l | (nchar(xx) == 1L) | alone
+        keep <- havecaps | l | (nchar(xx) == 1L)
+        keep[1L] <- FALSE
+        keep <- keep | alone
         xx[!keep] <- sapply(xx[!keep], do1)
         paste(xx, collapse = "")

Note that this patch would capitalize all single-letter words at the beginning of a sentence, not just 'A'. We are not completely sure if this is the desired behaviour or not.

'A' is not capitalized after a colon

tools::toTitleCase("Salzburg: a city of music")
#> [1] "Salzburg: a City of Music"

This happens for the same reason, where single-letter words get a special treatment late in the logic. The only difference with the previous section is where the custom logic to capitalize the first word after a colon is applied:

https://github.com/r-devel/r-svn/blob/09f3acb92dcc180636a7c4e257491302e9a22ae8/src/library/tools/R/utils.R#L2689-L2691

        ## do not remove capitalization immediately after ": " or "- "
        ind <- grep("[-:]$", xx); ind <- ind[ind + 2L <= length(l)]
        ind <- ind[(xx[ind + 1L] == " ") & grepl("^['[:alnum:]]", xx[ind + 2L])]
        l[ind + 2L] <- FALSE

Special either exclusion list words are not capitalized at the beginning of a sentence or after a colon

At the moment, words in alone:

https://github.com/r-devel/r-svn/blob/09f3acb92dcc180636a7c4e257491302e9a22ae8/src/library/tools/R/utils.R#L2656-L2661

    ## leave these alone: the internal caps rule would do that
    ## in some cases.  We could insist on this exact capitalization.
    alone <- c("2D", "3D", "AIC", "BayesX", "GoF", "HTML", "LaTeX",
               "MonetDB", "OpenBUGS", "TeX", "U.S.", "U.S.A.", "WinBUGS",
               "aka", "et", "al.", "ggplot2", "i.e.", "jar", "jars",
               "ncdf", "netCDF", "rgl", "rpart", "xls", "xlsx")

and either:

https://github.com/r-devel/r-svn/blob/09f3acb92dcc180636a7c4e257491302e9a22ae8/src/library/tools/R/utils.R#L2665-L2670

    ## These we don't care about
    either <- c("all", "above", "after", "along", "also", "among",
                "any", "both", "can", "few", "it", "less", "log",
                "many", "may", "more", "over", "some", "their",
                "then", "this", "under", "until", "using", "von",
                "when", "where", "which", "will", "without",
                "yet", "you", "your")

are treated the same way:

https://github.com/r-devel/r-svn/blob/09f3acb92dcc180636a7c4e257491302e9a22ae8/src/library/tools/R/utils.R#L2683

        alone <- xx %in% c(alone, either)

But this leads to situations where words in either are not capitalized at the beginning of a sentence or after a colon:

tools::toTitleCase("more than writing code")
#> [1] "more than Writing Code"

We may want to separate the logic for alone and either and ensure special handling happens at the correct location for each of them.

In particular, we may want to ensure either words are capitalized at the start of the sentence, while we probably never want this for alone words.

@reikookamoto
Copy link

As mentioned in previous posts, the code has a vector named alone. This vector includes a mix of technical words that are in mixed case and others that are not usually capitalized. However, it is not comprehensive and misses important tools like R. While I am not sure of the best solution, the current setup appears incomplete and difficult to maintain.

@Bisaloo
Copy link

Bisaloo commented Jul 12, 2024

Quoted words are not protected from capitalization change when followed by a comma:

tools::toTitleCase("Import and Export 'SPSS', 'Stata' and 'SAS' Files")
#> [1] "Import and Export 'Spss', 'Stata' and 'SAS' Files"

tools::toTitleCase("Import and Export 'SPSS' 'Stata' and 'SAS' Files")
#> [1] "Import and Export 'SPSS' 'Stata' and 'SAS' Files"

Created on 2024-07-12 with reprex v2.1.1

@shannonpileggi
Copy link
Author

A PR has been made to r-svn that

  1. fixes the intially reported bug
  2. fixes a spaced in the help documentation
  3. adds two examples to the help documentation
  4. adds two tests regarding the bug fixed

r-devel/r-svn#172

@shannonpileggi
Copy link
Author

Hi! Martin has committed changes here: r-devel/r-svn@cd556f5

Does anyone want take lead on summarizing other potential bugs and adding that to the bugzilla report for consideration?

@Bisaloo
Copy link

Bisaloo commented Jul 17, 2024

We still have r-devel/r-svn#174 on hold on bugzilla and I'm happy to post this comment about capitalization at the start of a string as a new report: #32 (comment)

@shannonpileggi, please let me know if the phrasing in #32 (comment) seems good to you or if you'd tweak anything. If it sounds good, a 👍 reaction on the message to confirm your agreement would be great!

@shannonpileggi
Copy link
Author

Yes, it is very through and looks great!! Please post @Bisaloo 🙏

@Bisaloo
Copy link

Bisaloo commented Jul 17, 2024

Thanks, it is now submitted as https://bugs.r-project.org/show_bug.cgi?id=18767.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants