Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the state of stringi #80

Closed
JanMarvin opened this issue Mar 6, 2022 · 11 comments
Closed

the state of stringi #80

JanMarvin opened this issue Mar 6, 2022 · 11 comments

Comments

@JanMarvin
Copy link
Owner

Having moved imports in f88f995 I wondered, if we still need the stringi dependency or if we should continue to strengthen its use.

Previously somebody complained, that c() and paste/0() are slow, and introduced stringi to the package. Is this still a valid claim? Especially because I have not taken care of it, we currently have a mix of both worlds. Should we move the remaining c() and friends to stringi or should we ditch stringi altogether?

Right now, the only advantage I see is that we have slightly better unicode support on non utf8 Windows, but hopefully this will be fixed with the Windows unicode only R releases.

# from NAMESPACE
importFrom(stringi,stri_c)
importFrom(stringi,stri_conv)
importFrom(stringi,stri_isempty)
importFrom(stringi,stri_join)
importFrom(stringi,stri_match)
importFrom(stringi,stri_replace_all_fixed)
importFrom(stringi,stri_split_fixed)
importFrom(stringi,stri_sub)
@jmbarbone
Copy link
Collaborator

Agree that if it stays let's incorporate it more.

I might lean more towards removing, though.

  1. I think most of our use is just stri_join(), which does the same thing as paste() (with different defaults; some uses of stri_sub()/stri_replace_all() (same as gsub() or just sub()). So not really doing anything different. We could use more of the stri_extract() to replace the magic we're doing with the regmatches(x, gregexpr(pattern, x)) workflow -- very sure stringi is a lot quicker there.
  2. stringi has better benchmarking than base R, but I'm wondering if we're really working with that much text anyway that it would matter. Base R paste() and c() I've never felt as slow for what I've done, so it's probably not very noticeable until you get to really large/complicated vectors.
  3. It's another dependency. I think we're getting some speed back with changing ReferenceClass to R6 and moving more things into C++. Don't think this is really going to save us. Besides, I'm pretty sure some of the changes I've made to incorporate paste_c() (and being more efficient with string joining) has saved more time than simply replacing all the base R with stringi.
  4. If unicode handling is especially important/problematic than I'd rather not have to think about it and let stringi do the thinking for me

But if we want to keep it I'd happily move towards a more global use.

@JanMarvin
Copy link
Owner Author

JanMarvin commented Mar 7, 2022

This is the original pull request: awalker89/openxlsx#440. Guess removing the lists from style has the greatest impact here. I'm somewhat undecided and hoped you have a string [edit:] strong feeling for one or the other :)
Pugixml treats everything as UTF-8, stringi afaik is a bit more reliable keeping UTF-8 as UTF-8, but in all honesty I don't want to wrap my head around theoretical issues that will resolve on modern software (we can cross the bridge once we get there).

I'll see if I can make some "replace all changes" to use more stringi and see if the tests bark

@jmbarbone
Copy link
Collaborator

It might be good to set up some general bench marks, too. We could do some large tasks and then also check if replacing base with stringi is enough of an improvement.

Okay, my strong-threaded opinion would be: 🧵 keep stringi if we show it's much faster. Otherwise let the package install a bit more quickly.

I had played around with some bench marks for stringi:
https://gist.github.com/jmbarbone/2327d308edcbf79bfeb10dec76cced06

I know I had seen a more comprehensive benchmarking of stringi but I can't find that. The package documentation only has a small section on this that I could easily find:https://stringi.gagolewski.com/weave/design_principles.html?highlight=bench

@JanMarvin
Copy link
Owner Author

Thanks (for the benchmark), I like strong-threaded opinions 😆 And I totally agree. I have no idea, how they came to their initial opinion, that using stringi would fix the bottleneck, because in my experience all calls to create R objects (any kind of R vectors such as lists, e.g. all the nice looking styleObjects) slow things down massively. Nonetheless, I suspect that keeping stringi if helpful is fine.

I usually tend to avoid benchmarks, because they often are purely synthetic, if someone's life depends on nanoseconds or microseconds, that's tough, but usually it is negligible. I only did a few benchmarks with openxlsx2 when I was redeveloping the whole backend. So I was not entirely sure how we compete with openxlsx. Microbenchmark on M1 Macbook Pro 2021

library(microbenchmark)
library(openxlsx2)

xlsxFile <- system.file("extdata", "readTest.xlsx", package = "openxlsx2")
# xlsxFile <- system.file("extdata", "loadExample.xlsx", package = "openxlsx2")

xlsx_new <- temp_xlsx()
xlsx_old <- temp_xlsx()
wb_old <- openxlsx::loadWorkbook(xlsxFile)
wb_new <- openxlsx2::loadWorkbook(xlsxFile)

res <- microbenchmark(
  wb_old <- openxlsx::loadWorkbook(xlsxFile),
  wb_new <- openxlsx2::loadWorkbook(xlsxFile),
  x_old <- openxlsx::read.xlsx(wb_old, sheet = 3),
  x_new <- openxlsx2::read.xlsx(wb_new, sheet = 3),
  openxlsx::saveWorkbook(wb_old, xlsx_old, TRUE),
  openxlsx2::saveWorkbook(wb_new, xlsx_new, TRUE),
  times = 10
)

Not so bad after all

> res
Unit: milliseconds
                                             expr        min         lq       mean     median         uq        max neval
       wb_old <- openxlsx::loadWorkbook(xlsxFile) 113.735271 115.945991 134.388582 121.319840 126.267331 200.960147    10
      wb_new <- openxlsx2::loadWorkbook(xlsxFile) 105.957325 106.947844 111.388439 111.474962 113.801281 122.447566    10
  x_old <- openxlsx::read.xlsx(wb_old, sheet = 3)   5.334305   5.672104   5.934488   5.946599   6.242168   6.330523    10
 x_new <- openxlsx2::read.xlsx(wb_new, sheet = 3)  51.637901  55.926952  66.522352  57.760185  65.882449 129.364471    10
   openxlsx::saveWorkbook(wb_old, xlsx_old, TRUE) 338.549136 341.858205 356.209533 347.462269 357.731109 415.190846    10
  openxlsx2::saveWorkbook(wb_new, xlsx_new, TRUE) 248.617153 251.209173 255.417155 252.374906 256.392475 271.931147    10

BTT I'll see if we gain significant speed from stringi, but somehow doubt it.

@JanMarvin
Copy link
Owner Author

JanMarvin commented Mar 27, 2022

An update, just to show that why I was already pleased with the previous results, we still have improved a bit (keep in mind that the readTest example is somewhat unfavourable to us, openxlsx only reads and writes a quarter of the rows on the sheet, while we read the entire thing with everything on it):

> library(microbenchmark)
> # library(openxlsx2)
> # library(openxlsx)
> 
> xlsxFile <- system.file("extdata", "readTest.xlsx", package = "openxlsx2")
> # xlsxFile <- system.file("extdata", "loadExample.xlsx", package = "openxlsx2")
> 
> xlsx_new <- openxlsx::temp_xlsx()
> xlsx_old <- openxlsx2::temp_xlsx()
> wb_old <- openxlsx::loadWorkbook(xlsxFile)
> wb_new <- openxlsx2::loadWorkbook(xlsxFile)
> 
> # profvis::profvis({openxlsx2::wb_to_df(wb_new, sheet = 3)})
> 
> sheet <- 5
> res <- microbenchmark(
+   wb_old <- openxlsx::loadWorkbook(xlsxFile),
+   wb_new <- openxlsx2::loadWorkbook(xlsxFile),
+   x_old <- openxlsx::read.xlsx(wb_old, sheet = sheet),
+   x_new <- openxlsx2::read.xlsx(wb_new, sheet = sheet),
+   openxlsx::saveWorkbook(wb_old, xlsx_old, TRUE),
+   openxlsx2::wb_save(wb_new, xlsx_new, TRUE),
+   x_old <- openxlsx::read.xlsx(xlsxFile, sheet = sheet),
+   x_new <- openxlsx2::read.xlsx(xlsxFile, sheet = sheet),
+   times = 25,
+   unit = "ms"
+ ); res
Unit: milliseconds
                                                   expr       min        lq      mean    median        uq       max neval
             wb_old <- openxlsx::loadWorkbook(xlsxFile) 114.06011 116.91896 120.63606 119.97449 125.32626 127.92845    25
            wb_new <- openxlsx2::loadWorkbook(xlsxFile)  97.32781  99.83734 105.17499 101.94228 104.09662 170.98488    25
    x_old <- openxlsx::read.xlsx(wb_old, sheet = sheet)  12.07024  13.31081  27.02152  17.09983  24.91381  81.92005    25
   x_new <- openxlsx2::read.xlsx(wb_new, sheet = sheet)  90.66912  98.50570 118.53569 101.88242 152.26818 180.64448    25
         openxlsx::saveWorkbook(wb_old, xlsx_old, TRUE) 335.47623 347.45782 363.75094 357.61180 368.94383 416.66320    25
             openxlsx2::wb_save(wb_new, xlsx_new, TRUE) 245.38106 252.07743 260.60964 255.55972 259.98461 328.89134    25
  x_old <- openxlsx::read.xlsx(xlsxFile, sheet = sheet)  79.98370  82.54399  90.22901  84.29079  88.24418 156.33743    25
 x_new <- openxlsx2::read.xlsx(xlsxFile, sheet = sheet) 157.08068 166.63302 190.37043 176.86383 222.48687 246.65481    25

[edit] already obsolete, now:

Unit: milliseconds
                                                   expr       min        lq      mean    median        uq       max neval
             wb_old <- openxlsx::loadWorkbook(xlsxFile) 114.56835 118.46339 128.22633 124.52356 125.91682 186.02368    25
            wb_new <- openxlsx2::loadWorkbook(xlsxFile)  97.90185 101.02084 107.55111 102.46511 110.21075 178.37165    25
    x_old <- openxlsx::read.xlsx(wb_old, sheet = sheet)  12.48958  13.26789  23.93558  13.73180  22.54200  88.13606    25
   x_new <- openxlsx2::read.xlsx(wb_new, sheet = sheet)  58.17527  70.25260  86.41652  76.04299  83.44763 149.41700    25
         openxlsx::saveWorkbook(wb_old, xlsx_old, TRUE) 340.71262 355.09063 394.10805 368.02978 420.84393 551.88726    25
             openxlsx2::wb_save(wb_new, xlsx_new, TRUE) 235.77870 244.68706 250.40670 246.74021 254.71594 298.25569    25
  x_old <- openxlsx::read.xlsx(xlsxFile, sheet = sheet)  80.44028  82.04740  87.90603  88.31716  92.29674 101.44950    25
 x_new <- openxlsx2::read.xlsx(xlsxFile, sheet = sheet) 131.38844 135.31841 158.03921 144.35038 170.05090 215.07284    25

[edit2] final update for today:

Unit: milliseconds
                                                   expr       min        lq      mean    median        uq      max neval
             wb_old <- openxlsx::loadWorkbook(xlsxFile) 115.21385 117.28489 123.83189 119.79142 132.34923 135.1745    25
            wb_new <- openxlsx2::loadWorkbook(xlsxFile)  94.89352  96.47255 104.45200  97.88512 112.79448 127.8093    25
    x_old <- openxlsx::read.xlsx(wb_old, sheet = sheet)  12.00849  12.66236  29.23914  26.09047  27.60120 105.1724    25
   x_new <- openxlsx2::read.xlsx(wb_new, sheet = sheet)  58.12714  73.18217  89.22837  85.95888  91.18047 157.2743    25
         openxlsx::saveWorkbook(wb_old, xlsx_old, TRUE) 329.60691 348.01489 380.94227 359.41129 405.29353 621.1345    25
             openxlsx2::wb_save(wb_new, xlsx_new, TRUE) 233.41152 250.11640 256.15387 252.49128 259.62258 302.5504    25
  x_old <- openxlsx::read.xlsx(xlsxFile, sheet = sheet)  79.00429  79.83381  88.75215  81.39796  95.93943 165.1624    25
 x_new <- openxlsx2::read.xlsx(xlsxFile, sheet = sheet) 119.75932 137.94233 160.16483 150.45118 160.43333 239.5629    25

@JanMarvin
Copy link
Owner Author

I've added a few stri_joins in f2b8408, but the entire benchmarking is a bottomless pit. Desperately craving for a few ms ...

@jmbarbone
Copy link
Collaborator

So, many of our string manipulation with gsub() and regmatches() utilize perl expressions which isn't supported in stringi.

This is how I see it, we can do one of these:

  • rewrite the string searches to no longer use perl expression
  • use base R where we need perl expressions, replace what we can with stringi
  • remove stringi

The first one seems more time expensive and possibly problematic. Keeping stringi where we can could be beneficial but I don't know by how much or if it would be worth the trouble of including the dependency. I don't think the string operations are our biggest slowdown.

@JanMarvin
Copy link
Owner Author

I have been thinking about this a bit lately and discussed it over a few beers earlier this week when I visited a friend in Munich. I did some tests with profvis (the profiler that RStudio comes with) and using stringi was indeed helpful. However, I would not invest much time in forcing or converting to/from stringi. It's just not worth the effort. After all, this isn't competitive programming and I'm not going to lose sleep over a few wasted milliseconds. Especially not if it takes a few hours of development work on my part. As long as we don't see or feel that we are slow or are pointed out by users, we shouldn't waste our time optimizing some artificial benchmarks.

@JanMarvin
Copy link
Owner Author

After merging #173 (increased loading time) and #178 (decreased loading time) we're at:

> xlsxFile <- system.file("extdata", "readTest.xlsx", package = "openxlsx2")
>
> xlsx_new <- openxlsx::temp_xlsx()
> xlsx_old <- openxlsx2::temp_xlsx()
> wb_old <- openxlsx::loadWorkbook(xlsxFile)
> wb_new <- openxlsx2::wb_load(xlsxFile)
>
> sheet <- 5 # large sheet 271 x 297. With more columns than rows and we are still a bit slow creating long_to_wide
Unit: milliseconds
                                                   expr       min        lq      mean    median        uq       max neval
             wb_old <- openxlsx::loadWorkbook(xlsxFile) 117.06906 121.93002 125.00688 124.12016 127.09787 140.59572    25
                 wb_new <- openxlsx2::wb_load(xlsxFile)  77.49894  82.88466  88.44689  86.89495  91.82446 108.19031    25
    x_old <- openxlsx::read.xlsx(wb_old, sheet = sheet)  11.91993  12.83833  14.93066  13.58892  16.05548  30.48112    25
   x_new <- openxlsx2::read_xlsx(wb_new, sheet = sheet)  62.33648  64.52658  78.70571  68.56979  83.40298 131.60475    25
         openxlsx::saveWorkbook(wb_old, xlsx_old, TRUE) 332.48864 347.91272 367.21375 355.74601 369.73673 526.97620    25
             openxlsx2::wb_save(wb_new, xlsx_new, TRUE) 232.77114 241.03814 256.42354 252.86414 255.83221 304.76895    25
  x_old <- openxlsx::read.xlsx(xlsxFile, sheet = sheet)  77.80480  82.31591  89.16942  84.56348  89.28549 145.26382    25
 x_new <- openxlsx2::read_xlsx(xlsxFile, sheet = sheet) 120.41852 126.09833 147.42198 135.29242 177.04768 189.69425    25

@jmbarbone
Copy link
Collaborator

Probably good to close? I think most of the base functions in here have been replaced with {stringi} by now.

@JanMarvin
Copy link
Owner Author

yeah, it was open solely for the benchmarks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants