-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the state of stringi #80
Comments
Agree that if it stays let's incorporate it more. I might lean more towards removing, though.
But if we want to keep it I'd happily move towards a more global use. |
This is the original pull request: awalker89/openxlsx#440. Guess removing the lists from style has the greatest impact here. I'm somewhat undecided and hoped you have a I'll see if I can make some "replace all changes" to use more stringi and see if the tests bark |
It might be good to set up some general bench marks, too. We could do some large tasks and then also check if replacing base with Okay, my strong-threaded opinion would be: 🧵 keep I had played around with some bench marks for I know I had seen a more comprehensive benchmarking of |
Thanks (for the benchmark), I like strong-threaded opinions 😆 And I totally agree. I have no idea, how they came to their initial opinion, that using stringi would fix the bottleneck, because in my experience all calls to create R objects (any kind of R vectors such as lists, e.g. all the nice looking styleObjects) slow things down massively. Nonetheless, I suspect that keeping stringi if helpful is fine. I usually tend to avoid benchmarks, because they often are purely synthetic, if someone's life depends on nanoseconds or microseconds, that's tough, but usually it is negligible. I only did a few benchmarks with library(microbenchmark)
library(openxlsx2)
xlsxFile <- system.file("extdata", "readTest.xlsx", package = "openxlsx2")
# xlsxFile <- system.file("extdata", "loadExample.xlsx", package = "openxlsx2")
xlsx_new <- temp_xlsx()
xlsx_old <- temp_xlsx()
wb_old <- openxlsx::loadWorkbook(xlsxFile)
wb_new <- openxlsx2::loadWorkbook(xlsxFile)
res <- microbenchmark(
wb_old <- openxlsx::loadWorkbook(xlsxFile),
wb_new <- openxlsx2::loadWorkbook(xlsxFile),
x_old <- openxlsx::read.xlsx(wb_old, sheet = 3),
x_new <- openxlsx2::read.xlsx(wb_new, sheet = 3),
openxlsx::saveWorkbook(wb_old, xlsx_old, TRUE),
openxlsx2::saveWorkbook(wb_new, xlsx_new, TRUE),
times = 10
) Not so bad after all > res
Unit: milliseconds
expr min lq mean median uq max neval
wb_old <- openxlsx::loadWorkbook(xlsxFile) 113.735271 115.945991 134.388582 121.319840 126.267331 200.960147 10
wb_new <- openxlsx2::loadWorkbook(xlsxFile) 105.957325 106.947844 111.388439 111.474962 113.801281 122.447566 10
x_old <- openxlsx::read.xlsx(wb_old, sheet = 3) 5.334305 5.672104 5.934488 5.946599 6.242168 6.330523 10
x_new <- openxlsx2::read.xlsx(wb_new, sheet = 3) 51.637901 55.926952 66.522352 57.760185 65.882449 129.364471 10
openxlsx::saveWorkbook(wb_old, xlsx_old, TRUE) 338.549136 341.858205 356.209533 347.462269 357.731109 415.190846 10
openxlsx2::saveWorkbook(wb_new, xlsx_new, TRUE) 248.617153 251.209173 255.417155 252.374906 256.392475 271.931147 10 BTT I'll see if we gain significant speed from stringi, but somehow doubt it. |
An update, just to show that why I was already pleased with the previous results, we still have improved a bit (keep in mind that the > library(microbenchmark)
> # library(openxlsx2)
> # library(openxlsx)
>
> xlsxFile <- system.file("extdata", "readTest.xlsx", package = "openxlsx2")
> # xlsxFile <- system.file("extdata", "loadExample.xlsx", package = "openxlsx2")
>
> xlsx_new <- openxlsx::temp_xlsx()
> xlsx_old <- openxlsx2::temp_xlsx()
> wb_old <- openxlsx::loadWorkbook(xlsxFile)
> wb_new <- openxlsx2::loadWorkbook(xlsxFile)
>
> # profvis::profvis({openxlsx2::wb_to_df(wb_new, sheet = 3)})
>
> sheet <- 5
> res <- microbenchmark(
+ wb_old <- openxlsx::loadWorkbook(xlsxFile),
+ wb_new <- openxlsx2::loadWorkbook(xlsxFile),
+ x_old <- openxlsx::read.xlsx(wb_old, sheet = sheet),
+ x_new <- openxlsx2::read.xlsx(wb_new, sheet = sheet),
+ openxlsx::saveWorkbook(wb_old, xlsx_old, TRUE),
+ openxlsx2::wb_save(wb_new, xlsx_new, TRUE),
+ x_old <- openxlsx::read.xlsx(xlsxFile, sheet = sheet),
+ x_new <- openxlsx2::read.xlsx(xlsxFile, sheet = sheet),
+ times = 25,
+ unit = "ms"
+ ); res
Unit: milliseconds
expr min lq mean median uq max neval
wb_old <- openxlsx::loadWorkbook(xlsxFile) 114.06011 116.91896 120.63606 119.97449 125.32626 127.92845 25
wb_new <- openxlsx2::loadWorkbook(xlsxFile) 97.32781 99.83734 105.17499 101.94228 104.09662 170.98488 25
x_old <- openxlsx::read.xlsx(wb_old, sheet = sheet) 12.07024 13.31081 27.02152 17.09983 24.91381 81.92005 25
x_new <- openxlsx2::read.xlsx(wb_new, sheet = sheet) 90.66912 98.50570 118.53569 101.88242 152.26818 180.64448 25
openxlsx::saveWorkbook(wb_old, xlsx_old, TRUE) 335.47623 347.45782 363.75094 357.61180 368.94383 416.66320 25
openxlsx2::wb_save(wb_new, xlsx_new, TRUE) 245.38106 252.07743 260.60964 255.55972 259.98461 328.89134 25
x_old <- openxlsx::read.xlsx(xlsxFile, sheet = sheet) 79.98370 82.54399 90.22901 84.29079 88.24418 156.33743 25
x_new <- openxlsx2::read.xlsx(xlsxFile, sheet = sheet) 157.08068 166.63302 190.37043 176.86383 222.48687 246.65481 25 [edit] already obsolete, now: Unit: milliseconds
expr min lq mean median uq max neval
wb_old <- openxlsx::loadWorkbook(xlsxFile) 114.56835 118.46339 128.22633 124.52356 125.91682 186.02368 25
wb_new <- openxlsx2::loadWorkbook(xlsxFile) 97.90185 101.02084 107.55111 102.46511 110.21075 178.37165 25
x_old <- openxlsx::read.xlsx(wb_old, sheet = sheet) 12.48958 13.26789 23.93558 13.73180 22.54200 88.13606 25
x_new <- openxlsx2::read.xlsx(wb_new, sheet = sheet) 58.17527 70.25260 86.41652 76.04299 83.44763 149.41700 25
openxlsx::saveWorkbook(wb_old, xlsx_old, TRUE) 340.71262 355.09063 394.10805 368.02978 420.84393 551.88726 25
openxlsx2::wb_save(wb_new, xlsx_new, TRUE) 235.77870 244.68706 250.40670 246.74021 254.71594 298.25569 25
x_old <- openxlsx::read.xlsx(xlsxFile, sheet = sheet) 80.44028 82.04740 87.90603 88.31716 92.29674 101.44950 25
x_new <- openxlsx2::read.xlsx(xlsxFile, sheet = sheet) 131.38844 135.31841 158.03921 144.35038 170.05090 215.07284 25 [edit2] final update for today: Unit: milliseconds
expr min lq mean median uq max neval
wb_old <- openxlsx::loadWorkbook(xlsxFile) 115.21385 117.28489 123.83189 119.79142 132.34923 135.1745 25
wb_new <- openxlsx2::loadWorkbook(xlsxFile) 94.89352 96.47255 104.45200 97.88512 112.79448 127.8093 25
x_old <- openxlsx::read.xlsx(wb_old, sheet = sheet) 12.00849 12.66236 29.23914 26.09047 27.60120 105.1724 25
x_new <- openxlsx2::read.xlsx(wb_new, sheet = sheet) 58.12714 73.18217 89.22837 85.95888 91.18047 157.2743 25
openxlsx::saveWorkbook(wb_old, xlsx_old, TRUE) 329.60691 348.01489 380.94227 359.41129 405.29353 621.1345 25
openxlsx2::wb_save(wb_new, xlsx_new, TRUE) 233.41152 250.11640 256.15387 252.49128 259.62258 302.5504 25
x_old <- openxlsx::read.xlsx(xlsxFile, sheet = sheet) 79.00429 79.83381 88.75215 81.39796 95.93943 165.1624 25
x_new <- openxlsx2::read.xlsx(xlsxFile, sheet = sheet) 119.75932 137.94233 160.16483 150.45118 160.43333 239.5629 25 |
I've added a few stri_joins in f2b8408, but the entire benchmarking is a bottomless pit. Desperately craving for a few ms ... |
So, many of our string manipulation with This is how I see it, we can do one of these:
The first one seems more time expensive and possibly problematic. Keeping |
I have been thinking about this a bit lately and discussed it over a few beers earlier this week when I visited a friend in Munich. I did some tests with |
After merging #173 (increased loading time) and #178 (decreased loading time) we're at: > xlsxFile <- system.file("extdata", "readTest.xlsx", package = "openxlsx2")
>
> xlsx_new <- openxlsx::temp_xlsx()
> xlsx_old <- openxlsx2::temp_xlsx()
> wb_old <- openxlsx::loadWorkbook(xlsxFile)
> wb_new <- openxlsx2::wb_load(xlsxFile)
>
> sheet <- 5 # large sheet 271 x 297. With more columns than rows and we are still a bit slow creating long_to_wide
Unit: milliseconds
expr min lq mean median uq max neval
wb_old <- openxlsx::loadWorkbook(xlsxFile) 117.06906 121.93002 125.00688 124.12016 127.09787 140.59572 25
wb_new <- openxlsx2::wb_load(xlsxFile) 77.49894 82.88466 88.44689 86.89495 91.82446 108.19031 25
x_old <- openxlsx::read.xlsx(wb_old, sheet = sheet) 11.91993 12.83833 14.93066 13.58892 16.05548 30.48112 25
x_new <- openxlsx2::read_xlsx(wb_new, sheet = sheet) 62.33648 64.52658 78.70571 68.56979 83.40298 131.60475 25
openxlsx::saveWorkbook(wb_old, xlsx_old, TRUE) 332.48864 347.91272 367.21375 355.74601 369.73673 526.97620 25
openxlsx2::wb_save(wb_new, xlsx_new, TRUE) 232.77114 241.03814 256.42354 252.86414 255.83221 304.76895 25
x_old <- openxlsx::read.xlsx(xlsxFile, sheet = sheet) 77.80480 82.31591 89.16942 84.56348 89.28549 145.26382 25
x_new <- openxlsx2::read_xlsx(xlsxFile, sheet = sheet) 120.41852 126.09833 147.42198 135.29242 177.04768 189.69425 25 |
Probably good to close? I think most of the base functions in here have been replaced with |
yeah, it was open solely for the benchmarks |
Having moved imports in f88f995 I wondered, if we still need the
stringi
dependency or if we should continue to strengthen its use.Previously somebody complained, that
c()
andpaste/0()
are slow, and introducedstringi
to the package. Is this still a valid claim? Especially because I have not taken care of it, we currently have a mix of both worlds. Should we move the remainingc()
and friends tostringi
or should we ditchstringi
altogether?Right now, the only advantage I see is that we have slightly better unicode support on non utf8 Windows, but hopefully this will be fixed with the Windows unicode only R releases.
The text was updated successfully, but these errors were encountered: