[cbindlist/mergelist] New seqexp helper #6434

MichaelChirico · 2024-08-29T17:25:22Z

Towards #4370.

MichaelChirico · 2024-08-29T17:25:34Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

[cbindlist/mergelist] New seqexp helper #6434 👈
[cbindlist/mergelist] Add some C utilities: NROW/NCOL/isDataTable/isDataList/perhapsDataTable #6433 : 1 other dependent PR (#6435 )
[cbindlist/mergelist] Add ERR to the mult enum in bmerge.c #6432
[cbindlist/mergelist] Whitespace changes #6431
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @MichaelChirico and the rest of your teammates on Graphite

@jan-glx

add cbind by reference, timing R prototype of mergelist wording use lower overhead funs stick to int32 for now, correct R_alloc bmerge C refactor for codecov and one loop for speed address revealed codecov gaps refactor vecseq for codecov seqexp helper, some alloccol export on C bmerge codecov, types handled in R bmerge already better comment seqexp bmerge mult=error #655 multiple new C utils swap if branches explain new C utils comments mostly reduce conflicts to PR #4386 comment C code address multiple matches during update-on-join #3747 Revert "address multiple matches during update-on-join #3747" This reverts commit b64c0c3. merge.dt has temporarily mult arg, for testing minor changes to cbindlist c dev mergelist, for single pair now add quiet option to cc() mergelist tests add check for names to perhaps.dt rm mult from merge.dt method rework, clean, polish multer, fix righ and full joins make full join symmetric mergepair inner function to loop on extra check for symmetric mergelist manual ensure no df-dt passed where list expected comments and manual handle 0 cols tables more tests more tests and debugging move more logic closer to bmerge, simplify mergepair more tests revert not used changes reduce not needed checks, cleanup copy arg behavior, manual, no tests yet cbindlist manual, export both cleanup processing bmerge to dtmatch test function match order for easier preview vecseq gets short-circuit batch test allow browser big cleanup remmove unneeded stuff, reduce diff more cleanup, minor manual fixes add proper test scripts Merge branch 'master' into cbind-merge-list comment out not used code for coverage more tests, some nocopy opts rename sql test script, should fix codecov simplify dtmatch inner branch more precise copy, now copy only T or F unused arg not yet in api, wording comments and refer issues codecov hasindex coverage codecov gap tests for join using key, cols argument fix missing import forderv more tests, improve missing on handling more tests for order of inner and full join for long keys new allow.cartesian option, #4383, #914 reduce diff, improve codecov reduce diff, comments need more DT, not lists, mergelist 3+ tbls proper escape heavy check unit tests more tests, address overalloc failure mergelist and cbindlist retain index manual, examples fix manual minor clarify in manual retain keys, right outer join for snowflake schema joins duplicates in cbindlist recycling in cbindlist escape 0 input in copyCols empty input handling closing cbindlist vectorized _on_ and _join.many_ arg rename dtmatch to dtmerge vectorized args: how, mult push down input validation add support for cross join, semi join, anti join full join, reduce overhead for mult=error mult default value dynamic fix manual add "see details" to Rd mention shared on in arg description amend feedback from Michael semi and anti joins will not reorder x columns Merge branch 'master' into cbind-merge-list spelling, thx to @jan-glx check all new funs used and add comments bugfix, sort=T needed for now Merge branch 'master' into cbind-merge-list Update NEWS.md Merge branch 'master' into cbind-merge-list Merge branch 'master' into cbind-merge-list NEWS placement numbering ascArg->order Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list attempt to restore from master Update to stopf() error style Need isFrame for now More quality checks: any(!x)->!all(x); use vapply_1{b,c,i} really restore from master try to PROTECT() before duplicate() update error message in test appease the rchk gods extraneous space missing ';' use catf simplify perhapsDataTableR move sqlite.Rraw.manual into other.Rraw simplify for loop Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list

MichaelChirico · 2024-08-29T17:39:35Z

@jangorecki AFAICT this is just rep(seq_along(x), x), do we actually need this?

github-actions · 2024-08-29T17:45:31Z

Generated via commit f1ee884

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 11 minutes and 36 seconds

Time taken to run atime::atime_pkg on the tests: 5 minutes and 55 seconds

jangorecki · 2024-08-29T22:10:54Z

Don't remember now but comments says something else

MichaelChirico · 2024-08-29T22:14:30Z

Using what's described in the comment:

seqexp = function(x) unlist(lapply(seq_along(x), function(i) rep(i, x[[i]])))
baseonly = function(x) rep(seq_along(x), x)

for (ii in 1e4) {
  input = sample(0:10, sample(10), TRUE)
  stopifnot(identical(seqexp(input), baseonly(input)))
}

So I think we can just use rep(seq_along(x), x) and drop this PR.

jangorecki · 2024-08-30T07:55:47Z

If it doesn't sacrifice performance then make sense

MichaelChirico · 2024-08-30T17:09:38Z

I think your fully custom function is bound to be a bit faster -- I am seeing ~2x speedup vs. the rep() approach. But both are leagues faster (50 or 100x) than the original suggested unlist(lapply(...)) approach:

slow_seqexp = function(x) unlist(lapply(seq_along(x), function(i) rep(i, x[[i]])))
rep_seqexp = function(x) rep(seq_along(x), x)
seqexp = function(x) .Call(Cseqexp, x)

library(microbenchmark)
for (n in 10^(0:7)) {
  x = sample(0:9, n, TRUE)
  message("length(x)=", n)
  print(microbenchmark(times = 50L,
    slow_seqexp(x),
    rep_seqexp(x),
    seqexp(x)
  ))
}

Output:

length(x)=1
Unit: nanoseconds
           expr  min   lq    mean median   uq   max neval
 slow_seqexp(x) 2785 2925 5158.56   3086 5380 55624    50
  rep_seqexp(x)  470  530 1013.98    611  912  8286    50
      seqexp(x)  361  400  867.44    481  591 16491    50
length(x)=10
Unit: nanoseconds
           expr   min    lq     mean  median    uq   max neval
 slow_seqexp(x) 10029 10310 11207.58 10535.0 10800 36198    50
  rep_seqexp(x)   852  1022  1319.74  1097.0  1262  9969    50
      seqexp(x)   441   571  1238.62   671.5   791 18615    50
length(x)=100
Unit: nanoseconds
           expr   min    lq     mean  median    uq    max neval
 slow_seqexp(x) 80481 81802 83225.96 82344.0 83146 104756    50
  rep_seqexp(x)  2846  3166  4221.00  3451.0  3727  36428    50
      seqexp(x)   741   922  1228.06  1071.5  1272   7243    50
length(x)=1000
Unit: microseconds
           expr     min      lq      mean   median      uq      max neval
 slow_seqexp(x) 802.197 841.701 954.25018 868.2700 920.979 2569.766    50
  rep_seqexp(x)  22.803  23.765  28.58256  24.6160  27.992  120.465    50
      seqexp(x)   3.246   3.436   5.82878   3.8175   4.649   41.337    50
length(x)=10000
Unit: microseconds
           expr      min       lq        mean    median        uq       max neval
 slow_seqexp(x) 8350.744 8811.453 10005.48454 9305.6555 11389.805 13688.665    50
  rep_seqexp(x)  269.052  270.976   278.48738  274.4725   278.309   399.265    50
      seqexp(x)   56.044   74.980    87.70662   88.2995    99.846   116.808    50
length(x)=1e+05
Unit: microseconds
           expr        min         lq       mean     median         uq        max neval
 slow_seqexp(x) 101143.434 125845.171 152667.150 146076.487 168876.313 306177.552    50
  rep_seqexp(x)   2672.197   2705.428   3196.966   3375.609   3408.150   5426.176    50
      seqexp(x)    946.787    988.405   1316.806   1053.977   1650.169   2147.478    50
length(x)=1e+06
Unit: milliseconds
           expr         min         lq      mean     median         uq       max neval
 slow_seqexp(x) 1096.518670 1319.06547 1386.3892 1366.99251 1453.26510 1817.7871    50
  rep_seqexp(x)   26.939532   29.19514   41.5856   33.94697   38.86142  141.5816    50
      seqexp(x)    9.680785   12.03521   26.4762   16.21189   21.59973  141.1328    50
1e+07
Unit: milliseconds
           expr         min         lq       mean     median         uq        max neval
 slow_seqexp(x) 11887.56058 13621.8274 15053.2152 14740.4736 16888.6295 18849.8520    50
  rep_seqexp(x)   268.52191   296.4584   470.7072   348.7749   459.7034  1117.5036    50
      seqexp(x)    96.75193   130.7713   323.5593   173.1193   327.9886   945.9111    50

Even at an output of 45,000,000 elements, the difference is <.2 seconds. I think we can drop this helper.

It's also conceivable in the future base R would use ALTREP to speed this up -- we could do the same but probably better to have such logic live in base.

This was referenced Aug 29, 2024

[cbindlist/mergelist] Whitespace changes #6431

Open

[cbindlist/mergelist] Add ERR to the mult enum in bmerge.c #6432

Open

MichaelChirico force-pushed the cbind-merge-list-utils branch from 63141a6 to 627e77d Compare August 29, 2024 17:31

MichaelChirico force-pushed the cbind-merge-list-seqexp branch from 090a21b to f1ee884 Compare August 29, 2024 17:31

MichaelChirico changed the title ~~cbindlist~~ [cbindlist/mergelist] New seqexp helper Aug 29, 2024

MichaelChirico added the graphite-ready PRs that are managed by Graphite, possibly marked as draft, but ready for review label Aug 29, 2024

MichaelChirico closed this Aug 30, 2024

MichaelChirico deleted the cbind-merge-list-seqexp branch August 30, 2024 17:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cbindlist/mergelist] New seqexp helper #6434

[cbindlist/mergelist] New seqexp helper #6434

MichaelChirico commented Aug 29, 2024 •

edited

Loading

MichaelChirico commented Aug 29, 2024 •

edited

Loading

MichaelChirico commented Aug 29, 2024

github-actions bot commented Aug 29, 2024 •

edited

Loading

jangorecki commented Aug 29, 2024

MichaelChirico commented Aug 29, 2024

jangorecki commented Aug 30, 2024

MichaelChirico commented Aug 30, 2024 •

edited

Loading

[cbindlist/mergelist] New seqexp helper #6434

[cbindlist/mergelist] New seqexp helper #6434

Conversation

MichaelChirico commented Aug 29, 2024 • edited Loading

MichaelChirico commented Aug 29, 2024 • edited Loading

MichaelChirico commented Aug 29, 2024

github-actions bot commented Aug 29, 2024 • edited Loading

jangorecki commented Aug 29, 2024

MichaelChirico commented Aug 29, 2024

jangorecki commented Aug 30, 2024

MichaelChirico commented Aug 30, 2024 • edited Loading

MichaelChirico commented Aug 29, 2024 •

edited

Loading

MichaelChirico commented Aug 29, 2024 •

edited

Loading

github-actions bot commented Aug 29, 2024 •

edited

Loading

MichaelChirico commented Aug 30, 2024 •

edited

Loading