Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cbindlist/mergelist] New seqexp helper #6434

Closed

Conversation

MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Aug 29, 2024

Towards #4370.

Copy link
Member Author

MichaelChirico commented Aug 29, 2024

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @MichaelChirico and the rest of your teammates on Graphite Graphite

add cbind by reference, timing

R prototype of mergelist

wording

use lower overhead funs

stick to int32 for now, correct R_alloc

bmerge C refactor for codecov and one loop for speed

address revealed codecov gaps

refactor vecseq for codecov

seqexp helper, some alloccol export on C

bmerge codecov, types handled in R bmerge already

better comment seqexp

bmerge mult=error #655

multiple new C utils

swap if branches

explain new C utils

comments mostly

reduce conflicts to PR #4386

comment C code

address multiple matches during update-on-join #3747

Revert "address multiple matches during update-on-join #3747"

This reverts commit b64c0c3.

merge.dt has temporarily mult arg, for testing

minor changes to cbindlist c

dev mergelist, for single pair now

add quiet option to cc()

mergelist tests

add check for names to perhaps.dt

rm mult from merge.dt method

rework, clean, polish multer, fix righ and full joins

make full join symmetric

mergepair inner function to loop on

extra check for symmetric

mergelist manual

ensure no df-dt passed where list expected

comments and manual

handle 0 cols tables

more tests

more tests and debugging

move more logic closer to bmerge, simplify mergepair

more tests

revert not used changes

reduce not needed checks, cleanup

copy arg behavior, manual, no tests yet

cbindlist manual, export both

cleanup processing bmerge to dtmatch

test function match order for easier preview

vecseq gets short-circuit

batch test allow browser

big cleanup

remmove unneeded stuff, reduce diff

more cleanup, minor manual fixes

add proper test scripts

Merge branch 'master' into cbind-merge-list

comment out not used code for coverage

more tests, some nocopy opts

rename sql test script, should fix codecov

simplify dtmatch inner branch

more precise copy, now copy only T or F

unused arg not yet in api, wording

comments and refer issues

codecov

hasindex coverage

codecov gap

tests for join using key, cols argument

fix missing import forderv

more tests, improve missing on handling

more tests for order of inner and full join for long keys

new allow.cartesian option, #4383, #914

reduce diff, improve codecov

reduce diff, comments

need more DT, not lists, mergelist 3+ tbls

proper escape heavy check

unit tests

more tests, address overalloc failure

mergelist and cbindlist retain index

manual, examples

fix manual

minor clarify in manual

retain keys, right outer join for snowflake schema joins

duplicates in cbindlist

recycling in cbindlist

escape 0 input in copyCols

empty input handling

closing cbindlist

vectorized _on_ and _join.many_ arg

rename dtmatch to dtmerge

vectorized args: how, mult
push down input validation
add support for cross join, semi join, anti join

full join, reduce overhead for mult=error

mult default value dynamic

fix manual

add "see details" to Rd

mention shared on in arg description

amend feedback from Michael

semi and anti joins will not reorder x columns

Merge branch 'master' into cbind-merge-list

spelling, thx to @jan-glx

check all new funs used and add comments

bugfix, sort=T needed for now

Merge branch 'master' into cbind-merge-list

Update NEWS.md

Merge branch 'master' into cbind-merge-list

Merge branch 'master' into cbind-merge-list

NEWS placement

numbering

ascArg->order

Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list

attempt to restore from master

Update to stopf() error style

Need isFrame for now

More quality checks: any(!x)->!all(x); use vapply_1{b,c,i}

really restore from master

try to PROTECT() before duplicate()

update error message in test

appease the rchk gods

extraneous space

missing ';'

use catf

simplify perhapsDataTableR

move sqlite.Rraw.manual into other.Rraw

simplify for loop

Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list
@MichaelChirico MichaelChirico changed the title cbindlist [cbindlist/mergelist] New seqexp helper Aug 29, 2024
@MichaelChirico
Copy link
Member Author

@jangorecki AFAICT this is just rep(seq_along(x), x), do we actually need this?

@MichaelChirico MichaelChirico added the graphite-ready PRs that are managed by Graphite, possibly marked as draft, but ready for review label Aug 29, 2024
Copy link

github-actions bot commented Aug 29, 2024

Comparison Plot

Generated via commit f1ee884

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 11 minutes and 36 seconds

Time taken to run atime::atime_pkg on the tests: 5 minutes and 55 seconds

@jangorecki
Copy link
Member

Don't remember now but comments says something else

@MichaelChirico
Copy link
Member Author

Using what's described in the comment:

seqexp = function(x) unlist(lapply(seq_along(x), function(i) rep(i, x[[i]])))
baseonly = function(x) rep(seq_along(x), x)

for (ii in 1e4) {
  input = sample(0:10, sample(10), TRUE)
  stopifnot(identical(seqexp(input), baseonly(input)))
}

So I think we can just use rep(seq_along(x), x) and drop this PR.

@jangorecki
Copy link
Member

If it doesn't sacrifice performance then make sense

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Aug 30, 2024

I think your fully custom function is bound to be a bit faster -- I am seeing ~2x speedup vs. the rep() approach. But both are leagues faster (50 or 100x) than the original suggested unlist(lapply(...)) approach:

slow_seqexp = function(x) unlist(lapply(seq_along(x), function(i) rep(i, x[[i]])))
rep_seqexp = function(x) rep(seq_along(x), x)
seqexp = function(x) .Call(Cseqexp, x)

library(microbenchmark)
for (n in 10^(0:7)) {
  x = sample(0:9, n, TRUE)
  message("length(x)=", n)
  print(microbenchmark(times = 50L,
    slow_seqexp(x),
    rep_seqexp(x),
    seqexp(x)
  ))
}

Output:

length(x)=1
Unit: nanoseconds
           expr  min   lq    mean median   uq   max neval
 slow_seqexp(x) 2785 2925 5158.56   3086 5380 55624    50
  rep_seqexp(x)  470  530 1013.98    611  912  8286    50
      seqexp(x)  361  400  867.44    481  591 16491    50
length(x)=10
Unit: nanoseconds
           expr   min    lq     mean  median    uq   max neval
 slow_seqexp(x) 10029 10310 11207.58 10535.0 10800 36198    50
  rep_seqexp(x)   852  1022  1319.74  1097.0  1262  9969    50
      seqexp(x)   441   571  1238.62   671.5   791 18615    50
length(x)=100
Unit: nanoseconds
           expr   min    lq     mean  median    uq    max neval
 slow_seqexp(x) 80481 81802 83225.96 82344.0 83146 104756    50
  rep_seqexp(x)  2846  3166  4221.00  3451.0  3727  36428    50
      seqexp(x)   741   922  1228.06  1071.5  1272   7243    50
length(x)=1000
Unit: microseconds
           expr     min      lq      mean   median      uq      max neval
 slow_seqexp(x) 802.197 841.701 954.25018 868.2700 920.979 2569.766    50
  rep_seqexp(x)  22.803  23.765  28.58256  24.6160  27.992  120.465    50
      seqexp(x)   3.246   3.436   5.82878   3.8175   4.649   41.337    50
length(x)=10000
Unit: microseconds
           expr      min       lq        mean    median        uq       max neval
 slow_seqexp(x) 8350.744 8811.453 10005.48454 9305.6555 11389.805 13688.665    50
  rep_seqexp(x)  269.052  270.976   278.48738  274.4725   278.309   399.265    50
      seqexp(x)   56.044   74.980    87.70662   88.2995    99.846   116.808    50
length(x)=1e+05
Unit: microseconds
           expr        min         lq       mean     median         uq        max neval
 slow_seqexp(x) 101143.434 125845.171 152667.150 146076.487 168876.313 306177.552    50
  rep_seqexp(x)   2672.197   2705.428   3196.966   3375.609   3408.150   5426.176    50
      seqexp(x)    946.787    988.405   1316.806   1053.977   1650.169   2147.478    50
length(x)=1e+06
Unit: milliseconds
           expr         min         lq      mean     median         uq       max neval
 slow_seqexp(x) 1096.518670 1319.06547 1386.3892 1366.99251 1453.26510 1817.7871    50
  rep_seqexp(x)   26.939532   29.19514   41.5856   33.94697   38.86142  141.5816    50
      seqexp(x)    9.680785   12.03521   26.4762   16.21189   21.59973  141.1328    50
1e+07
Unit: milliseconds
           expr         min         lq       mean     median         uq        max neval
 slow_seqexp(x) 11887.56058 13621.8274 15053.2152 14740.4736 16888.6295 18849.8520    50
  rep_seqexp(x)   268.52191   296.4584   470.7072   348.7749   459.7034  1117.5036    50
      seqexp(x)    96.75193   130.7713   323.5593   173.1193   327.9886   945.9111    50

Even at an output of 45,000,000 elements, the difference is <.2 seconds. I think we can drop this helper.

It's also conceivable in the future base R would use ALTREP to speed this up -- we could do the same but probably better to have such logic live in base.

@MichaelChirico MichaelChirico deleted the cbind-merge-list-seqexp branch August 30, 2024 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
graphite-ready PRs that are managed by Graphite, possibly marked as draft, but ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants