-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-45300: [R] Remove data.table from class attribute in metadata #45346
base: main
Are you sure you want to change the base?
Conversation
|
@github-actions crossbow submit test-r-extra-packages |
Revision: 9827ff1 Submitted crossbow builds: ursacomputing/crossbow @ actions-e869f440e0
|
I haven't read the SO post yet, but FWIW, we added some extra tests confirming that generally speaking data.table objects do roundtrip. #43634 has some more details, but IIUC the pointed not being roundtripped is ok and data.table should recreate it when it needs to. |
I guess an alternative to dropping the data.table class, if we wanted to round-trip data.table objects, would be when applying R metadata on read, if class contains data.table, call |
Thanks! I'll take a look at the setDT idea later today. I'm also considering whether it would work to just drop the |
I thought we already did--it's an |
IIUC data.table self-heals in situations like these. You might have a performance hit the first time you do something when it recreates the pointer, but that's expected with any serialization of data.table (it's not unique to arrow or parquet). @TysonStanley might be able to correct me if I'm off here (or advise if there's a better way for this to happen) |
You're both right about the behavior. The > library(data.table)
> dt_in <- data.table(x = 1:10)
> attributes(dt_in)
$names
[1] "x"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10
$class
[1] "data.table" "data.frame"
$.internal.selfref
<pointer: 0x13b821ae0>
> arrow::write_parquet(dt_in, "test.parquet")
> dt_out <- read_parquet("test.parquet")
> attributes(dt_out)
$names
[1] "x"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10
$class
[1] "data.table" "data.frame"
> setDT(dt_out)
> attributes(dt_out)
$names
[1] "x"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10
$class
[1] "data.table" "data.frame"
$.internal.selfref
<pointer: 0x13b821ae0> It doesn't make sense to me how the Edit: Reading the SO post now. |
Yeah, this is even true if you don't attach dt_in <- data.table::data.table(x = 1:10)
attributes(dt_in)
#> $names
#> [1] "x"
#>
#> $row.names
#> [1] 1 2 3 4 5 6 7 8 9 10
#>
#> $class
#> [1] "data.table" "data.frame"
#>
#> $.internal.selfref
#> <pointer: 0x141813ee0>
arrow::write_parquet(dt_in, "test.parquet")
dt_out <- arrow::read_parquet("test.parquet")
attributes(dt_out)
#> $names
#> [1] "x"
#>
#> $row.names
#> [1] 1 2 3 4 5 6 7 8 9 10
#>
#> $class
#> [1] "data.table" "data.frame"
data.table::setDT(dt_out)
attributes(dt_out)
#> $names
#> [1] "x"
#>
#> $row.names
#> [1] 1 2 3 4 5 6 7 8 9 10
#>
#> $class
#> [1] "data.table" "data.frame"
#>
#> $.internal.selfref
#> <pointer: 0x141813ee0>
dt_in[, y := 10:1]
attributes(dt_in)
#> $names
#> [1] "x" "y"
#>
#> $row.names
#> [1] 1 2 3 4 5 6 7 8 9 10
#>
#> $class
#> [1] "data.table" "data.frame"
#>
#> $.internal.selfref
#> <pointer: 0x141813ee0>
attributes(dt_out)
#> $names
#> [1] "x"
#>
#> $row.names
#> [1] 1 2 3 4 5 6 7 8 9 10
#>
#> $class
#> [1] "data.table" "data.frame"
#>
#> $.internal.selfref
#> <pointer: 0x141813ee0>
print(dt_in)
#> x y
#> <int> <int>
#> 1: 1 10
#> 2: 2 9
#> 3: 3 8
#> 4: 4 7
#> 5: 5 6
#> 6: 6 5
#> 7: 7 4
#> 8: 8 3
#> 9: 9 2
#> 10: 10 1
print(dt_out)
#> x
#> <int>
#> 1: 1
#> 2: 2
#> 3: 3
#> 4: 4
#> 5: 5
#> 6: 6
#> 7: 7
#> 8: 8
#> 9: 9
#> 10: 10 |
Nothing unexpected. It will even create the same library(data.table)
dt = data.table(x=1:3)
attributes(dt)
#> $names
#> [1] "x"
#>
#> $row.names
#> [1] 1 2 3
#>
#> $class
#> [1] "data.table" "data.frame"
#>
#> $.internal.selfref
#> <pointer: 0x5640ce7bc310>
dt2 = data.table(x=1:3)
attributes(dt2)
#> $names
#> [1] "x"
#>
#> $row.names
#> [1] 1 2 3
#>
#> $class
#> [1] "data.table" "data.frame"
#>
#> $.internal.selfref
#> <pointer: 0x5640ce7bc310> But you can always check that they are not the same by checking |
I thought that's what we're already doing, so maybe I misunderstand the original report. |
I consider the initial report as WAI, although our warning could be better since we also warn about the serializing case when |
Revert "Remove data.table from class attribute" This reverts commit 9827ff1. Use setDT when applying metadata
aa03eec
to
459bde8
Compare
We can silence the warning if we do what @nealrichardson suggested in #45346 (comment) though I'm not sure whether we want to or not. I pushed up 459bde8 for discussion. |
Thanks for all the info @TysonStanley and @ben-schwen
nods yeah, this sounds good to me. I'm happy to help with a PR for improving that warning in
Thanks for pushing this up for discussion (and extra thanks for jumping on this overall), IMHO we probably don't want to do this. If I am understanding correctly, there will still be a (small) performance hit when the memory is allocated and we should defer to It sounds like we will likely close this PR, but I can go back to the issue and follow up with a TL;DR there explaining what we learned here (and closing that too since it's working as intended). I'm a little surprised in the original issue they mention this (seeming to) work in older versions of Arrow. There's no way it actually did serialize the pointer and bring it back faithfully, but maybe something in #41969 changed the serialization slightly such that |
Rationale for this change
Fixes #45300.
What changes are included in this PR?
Updated check in
arrow_attributes
in metadata.R.Are these changes tested?
Yes.
Are there any user-facing changes?
Users may be depending on data.tables round-tripping so this may be considered a very minor breaking change.