Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45300: [R] Remove data.table from class attribute in metadata #45346

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

amoeba
Copy link
Member

@amoeba amoeba commented Jan 24, 2025

Rationale for this change

Fixes #45300.

What changes are included in this PR?

Updated check in arrow_attributes in metadata.R.

Are these changes tested?

Yes.

Are there any user-facing changes?

Users may be depending on data.tables round-tripping so this may be considered a very minor breaking change.

Copy link

⚠️ GitHub issue #45300 has been automatically assigned in GitHub to PR creator.

@jonkeane
Copy link
Member

@github-actions crossbow submit test-r-extra-packages

Copy link

Revision: 9827ff1

Submitted crossbow builds: ursacomputing/crossbow @ actions-e869f440e0

Task Status
test-r-extra-packages GitHub Actions

@jonkeane
Copy link
Member

I haven't read the SO post yet, but FWIW, we added some extra tests confirming that generally speaking data.table objects do roundtrip. #43634 has some more details, but IIUC the pointed not being roundtripped is ok and data.table should recreate it when it needs to.

@nealrichardson
Copy link
Member

I guess an alternative to dropping the data.table class, if we wanted to round-trip data.table objects, would be when applying R metadata on read, if class contains data.table, call setDT() on the data frame?

@amoeba
Copy link
Member Author

amoeba commented Jan 24, 2025

Thanks! I'll take a look at the setDT idea later today. I'm also considering whether it would work to just drop the .internal.selfref attribute.

@nealrichardson
Copy link
Member

nealrichardson commented Jan 24, 2025

Thanks! I'll take a look at the setDT idea later today. I'm also considering whether it would work to just drop the .internal.selfref attribute.

I thought we already did--it's an externalptr right? My understanding of the original issue was that we were dropping the pointer but not the class, so data.table was confused to see an object that declared itself to be a data.table but that didn't have the pointer attached.

@jonkeane
Copy link
Member

so data.table was confused to see an object that declared itself to be a data.table but that didn't have the pointer attached.

IIUC data.table self-heals in situations like these. You might have a performance hit the first time you do something when it recreates the pointer, but that's expected with any serialization of data.table (it's not unique to arrow or parquet). @TysonStanley might be able to correct me if I'm off here (or advise if there's a better way for this to happen)

@amoeba
Copy link
Member Author

amoeba commented Jan 24, 2025

You're both right about the behavior. The .internal.selfref attribute is removed and data.table self-heals like @jonkeane said,

> library(data.table)
> dt_in <- data.table(x = 1:10)
> attributes(dt_in)
$names
[1] "x"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10

$class
[1] "data.table" "data.frame"

$.internal.selfref
<pointer: 0x13b821ae0>
> arrow::write_parquet(dt_in, "test.parquet")
> dt_out <- read_parquet("test.parquet")
> attributes(dt_out)
$names
[1] "x"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10

$class
[1] "data.table" "data.frame"
> setDT(dt_out)
> attributes(dt_out)
$names
[1] "x"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10

$class
[1] "data.table" "data.frame"

$.internal.selfref
<pointer: 0x13b821ae0>

It doesn't make sense to me how the .internal.selfref pointer is the same on both sides though.

Edit: Reading the SO post now.

@TysonStanley
Copy link

Yeah, this is even true if you don't attach data.table but interestingly even with the same .internal.selfref modifying dt_in does not result in changes in dt_out so they are seen as different objects with the same .internal.selfref.

dt_in <- data.table::data.table(x = 1:10)
attributes(dt_in)
#> $names
#> [1] "x"
#> 
#> $row.names
#>  [1]  1  2  3  4  5  6  7  8  9 10
#> 
#> $class
#> [1] "data.table" "data.frame"
#> 
#> $.internal.selfref
#> <pointer: 0x141813ee0>
arrow::write_parquet(dt_in, "test.parquet")
dt_out <- arrow::read_parquet("test.parquet")
attributes(dt_out)
#> $names
#> [1] "x"
#> 
#> $row.names
#>  [1]  1  2  3  4  5  6  7  8  9 10
#> 
#> $class
#> [1] "data.table" "data.frame"
data.table::setDT(dt_out)
attributes(dt_out)
#> $names
#> [1] "x"
#> 
#> $row.names
#>  [1]  1  2  3  4  5  6  7  8  9 10
#> 
#> $class
#> [1] "data.table" "data.frame"
#> 
#> $.internal.selfref
#> <pointer: 0x141813ee0>
dt_in[, y := 10:1]

attributes(dt_in)
#> $names
#> [1] "x" "y"
#> 
#> $row.names
#>  [1]  1  2  3  4  5  6  7  8  9 10
#> 
#> $class
#> [1] "data.table" "data.frame"
#> 
#> $.internal.selfref
#> <pointer: 0x141813ee0>
attributes(dt_out)
#> $names
#> [1] "x"
#> 
#> $row.names
#>  [1]  1  2  3  4  5  6  7  8  9 10
#> 
#> $class
#> [1] "data.table" "data.frame"
#> 
#> $.internal.selfref
#> <pointer: 0x141813ee0>

print(dt_in)
#>         x     y
#>     <int> <int>
#>  1:     1    10
#>  2:     2     9
#>  3:     3     8
#>  4:     4     7
#>  5:     5     6
#>  6:     6     5
#>  7:     7     4
#>  8:     8     3
#>  9:     9     2
#> 10:    10     1
print(dt_out)
#>         x
#>     <int>
#>  1:     1
#>  2:     2
#>  3:     3
#>  4:     4
#>  5:     5
#>  6:     6
#>  7:     7
#>  8:     8
#>  9:     9
#> 10:    10

@ben-schwen
Copy link

Yeah, this is even true if you don't attach data.table but interestingly even with the same .internal.selfref modifying dt_in does not result in changes in dt_out so they are seen as different objects with the same .internal.selfref.

Nothing unexpected. It will even create the same .internal.selfref for different objects.

library(data.table)

dt = data.table(x=1:3)
attributes(dt)
#> $names
#> [1] "x"
#> 
#> $row.names
#> [1] 1 2 3
#> 
#> $class
#> [1] "data.table" "data.frame"
#> 
#> $.internal.selfref
#> <pointer: 0x5640ce7bc310>

dt2 = data.table(x=1:3)
attributes(dt2)
#> $names
#> [1] "x"
#> 
#> $row.names
#> [1] 1 2 3
#> 
#> $class
#> [1] "data.table" "data.frame"
#> 
#> $.internal.selfref
#> <pointer: 0x5640ce7bc310>

But you can always check that they are not the same by checking address(dt) != address(dt2). "Safe option" would be not to drop the data.table class but only drop the .internal.selfref attribute, so data.table will detect that and allocate the right memory.

@nealrichardson
Copy link
Member

"Safe option" would be not to drop the data.table class but only drop the .internal.selfref attribute, so data.table will detect that and allocate the right memory.

I thought that's what we're already doing, so maybe I misunderstand the original report.

@ben-schwen
Copy link

"Safe option" would be not to drop the data.table class but only drop the .internal.selfref attribute, so data.table will detect that and allocate the right memory.

I thought that's what we're already doing, so maybe I misunderstand the original report.

I consider the initial report as WAI, although our warning could be better since we also warn about the serializing case when saveRDS and subsequent readRDS.

Revert "Remove data.table from class attribute"

This reverts commit 9827ff1.

Use setDT when applying metadata
@amoeba amoeba force-pushed the fix/GH-45300--r-data.table-metadata branch from aa03eec to 459bde8 Compare January 25, 2025 01:42
@amoeba
Copy link
Member Author

amoeba commented Jan 25, 2025

We can silence the warning if we do what @nealrichardson suggested in #45346 (comment) though I'm not sure whether we want to or not. I pushed up 459bde8 for discussion.

@jonkeane
Copy link
Member

Thanks for all the info @TysonStanley and @ben-schwen

"Safe option" would be not to drop the data.table class but only drop the .internal.selfref attribute, so data.table will detect that and allocate the right memory. ... I consider the initial report as WAI, although our warning could be better since we also warn about the serializing case when saveRDS and subsequent readRDS.

nods yeah, this sounds good to me. I'm happy to help with a PR for improving that warning in data.table. It makes a lot of sense for it to be there since it's about serializing those objects in any fashion, not just with Arrow/Parquet, but I wonder if there is anything that we should add to our documentation that calls this out as well. We could add something to https://arrow.apache.org/docs/r/articles/metadata.html#r-object-attributes which calls this case out as one of the cases where we can't preserve a portion of metadata, but the way that data.table works that's working as expected and self heals, etc.

We can silence the warning if we do what @nealrichardson suggested in #45346 (comment) though I'm not sure whether we want to or not. I pushed up 459bde8 for discussion.

Thanks for pushing this up for discussion (and extra thanks for jumping on this overall), IMHO we probably don't want to do this. If I am understanding correctly, there will still be a (small) performance hit when the memory is allocated and we should defer to data.table's self healing here since it knows best when and how (and can message appropriately that there might be a small performance degradation that is coming from the fact that any serialization happened).

It sounds like we will likely close this PR, but I can go back to the issue and follow up with a TL;DR there explaining what we learned here (and closing that too since it's working as intended). I'm a little surprised in the original issue they mention this (seeming to) work in older versions of Arrow. There's no way it actually did serialize the pointer and bring it back faithfully, but maybe something in #41969 changed the serialization slightly such that data.table is now better able to recognize and warn about the need to self-heal?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R] Arrow write_parquet() breaks data.table ability to set columns by reference
5 participants