Skip to content

Commit

Permalink
ARROW-10624: [R] Proactively remove "problems" attributes
Browse files Browse the repository at this point in the history
Closes apache#9092 from jonkeane/r_attr

Authored-by: Jonathan Keane <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
  • Loading branch information
jonkeane authored and nealrichardson committed Jan 4, 2021
1 parent be4cb61 commit e306c35
Show file tree
Hide file tree
Showing 3 changed files with 32 additions and 3 deletions.
6 changes: 4 additions & 2 deletions r/NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,17 +29,19 @@

## Enhancements

* Table columns can now be added, replaced, or removed by assigning `<-` with either `$` or `[[`
* Table columns can now be added, replaced, or removed by assigning (`<-`) with either `$` or `[[`
* Column names of Tables and RecordBatches can be renamed by assigning `names()`
* Large string types can now be written to Parquet files
* The [pronouns `.data` and `.env`](https://rlang.r-lib.org/reference/tidyeval-data.html) are now fully supported in Arrow-dplyr pipelines.
* The [pronouns `.data` and `.env`](https://rlang.r-lib.org/reference/tidyeval-data.html) are now fully supported in Arrow `dplyr` pipelines.
* Option `arrow.skip_nul` (default `FALSE`, as in `base::scan()`) allows conversion of Arrow string (`utf8()`) type data containing embedded nul `\0` characters to R. If set to `TRUE`, nuls will be stripped and a warning is emitted if any are found.

## Bug fixes

* Fixed a performance regression in converting Arrow string types to R that was present in the 2.0.0 release
* C++ functions now trigger garbage collection when needed
* `write_parquet()` can now write RecordBatches
* Reading a Table from a RecordBatchStreamReader containing 0 batches no longer crashes
* `readr`'s `problems` attribute is removed when converting to Arrow RecordBatch and table to prevent large amounts of metadata from accumulating inadvertently [ARROW-10624](https://issues.apache.org/jira/browse/ARROW-10624)

## Packaging and installation

Expand Down
5 changes: 5 additions & 0 deletions r/R/record-batch.R
Original file line number Diff line number Diff line change
Expand Up @@ -274,6 +274,11 @@ as.data.frame.RecordBatch <- function(x, row.names = NULL, optional = FALSE, ...
}

.serialize_arrow_r_metadata <- function(x) {
assert_is(x, "list")

# drop problems attributes (most likely from readr)
x[["attributes"]][["problems"]] <- NULL

rawToChar(serialize(x, NULL, ascii = TRUE))
}

Expand Down
24 changes: 23 additions & 1 deletion r/tests/testthat/test-metadata.R
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,9 @@ test_that("Garbage R metadata doesn't break things", {
"Invalid metadata$r",
fixed = TRUE
)
tab$metadata$r <- .serialize_arrow_r_metadata("garbage")
# serialize data like .serialize_arrow_r_metadata does, but don't call that
# directly since it checks to ensure that the data is a list
tab$metadata$r <- rawToChar(serialize("garbage", NULL, ascii = TRUE))
expect_warning(
expect_identical(as.data.frame(tab), example_data[1:6]),
"Invalid metadata$r",
Expand Down Expand Up @@ -134,3 +136,23 @@ test_that("metadata keeps attribute of top level data frame", {
expect_identical(attr(as.data.frame(tab), "foo"), "bar")
expect_identical(as.data.frame(tab), df)
})

test_that("metadata drops readr's problems attribute", {
readr_like <- tibble::tibble(
dbl = 1.1,
not_here = NA_character_
)
attributes(readr_like) <- append(
attributes(readr_like),
list(problems = tibble::tibble(
row = 1L,
col = NA_character_,
expected = "2 columns",
actual = "1 columns",
file = "'test'"
))
)

tab <- Table$create(readr_like)
expect_null(attr(as.data.frame(tab), "problems"))
})

0 comments on commit e306c35

Please sign in to comment.