Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add append!(::StructVector, iterator::Any) using Tables.isrowtable #117

Merged
merged 3 commits into from
Feb 14, 2020

Conversation

tkf
Copy link
Member

@tkf tkf commented Feb 13, 2020

This PR implements:

  1. A generic append! so that any "row iterator" can be appended to a struct array.
  2. An optimized path based on Tables.isrowtable.

This is the same API implemented in JuliaData/TypedTables.jl#57. Since this is a PR on top of #116, I'm posting it as a draft.

I'll add some tests if adding this API is OK.

@codecov-io
Copy link

codecov-io commented Feb 13, 2020

Codecov Report

Merging #117 into master will increase coverage by 0.66%.
The diff coverage is 64.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #117      +/-   ##
==========================================
+ Coverage   93.87%   94.54%   +0.66%     
==========================================
  Files          10       10              
  Lines         392      403      +11     
==========================================
+ Hits          368      381      +13     
+ Misses         24       22       -2
Impacted Files Coverage Δ
src/utils.jl 93.65% <57.14%> (-1%) ⬇️
src/tables.jl 84.61% <71.42%> (+17.94%) ⬆️
src/lazy.jl 100% <0%> (+7.69%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae6a52b...8f1ce3f. Read the comment docs.

@piever
Copy link
Collaborator

piever commented Feb 13, 2020

Thanks! Yes, the API looks good to me, and this is actually very useful, for example to improve IndexedTables.join implementation.

src/tables.jl Show resolved Hide resolved
src/tables.jl Outdated
# Input `rows` is a container of rows _and_ satisfies column
# table interface. Thus, we can add the input column-by-column.
table = Tables.columns(rows)
nt = foldl(Tables.columnnames(table); init = NamedTuple()) do nt, name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that this could be a bit heavy on the compiler. Maybe, when iterating pairs name => Tables.getcolumn(table, name), one could just do append!(getproperty(s, name), Tables.getcolumn(table, name)), without having to generate all the intermediate named tuples.

It'd be also good to check that this works for StructArray{<:Tuple}, I'm not completely sure how Tables.jl works in the case of integer names (rather than symbol names).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

integer column names should be fully supported

Copy link
Member Author

@tkf tkf Feb 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test like this:

# Testing integer column "names":
@test invoke(append!, Tuple{StructVector,Any}, StructArray(([0],)), StructArray(([1],))) ==
StructArray(([0, 1],))

I used invoke as I don't know if there is a pre-defined tables with integer column names in Tables.jl. I tried Tables.table(([1, 2], [3, 4])) and Tables.table([1 2; 3 4]) but they didn't work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t = Tables.table([1 2; 3 4]) didn't work in the sense that propertynames(t) is [:Column1, :Column2] rather than [1, 2] (even though getproperty(t, 1) does work).

Since Tables.getcolumn does not have something equivalent to get(container, key, default), there is no exception-safe way to append tables column-by-column, other than checking Tables.columnnames.

@tkf tkf marked this pull request as ready for review February 14, 2020 03:15
@tkf
Copy link
Member Author

tkf commented Feb 14, 2020

@piever Thanks for having look at it. I added some tests and tried to reflect your comments.

table = Tables.columns(rows)
isempty(_setdiff(propertynames(s), Tables.columnnames(rows))) ||
_invalid_columns_error(s, rows)
_foreach(propertynames(s)) do name
Copy link
Collaborator

@piever piever Feb 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally for this StructArrays uses foreachfield, but that assumes that also rows implements getproperty for column access, which in the latest Tables doesn't need to be the case. What I don't understand is: how is foldl better than foreach for tuples? Is it as efficient as the generated function approach?

Copy link
Collaborator

@piever piever Feb 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By looking at the @code_llvm or @code_native for mapfoldl on tuples, it seems to generate the optimal thing, I'm just a bit confused as to where is the special method for tuples. Nevermind, I now see that it dispatches to Base.afoldl.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for merging this. Maybe you've already figured it out why, but the reason why I didn't use foreachfield there was that AFAICT foreachfield does not pass property names to the function.

@piever piever merged commit 00e42b1 into JuliaArrays:master Feb 14, 2020
@tkf tkf deleted the append branch February 14, 2020 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants