Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support appending arbitrary container #57

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

tkf
Copy link

@tkf tkf commented Dec 15, 2019

This PR implements append!(::Table, source) for arbitrary source table types:

julia> append!(Table(a=[1], b=[2]), (a=[3], b=[4]))
Table with 2 columns and 2 rows:
     a  b
   ┌─────
 11  2
 23  4

julia> append!(Table(a=[1], b=[2]), [(a=3, b=4)])
Table with 2 columns and 2 rows:
     a  b
   ┌─────
 11  2
 23  4

julia> append!(Table(a=[1], b=[2]), DataFrame(a=[3], b=[4]))
Table with 2 columns and 2 rows:
     a  b
   ┌─────
 11  2
 23  4

julia> append!(Table(a=[1], b=[2]), StructVector((a=[3], b=[4])))
Table with 2 columns and 2 rows:
     a  b
   ┌─────
 11  2
 23  4

@codecov-io
Copy link

codecov-io commented Dec 15, 2019

Codecov Report

Merging #57 into master will increase coverage by 0.12%.
The diff coverage is 72.72%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #57      +/-   ##
==========================================
+ Coverage   67.06%   67.19%   +0.12%     
==========================================
  Files           6        6              
  Lines         498      509      +11     
==========================================
+ Hits          334      342       +8     
- Misses        164      167       +3
Impacted Files Coverage Δ
src/Table.jl 75.67% <72.72%> (-0.33%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc1247a...6c09ca2. Read the comment docs.

Copy link
Member

@andyferris andyferris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, this seems super useful!

Now, I maybe overthinking things, but I’m deep in thought about insertable and ordered containers in Dictionaries.jl and I’m slightly worried we might be complecting the interface of appendable containers (or, ordered insertable containers) with the duality between column and row access tables. I’m pushing towards stronger interfaces with guarantees needed for writing robust generic code, which provides boundaries for what generic functions like append! cannot do. There are certain relationships and identities that should hold between generic functions, and in this case we are speaking of appending iterable containers.

An example of the kind of identity I might keep in mind is that

length(b) == 1 && append!(a, b)

might be thought of as equivalent to

push!(a, only(b))

Basically if we could limit this to table containers b where first(b) is a row, then everything follows naturally from the container properties. Appending a column-access table to another should concatenate the columns in my opinion (and it should be easy to switch between the row and column based iteration orders).

What are your thoughts?

@tkf
Copy link
Author

tkf commented Dec 15, 2019

That's an interesting point. I think I agree. But a pain point is that there is no easy way to query if iterate(table) is row-wise or not. We can use Tables.rows(table) === table although I'm not sure if the compiler can always const-prop this (also, it'd be bad if Tables.rows is not O(1)). Anyway, maybe something like this?:

isrowiterator(t) =
    Tables.istable(t) && Tables.rowaccess(t) && Tables.rows(t) === t

function Base.append!(t::Table, t2)
    if isrowiterator(t2) && Tables.columnaccess(t2)
        return append_columnaccess!(t, t2)
    end
    return append_rows!(t, t2)
end

append_row!(t::Table, rows) =  # `rows` is a generic iterator
    mapfoldl(_asnamedtuple(NamedTuple{columnnames(t)}), push!, rows; init = t)

It's nice that now it naturally supports iterators with unknown eltype.

(BTW, it would be nice to add something like isrowiterator to Tables.jl interface if we do this here.)

With this, I think append!(Table(a=[1], b=[2]), Tables.rows((a=[3], b=[4]))) should still work and it's efficient as it uses append_columnaccess! path.

@tkf tkf changed the title Support appending arbitrary table Support appending arbitrary container Dec 16, 2019
@tkf
Copy link
Author

tkf commented Dec 16, 2019

append!(Table(a=[1], b=[2]), Tables.rows((a=[3], b=[4]))) should still work and it's efficient as it uses append_columnaccess! path

This already works. But we need JuliaData/Tables.jl#126 for the optimization (column-oriented memory access).

@quinnj
Copy link
Member

quinnj commented Dec 16, 2019

I'm not sure I quite understand what append_columnaccess! means in:

 if isrowiterator(t2) && Tables.columnaccess(t2)
    return append_columnaccess!(t, t2)
end

is it hcat? or still vcat, but column-by-column?

Also note that:

Tables.istable(t) && Tables.rowaccess(t) && Tables.rows(t) === t

doesn't return true for all valid row-iterators. For example, Tables.istable(t) returns false for a Generator of NamedTuples, even though it's a perfectly valid "row table", and works fine when you call Tables.rows on it.

@tkf
Copy link
Author

tkf commented Dec 16, 2019

still vcat, but column-by-column?

Yes, vcat done column-by-column as an optimization.

Also note that:

Tables.istable(t) && Tables.rowaccess(t) && Tables.rows(t) === t

doesn't return true for all valid row-iterators.

As append_columnaccess! is just an optimization, it is OK for isrowiterator to have false negative.

For example, Tables.istable(t) returns false for a Generator of NamedTuples, even though it's a perfectly valid "row table", and works fine when you call Tables.rows on it.

They hit append_rows! branch so it's fine.

@quinnj
Copy link
Member

quinnj commented Dec 16, 2019

Yes, vcat done column-by-column as an optimization.

So why not just always call Tables.columns and append the result columns one by one anyway? Any time you're switching based on rowaccess/columnaccess, that's a bit of a code-smell w/ the Tables.jl API; the point is to use Tables.rows or Tables.columns as is more appropriate for your use-case. For example, that's what DataFrames does.

@tkf
Copy link
Author

tkf commented Dec 16, 2019

I thought it's nice to avoid allocating unnecessary objects. If performance is not a concern, I think the cleanest approach here would be to add things row-by-row as there is already push!:

mapfoldl(_asnamedtuple(NamedTuple{columnnames(t)}), push!, rows; init = t)

@quinnj
Copy link
Member

quinnj commented Dec 16, 2019

Then I would just check Tables.columnaccess(t) and use the column-based path if so, otherwise fallback to the row-based approach. The default is Tables.columnaccess(x) = false, so it will only be true if a table has explicitly provided support for Tables.columns.

@tkf
Copy link
Author

tkf commented Dec 16, 2019

I initially had something like that. But then @andyferris pointed out that it might be better to strictly require that src in append!(::TypedTable.Table, src) to be (semantically) an iterator over rows #57 (review). Essentially this boils down to that Tables.istable(x) does not imply anything about how iterate(x) works (or not work). (Just to be clear, I'm not saying this is bad. I think Tables.jl has a slick API that demonstrates how much you can do with a flexible trait-based system in Julia.) By the way, this is why I said Tabels.isrowiterator might be a good addition to Tables.jl interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants