Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ResultSet and Metadata implementation #36

Merged
merged 30 commits into from
Mar 7, 2018
Merged

ResultSet and Metadata implementation #36

merged 30 commits into from
Mar 7, 2018

Conversation

tk3369
Copy link
Owner

@tk3369 tk3369 commented Feb 24, 2018

For #29, #30 and #31

@coveralls
Copy link

coveralls commented Feb 24, 2018

Coverage Status

Coverage increased (+0.4%) to 95.424% when pulling 26e6917 on tk/objects into 924839c on master.

@codecov
Copy link

codecov bot commented Feb 24, 2018

Codecov Report

Merging #36 into master will increase coverage by 0.44%.
The diff coverage is 93.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #36      +/-   ##
==========================================
+ Coverage   94.97%   95.42%   +0.44%     
==========================================
  Files           4        7       +3     
  Lines         697      743      +46     
==========================================
+ Hits          662      709      +47     
+ Misses         35       34       -1
Impacted Files Coverage Δ
src/ObjectPool.jl 100% <100%> (ø) ⬆️
src/Types.jl 100% <100%> (ø)
src/Metadata.jl 100% <100%> (ø)
src/SASLib.jl 94.75% <66.66%> (+0.13%) ⬆️
src/ResultSet.jl 97.67% <97.67%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 924839c...26e6917. Read the comment docs.

…ullresult funciton, commented out experiemental code for IterableTables, updated unit test to include iteration.
Copy link
Contributor

@tbeason tbeason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments

README.md Outdated

It's OK to access the fields directly.
```julia
julia> fieldnames(md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to be compatible with 0.7 and onward, this line should read fieldnames(typeof(md)). There was a discussion on Slack or Discourse, can't remember which.

README.md Outdated

```julia
julia> metadata(handler)
SASLib.Metadata("productsales.sas7bdat", "US-ASCII", :LittleEndian, :none, 8192, 18, 1440, 10, Pair{Symbol,DataType}[:ACTUAL=>Float64, :PREDICT=>Float64, :COUNTRY=>String, :REGION=>String, :DIVISION=>String, :PRODTYPE=>String, :PRODUCT=>String, :QUARTER=>Float64, :YEAR=>Float64, :MONTH=>Float64])
```

## Why another package?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, this package has matured quite a bit and keeps getting better. Its functionality is moving beyond just a port of some Python code. At this point, you might not need this section anymore. This isn't just a package you made for your own purposes, this is becoming the best way to use SAS files in Julia. I'd suggest merging this section and the one below it into 1 section titled Notes and just have 1 sentence that says something to the effect of "The core functionality of this package borrowed inspiration from Pandas' read_sas and ReadStat.jl."

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree 💯

@tk3369
Copy link
Owner Author

tk3369 commented Feb 24, 2018

@tbeason do you have time to beta test this? I would like to know how it well it works and if there's any other related change requests.

@tbeason
Copy link
Contributor

tbeason commented Feb 24, 2018

I can test it out on Monday. You're working at lightning speed!

@tk3369
Copy link
Owner Author

tk3369 commented Feb 24, 2018

sounds good! enjoy your weekend!

@tbeason
Copy link
Contributor

tbeason commented Feb 27, 2018

This seems to be working well! I like these changes. I tested on v0.6.2 and on v0.7. It works on both.

The ResultSet looks good. Subsetting with rs[1:3] and rs[3,3] worked but rs[1:3,1:3] and rs[1:3,2] did not, likely just a matter of defining additional indexing methods I think. I'd also like to point out that your convention of selecting rows with rs[1:3] is different than the behavior of DataFrames (and most other table packages IIRC) where rs[1:3] would select the first 3 columns.

Printing-wise, the entire floating point number prints. For display purposes, might be good to truncate after so many decimal places.

I'd change the show function for the metadata to be a bit more how it was before. Perhaps something like a compressed vertical Dict? As it is now, the display isn't very useful. You have to access the fields directly to really see anything.

Read performance seems to have increased quite a bit as well, but I didn't do any formal benchmarks. Still seeing improvements on v0.7 relative to v0.6.2 as well.

Would perhaps be nice to define a convenience constructor DataFrame(rs::SASLib.ResultSet) = DataFrame(columns(rs),names(rs)), but I'm uncertain if it can live in this package without REQUIRE'ing DataFrames.jl. If it creates a package dependency, it probably isn't worth it.

@davidanthoff
Copy link
Contributor

Would perhaps be nice to define a convenience constructor DataFrame(rs::SASLib.ResultSet) = DataFrame(columns(rs),names(rs)), but I'm uncertain if it can live in this package without REQUIRE'ing DataFrames.jl. If it creates a package dependency, it probably isn't worth it.

Yes, that would require a dependency on DataFrames.jl. Once this package here is more stable I plan to switch StatFiles.jl over to use it, and then you should be able to just do DataFrame(load(filename)) (and the same pattern will work for pretty much any table type).

@tk3369
Copy link
Owner Author

tk3369 commented Feb 27, 2018

Hi @tbeason, great feedbacks!

Subsetting with rs[1:3] and rs[3,3] worked but rs[1:3,1:3] and rs[1:3,2] did not, likely just a matter of defining additional indexing methods I think.

I wanted to draw a line somewhere such that I don't end up turning ResultSet into a DataFrame. So the idea was to provide only some simple indexing methods - which covers individual cells rs[i,j], individual rows rs[i], individual columns rs[:c], rows subset rs[m:n] and columns subset rs[:c1, :c2, ...]. To support generic rs[x,y] where x is a type of [Int, UnitRange, Colon] and y is a type of [Int, UnitRange, Symbol, NTuple{N,Symbol}, Colon], we will need 15 indexing functions. Right now, there are only two - Int,Int and Int,Symbol. Though a bit unnatural, you could work around by indexing twice - horizontally and vertically or vice versa. The question is - is it worth creating all these indexing functions? (or the user may as well turn it into a DataFrame and work from there.)

julia> rs[1:2,:Column3]
ERROR: MethodError: no method matching getindex(::SASLib.ResultSet, ::UnitRange{Int64}, ::Symbol)

julia> rs[1:2][:Column3]
2-element SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}:
 0.40982 
 0.928673

I'd also like to point out that your convention of selecting rows with rs[1:3] is different than the behavior of DataFrames (and most other table packages IIRC) where rs[1:3] would select the first 3 columns.

This is an interesting observation. The current design uses numbers for row indexing and symbols for column indexing. Hence it's unambiguous and that's also why rs[1:3] selects rows not columns. Looks like DataFrames.jl mimics R's data.frame behavior, which to me is somewhat confusing...

  1. data.frame - df[1:3] selects first 3 columns whereas df[1:3,] selects first 3 rows
  2. data.table - dt[1:3] selects first 3 rows whereas dt[,1:3] selects first 3 columns
  3. pandas - df.iloc[0:3] selects first 3 rows whereas df.iloc[:,0:3] selects first 3 columns
  4. SASLib ResultSet - rs[1:3] select first 3 rows whereas rs[:c1, :c2, :c3] selects first 3 columns

Using symbols for column selection may sound a bit inflexible but can be easily worked around by mapping and splatting it inside:

julia> rs[names(rs)[1:3]...]
SASLib.ResultSet (10 rows x 3 columns)
Columns 1:Column1, 2:Column2, 3:Column3
1: 0.636, pear, 0.4098203851193005
2: 0.283, dog, 0.9286730337856774
3: 0.452, pear, 0.6644174736708597

Printing-wise, the entire floating point number prints. For display purposes, might be good to truncate after so many decimal places.

This is a standard behavior with tuples. Not sure how I can change that without implementing a new display. For example:

julia> x = (rand(), 1.0 ,2)
(0.19858068095808146, 1.0, 2)

I'd change the show function for the metadata to be a bit more how it was before. Perhaps something like a compressed vertical Dict? As it is now, the display isn't very useful. You have to access the fields directly to really see anything.

Agree. I can add a show function. Note that it would display a subset of columns (maybe 10) if there are too many of them.

@tk3369
Copy link
Owner Author

tk3369 commented Feb 27, 2018

I would rather not make the package depend on DataFrames.jl. Seems overkill for just having a constructor that saves few keystrokes. Doing DataFrame(columns(rs),names(rs)) isn't that bad.

If we implement IterableTables here then we could also do DataFrame(rs), right? I actually tried to put that on (same branch) but I commented out because it fails on 0.7.

@tbeason
Copy link
Contributor

tbeason commented Feb 27, 2018

As far as subsetting/indexing, I see your point. I suppose another way to view the current indexing is to think about it as iterating through rows just like SAS does. And ultimately, I think most analysis will probably be done by converting the ResultSet to DataFrame first. If that's how you view it as well, then this package should provide just enough functionality to make filtering/iterating/etc. possible on the ResultSet. I think that box is checked.

@davidanthoff is obviously the one to comment on IterableTables and the like. My opinion is that more interoperability is better.

@tk3369
Copy link
Owner Author

tk3369 commented Feb 28, 2018

That's right. The user is encouraged to convert it to DataFrame for more complex queries/manipulation.

@tbeason
Copy link
Contributor

tbeason commented Feb 28, 2018

Also, these are some fairly breaking changes. I am not super familiar with depwarn and @deprecated (unsure if those are even the right things to use), but it could be useful to keep around the old functionality for a release or two and just display deprecation warnings for it.

Another approach is to do what ShiftedArrays.jl is doing right now. They print a message during precompilation to tell you that functionality has changed.

@tk3369
Copy link
Owner Author

tk3369 commented Mar 1, 2018

Good point! I don't know how deprecation would work in this scenario though since method signatures are not changed but rather the return type is different. As the package is relatively new, I'm hoping the impact (in terms of number people affected by this) isn't too big. I would rather not keep old behavior around -- if that's what a user want, just check out the older version from github.

The way that ShiftedArrays displays the warning is quite brilliant. That looks like an easy change. In addition, I can put up notice near the top of README.

@tk3369 tk3369 merged commit ab72196 into master Mar 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants