ResultSet and Metadata implementation #36

tk3369 · 2018-02-24T18:32:54Z

For #29, #30 and #31

, #31)

…t info from handler.

coveralls · 2018-02-24T18:39:42Z

Coverage increased (+0.4%) to 95.424% when pulling 26e6917 on tk/objects into 924839c on master.

codecov · 2018-02-24T19:11:58Z

Codecov Report

Merging #36 into master will increase coverage by 0.44%.
The diff coverage is 93.75%.

@@            Coverage Diff             @@
##           master      #36      +/-   ##
==========================================
+ Coverage   94.97%   95.42%   +0.44%     
==========================================
  Files           4        7       +3     
  Lines         697      743      +46     
==========================================
+ Hits          662      709      +47     
+ Misses         35       34       -1

Impacted Files	Coverage Δ
src/ObjectPool.jl	`100% <100%> (ø)`	⬆️
src/Types.jl	`100% <100%> (ø)`
src/Metadata.jl	`100% <100%> (ø)`
src/SASLib.jl	`94.75% <66.66%> (+0.13%)`	⬆️
src/ResultSet.jl	`97.67% <97.67%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 924839c...26e6917. Read the comment docs.

…ullresult funciton, commented out experiemental code for IterableTables, updated unit test to include iteration.

tbeason

Minor comments

tbeason · 2018-02-24T21:01:53Z

README.md

+
+It's OK to access the fields directly.
+```julia
+julia> fieldnames(md)


I think to be compatible with 0.7 and onward, this line should read fieldnames(typeof(md)). There was a discussion on Slack or Discourse, can't remember which.

tbeason · 2018-02-24T21:07:13Z

README.md

+
+```julia
+julia> metadata(handler)
+SASLib.Metadata("productsales.sas7bdat", "US-ASCII", :LittleEndian, :none, 8192, 18, 1440, 10, Pair{Symbol,DataType}[:ACTUAL=>Float64, :PREDICT=>Float64, :COUNTRY=>String, :REGION=>String, :DIVISION=>String, :PRODTYPE=>String, :PRODUCT=>String, :QUARTER=>Float64, :YEAR=>Float64, :MONTH=>Float64])
 ```

 ## Why another package?


Honestly, this package has matured quite a bit and keeps getting better. Its functionality is moving beyond just a port of some Python code. At this point, you might not need this section anymore. This isn't just a package you made for your own purposes, this is becoming the best way to use SAS files in Julia. I'd suggest merging this section and the one below it into 1 section titled Notes and just have 1 sentence that says something to the effect of "The core functionality of this package borrowed inspiration from Pandas' read_sas and ReadStat.jl."

…h a new Related Packages section

tk3369 · 2018-02-24T22:46:09Z

@tbeason do you have time to beta test this? I would like to know how it well it works and if there's any other related change requests.

tbeason · 2018-02-24T22:47:42Z

I can test it out on Monday. You're working at lightning speed!

tk3369 · 2018-02-24T22:56:30Z

sounds good! enjoy your weekend!

tbeason · 2018-02-27T04:14:39Z

This seems to be working well! I like these changes. I tested on v0.6.2 and on v0.7. It works on both.

The ResultSet looks good. Subsetting with rs[1:3] and rs[3,3] worked but rs[1:3,1:3] and rs[1:3,2] did not, likely just a matter of defining additional indexing methods I think. I'd also like to point out that your convention of selecting rows with rs[1:3] is different than the behavior of DataFrames (and most other table packages IIRC) where rs[1:3] would select the first 3 columns.

Printing-wise, the entire floating point number prints. For display purposes, might be good to truncate after so many decimal places.

I'd change the show function for the metadata to be a bit more how it was before. Perhaps something like a compressed vertical Dict? As it is now, the display isn't very useful. You have to access the fields directly to really see anything.

Read performance seems to have increased quite a bit as well, but I didn't do any formal benchmarks. Still seeing improvements on v0.7 relative to v0.6.2 as well.

Would perhaps be nice to define a convenience constructor DataFrame(rs::SASLib.ResultSet) = DataFrame(columns(rs),names(rs)), but I'm uncertain if it can live in this package without REQUIRE'ing DataFrames.jl. If it creates a package dependency, it probably isn't worth it.

davidanthoff · 2018-02-27T04:30:20Z

Would perhaps be nice to define a convenience constructor DataFrame(rs::SASLib.ResultSet) = DataFrame(columns(rs),names(rs)), but I'm uncertain if it can live in this package without REQUIRE'ing DataFrames.jl. If it creates a package dependency, it probably isn't worth it.

Yes, that would require a dependency on DataFrames.jl. Once this package here is more stable I plan to switch StatFiles.jl over to use it, and then you should be able to just do DataFrame(load(filename)) (and the same pattern will work for pretty much any table type).

tk3369 · 2018-02-27T07:51:48Z

Hi @tbeason, great feedbacks!

Subsetting with rs[1:3] and rs[3,3] worked but rs[1:3,1:3] and rs[1:3,2] did not, likely just a matter of defining additional indexing methods I think.

I wanted to draw a line somewhere such that I don't end up turning ResultSet into a DataFrame. So the idea was to provide only some simple indexing methods - which covers individual cells rs[i,j], individual rows rs[i], individual columns rs[:c], rows subset rs[m:n] and columns subset rs[:c1, :c2, ...]. To support generic rs[x,y] where x is a type of [Int, UnitRange, Colon] and y is a type of [Int, UnitRange, Symbol, NTuple{N,Symbol}, Colon], we will need 15 indexing functions. Right now, there are only two - Int,Int and Int,Symbol. Though a bit unnatural, you could work around by indexing twice - horizontally and vertically or vice versa. The question is - is it worth creating all these indexing functions? (or the user may as well turn it into a DataFrame and work from there.)

julia> rs[1:2,:Column3]
ERROR: MethodError: no method matching getindex(::SASLib.ResultSet, ::UnitRange{Int64}, ::Symbol)

julia> rs[1:2][:Column3]
2-element SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}:
 0.40982 
 0.928673

I'd also like to point out that your convention of selecting rows with rs[1:3] is different than the behavior of DataFrames (and most other table packages IIRC) where rs[1:3] would select the first 3 columns.

This is an interesting observation. The current design uses numbers for row indexing and symbols for column indexing. Hence it's unambiguous and that's also why rs[1:3] selects rows not columns. Looks like DataFrames.jl mimics R's data.frame behavior, which to me is somewhat confusing...

data.frame - df[1:3] selects first 3 columns whereas df[1:3,] selects first 3 rows
data.table - dt[1:3] selects first 3 rows whereas dt[,1:3] selects first 3 columns
pandas - df.iloc[0:3] selects first 3 rows whereas df.iloc[:,0:3] selects first 3 columns
SASLib ResultSet - rs[1:3] select first 3 rows whereas rs[:c1, :c2, :c3] selects first 3 columns

Using symbols for column selection may sound a bit inflexible but can be easily worked around by mapping and splatting it inside:

julia> rs[names(rs)[1:3]...]
SASLib.ResultSet (10 rows x 3 columns)
Columns 1:Column1, 2:Column2, 3:Column3
1: 0.636, pear, 0.4098203851193005
2: 0.283, dog, 0.9286730337856774
3: 0.452, pear, 0.6644174736708597

Printing-wise, the entire floating point number prints. For display purposes, might be good to truncate after so many decimal places.

This is a standard behavior with tuples. Not sure how I can change that without implementing a new display. For example:

julia> x = (rand(), 1.0 ,2)
(0.19858068095808146, 1.0, 2)

I'd change the show function for the metadata to be a bit more how it was before. Perhaps something like a compressed vertical Dict? As it is now, the display isn't very useful. You have to access the fields directly to really see anything.

Agree. I can add a show function. Note that it would display a subset of columns (maybe 10) if there are too many of them.

tk3369 · 2018-02-27T08:06:10Z

I would rather not make the package depend on DataFrames.jl. Seems overkill for just having a constructor that saves few keystrokes. Doing DataFrame(columns(rs),names(rs)) isn't that bad.

If we implement IterableTables here then we could also do DataFrame(rs), right? I actually tried to put that on (same branch) but I commented out because it fails on 0.7.

tbeason · 2018-02-27T21:15:08Z

As far as subsetting/indexing, I see your point. I suppose another way to view the current indexing is to think about it as iterating through rows just like SAS does. And ultimately, I think most analysis will probably be done by converting the ResultSet to DataFrame first. If that's how you view it as well, then this package should provide just enough functionality to make filtering/iterating/etc. possible on the ResultSet. I think that box is checked.

@davidanthoff is obviously the one to comment on IterableTables and the like. My opinion is that more interoperability is better.

tk3369 · 2018-02-28T07:51:07Z

That's right. The user is encouraged to convert it to DataFrame for more complex queries/manipulation.

tbeason · 2018-02-28T16:14:39Z

Also, these are some fairly breaking changes. I am not super familiar with depwarn and @deprecated (unsure if those are even the right things to use), but it could be useful to keep around the old functionality for a release or two and just display deprecation warnings for it.

Another approach is to do what ShiftedArrays.jl is doing right now. They print a message during precompilation to tell you that functionality has changed.

tk3369 · 2018-03-01T05:36:00Z

Good point! I don't know how deprecation would work in this scenario though since method signatures are not changed but rather the return type is different. As the package is relatively new, I'm hoping the impact (in terms of number people affected by this) isn't too big. I would rather not keep old behavior around -- if that's what a user want, just check out the older version from github.

The way that ShiftedArrays displays the warning is quite brilliant. That looks like an easy change. In addition, I can put up notice near the top of README.

This reverts commit 4b03b91.

tk3369 added 8 commits February 22, 2018 00:09

Fixed arg type for getindex

441807d

ResultSet/Metadata changes initial commit for beta testing (re: #29, #30

e9056f3

, #31)

added metadata function and updated docstrings

542c675

updated for metadata function and added timestamp in log

e5944d3

reworked show function for Handler; metadata should be used to extrac…

8b111c1

…t info from handler.

Updated show function

6b6e169

Updated user guide for ResultSet & Metadata

21cb884

updated contributors

f094f61

tk3369 mentioned this pull request Feb 24, 2018

Inconsistent naming of :nrows and .row_count #33

Closed

tk3369 added 2 commits February 24, 2018 11:11

removed unused compressionstring function

ce29a94

updated unit tests for Metadata and ResultSet functions

8a69d74

Added ResultSet constructors including an empty one, removed unused n…

4c03d68

…ullresult funciton, commented out experiemental code for IterableTables, updated unit test to include iteration.

tbeason reviewed Feb 24, 2018

View reviewed changes

tk3369 added 2 commits February 24, 2018 14:25

added test for metadata(filename) function

3c0666b

revised fieldnames example and replaced why/porting notes section wit…

b316601

…h a new Related Packages section

tk3369 added 3 commits February 24, 2018 22:33

removed unreachable code in readsas & minor changes in read_chunk

71d3274

verbose_level test

3fabad6

added at least one test for verbosity

c36c2c4

tk3369 mentioned this pull request Feb 27, 2018

handler.compression now returns a number #38

Closed

tk3369 added 14 commits March 1, 2018 20:29

installation notice

5d051e4

updated py/jl benchmarks

4b03b91

Revert "updated py/jl benchmarks"

e6c0188

This reverts commit 4b03b91.

updated py/jl benchmarks

0d60e46

pretty print metadata using TabularDisplay package

06d77f7

updated for metadata change

129b442

added TabularDisplay and others related to IterableTables

f5660e8

fixed metadata test

36cce4e

make sure joinpath is used rather than hardcoding with slash

e7779e0

no need to sort types

8e8e8d6

added tests for private functions in Metadata

9387ad3

consolidated Base imports

76c0bc2

minor edits

4bd90eb

added blurb about breaking changes in 0.5.0

26e6917

tk3369 merged commit ab72196 into master Mar 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ResultSet and Metadata implementation #36

ResultSet and Metadata implementation #36

tk3369 commented Feb 24, 2018

coveralls commented Feb 24, 2018 •

edited

Loading

codecov bot commented Feb 24, 2018 •

edited

Loading

tbeason left a comment

tbeason Feb 24, 2018

tbeason Feb 24, 2018

tk3369 Feb 24, 2018

tk3369 commented Feb 24, 2018

tbeason commented Feb 24, 2018

tk3369 commented Feb 24, 2018

tbeason commented Feb 27, 2018

davidanthoff commented Feb 27, 2018

tk3369 commented Feb 27, 2018

tk3369 commented Feb 27, 2018

tbeason commented Feb 27, 2018

tk3369 commented Feb 28, 2018

tbeason commented Feb 28, 2018

tk3369 commented Mar 1, 2018

ResultSet and Metadata implementation #36

ResultSet and Metadata implementation #36

Conversation

tk3369 commented Feb 24, 2018

coveralls commented Feb 24, 2018 • edited Loading

codecov bot commented Feb 24, 2018 • edited Loading

Codecov Report

tbeason left a comment

Choose a reason for hiding this comment

tbeason Feb 24, 2018

Choose a reason for hiding this comment

tbeason Feb 24, 2018

Choose a reason for hiding this comment

tk3369 Feb 24, 2018

Choose a reason for hiding this comment

tk3369 commented Feb 24, 2018

tbeason commented Feb 24, 2018

tk3369 commented Feb 24, 2018

tbeason commented Feb 27, 2018

davidanthoff commented Feb 27, 2018

tk3369 commented Feb 27, 2018

tk3369 commented Feb 27, 2018

tbeason commented Feb 27, 2018

tk3369 commented Feb 28, 2018

tbeason commented Feb 28, 2018

tk3369 commented Mar 1, 2018

coveralls commented Feb 24, 2018 •

edited

Loading

codecov bot commented Feb 24, 2018 •

edited

Loading