SASLib is a fast reader for sas7bdat files. The goal is to allow easier integration with SAS processes. Only sas7bdat
format is supported. SASLib is licensed under the MIT Expat license.
Pkg.add("SASLib")
I did benchmarking mostly on my Macbook Pro laptop. In general, the Julia implementation is somewhere between 10-100x faster than the Python Pandas. Test results are documented in the test/perf_results_<version>
folders.
Latest performance test results for v1.0.0 is as follows:
Test | Result |
---|---|
py_jl_homimp_50.md | 30x faster than Python/Pandas |
py_jl_numeric_1000000_2_100.md | 10x faster than Python/Pandas |
py_jl_productsales_100.md | 50x faster than Python/Pandas |
py_jl_test1_100.md | 120x faster than Python/Pandas |
py_jl_topical_30.md | 30x faster than Python/Pandas |
julia> using SASLib
Use the readsas
function to read a SAS7BDAT file.
julia> rs = readsas("productsales.sas7bdat")
Read productsales.sas7bdat with size 1440 x 10 in 0.00256 seconds
SASLib.ResultSet (1440 rows x 10 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:COUNTRY, 4:REGION, 5:DIVISION, 6:PRODTYPE, 7:PRODUCT, 8:QUARTER, 9:YEAR, 10:MONTH
1: 925.0, 850.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-01-01
2: 999.0, 297.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-02-01
3: 608.0, 846.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-03-01
4: 642.0, 533.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-04-01
5: 656.0, 646.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-05-01
⋮
There are several ways to access the data conveniently without using any third party packages. Each cell value may be retrieved directly via the regular [i,j]
index. Accessing an entire row or column returns a tuple and a vector respectively.
julia> rs[4,2]
533.0
julia> rs[4, :PREDICT]
533.0
julia> rs[1]
(ACTUAL = 925.0, PREDICT = 850.0, COUNTRY = "CANADA", REGION = "EAST", DIVISION = "EDUCATION", PRODTYPE = "FURNITURE", PRODUCT = "SOFA", QUARTER = 1.0, YEAR = 1993.0, MONTH = 1993-01-01)
julia> rs[:ACTUAL]
1440-element Array{Float64,1}:
925.0
999.0
608.0
⋮
julia> rs.ACTUAL
1440-element Array{Float64,1}:
925.0
999.0
608.0
⋮
julia> rs[2:4]
SASLib.ResultSet (3 rows x 10 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:COUNTRY, 4:REGION, 5:DIVISION, 6:PRODTYPE, 7:PRODUCT, 8:QUARTER, 9:YEAR, 10:MONTH
1: 999.0, 297.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-02-01
2: 608.0, 846.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-03-01
3: 642.0, 533.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-04-01
julia> rs[:ACTUAL, :PREDICT, :YEAR, :MONTH]
SASLib.ResultSet (1440 rows x 4 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:YEAR, 4:MONTH
1: 925.0, 850.0, 1993.0, 1993-01-01
2: 999.0, 297.0, 1993.0, 1993-02-01
3: 608.0, 846.0, 1993.0, 1993-03-01
4: 642.0, 533.0, 1993.0, 1993-04-01
5: 656.0, 646.0, 1993.0, 1993-05-01
⋮
You may assign values at the cell level, causing a side effect in memory:
julia> srs = rs[:ACTUAL, :PREDICT, :YEAR, :MONTH][1:2]
SASLib.ResultSet (2 rows x 4 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:YEAR, 4:MONTH
1: 925.0, 850.0, 1993.0, 1993-01-01
2: 999.0, 297.0, 1993.0, 1993-02-01
julia> srs[2,2] = 3
3
julia> rs[1:2]
SASLib.ResultSet (2 rows x 10 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:COUNTRY, 4:REGION, 5:DIVISION, 6:PRODTYPE, 7:PRODUCT, 8:QUARTER, 9:YEAR, 10:MONTH
1: 925.0, 850.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-01-01
2: 999.0, 3.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-02-01
ResultSet implements the usual standard iteration interface, so it's easy to walk through the results:
julia> mean(r.ACTUAL - r.PREDICT for r in rs)
16.695833333333333
There are simple functions to retrieve meta information about a ResultSet.
names(rs)
size(rs)
length(rs)
It may be beneficial to convert the result set to DataFrame for more complex queries and manipulations.
The SASLib.ResultSet
object implements the Tables.jl interface,
so you can directly create a DataFrame as shown below:
julia> df = DataFrame(rs);
julia> first(df, 5)
5×10 DataFrame
│ Row │ ACTUAL │ PREDICT │ COUNTRY │ REGION │ DIVISION │ PRODTYPE │ PRODUCT │ QUARTER │ YEAR │ MONTH │
│ │ Float64 │ Float64 │ String │ String │ String │ String │ String │ Float64 │ Float64 │ Dates…⍰ │
├─────┼─────────┼─────────┼─────────┼────────┼───────────┼───────────┼─────────┼─────────┼─────────┼────────────┤
│ 1 │ 925.0 │ 850.0 │ CANADA │ EAST │ EDUCATION │ FURNITURE │ SOFA │ 1.0 │ 1993.0 │ 1993-01-01 │
│ 2 │ 999.0 │ 297.0 │ CANADA │ EAST │ EDUCATION │ FURNITURE │ SOFA │ 1.0 │ 1993.0 │ 1993-02-01 │
│ 3 │ 608.0 │ 846.0 │ CANADA │ EAST │ EDUCATION │ FURNITURE │ SOFA │ 1.0 │ 1993.0 │ 1993-03-01 │
│ 4 │ 642.0 │ 533.0 │ CANADA │ EAST │ EDUCATION │ FURNITURE │ SOFA │ 2.0 │ 1993.0 │ 1993-04-01 │
│ 5 │ 656.0 │ 646.0 │ CANADA │ EAST │ EDUCATION │ FURNITURE │ SOFA │ 2.0 │ 1993.0 │ 1993-05-01 │
Column Inclusion
It is always faster to read only the columns that you need. The include_columns
argument comes in handy:
julia> rs = readsas("productsales.sas7bdat", include_columns=[:YEAR, :MONTH, :PRODUCT, :ACTUAL])
Read productsales.sas7bdat with size 1440 x 4 in 0.00151 seconds
SASLib.ResultSet (1440 rows x 4 columns)
Columns 1:ACTUAL, 2:PRODUCT, 3:YEAR, 4:MONTH
1: 925.0, SOFA, 1993.0, 1993-01-01
2: 999.0, SOFA, 1993.0, 1993-02-01
3: 608.0, SOFA, 1993.0, 1993-03-01
4: 642.0, SOFA, 1993.0, 1993-04-01
5: 656.0, SOFA, 1993.0, 1993-05-01
⋮
Column Exclusion
Likewise, you can read all columns except the ones you don't want as specified in exclude_columns
argument:
julia> rs = readsas("productsales.sas7bdat", exclude_columns=[:YEAR, :MONTH, :PRODUCT, :ACTUAL])
Read productsales.sas7bdat with size 1440 x 6 in 0.00265 seconds
SASLib.ResultSet (1440 rows x 6 columns)
Columns 1:PREDICT, 2:COUNTRY, 3:REGION, 4:DIVISION, 5:PRODTYPE, 6:QUARTER
1: 850.0, CANADA, EAST, EDUCATION, FURNITURE, 1.0
2: 297.0, CANADA, EAST, EDUCATION, FURNITURE, 1.0
3: 846.0, CANADA, EAST, EDUCATION, FURNITURE, 1.0
4: 533.0, CANADA, EAST, EDUCATION, FURNITURE, 2.0
5: 646.0, CANADA, EAST, EDUCATION, FURNITURE, 2.0
⋮
Case Sensitivity and Column Number
Column symbols are matched in a case insensitive manner with SAS column names.
Both include_columns
and exclude_columns
accept column number. In fact, you can mixed column symbols and column numbers as such:
julia> readsas("productsales.sas7bdat", include_columns=[:actual, :predict, 8, 9, 10])
Read productsales.sas7bdat with size 1440 x 5 in 0.16378 seconds
SASLib.ResultSet (1440 rows x 5 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:QUARTER, 4:YEAR, 5:MONTH
1: 925.0, 850.0, 1.0, 1993.0, 1993-01-01
2: 999.0, 297.0, 1.0, 1993.0, 1993-02-01
3: 608.0, 846.0, 1.0, 1993.0, 1993-03-01
4: 642.0, 533.0, 2.0, 1993.0, 1993-04-01
5: 656.0, 646.0, 2.0, 1993.0, 1993-05-01
⋮
If you need to read files incrementally, you can use the SASLib.open
function to obtain a handle of the file. Then, use the SASLib.read
function to fetch a number of rows. Remember to close the handler with SASLib.close
to avoid memory leak.
julia> handler = SASLib.open("productsales.sas7bdat")
SASLib.Handler[productsales.sas7bdat]
julia> rs = SASLib.read(handler, 2)
Read productsales.sas7bdat with size 2 x 10 in 0.06831 seconds
SASLib.ResultSet (2 rows x 10 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:COUNTRY, 4:REGION, 5:DIVISION, 6:PRODTYPE, 7:PRODUCT, 8:QUARTER, 9:YEAR, 10:MONTH
1: 925.0, 850.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-01-01
2: 999.0, 297.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-02-01
julia> rs = SASLib.read(handler, 3)
Read productsales.sas7bdat with size 3 x 10 in 0.00046 seconds
SASLib.ResultSet (3 rows x 10 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:COUNTRY, 4:REGION, 5:DIVISION, 6:PRODTYPE, 7:PRODUCT, 8:QUARTER, 9:YEAR, 10:MONTH
1: 608.0, 846.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-03-01
2: 642.0, 533.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-04-01
3: 656.0, 646.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-05-01
julia> SASLib.close(handler)
Note that there is no facility at the moment to jump and read a subset of rows.
SASLib always read from the beginning.
By default, string columns are read into a special AbstractArray structure called ObjectPool
in order to conserve memory space that might otherwise be wasted for duplicate string values. SASLib tries to be smart -- when it encounters too many unique values (> 10%) in a large array (> 2000 rows), it falls back to a regular Julia array.
You can use a different array type (e.g. CategoricalArray or PooledArray) for any columns as you wish by specifying a string_array_fn
parameter when reading the file. This argument must be a Dict that maps a column symbol into a function that takes an integer argument and returns any array of that size.
Here's the normal case:
julia> rs = readsas("productsales.sas7bdat", include_columns=[:COUNTRY, :REGION]);
Read productsales.sas7bdat with size 1440 x 2 in 0.00193 seconds
julia> typeof.(columns(rs))
2-element Array{DataType,1}:
SASLib.ObjectPool{String,UInt16}
SASLib.ObjectPool{String,UInt16}
If you really want a regular String array, you can force SASLib to do so as such:
julia> rs = readsas("productsales.sas7bdat", include_columns=[:COUNTRY, :REGION],
string_array_fn=Dict(:COUNTRY => (n)->fill("",n)));
Read productsales.sas7bdat with size 1440 x 2 in 0.00333 seconds
julia> typeof.(columns(rs))
2-element Array{DataType,1}:
Array{String,1}
SASLib.ObjectPool{String,UInt16}
For convenience, SASLib.REGULAR_STR_ARRAY
is a function that does exactly that. In addition, if you need all columns to be configured the same then the key of the string_array_fn
dict may be just the symbol :_all_
.
julia> rs = readsas("productsales.sas7bdat", include_columns=[:COUNTRY, :REGION],
string_array_fn=Dict(:_all_ => REGULAR_STR_ARRAY));
Read productsales.sas7bdat with size 1440 x 2 in 0.00063 seconds
julia> typeof.(columns(rs))
2-element Array{DataType,1}:
Array{String,1}
Array{String,1}
In general, SASLib allocates native arrays when returning numerical column data. However, you can provide a custom constructor so you would be able to either pre-allcoate the array or construct a different type of array. The number_array_fn
parameter is a Dict
that maps column symbols to the custom constructors. Similar to string_array_fn
, this Dict may be specified with a special symbol :_all_
to indicate such constructor be used for all numeric columns.
Example - create SharedArray
:
julia> rs = readsas("productsales.sas7bdat", include_columns=[:ACTUAL,:PREDICT],
number_array_fn=Dict(:ACTUAL => (n)->SharedArray{Float64}(n)));
Read productsales.sas7bdat with size 1440 x 2 in 0.00385 seconds
julia> typeof.(columns(rs))
2-element Array{DataType,1}:
SharedArray{Float64,1}
Array{Float64,1}
Example - preallocate arrays:
julia> A = zeros(1440, 2);
julia> f1(n) = @view A[:, 1];
julia> f2(n) = @view A[:, 2];
julia> readsas("productsales.sas7bdat", include_columns=[:ACTUAL,:PREDICT],
number_array_fn=Dict(:ACTUAL => f1, :PREDICT => f2));
Read productsales.sas7bdat with size 1440 x 2 in 0.00041 seconds
julia> A[1:5,:]
5×2 Array{Float64,2}:
925.0 850.0
999.0 297.0
608.0 846.0
642.0 533.0
656.0 646.0
Often, you want a column to be an integer but the SAS7BDAT stores everything as Float64. Specifying the column_type
argument does the conversion for you.
julia> rs = readsas("productsales.sas7bdat", column_types=Dict(:ACTUAL=>Int))
Read productsales.sas7bdat with size 1440 x 10 in 0.08043 seconds
SASLib.ResultSet (1440 rows x 10 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:COUNTRY, 4:REGION, 5:DIVISION, 6:PRODTYPE, 7:PRODUCT, 8:QUARTER, 9:YEAR, 10:MONTH
1: 925, 850.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-01-01
2: 999, 297.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-02-01
3: 608, 846.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-03-01
4: 642, 533.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-04-01
5: 656, 646.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-05-01
julia> typeof(rs[:ACTUAL])
Array{Int64,1}
You may obtain meta data for a SAS data file using the metadata
function.
julia> md = metadata("productsales.sas7bdat")
File: productsales.sas7bdat (1440 x 10)
1:ACTUAL(Float64) 5:DIVISION(String) 9:YEAR(Float64)
2:PREDICT(Float64) 6:PRODTYPE(String) 10:MONTH(Date/Missings.Missing)
3:COUNTRY(String) 7:PRODUCT(String)
4:REGION(String) 8:QUARTER(Float64)
It's OK to access the fields directly.
julia> fieldnames(SASLib.Metadata)
9-element Array{Symbol,1}:
:filename
:encoding
:endianness
:compression
:pagesize
:npages
:nrows
:ncols
:columnsinfo
julia> md = metadata("test3.sas7bdat");
julia> md.compression
:RDC
ReadStat.jl uses the ReadStat C-library. However, ReadStat-C does not support reading RDC-compressed binary files.
StatFiles.jl is a higher-level package built on top of ReadStat.jl and implements the FileIO interface.
Python Pandas package has an implementation of SAS file reader that SASLib borrows heavily from.
- Jared Hobbs, the author of the SAS reader code from Pandas. See LICENSE_SAS7BDAT.md.
- Evan Miller, the author of ReadStat C/C++ library. See LICENSE_READSTAT.md.
- David Anthoff, who provided many valuable ideas at the early stage of development.
- Tyler Beason
- susabi
I also want to thank all the active members at the Julia Discourse community. This project wouldn't be possible without all the help I got from the community. That's the beauty of open-source development.