DataFrameIntervals provides two functions that are handy for computing joins over intervals
of time: interval_join
and groupby_interval_join
, and a helper function called quantile_windows
. See their doc strings for details.
Rows match in this join if their time spans overlap. The time spans can be represented as
There are several options to support additional types, such as AlignedSpans. One option is to add interface methods to support automatic conversions to intervals; see e.g. #13. Another option is to manually convert to a supported type; this can provide additional control over how the conversion takes place. For example, one can simply convert to TimeSpan
s:
timespanify = :span => ByRow(TimeSpan) => :span
interval_join(transform(df1, timespanify), transform(df2, timespanify); on=:span)
For AlignedSpans, we can convert to integer indices, after checking the sample rates are all equal:
using Compat # for allequal
if !allequal(Iterators.flatten(((as.sample_rate for as in df1.span), (as.sample_rate for as in df2.span))))
throw(ArgumentError("Sampling rates do not all match!"))
end
integer_spanify = :span => ByRow(as -> Interval{Int, Closed, Closed}(as.first_index, as.last_index)) => :span
interval_join(transform(df1, integer_spanify), transform(df2, integer_spanify); on=:span)
using TimeSpans
using DataFrames
using DataFrameIntervals
using Distributions
using Random
using Dates
n = 100
tovalue(x) = Nanosecond(round(Int, x * 1e9))
times = cumsum(rand(MersenneTwister(hash((:dataframe_intervals, 2022_06_01))), Gamma(3, 2), n+1))
spans = TimeSpan.(tovalue.(times[1:(end-1)]), tovalue.(times[2:end]))
df = DataFrame(label = rand(('a':'d'), n), x = rand(n), span = spans)
100×3 DataFrame
Row │ label x span
│ Char Float64 TimeSpan
─────┼─────────────────────────────────────────────────────
1 │ b 0.0606309 TimeSpan(00:00:05.164631882, 00:…
2 │ a 0.961599 TimeSpan(00:00:08.853504418, 00:…
3 │ c 0.55525 TimeSpan(00:00:13.431519652, 00:…
4 │ d 0.058248 TimeSpan(00:00:25.929078264, 00:…
⋮ │ ⋮ ⋮ ⋮
98 │ a 0.995222 TimeSpan(00:08:51.512608520, 00:…
99 │ d 0.188141 TimeSpan(00:08:56.662988067, 00:…
100 │ a 0.338053 TimeSpan(00:08:58.445446762, 00:…
quarters = quantile_windows(4, df, label=:quarter)
interval_join(df, quarters, on=:span)
103×6 DataFrame
Row │ quarter label x span_left span_right span
│ Int64 Char Float64 TimeSpan TimeSpan TimeSpan
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 1 b 0.0606309 TimeSpan(00:00:05.164631882, 00:… TimeSpan(00:00:05.164631882, 00:… TimeSpan(00:00:05.164631882, 00:…
2 │ 1 a 0.961599 TimeSpan(00:00:08.853504418, 00:… TimeSpan(00:00:05.164631882, 00:… TimeSpan(00:00:08.853504418, 00:…
3 │ 1 c 0.55525 TimeSpan(00:00:13.431519652, 00:… TimeSpan(00:00:05.164631882, 00:… TimeSpan(00:00:13.431519652, 00:…
4 │ 1 d 0.058248 TimeSpan(00:00:25.929078264, 00:… TimeSpan(00:00:05.164631882, 00:… TimeSpan(00:00:25.929078264, 00:…
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
101 │ 4 a 0.995222 TimeSpan(00:08:51.512608520, 00:… TimeSpan(00:06:51.442142229, 00:… TimeSpan(00:08:51.512608520, 00:…
102 │ 4 d 0.188141 TimeSpan(00:08:56.662988067, 00:… TimeSpan(00:06:51.442142229, 00:… TimeSpan(00:08:56.662988067, 00:…
103 │ 4 a 0.338053 TimeSpan(00:08:58.445446762, 00:… TimeSpan(00:06:51.442142229, 00:… TimeSpan(00:08:58.445446762, 00:…
Below is a list of related packages and a brief indication of their differences from DataFrameIntervals
.
- TSx various operations on time series data: includes many features DataFrameIntervals does not aim to implement. Does not implement joins over intervals of time.
- FlexiJoins generic join operations, including by interval predicates (
∈, ⊆, ⊊, ⊋, ⊇, !isdisjoint
): the algorithms applied here are more general purpose and are bound by the complexity of more general purpose data structures (e.g. KD-trees). DataFrameIntervals is (currently) bound by a lower complexity class for its specific use case. - InMemoryDatasets.jl includes inequality-like joins over intervals of time (where the interval is represented as two columns); this cannot yet achieve the behavior implemented in
DataFrameIntervals
, where multiple inequalities must be checked to determine overlap.