-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Union and intersection of TimeSeries #100
base: master
Are you sure you want to change the base?
Conversation
- Temp fix frequency computation issue to pass the build sryza#87 - Remove time zone from usage of factory method 'hybrib' - Add comment on 'binarySearch' returning a tuple of (Int, Int)
* b: is the array index of the date-time index where the queried date-time dt could | ||
* be inserted. This value is used by insertionLoc method. | ||
*/ | ||
private def binarySearch(low: Int, high: Int, dt: ZonedDateTime): (Int, Int) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you include a header comment here to indicate what the two tuple values mean? ~ @sryza
Added
Do you think it will be common for users to want to union / intersect both vertically and horizontally at the same time? I'm wondering if it makes sense to have different methods for unioning time-wise vs. column-wise. |
I'm sorry I didn't get what you mean. I think union/intersect time-wise implies combining columns; it is well defined how to handle time overlaps however it is not defined how to handle overlaps of columns of equal keys. Hence, keys of unioned/intersected time series should be disjoint. The predefined method |
* or a is of size 1 and b is irregular -> d is irregular | ||
*/ | ||
def simplify(indices: Array[DateTimeIndex]): Array[DateTimeIndex] = { | ||
val simplified = new ListBuffer[DateTimeIndex] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArrayBuffer is preferable to ListBuffer for performance ~ @sryza
Only append operations are performed and ListBuffer
is better than ArrayBuffer
in append.
http://docs.scala-lang.org/overviews/collections/performance-characteristics.html
Am I missing something?
Thinking about this a little more, my concern is that, as an RDD, TimeSeriesRDD already has a Then there's also the question of "left joins", both across rows and columns. E.g., I imagine that the most common use case for this type of functionality would be someone who wants to update a dataset from observations from a wider time range. They have a data set, and they have another dataset, which possibly has more keys than the first dataset, and covers a different time range. They want their resulting dataset to include only the keys from the first dataset, but to cover the time ranges from both the first and second data set. So ultimately I think there are two ternary parameters that come up when someone wants to join two TimeSeriesRDDs:
I still need to think a little bit about the best way to expose an API for this, but I am of course open to suggestions. |
I can see your thoughts It is more about joins than just horizontal gluing of datasets. In fact, it's about set operations on time index. The sequel commits of the original PR #88 have left (right) join and except (or difference) set operations. Those kind of transformations are basically transformations of the underlying date time index such that:
We have two concerns here:
Let's tackle the second concern by example: Consider the following two multivariate time series:
Let's consider an outer join but vary the domain between key, time and key-time: Outer join on key
Distinct keys and concatenated time indices ... This might result in improper time index. It could be okay if time indices are disjoint / non-overlapping. Outer join on time
Distinct time instants and concatenated keys ... This might result in duplicate keys. It could be okay if keys are disjoint. Outer join on key-time
Distinct time instants and distinct keys ... This might result in multiple values (collisions) for the same key and instant. It could be okay if keys and time indices are disjoint. Outer join on key VS Outer join on time Outer join on key-time VS Outer join on time |
Thanks for all the detail above. I agree with most of your points. If I understand correctly, your ultimate assertion is that we should start by exposing binary operators that return a TimeSeriesRDD with a unified time index, but that don't merge the data in time series with the same key? I think that sounds mostly reasonable, but there's a performance concern. In the most common situations, I think users ultimately will want to merge series with the same keys. If that is implemented with If you want, while we figure out the right approach here, you could post a PR that just includes DateTimeIndex.union (not TimeSeries) and I could review and merge that. |
I would say yes. I'd rather say that the API assumes disjoint keys.
👍 good catch
Sounds like a sparse vector. Generalizing on that, how about a vector that holds a special value, may be call it a span, that holds a value (say NaN) for a contiguous range of indices)? Conceptually, I can see a vector here as a multipart function such that each part is defined within a separate range of indices, one part could be a direct mapping like a vector in the same sense as we know, another part could be a constant value, another part could be a spline ... I think this might help a lot in lengthy synthesized time series (upsampling for example) ... You don't store the data, instead you store a generating model of the data |
Issued a PR #101 for the first three commits; about a generic rebaser |
I think in the most common cases keys will not be disjoint. For example, I imagine that a common case is that one has a bunch of tick data from one source covering a certain time range, as well as a bunch of tick data for the same ticker symbols from another source, covering a different time range.
Yeah, somewhere between a sparse vector and a dense vector. I.e. a vector with a dense range, and otherwise empty. Using the existing Breeze sparse vector would be inefficient because it would require storing an index alongside every value in the dense range.
Yeah, exactly. That said, I don't think it makes sense to add this in full generality right now. |
Additions to the public API:
DateTimeIndex
helper methodsmillisIterator(): Iterator[Long]
zonedDateTimeIterator(): Iterator[ZonedDateTime]
insertionLoc
methods to find the location at which the given date-time could be inserted. It is the location of the first date-time that is greater than the given date-time. If the given date-time is greater than or equal to the last date-time in the index, the index size is returned. Used in transformations on multiple indices.atZone(zone: ZoneId)
adjusts the time zone of the index. Used in transformations on multiple indices.TimeSeries
union
multiple multivariate time series of disjoint keys into one multivariate time series by applyingunion
on all time indices and rebasing all univariate time series using the union index.intersect
multiple multivariate time series of disjoint keys into one multivariate time series if possibleAdditions to the private API:
DateTimeIndex
rebaser atTimeSeriesUtils.rebaserGeneric(sourceIndex: DateTimeIndex, targetIndex: DateTimeIndex, defaultValue: Double)
. Helpful for transformations on multiple indices of different types.DateTimeIndexUtils
object holds utilities methods of theDateTimeIndex
dateTimeIndexOrdering
defines an ordering onDateTimeIndex
s.t. for twoDateTimeIndex
x
andy
,x < y iff x.first < y.first || (x.first == y.first && x.size < y.size)
simplify(indices: Array[DateTimeIndex]): Array[DateTimeIndex]
merges contiguous indices as possibleunion
unions a list of indices into oneDateTimeIndex
intersect
intersects a list of indices and returns a new index if possibleTimeSeriesUtils
rebaseAndMerge(tss: Array[TimeSeries[K]], newIndex: DateTimeIndex, defaultValue: Double): TimeSeries[K]
a utility for rebasing a collection of multivariate time series of disjoint keys and merging them into one multivariate time series