DataSource
is Sif
's abstraction for a source of Partition
s. DataSource
s adhere to the following interface:
type DataSource interface {
// See below for details
Analyze() (PartitionMap, error)
// For deserializing a PartitionLoader, commonly
// constructing a fresh one and calling GobDecode
DeserializeLoader([]byte) (PartitionLoader, error)
// returns true iff this is a streaming DataSource
IsStreaming() bool
}
The first task of any DataSource
is to produce a PartitionMap
. A PartitionMap
represents a sequence of "units of work" (PartitionLoader
s) which can be assigned to individual Worker
s. Each "unit of work" is a task which will produce one or more Partition
s.
For example, when loading a directory of files, datasource.file
produces a PartitionMap
where each PartitionLoader
represents an individual file, and each Worker
receiving one of these PartitionLoader
s uses it to read the file and produce Partition
s.
It is commonplace for a DataSource
package to also provide a CreateDataFrame
factory function, which accepts configuration, a Schema
and a DataSourceParser
, instantiates the DataSource
, and passes it to datasource.CreateDataFrame()
. For example, datasource.file
's factory function looks like this:
type DataSourceConf struct {
// ...
}
func CreateDataFrame(conf *DataSourceConf, parser sif.DataSourceParser, schema sif.Schema) sif.DataFrame {
source := &DataSource{conf: conf, parser: parser, schema: schema}
df := datasource.CreateDataFrame(source, parser, schema)
return df
}
Note: It is conceivable that a DataSource
may not require a parser (such as one that is accessing data from a database). In this case, the factory function may omit a parser
argument, and calls to PartitionLoader.Load()
would include a nil
value in place of a DataSourceParser
.
Explore the included DataSource
s (particularly datasource.file
) for concrete implementation examples.