Skip to content

Latest commit

 

History

History
41 lines (30 loc) · 2.01 KB

custom-datasources.md

File metadata and controls

41 lines (30 loc) · 2.01 KB

Implementing Custom DataSources

DataSource is Sif's abstraction for a source of Partitions. DataSources adhere to the following interface:

type DataSource interface {
	// See below for details
	Analyze() (PartitionMap, error)
	// For deserializing a PartitionLoader, commonly
	// constructing a fresh one and calling GobDecode
	DeserializeLoader([]byte) (PartitionLoader, error)
	// returns true iff this is a streaming DataSource
	IsStreaming() bool
}

Implementing Analyze()

The first task of any DataSource is to produce a PartitionMap. A PartitionMap represents a sequence of "units of work" (PartitionLoaders) which can be assigned to individual Workers. Each "unit of work" is a task which will produce one or more Partitions.

For example, when loading a directory of files, datasource.file produces a PartitionMap where each PartitionLoader represents an individual file, and each Worker receiving one of these PartitionLoaders uses it to read the file and produce Partitions.

Factory

It is commonplace for a DataSource package to also provide a CreateDataFrame factory function, which accepts configuration, a Schema and a DataSourceParser, instantiates the DataSource, and passes it to datasource.CreateDataFrame(). For example, datasource.file's factory function looks like this:

type DataSourceConf struct {
	// ...
}

func CreateDataFrame(conf *DataSourceConf, parser sif.DataSourceParser, schema sif.Schema) sif.DataFrame {
	source := &DataSource{conf: conf, parser: parser, schema: schema}
	df := datasource.CreateDataFrame(source, parser, schema)
	return df
}

Note: It is conceivable that a DataSource may not require a parser (such as one that is accessing data from a database). In this case, the factory function may omit a parser argument, and calls to PartitionLoader.Load() would include a nil value in place of a DataSourceParser.

Explore the included DataSources (particularly datasource.file) for concrete implementation examples.