Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review use of FunctionNode #121

Open
etrain opened this issue May 18, 2015 · 4 comments
Open

Review use of FunctionNode #121

etrain opened this issue May 18, 2015 · 4 comments

Comments

@etrain
Copy link
Contributor

etrain commented May 18, 2015

In several pipelines, we use FunctionNode to handle cases where, for example, an Estimator[A,B] doesn't return a Transformer[A,B], but instead returns a Transformer[C,D], or where there is no good meaning for a single-item transformation.

Currently, FunctionNode feels like a "catch-all" because the Transformer/Estimator APIs don't sufficiently cover some of the data transformation operations we need to support.

One example of this is NGramsCounts which takes a Seq[Seq[T]] and returns a model of type NGrams[T] => Int.

Other examples include Windower and Sampler which are used in the RandomPatchCifar pipeline. These nodes are different in that they do not operate on single items and are thus not transformers, but act as something like an Aggregator if we were going to draw a database analogy.

@tomerk
Copy link
Contributor

tomerk commented May 18, 2015

What percent of these are only being used in the "fitting" part of a pipeline and not the "prediction" part?

@tomerk
Copy link
Contributor

tomerk commented May 18, 2015

And are these all RDD to RDD?

@concretevitamin
Copy link
Member

Just randomly chiming in - the aggregation pattern is everywhere in every query processing engine (and you're totally right, it's also in decade-old databases!), so I guess there's a reason.

@tomerk
Copy link
Contributor

tomerk commented May 18, 2015

So after taking a closer look, it seems to me like the cases we're using FunctionNode right now fall under either:

  • Some form of 'aggregation' (representable as any RDD transformation that isn't item to item) being done only at 'fit' time
  • Something related to zipping & block transformers & estimators which we still need to figure out how to do cleanly

Some questions I have about the Aggregators are:

  • Do we want to be able to chain these with transformers? (judging by how they're being used right now, it looks like there's at least some interest in it)
  • Where do we want to call these aggregators?
  • Only internally within Estimators?
  • Directly on the training data before we call estimator.fit(data)?
  • Somehow chain it within a pipeline but have it only apply in the 'fitting' stage?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants