Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port Cifar Augmented pipeline and its nodes #247

Merged
merged 17 commits into from
Mar 22, 2016

Conversation

shivaram
Copy link
Contributor

Fixes #242

@shivaram
Copy link
Contributor Author

cc @ericmjonas @Vaishaal

Some notes:

  1. This lacks any unit tests right now. I wanted to get this out early to get some feedback on interfaces. I will work on unit tests soon
  2. To do augmentation I was forced to use FunctionNode -- I don't think we have a better way to do this ? Also not being able to chain function nodes in the middle of a pipeline is a pain.
  3. It'll be good to see if some of these APIs like RandomFlipper can be generalized like @ericmjonas mentioned in the other thread.

import org.apache.spark.rdd.RDD
import nodes.util.MaxClassifier

object AugmentedExamplesEvaluator extends Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of AugmentedExamplesEvaluator this seems like it could just as easily be called EnsembleEvaluator or something.

In the traditional ensemble setting, a hypothetical EnsembleEstimator could be an Estimator[T,Seq[V]] or if people wanted to construct ensembles themselves, they could use a Pipeline.gather to coallate the results of running several models on an input data item.

Here we have a case that's slightly different, but I think the basic idea still holds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now tracked in #250 and #249

@etrain
Copy link
Contributor

etrain commented Mar 12, 2016

Alright - code-wise, this mostly looks good. The higher level question is how we want to represent this pipeline conceptually. Is it a flatmap with some key attached to the features that we can later group on (this is messy because it assumes labels and features have identical partitioning, etc.) or would we rather think of this in more general terms with some kind of ensembling. I think this is one of those situations where a whiteboard and an example pipeline might be useful, then we can decide if we have sufficient tools to represent that thing or not.

@shivaram
Copy link
Contributor Author

Cool - We can chat about this on Monday. I don't think we have the right machinery to do this kind of augment-ensembles cleanly yet, but we can try to use this pipeline to design things

*/
case class RandomFlipper(flipChance: Double, seed: Long = 12334L) extends Transformer[Image, Image] {

val rnd = new scala.util.Random(seed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apply scala.util.Random isn't serializable...

Caused by: java.io.NotSerializableException: scala.util.Random
Serialization stack:
        - object not serializable (class: scala.util.Random, value: scala.util.Random@39652a30)
        - field (class: nodes.images.RandomPatcher, name: rnd, type: class scala.util.Random)
        - object (class nodes.images.RandomPatcher, <function1>)
        - field (class: nodes.images.RandomPatcher$$anonfun$apply$1, name: $outer, type: class nodes.images.RandomPatcher)
        - object (class nodes.images.RandomPatcher$$anonfun$apply$1, <function1>)
        at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
        at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
        at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
        at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
        ... 22 more
16/03/13 23:33:37 ERROR ErrorMonitor: AssociationError [akka.tcp://[email protected]:48870] <- [ak

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh - thats a pain. Using java.util.Random now

@shivaram
Copy link
Contributor Author

I've fixed all the comments other the discussion about ensembling. I also introduced a new class RandomImageTransformer instead of the RandomFlipper per @ericmjonas 's request.

((unnormFilters(::, *) / (twoNorms + 1e-10)) * whitener.whitener.t, whitener)
}

val trainImagesAugmented = RandomImageTransformer(flipChance, ImageUtils.flipHorizontal).apply(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this written this way because FunctionNode isn't a first-class citizen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - you can't chain anything after function nodes right now

@shivaram
Copy link
Contributor Author

Alright added unit tests for the two patcher nodes and one for flipHorizontal. I'm going to run the entire pipeline on the local cluster to make sure we get the same results we got for the paper. @etrain Let me know if you have any other comments on this.

val trainImages = ImageExtractor(trainData)

val patchExtractor = new Windower(conf.patchSteps, conf.patchSize)
.andThen(ImageVectorizer.apply)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit - can we move the andThen to the end of the previous line and remove the spurious ., (, ) and apply here? I know you're copy/pasting old code, but I like that style better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@shivaram
Copy link
Contributor Author

Other than the performance problem with repeated convolutions, the test error was fine. I got

16/03/21 01:05:04 INFO RandomPatchCifarAugmented: Test error is: 0.1833

from using 5120 filters (i.e. 40k features)


val unscaledFeaturizer =
new Convolver(filters, augmentPatchSize, augmentPatchSize, numChannels, Some(whitener), true)
.andThen(SymmetricRectifier(alpha=conf.alpha))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next ten or so lines have the old .andThen( syntax still.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed all the andThens now

@shivaram
Copy link
Contributor Author

@etrain the latest commit 202317e updates the other CIFAR / Mnist pipeline to also use the style described in #258 -- Tests pass, but it would still be good to get another pair of eyes on this commit


val predictionPipeline = featurizer andThen model andThen MaxClassifier
val predictionPipeline = featurizer andThen MaxClassifier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add the MaxClassfier to the end of what we're calling featurizer and change that name to predictionPipeline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - done

@etrain
Copy link
Contributor

etrain commented Mar 22, 2016

This LGTM, merging. Thanks @shivaram !

etrain added a commit that referenced this pull request Mar 22, 2016
Port Cifar Augmented pipeline and its nodes
@etrain etrain merged commit 4552d2c into amplab:master Mar 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants