Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imagenet Pipeline #120

Merged
merged 11 commits into from
May 19, 2015
Merged

Imagenet Pipeline #120

merged 11 commits into from
May 19, 2015

Conversation

shivaram
Copy link
Contributor

Closes #11

This PR adds the SIFT + LCS + FV imagenet pipeline. This includes changes to a bunch of things that help us avoid doing multiple passes of SIFT over the data (e.g., VectorSplitter and the Sampling node)

The pipeline still looks a bit complex due to the sample re-use stuff across PCA, GMM -- Let me know if you can think of ways to make this better

// In place deterministic shuffle
def shuffleArray[T](arr: Array[T]) = {
// Shuffle each row in the same fashion
val rnd = new java.util.Random(42)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably take seed as a parameter. also is Breeze's shuffle not good enough for what you're trying to do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added seed param. This is different from Breeze's shuffle in that I am trying to shuffle Array[DenseVector[Double]] which is the output of calling collect on RDD[DV[Double]]

@shivaram
Copy link
Contributor Author

I've addressed the comments and also added the SignedHellinger after SIFTs (before PCA). Note that I had to make a new BatchedHellingerMapper and that this uses DenseMatrix[Float] (we really need the Numeric Transformer stuff urgently)

@@ -159,8 +159,15 @@ class BlockLeastSquaresEstimator(blockSize: Int, numIter: Int, lambda: Double =
override def fit(
trainingFeatures: RDD[DenseVector[Double]],
trainingLabels: RDD[DenseVector[Double]]): BlockLinearMapper = {
val vectorSplitter = new VectorSplitter(blockSize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a problem to have a single version of these with None or does it break the Estimator API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I tried it and it breaks the api.

@etrain
Copy link
Contributor

etrain commented May 19, 2015

Awesome stuff @shivaram! I had a few minor things - one about refactoring to reuse some code in the SIFT/LCS pipeline and one about not reimplementing shuffle. If you want to merge this as-is and save for a future PR, I'm good with this!

@shivaram
Copy link
Contributor Author

Alright merging this to hit milestone 0.1 !

shivaram added a commit that referenced this pull request May 19, 2015
@shivaram shivaram merged commit e221536 into amplab:master May 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ImageNet Pipeline
3 participants