prashan_pul #1

… cleaning.

Improving documentation and identifying potential bug in CC calculation.

`sbt/sbt doc` used to fail. This fixed it.

Updated JavaStreamingContext to make scaladoc compile. `sbt/sbt doc` used to fail. This fixed it.

The bug was due to a misunderstanding of the activeSetOpt parameter to Graph.mapReduceTriplets. Passing EdgeDirection.Both causes mapReduceTriplets to run only on edges with *both* vertices in the active set. This commit adds EdgeDirection.Either, which causes mapReduceTriplets to run on edges with *either* vertex in the active set. This is what connected components needed.

…ing serialization support for GraphImpl to address issues with failed closure capture.

…aphx Conflicts: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala

…aphx

…maintain backwards compatibility.

Improved logic of finding new files in FileInputDStream Earlier, if HDFS has a hiccup and reports a existence of a new file (mod time T sec) at time T + 1 sec, then fileStream could have missed that file. With this change, it should be able to find files that are delayed by up to <batch size> seconds. That is, even if file is reported at T + <batch time> sec, file stream should be able to catch it. The new logic, at a high level, is as follows. It keeps track of the new files it found in the previous interval and mod time of the oldest of those files (lets call it X). Then in the current interval, it will ignore those files that were seen in the previous interval and those which have mod time older than X. So if a new file gets reported by HDFS that in the current interval, but has mod time in the previous interval, it will be considered. However, if the mod time earlier than the previous interval (that is, earlier than X), they will be ignored. This is the current limitation, and future version would improve this behavior further. Also reduced line lengths in DStream to <=100 chars.

Add default value for HadoopRDD's `cloneRecords` constructor arg Small mend to https://github.com/apache/incubator-spark/pull/359/files#diff-1 for backwards compatibility

The loop occurred when numEdges < numVertices. This commit fixes it by allowing generateRandomEdges to generate a multigraph.

Automatically unpersisting RDDs that have been cleaned up from DStreams Earlier RDDs generated by DStreams were forgotten but not unpersisted. The system relied on the natural BlockManager LRU to drop the data. The cleaner.ttl was a hammer to clean up RDDs but it is something that needs to be set separately and need to be set very conservatively (at best, few minutes). This automatic unpersisting allows the system to handle this automatically, which reduces memory usage. As a side effect it will also improve GC performance as there are less number of objects stored in memory. In fact, for some workloads, it may allow RDDs to be cached as deserialized, which speeds up processing without too much GC overheads. This is disabled by default. To enable it set configuration spark.streaming.unpersist to true. In future release, this will be set to true by default. Also, reduced sleep time in TaskSchedulerImpl.stop() from 5 second to 1 second. From my conversation with Matei, there does not seem to be any good reason for the sleep for letting messages be sent out be so long.

External sorting - Add number of bytes spilled to Web UI Additionally, update test suite for external sorting to induce spilling.

…and remove an empty line

Adjusted visibility of various components and documentation for 0.9.0 release.

Improvements to external sorting 1. Adds the option of compressing outputs. 2. Adds batching to the serialization to prevent OOM on the read side. 3. Slight renaming of config options. 4. Use Spark's buffer size for reads in addition to writes.

@jegonzal

GraphX: Unifying Graphs and Tables GraphX extends Spark's distributed fault-tolerant collections API and interactive console with a new graph API which leverages recent advances in graph systems (e.g., [GraphLab](http://graphlab.org)) to enable users to easily and interactively build, transform, and reason about graph structured data at scale. See http://amplab.github.io/graphx/. Thanks to @jegonzal, @rxin, @ankurdave, @dcrankshaw, @jianpingjwang, @amatsukawa, @kellrott, and @adamnovak. Tasks left: - [x] Graph-level uncache - [x] Uncache previous iterations in Pregel - [x] ~~Uncache previous iterations in GraphLab~~ (postponed to post-release) - [x] - Describe GC issue with GraphLab - [ ] Write `docs/graphx-programming-guide.md` - [x] - Mention future Bagel support in docs - [ ] - Section on caching/uncaching in docs: As with Spark, cache something that is used more than once. In an iterative algorithm, try to cache and force (i.e., materialize) something every iteration, then uncache the cached things that depended on the newly materialized RDD but that won't be referenced again. - [x] Undo modifications to core collections and instead copy them to org.apache.spark.graphx - [x] Make Graph serializable to work around capture in Spark shell - [x] Rename graph -> graphx in package name and subproject - [x] Remove standalone PageRank - [x] ~~Fix amplab/graphx#52 by checking `iter.hasNext`~~

Add Naive Bayes to Python MLlib, and some API fixes - Added a Python wrapper for Naive Bayes - Updated the Scala Naive Bayes to match the style of our other algorithms better and in particular make it easier to call from Java (added builder pattern, removed default value in train method) - Updated Python MLlib functions to not require a SparkContext; we can get that from the RDD the user gives - Added a toString method in LabeledPoint - Made the Python MLlib tests run as part of run-tests as well (before they could only be run individually through each file)

… - they were useless as InputDStream has been made to register itself. Also made DStream.register() private[streaming] - not useful to expose the confusing function. Updated a lot of documentation.

Enable compression by default for spills

Conflicts: streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala

Removed unnecessary DStream operations and updated docs Removed StreamingContext.registerInputStream and registerOutputStream - they were useless. InputDStream has been made to register itself, and just registering a DStream as output stream cause RDD objects to be created but the RDDs will not be computed at all.. Also made DStream.register() private[streaming] for the same reasons. Updated docs, specially added package documentation for streaming package. Also, changed NetworkWordCount's input storage level to use MEMORY_ONLY, replication on the local machine causes warning messages (as replication fails) which is scary for a new user trying out his/her first example.

- more variants of mapPartitions added to JavaRDDLike - move setGenerator to JavaRDDLike - clean up

Add missing header files

Improving the graphx-programming-guide This PR will track a few minor improvements to the content and formatting of the graphx-programming-guide.

Note that previously Broadcast class was accidentally marked as private[spark]. It needs to be public for broadcast variables to work. Also exposing the broadcast varaible id.

…Serializable for Broadcast and Accumulator.

API doc update & make Broadcast public In #413 Broadcast was mistakenly made private[spark]. I changed it to public again. Also exposing id in public given the R frontend requires that. Copied some of the documentation from the programming guide to API Doc for Broadcast and Accumulator. This should be cherry picked into branch-0.9 as well for 0.9.0 release.

Deprecate rather than remove old combineValuesByKey function

Add GraphX dependency to examples/pom.xml

Don't clone records for text files

Describe caching and uncaching in GraphX programming guide

Additional edits for clarity in the graphx programming guide. Added an overview of the Graph and GraphOps functions and fixed numerous typos.

…g paths in log4j.properties of external modules.

Fixed SVDPlusPlusSuite in Maven build. This should go into 0.9.0 also.

@mateiz

Fixed the flaky tests by making SparkConf not serializable SparkConf was being serialized with CoGroupedRDD and Aggregator, which somehow caused OptionalJavaException while being deserialized as part of a ShuffleMapTask. SparkConf should not even be serializable (according to conversation with Matei). This change fixes that. @mateiz @pwendell

…cularly since checkpoint directory is autogenerated now

Rename VertexID -> VertexId in GraphX

remove "-XX:+UseCompressedStrings" option from tuning guide since jdk7 no longer supports this.

More yarn code refactor Try to retrive common code in yarn alpha/stable for client and workerRunnable to reduce duplicated codes. By put them into a trait in common dir and extends with them. Same works could be done for the remaining files in alpha/stable , while the remainning files have much more overlapping codes with different API call here and there within functions, and will need much more close review , aslo it might divide functions into too small trifle ones, thus might not deserve to be done in this way. So just make it run for these two files firstly.

Updated Debian packaging

GraphX shouldn't list Spark as provided. I noticed this when building an application against GraphX to audit the released artifacts.

…treamingContext.

Made some classes private[stremaing] and deprecated a method in JavaStreamingContext. Classes `RawTextHelper`, `RawTextSender` and `RateLimitedOutputStream` are not useful in the streaming API. There are not used by the core functionality and was there as a support classes for an obscure example. One of the classes is RawTextSender has a main function which can be executed using bin/spark-class even if it is made private[streaming]. In future, I will probably completely remove these classes. For the time being, I am just converting them to private[streaming]. Accessing underlying JavaSparkContext in JavaStreamingContext was through `JavaStreamingContext.sc` . This is deprecated and preferred method is `JavaStreamingContext.sparkContext` to keep it consistent with the `StreamingContext.sparkContext`.

Workers should use working directory as spark home if it's not specified If users don't set SPARK_HOME in their environment file when launching an application, the standalone cluster should default to the spark home of the worker.

Clarify that Python 2.7 is only needed for MLlib

Prior to this commit, if a task crashes the JVM, the task (and all other tasks running on that executor) is marked at KILLED rather than FAILED. As a result, the TaskSetManager will retry the task indefiniteily rather than failing the job after maxFailures. This commit fixes that problem by marking tasks as FAILED rather than killed when an executor is lost. The downside of this commit is that if task A fails because another task running on the same executor caused the VM to crash, the failure will incorrectly be counted as a failure of task A. This should not be an issue because we typically set maxFailures to 3, and it is unlikely that a task will be co-located with a JVM-crashing task multiple times.

SPARK-1024 Remove "-XX:+UseCompressedStrings" option from tuning guide remove "-XX:+UseCompressedStrings" option from tuning guide since jdk7 no longer supports this.

spark-shell intends to set MASTER automatically if we do not provide the option when we start the shell , but there's a problem. The condition is "if [[ "x" != "x$SPARK_MASTER_IP" && "y" != "y$SPARK_MASTER_PORT" ]];" we sure will set SPARK_MASTER_IP explicitly, the SPARK_MASTER_PORT option, however, we probably do not set just using spark default port 7077. So if we do not set SPARK_MASTER_PORT, the condition will never be true. We should just use default port if users do not set port explicitly I think.

Code clean up for mllib * Removed unnecessary parentheses * Removed unused imports * Simplified `filter...size()` to `count ...` * Removed obsoleted parameters' comments

Fail rather than hanging if a task crashes the JVM. Prior to this commit, if a task crashes the JVM, the task (and all other tasks running on that executor) is marked at KILLED rather than FAILED. As a result, the TaskSetManager will retry the task indefinitely rather than failing the job after maxFailures. Eventually, this makes the job hang, because the Standalone Scheduler removes the application after 10 works have failed, and then the app is left in a state where it's disconnected from the master and waiting to reconnect. This commit fixes that problem by marking tasks as FAILED rather than killed when an executor is lost. The downside of this commit is that if task A fails because another task running on the same executor caused the VM to crash, the failure will incorrectly be counted as a failure of task A. This should not be an issue because we typically set maxFailures to 3, and it is unlikely that a task will be co-located with a JVM-crashing task multiple times.

JIRA SPARK-1029:https://spark-project.atlassian.net/browse/SPARK-1029

… the code examples.

Clone records java api

Fixed Window spark shell launch script error. JIRA SPARK-1029:https://spark-project.atlassian.net/browse/SPARK-1029

Use renamed shuffle spill config in CoGroupedRDD.scala This one got missed when it was renamed.

This is useful for the cases when a file needs to be refreshed and downloaded by the executors periodically. Signed-off-by: Yinan Li <[email protected]>

Remove Typesafe Config usage and conf files to fix nested property names With Typesafe Config we had the subtle problem of no longer allowing nested property names, which are used for a few of our properties: http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html This PR is for branch 0.9 but should be added into master too. (cherry picked from commit 34e911c) Signed-off-by: Patrick Wendell <[email protected]>

Re-enable Python MLlib tests (require Python 2.7 and NumPy 1.7+) We disabled these earlier because Jenkins didn't have these versions.

Minor api usability changes - Expose checkpoint directory - since it is autogenerated now - null check for jars - Expose SparkHadoopUtil : so that configuration creation is abstracted even from user code to avoid duplication of functionality already in spark.

Correct L2 regularized weight update with canonical form Per thread on the user@ mailing list, and comments from Ameet, I believe the weight update for L2 regularization needs to be corrected. See http://mail-archives.apache.org/mod_mbox/spark-user/201401.mbox/%3CCAH3_EVMetuQuhj3__NdUniDLc4P-FMmmrmxw9TS14or8nT4BNQ%40mail.gmail.com%3E

Signed-off-by: Yinan Li <[email protected]>

Updated java API docs for streaming, along with very minor changes in the code examples. Docs updated for: Scala: StreamingContext, DStream, PairDStreamFunctions Java: JavaStreamingContext, JavaDStream, JavaPairDStream Example updated: JavaQueueStream: Not use deprecated method ActorWordCount: Use the public interface the right way.

Only log error on missing jar to allow spark examples to jar. Right now to run the spark examples on Yarn you have to use the --addJars option and put the jar in hdfs. To make that nicer so the user doesn't have to specify the --addJars option change it to simply log an error instead of throwing.

change TestClient & Worker to Some("xxx") kill manager if it is started remove unnecessary .get when fetch "SPARK_HOME" values

This patch was proposed by Sean Mackrory.

Restricting /lib to top level directory in .gitignore This patch was proposed by Sean Mackrory.

Added StreamingContext.awaitTermination to streaming examples StreamingContext.start() currently starts a non-daemon thread which prevents termination of a Spark Streaming program even if main function has exited. Since the expected behavior of a streaming program is to run until explicitly killed, this was sort of fine when spark streaming applications are launched from the command line. However, when launched in Yarn-standalone mode, this did not work as the driver effectively got terminated when the main function exits. So SparkStreaming examples did not work on Yarn. This addition to the examples ensures that the examples work on Yarn and also ensures that everyone learns that StreamingContext.awaitTermination() being necessary for SparkStreaming programs to wait. The true bug-fix of making sure all threads by Spark Streaming are daemon threads is left for post-0.9.

SPARK-1028 : fix "set MASTER automatically fails" bug. spark-shell intends to set MASTER automatically if we do not provide the option when we start the shell , but there's a problem. The condition is "if [[ "x" != "x$SPARK_MASTER_IP" && "y" != "y$SPARK_MASTER_PORT" ]];" we sure will set SPARK_MASTER_IP explicitly, the SPARK_MASTER_PORT option, however, we probably do not set just using spark default port 7077. So if we do not set SPARK_MASTER_PORT, the condition will never be true. We should just use default port if users do not set port explicitly I think.

@matei

Made run-example respect SPARK_JAVA_OPTS and SPARK_MEM. bin/run-example scripts was not passing Java properties set through the SPARK_JAVA_OPTS to the example. This is important for examples like Twitter** as the Twitter authentication information must be set through java properties. Hence added the same JAVA_OPTS code in run-example as it is in bin/spark-class script. Also added SPARK_MEM, in case someone wants to run the example with different amounts of memory. This can be removed if it is not tune with the intended semantics of the run-example scripts. @matei Please check this soon I want this to go in 0.9-rc4

@mridulm

Handful of 0.9 fixes This patch addresses a few fixes for Spark 0.9.0 based on the last release candidate. @mridulm gets credit for reporting most of the issues here. Many of the fixes here are based on his work in #477 and follow up discussion with him.

… requests may not always be honored

@transient

…ts-for-mllib [MLlib] Use a LocalSparkContext trait in test suites Replaces the 9 instances of ```scala class XXXSuite extends FunSuite with BeforeAndAfterAll { @transient private var sc: SparkContext = _ override def beforeAll() { sc = new SparkContext("local", "test") } override def afterAll() { sc.stop() System.clearProperty("spark.driver.port") } ``` with ```scala class XXXSuite extends FunSuite with LocalSparkContext { ```

It's the task count across the cluster, not per worker, per machine, per core, or anything else.

Clarify spark.default.parallelism It's the task count across the cluster, not per worker, per machine, per core, or anything else.

…ll in Scala code. This hopefully will make the code cleaner.

This bug leads to a small performance hit because task set managers will get offered each rejected resource offer twice, but doesn't lead to any incorrect functionality.

Sparse SVD # Singular Value Decomposition Given an *m x n* matrix *A*, compute matrices *U, S, V* such that *A = U * S * V^T* There is no restriction on m, but we require n^2 doubles to fit in memory. Further, n should be less than m. The decomposition is computed by first computing *A^TA = V S^2 V^T*, computing svd locally on that (since n x n is small), from which we recover S and V. Then we compute U via easy matrix multiplication as *U = A * V * S^-1* Only singular vectors associated with the largest k singular values If there are k such values, then the dimensions of the return will be: * *S* is *k x k* and diagonal, holding the singular values on diagonal. * *U* is *m x k* and satisfies U^T*U = eye(k). * *V* is *n x k* and satisfies V^TV = eye(k). All input and output is expected in sparse matrix format, 0-indexed as tuples of the form ((i,j),value) all in RDDs. # Testing Tests included. They test: - Decomposition promise (A = USV^T) - For small matrices, output is compared to that of jblas - Rank 1 matrix test included - Full Rank matrix test included - Middle-rank matrix forced via k included # Example Usage import org.apache.spark.SparkContext import org.apache.spark.mllib.linalg.SVD import org.apache.spark.mllib.linalg.SparseMatrix import org.apache.spark.mllib.linalg.MatrixyEntry // Load and parse the data file val data = sc.textFile("mllib/data/als/test.data").map { line => val parts = line.split(',') MatrixEntry(parts(0).toInt, parts(1).toInt, parts(2).toDouble) } val m = 4 val n = 4 // recover top 1 singular vector val decomposed = SVD.sparseSVD(SparseMatrix(data, m, n), 1) println("singular values = " + decomposed.S.data.toArray.mkString) # Documentation Added to docs/mllib-guide.md

@mateiz

Fixed bug where task set managers are added to queue twice @mateiz can you verify that this is a bug and wasn't intentional? (https://github.com/apache/incubator-spark/commit/90a04dab8d9a2a9a372cea7cdf46cc0fd0f2f76c#diff-7fa4f84a961750c374f2120ca70e96edR551) This bug leads to a small performance hit because task set managers will get offered each rejected resource offer twice, but doesn't lead to any incorrect functionality. Thanks to @hdc1112 for pointing this out.

SPARK-1033. Ask for cores in Yarn container requests Tested on a pseudo-distributed cluster against the Fair Scheduler and observed a worker taking more than a single core.

…from Hadoop (which stops working in 2.2.x) and also use the newer commons-math3

fixed job name and usage information for the JavaSparkPi example

Fix graphx Commons Math dependency `graphx` depends on Commons Math (2.x) in `SVDPlusPlus.scala`. However the module doesn't declare this dependency. It happens to work because it is included by Hadoop artifacts. But, I can tell you this isn't true as of a month or so ago. Building versus recent Hadoop would fail. (That's how we noticed.) The simple fix is to declare the dependency, as it should be. But it's also worth noting that `commons-math` is the old-ish 2.x line, while `commons-math3` is where newer 3.x releases are. Drop-in replacement, but different artifact and package name. Changing this only usage to `commons-math3` works, tests pass, and isn't surprising that it does, so is probably also worth changing. (A comment in some test code also references `commons-math3`, FWIW.) It does raise another question though: `mllib` looks like it uses the `jblas` `DoubleMatrix` for general purpose vector/matrix stuff. Should `graphx` really use Commons Math for this? Beyond the tiny scope here but worth asking.

clean code

Introduced in d5a96fe. This should be picked into 0.8 and 0.9 as well.

fix for SPARK-1027 fix for SPARK-1027 (https://spark-project.atlassian.net/browse/SPARK-1027) FIXES 1. change sparkhome from String to Option(String) in ApplicationDesc 2. remove sparkhome parameter in LaunchExecutor message 3. adjust involved files

@aarondav

Fix bug in worker clean-up in UI Introduced in d5a96fe (/cc @aarondav). This should be picked into 0.8 and 0.9 as well. The bug causes old (zombie) workers on a node to not disappear immediately from the UI when a new one registers.

Replace commons-math with jblas in SVDPlusPlus

Extending Java API coverage Hi, I have added three new methods to JavaRDD. Please review and merge.

The code introduced in #359 used Hadoop's WritableUtils.clone() to duplicate objects when reading from Hadoop files. Some users have reported exceptions when cloning data in verious file formats, including Avro and another custom format. This patch removes that functionality to ensure stability for the 0.9 release. Instead, it puts a clear warning in the documentation that copying may be necessary for Hadoop data sets.

This case wasn't handled correctly and this patch fixes it.

Fix two bugs in PySpark cartesian(): SPARK-978 and SPARK-1034 This pull request fixes two bugs in PySpark's `cartesian()` method: - [SPARK-978](https://spark-project.atlassian.net/browse/SPARK-978): PySpark's cartesian method throws ClassCastException exception - [SPARK-1034](https://spark-project.atlassian.net/browse/SPARK-1034): Py4JException on PySpark Cartesian Result The JIRAs have more details describing the fixes.

Remove Hadoop object cloning and warn users making Hadoop RDD's. The code introduced in #359 used Hadoop's WritableUtils.clone() to duplicate objects when reading from Hadoop files. Some users have reported exceptions when cloning data in various file formats, including Avro and another custom format. This patch removes that functionality to ensure stability for the 0.9 release. Instead, it puts a clear warning in the documentation that copying may be necessary for Hadoop data sets.

Fix bug on read-side of external sort when using Snappy. This case wasn't handled correctly and this patch fixes it.

Also, replace the last reference to it in the docs. This fixes SPARK-1026.

Deprecate mapPartitionsWithSplit in PySpark (SPARK-1026) This commit deprecates `mapPartitionsWithSplit` in PySpark (see [SPARK-1026](https://spark-project.atlassian.net/browse/SPARK-1026) and removes the remaining references to it from the docs.

Upgrade junit-interface plugin from 0.9 to 0.10. I noticed that the JavaAPISuite tests didn't appear to display any output locally or under Jenkins, making it difficult to know whether they were running. This change increases the verbosity to more closely match the ScalaTest tests.

This fixes an issue where collectAsMap() could fail when called on a JavaPairRDD that was derived by transforming a non-JavaPairRDD. The root problem was that we were creating the JavaPairRDD's ClassTag by casting a ClassTag[AnyRef] to a ClassTag[Tuple2[K2, V2]]. To fix this, I cast a ClassTag[Tuple2[_, _]] instead, since this actually produces a ClassTag of the appropriate type because ClassTags don't capture type parameters: scala> implicitly[ClassTag[Tuple2[_, _]]] == implicitly[ClassTag[Tuple2[Int, Int]]] res8: Boolean = true scala> implicitly[ClassTag[AnyRef]].asInstanceOf[ClassTag[Tuple2[Int, Int]]] == implicitly[ClassTag[Tuple2[Int, Int]]] res9: Boolean = false

Fix ClassCastException in JavaPairRDD.collectAsMap() (SPARK-1040) This fixes [SPARK-1040](https://spark-project.atlassian.net/browse/SPARK-1040), an issue where JavaPairRDD.collectAsMap() could sometimes fail with ClassCastException. I applied the same fix to the Spark Streaming Java APIs. The commit message describes the fix in more detail. I also increased the verbosity of JUnit test output under SBT to make it easier to verify that the Java tests are actually running.

Fix PySpark hang when input files are deleted (SPARK-1025) This pull request addresses [SPARK-1025](https://spark-project.atlassian.net/browse/SPARK-1025), an issue where PySpark could hang if its input files were deleted.

…ithub.

Choose initial user/item vectors uniformly on the unit sphere ...rather than within the unit square to possibly avoid bias in the initial state and improve convergence. The current implementation picks the N vector elements uniformly at random from [0,1). This means they all point into one quadrant of the vector space. As N gets just a little large, the vector tend strongly to point into the "corner", towards (1,1,1...,1). The vectors are not unit vectors either. I suggest choosing the elements as Gaussian ~ N(0,1) and normalizing. This gets you uniform random choices on the unit sphere which is more what's of interest here. It has worked a little better for me in the past. This is pretty minor but wanted to warm up suggesting a few tweaks to ALS. Please excuse my Scala, pretty new to it. Author: Sean Owen <[email protected]> == Merge branch commits == commit 492b13a Author: Sean Owen <[email protected]> Date: Mon Jan 27 08:05:25 2014 +0000 Style: spaces around binary operators commit ce2b5b5 Author: Sean Owen <[email protected]> Date: Sun Jan 19 22:50:03 2014 +0000 Generate factors with all positive components, per discussion in https://github.com/apache/incubator-spark/pull/460 commit b6f7a8a Author: Sean Owen <[email protected]> Date: Sat Jan 18 15:54:42 2014 +0000 Choose initial user/item vectors uniformly on the unit sphere rather than within the unit square to possibly avoid bias in the initial state and improve convergence

Replace the check for None Option with isDefined and isEmpty in Scala code Propose to replace the Scala check for Option "!= None" with Option.isDefined and "=== None" with Option.isEmpty. I think this, using method call if possible then operator function plus argument, will make the Scala code easier to read and understand. Pass compile and tests.

modified SparkPluginBuild.scala to use https protocol for accessing gith... We cannot build Spark behind a proxy although we execute sbt with -Dhttp(s).proxyHost -Dhttp(s).proxyPort -Dhttp(s).proxyUser -Dhttp(s).proxyPassword options. It's because of using git protocol to clone junit_xml_listener.git. I could build after modifying SparkPluginBuild.scala. I reported this issue to JIRA. https://spark-project.atlassian.net/browse/SPARK-1046

Allow files added through SparkContext.addFile() to be overwritten This is useful for the cases when a file needs to be refreshed and downloaded by the executors periodically. For example, a possible use case is: the driver periodically renews a Hadoop delegation token and writes it to a token file. The token file needs to be downloaded by the executors whenever it gets renewed. However, the current implementation throws an exception when the target file exists and its contents do not match those of the new source. This PR adds an option to allow files to be overwritten to support use cases similar to the above.

@tyro89

This fixes SPARK-1043, a bug introduced in 0.9.0 where PySpark couldn't serialize strings > 64kB. This fix was written by @tyro89 and @bouk in #512. This commit squashes and rebases their pull request in order to fix some merge conflicts.

@tyro89

Switch from MUTF8 to UTF8 in PySpark serializers. This fixes SPARK-1043, a bug introduced in 0.9.0 where PySpark couldn't serialize strings > 64kB. This fix was written by @tyro89 and @bouk in #512. This commit squashes and rebases their pull request in order to fix some merge conflicts.

Updated Spark Streaming Programming Guide Here is the updated version of the Spark Streaming Programming Guide. This is still a work in progress, but the major changes are in place. So feedback is most welcome. In general, I have tried to make the guide to easier to understand even if the reader does not know much about Spark. The updated website is hosted here - http://www.eecs.berkeley.edu/~tdas/spark_docs/streaming-programming-guide.html The major changes are: - Overview illustrates the usecases of Spark Streaming - various input sources and various output sources - An example right after overview to quickly give an idea of what Spark Streaming program looks like - Made Java API and examples a first class citizen like Scala by using tabs to show both Scala and Java examples (similar to AMPCamp tutorial's code tabs) - Highlighted the DStream operations updateStateByKey and transform because of their powerful nature - Updated driver node failure recovery text to highlight automatic recovery in Spark standalone mode - Added information about linking and using the external input sources like Kafka and Flume - In general, reorganized the sections to better show the Basic section and the more advanced sections like Tuning and Recovery. Todos: - Links to the docs of external Kafka, Flume, etc - Illustrate window operation with figure as well as example. Author: Tathagata Das <[email protected]> == Merge branch commits == commit 18ff105 Author: Tathagata Das <[email protected]> Date: Tue Jan 28 21:49:30 2014 -0800 Fixed a lot of broken links. commit 34a5a60 Author: Tathagata Das <[email protected]> Date: Tue Jan 28 18:02:28 2014 -0800 Updated github url to use SPARK_GITHUB_URL variable. commit f338a60 Author: Tathagata Das <[email protected]> Date: Mon Jan 27 22:42:42 2014 -0800 More updates based on Patrick and Harvey's comments. commit 89a81ff Author: Tathagata Das <[email protected]> Date: Mon Jan 27 13:08:34 2014 -0800 Updated docs based on Patricks PR comments. commit d5b6196 Author: Tathagata Das <[email protected]> Date: Sun Jan 26 20:15:58 2014 -0800 Added spark.streaming.unpersist config and info on StreamingListener interface. commit e3dcb46 Author: Tathagata Das <[email protected]> Date: Sun Jan 26 18:41:12 2014 -0800 Fixed docs on StreamingContext.getOrCreate. commit 6c29524 Author: Tathagata Das <[email protected]> Date: Thu Jan 23 18:49:39 2014 -0800 Added example and figure for window operations, and links to Kafka and Flume API docs. commit f06b964 Author: Tathagata Das <[email protected]> Date: Wed Jan 22 22:49:12 2014 -0800 Fixed missing endhighlight tag in the MLlib guide. commit 036a7d4 Merge: eab351d a1cd185 Author: Tathagata Das <[email protected]> Date: Wed Jan 22 22:17:42 2014 -0800 Merge remote-tracking branch 'apache/master' into docs-update commit eab351d Author: Tathagata Das <[email protected]> Date: Wed Jan 22 22:17:15 2014 -0800 Update Spark Streaming Programming Guide.

Issue with failed worker registrations I've been going through the spark source after having some odd issues with workers dying and not coming back. After some digging (I'm very new to scala and spark) I believe I've found a worker registration issue. It looks to me like a failed registration follows the same code path as a successful registration which end up with workers believing they are connected (since they received a `RegisteredWorker` event) even tho they are not registered on the Master. This is a quick fix that I hope addresses this issue (assuming I didn't completely miss-read the code and I'm about to look like a silly person :P) I'm opening this pr now to start a chat with you guys while I do some more testing on my side :) Author: Erik Selin <[email protected]> == Merge branch commits == commit 973012f Author: Erik Selin <[email protected]> Date: Tue Jan 28 23:36:12 2014 -0500 break logwarning into two lines to respect line character limit. commit e3754dc Author: Erik Selin <[email protected]> Date: Tue Jan 28 21:16:21 2014 -0500 add log warning when worker registration fails due to attempt to re-register on same address. commit 14baca2 Author: Erik Selin <[email protected]> Date: Wed Jan 22 21:23:26 2014 -0500 address code style comment commit 71c0d7e Author: Erik Selin <[email protected]> Date: Wed Jan 22 16:01:42 2014 -0500 Make a failed registration not persist, not send a `RegisteredWordker` event and not run `schedule` but rather send a `RegisterWorkerFailed` message to the worker attempting to register.

Added spark.shuffle.file.buffer.kb to configuration doc. Author: Reynold Xin <[email protected]> == Merge branch commits == commit 0eea1d7 Author: Reynold Xin <[email protected]> Date: Wed Jan 29 14:40:48 2014 -0800 Added spark.shuffle.file.buffer.kb to configuration doc.

Add GraphX to assembly/pom.xml Author: Ankur Dave <[email protected]> == Merge branch commits == commit bb0b33e Author: Ankur Dave <[email protected]> Date: Fri Jan 31 15:24:52 2014 -0800 Add GraphX to assembly/pom.xml

Change the ⇒ character (maybe from scalariform) to => in Scala code for style consistency Looks like there are some ⇒ Unicode character (maybe from scalariform) in Scala code. This PR is to change it to => to get some consistency on the Scala code. If we want to use ⇒ as default we could use sbt plugin scalariform to make sure all Scala code has ⇒ instead of => And remove unused imports found in TwitterInputDStream.scala while I was there =) Author: Henry Saputra <[email protected]> == Merge branch commits == commit 29c1771 Author: Henry Saputra <[email protected]> Date: Sat Feb 1 22:05:16 2014 -0800 Change the ⇒ character (maybe from scalariform) to => in Scala code for style consistency.

Remove explicit conversion to PairRDDFunctions in cogroup() As SparkContext._ is already imported, using the implicit conversion appears to make the code much cleaner. Perhaps there was some sinister reason for doing the conversion explicitly, however. Author: Aaron Davidson <[email protected]> == Merge branch commits == commit aa4a63f Author: Aaron Davidson <[email protected]> Date: Sun Feb 2 23:48:04 2014 -0800 Remove explicit conversion to PairRDDFunctions in cogroup() As SparkContext._ is already imported, using the implicit conversion appears to make the code much cleaner. Perhaps there was some sinister reason for doing the converion explicitly, however.

Refactor RDD sampling and add randomSplit to RDD (update) Replace SampledRDD by PartitionwiseSampledRDD, which accepts a RandomSampler instance as input. The current sample with/without replacement can be easily integrated via BernoulliSampler and PoissonSampler. The benefits are: 1) RDD.randomSplit is implemented in the same way, related to https://github.com/apache/incubator-spark/pull/513 2) Stratified sampling and importance sampling can be implemented in the same manner as well. Unit tests are included for samplers and RDD.randomSplit. This should performance better than my previous request where the BernoulliSampler creates many Iterator instances: https://github.com/apache/incubator-spark/pull/513 Author: Xiangrui Meng <[email protected]> == Merge branch commits == commit e8ce957 Author: Xiangrui Meng <[email protected]> Date: Mon Feb 3 12:21:08 2014 -0800 more docs to PartitionwiseSampledRDD commit fbb4586 Author: Xiangrui Meng <[email protected]> Date: Mon Feb 3 00:44:23 2014 -0800 move XORShiftRandom to util.random and use it in BernoulliSampler commit 987456b Author: Xiangrui Meng <[email protected]> Date: Sat Feb 1 11:06:59 2014 -0800 relax assertions in SortingSuite because the RangePartitioner has large variance in this case commit 3690aae Author: Xiangrui Meng <[email protected]> Date: Sat Feb 1 09:56:28 2014 -0800 test split ratio of RDD.randomSplit commit 8a410bc Author: Xiangrui Meng <[email protected]> Date: Sat Feb 1 09:25:22 2014 -0800 add a test to ensure seed distribution and minor style update commit ce7e866 Author: Xiangrui Meng <[email protected]> Date: Fri Jan 31 18:06:22 2014 -0800 minor style change commit 750912b Author: Xiangrui Meng <[email protected]> Date: Fri Jan 31 18:04:54 2014 -0800 fix some long lines commit c446a25 Author: Xiangrui Meng <[email protected]> Date: Fri Jan 31 17:59:59 2014 -0800 add complement to BernoulliSampler and minor style changes commit dbe2bc2 Author: Xiangrui Meng <[email protected]> Date: Fri Jan 31 17:45:08 2014 -0800 switch to partition-wise sampling for better performance commit a1fca52 Merge: ac712e4 cf6128f Author: Xiangrui Meng <[email protected]> Date: Fri Jan 31 16:33:09 2014 -0800 Merge branch 'sample' of github.com:mengxr/incubator-spark into sample commit cf6128f Author: Xiangrui Meng <[email protected]> Date: Sun Jan 26 14:40:07 2014 -0800 set SampledRDD deprecated in 1.0 commit f430f84 Author: Xiangrui Meng <[email protected]> Date: Sun Jan 26 14:38:59 2014 -0800 update code style commit a8b5e20 Author: Xiangrui Meng <[email protected]> Date: Sun Jan 26 12:56:27 2014 -0800 move package random to util.random commit ab0fa2c Author: Xiangrui Meng <[email protected]> Date: Sun Jan 26 12:50:35 2014 -0800 add Apache headers and update code style commit 985609f Author: Xiangrui Meng <[email protected]> Date: Sun Jan 26 11:49:25 2014 -0800 add new lines commit b21bddf Author: Xiangrui Meng <[email protected]> Date: Sun Jan 26 11:46:35 2014 -0800 move samplers to random.IndependentRandomSampler and add tests commit c02dacb Author: Xiangrui Meng <[email protected]> Date: Sat Jan 25 15:20:24 2014 -0800 add RandomSampler commit 8ff7ba3 Author: Xiangrui Meng <[email protected]> Date: Fri Jan 24 13:23:22 2014 -0800 init impl of IndependentlySampledRDD

Fixed typo in scaladoc Author: Stevo Slavić <[email protected]> == Merge branch commits == commit 0a77f78 Author: Stevo Slavić <[email protected]> Date: Tue Feb 4 15:30:27 2014 +0100 Fixed typo in scaladoc

Fixed wrong path to compute-classpath.cmd compute-classpath.cmd is in bin, not in sbin directory Author: Stevo Slavić <[email protected]> == Merge branch commits == commit 23deca3 Author: Stevo Slavić <[email protected]> Date: Tue Feb 4 15:01:47 2014 +0100 Fixed wrong path to compute-classpath.cmd compute-classpath.cmd is in bin, not in sbin directory

Fix line end character stripping for Windows LogQuery Spark example would produce unwanted result when run on Windows platform because of different, platform specific trailing line end characters (not only \n but \r too). This fix makes use of Scala's standard library string functions to properly strip all trailing line end characters, letting Scala handle the platform specific stuff. Author: Stevo Slavić <[email protected]> == Merge branch commits == commit 1e43ba0 Author: Stevo Slavić <[email protected]> Date: Wed Feb 5 14:48:29 2014 +0100 Fix line end character stripping for Windows LogQuery Spark example would produce unwanted result when run on Windows platform because of different, platform specific trailing line end characters (not only \n but \r too). This fix makes use of Scala's standard library string functions to properly strip all trailing line end characters, letting Scala handle the platform specific stuff.

…#544. Fixed warnings in test compilation. This commit fixes two problems: a redundant import, and a deprecated function. Author: Kay Ousterhout <[email protected]> == Merge branch commits == commit da9d2e1 Author: Kay Ousterhout <[email protected]> Date: Wed Feb 5 11:41:51 2014 -0800 Fixed warnings in test compilation. This commit fixes two problems: a redundant import, and a deprecated function.

remove actorToWorker in master.scala, which is actually not used actorToWorker is actually not used in the code....just remove it Author: CodingCat <[email protected]> == Merge branch commits == commit 52656c2 Author: CodingCat <[email protected]> Date: Thu Feb 6 00:28:26 2014 -0500 remove actorToWorker in master.scala, which is actually not used

…s #526. spark on yarn - yarn-client mode doesn't always exit immediately https://spark-project.atlassian.net/browse/SPARK-1049 If you run in the yarn-client mode but you don't get all the workers you requested right away and then you exit your application, the application master stays around until it gets the number of workers you initially requested. This is a waste of resources. The AM should exit immediately upon the client going away. This fix simply checks to see if the driver closed while its waiting for the initial # of workers. Author: Thomas Graves <[email protected]> == Merge branch commits == commit 03f40a6 Author: Thomas Graves <[email protected]> Date: Fri Jan 31 11:23:10 2014 -0600 spark on yarn - yarn-client mode doesn't always exit immediately

Fix off-by-one error with task progress info log. Author: Kay Ousterhout <[email protected]> == Merge branch commits == commit 29798fc Author: Kay Ousterhout <[email protected]> Date: Wed Feb 5 13:40:01 2014 -0800 Fix off-by-one error with task progress info log.

Python api additions Author: Prashant Sharma <[email protected]> == Merge branch commits == commit 8b51591 Author: Prashant Sharma <[email protected]> Date: Fri Jan 24 11:50:29 2014 +0530 Josh's and Patricks review comments. commit d37f967 Author: Prashant Sharma <[email protected]> Date: Thu Jan 23 17:27:17 2014 +0530 fixed doc tests commit 27cb54b Author: Prashant Sharma <[email protected]> Date: Thu Jan 23 16:48:43 2014 +0530 Added keys and values methods for PairFunctions in python commit 4ce76b3 Author: Prashant Sharma <[email protected]> Date: Thu Jan 23 13:51:26 2014 +0530 Added foreachPartition commit 05f0534 Author: Prashant Sharma <[email protected]> Date: Thu Jan 23 13:02:59 2014 +0530 Added coalesce fucntion to python API commit 6568d2c Author: Prashant Sharma <[email protected]> Date: Thu Jan 23 12:52:44 2014 +0530 added repartition function to python API.

SPARK-1056. Fix header comment in Executor to not imply that it's only u... ...sed for Mesos and Standalone. Author: Sandy Ryza <[email protected]> == Merge branch commits == commit 1f2443d Author: Sandy Ryza <[email protected]> Date: Thu Feb 6 15:03:50 2014 -0800 SPARK-1056. Fix header comment in Executor to not imply that it's only used for Mesos and Standalone

Inform DAG scheduler about all started/finished tasks. Previously, the DAG scheduler was not always informed when tasks started and finished. The simplest example here is for speculated tasks: the DAGScheduler was only told about the first attempt of a task, meaning that SparkListeners were also not told about multiple task attempts, so users can't see what's going on with speculation in the UI. The DAGScheduler also wasn't always told about finished tasks, so in the UI, some tasks will never be shown as finished (this occurs, for example, if a task set gets killed). The other problem is that the fairness accounting was wrong -- the number of running tasks in a pool was decreased when a task set was considered done, even if all of its tasks hadn't yet finished. Author: Kay Ousterhout <[email protected]> == Merge branch commits == commit c8d547d Author: Kay Ousterhout <[email protected]> Date: Wed Jan 15 16:47:33 2014 -0800 Addressed Reynold's review comments. Always use a TaskEndReason (remove the option), and explicitly signal when we don't know the reason. Also, always tell DAGScheduler (and associated listeners) about started tasks, even when they're speculated. commit 3fee1e2 Author: Kay Ousterhout <[email protected]> Date: Wed Jan 8 22:58:13 2014 -0800 Fixed broken test and improved logging commit ff12fca Author: Kay Ousterhout <[email protected]> Date: Sun Dec 29 21:08:20 2013 -0800 Inform DAG scheduler about all finished tasks. Previously, the DAG scheduler was not always informed when tasks finished. For example, when a task set was aborted, the DAG scheduler was never told when the tasks in that task set finished. The DAG scheduler was also never told about the completion of speculated tasks. This led to confusion with SparkListeners because information about the completion of those tasks was never passed on to the listeners (so in the UI, for example, some tasks will never be shown as finished). The other problem is that the fairness accounting was wrong -- the number of running tasks in a pool was decreased when a task set was considered done, even if all of its tasks hadn't yet finished.

Only run ResubmitFailedStages event after a fetch fails Previously, the ResubmitFailedStages event was called every 200 milliseconds, leading to a lot of unnecessary event processing and clogged DAGScheduler logs. Author: Kay Ousterhout <[email protected]> == Merge branch commits == commit e603784 Author: Kay Ousterhout <[email protected]> Date: Wed Feb 5 11:34:41 2014 -0800 Re-add check for empty set of failed stages commit d258f0e Author: Kay Ousterhout <[email protected]> Date: Wed Jan 15 23:35:41 2014 -0800 Only run ResubmitFailedStages event after a fetch fails Previously, the ResubmitFailedStages event was called every 200 milliseconds, leading to a lot of unnecessary event processing and clogged DAGScheduler logs.

External spilling - generalize batching logic The existing implementation consists of a hack for Kryo specifically and only works for LZF compression. Introducing an intermediate batch-level stream takes care of pre-fetching and other arbitrary behavior of higher level streams in a more general way. Author: Andrew Or <[email protected]> == Merge branch commits == commit 3ddeb7e Author: Andrew Or <[email protected]> Date: Wed Feb 5 12:09:32 2014 -0800 Also privatize fields commit 090544a Author: Andrew Or <[email protected]> Date: Wed Feb 5 10:58:23 2014 -0800 Privatize methods commit 13920c9 Author: Andrew Or <[email protected]> Date: Tue Feb 4 16:34:15 2014 -0800 Update docs commit bd5a1d7 Author: Andrew Or <[email protected]> Date: Tue Feb 4 13:44:24 2014 -0800 Typo: phyiscal -> physical commit 287ef44 Author: Andrew Or <[email protected]> Date: Tue Feb 4 13:38:32 2014 -0800 Avoid reading the entire batch into memory; also simplify streaming logic Additionally, address formatting comments. commit 3df7005 Merge: a531d2e 164489d Author: Andrew Or <[email protected]> Date: Mon Feb 3 18:27:49 2014 -0800 Merge branch 'master' of github.com:andrewor14/incubator-spark commit a531d2e Author: Andrew Or <[email protected]> Date: Mon Feb 3 18:18:04 2014 -0800 Relax assumptions on compressors and serializers when batching This commit introduces an intermediate layer of an input stream on the batch level. This guards against interference from higher level streams (i.e. compression and deserialization streams), especially pre-fetching, without specifically targeting particular libraries (Kryo) and forcing shuffle spill compression to use LZF. commit 164489d Author: Andrew Or <[email protected]> Date: Mon Feb 3 18:18:04 2014 -0800 Relax assumptions on compressors and serializers when batching This commit introduces an intermediate layer of an input stream on the batch level. This guards against interference from higher level streams (i.e. compression and deserialization streams), especially pre-fetching, without specifically targeting particular libraries (Kryo) and forcing shuffle spill compression to use LZF.

SPARK-1062 Add rdd.intersection(otherRdd) method Author: Andrew Ash <[email protected]> == Merge branch commits == commit 5d9982b Author: Andrew Ash <[email protected]> Date: Thu Feb 6 18:11:45 2014 -0800 Minor fixes - style: (v,null) => (v, null) - mention the shuffle in Javadoc commit b86d02f Author: Andrew Ash <[email protected]> Date: Sun Feb 2 13:17:40 2014 -0800 Overload .intersection() for numPartitions and custom Partitioner commit bcaa349 Author: Andrew Ash <[email protected]> Date: Sun Feb 2 13:05:40 2014 -0800 Better naming of parameters in intersection's filter commit b10a6af Author: Andrew Ash <[email protected]> Date: Sat Jan 25 23:06:26 2014 -0800 Follow spark code format conventions of tab => 2 spaces commit 965256e Author: Andrew Ash <[email protected]> Date: Fri Jan 24 00:28:01 2014 -0800 Add rdd.intersection(otherRdd) method

@pwendell

tex formulas in the documentation using mathjax. and spliting the MLlib documentation by techniques see jira https://spark-project.atlassian.net/browse/MLLIB-19 and https://github.com/shivaram/spark/compare/mathjax Author: Martin Jaggi <[email protected]> == Merge branch commits == commit 0364bfa Author: Martin Jaggi <[email protected]> Date: Fri Feb 7 03:19:38 2014 +0100 minor polishing, as suggested by @pwendell commit dcd2142 Author: Martin Jaggi <[email protected]> Date: Thu Feb 6 18:04:26 2014 +0100 enabling inline latex formulas with $.$ same mathjax configuration as used in math.stackexchange.com sample usage in the linear algebra (SVD) documentation commit bbafafd Author: Martin Jaggi <[email protected]> Date: Thu Feb 6 17:31:29 2014 +0100 split MLlib documentation by techniques and linked from the main mllib-guide.md site commit d1c5212 Author: Martin Jaggi <[email protected]> Date: Thu Feb 6 16:59:43 2014 +0100 enable mathjax formula in the .md documentation files code by @shivaram commit d73948d Author: Martin Jaggi <[email protected]> Date: Thu Feb 6 16:57:23 2014 +0100 minor update on how to compile the documentation

Make sbt download an atomic operation Modifies the `sbt/sbt` script to gracefully recover when a previous invocation died in the middle of downloading the SBT jar. Author: Jey Kottalam <[email protected]> == Merge branch commits == commit 6c600eb Author: Jey Kottalam <[email protected]> Date: Fri Jan 17 10:43:54 2014 -0800 Make sbt download an atomic operation

Kill drivers in postStop() for Worker. JIRA SPARK-1068:https://spark-project.atlassian.net/browse/SPARK-1068 Author: Qiuzhuang Lian <[email protected]> == Merge branch commits == commit 9c19ce6 Author: Qiuzhuang Lian <[email protected]> Date: Sat Feb 8 16:07:39 2014 +0800 Kill drivers in postStop() for Worker. JIRA SPARK-1068:https://spark-project.atlassian.net/browse/SPARK-1068

@pwendell

Version number to 1.0.0-SNAPSHOT Since 0.9.0-incubating is done and out the door, we shouldn't be building 0.9.0-incubating-SNAPSHOT anymore. @pwendell Author: Mark Hamstra <[email protected]> == Merge branch commits == commit 1b00a8a Author: Mark Hamstra <[email protected]> Date: Wed Feb 5 09:30:32 2014 -0800 Version number to 1.0.0-SNAPSHOT

SPARK-1066: Add developer scripts to repository. These are some developer scripts I've been maintaining in a separate public repo. This patch adds them to the Spark repository so they can evolve here and are clearly accessible to all committers. I may do some small additional clean-up in this PR, but wanted to put them here in case others want to review. There are a few types of scripts here: 1. A tool to merge pull requests. 2. A script for packaging releases. 3. A script for auditing release candidates. Author: Patrick Wendell <[email protected]> == Merge branch commits == commit 5d5d331 Author: Patrick Wendell <[email protected]> Date: Sat Feb 8 22:11:47 2014 -0800 SPARK-1066: Add developer scripts to repository.

[WIP] SPARK-1067: Default log4j initialization causes errors for those not using log4j To fix this - we add a check when initializing log4j. Author: Patrick Wendell <[email protected]> == Merge branch commits == commit ffdce51 Author: Patrick Wendell <[email protected]> Date: Fri Feb 7 15:22:29 2014 -0800 Logging fix

Added example Python code for sort I added an example Python code for sort. Right now, PySpark has limited examples for new people willing to use the project. This example code sorts integers stored in a file. I was able to sort 5 million, 10 million and 25 million integers with this code. Author: jyotiska <[email protected]> == Merge branch commits == commit 8ad8faf Author: jyotiska <[email protected]> Date: Sun Feb 9 11:00:41 2014 +0530 Added comments in code on collect() method commit 6f98f1e Author: jyotiska <[email protected]> Date: Sat Feb 8 13:12:37 2014 +0530 Updated python example code sort.py commit 945e39a Author: jyotiska <[email protected]> Date: Sat Feb 8 12:59:09 2014 +0530 Added example python code for sort

[SPARK-1060] startJettyServer should explicitly use IP information https://spark-project.atlassian.net/browse/SPARK-1060 In the current implementation, the webserver in Master/Worker is started with val (srv, bPort) = JettyUtils.startJettyServer("0.0.0.0", port, handlers) inside startJettyServer: val server = new Server(currentPort) //here, the Server will take "0.0.0.0" as the hostname, i.e. will always bind to the IP address of the first NIC this can cause wrong IP binding, e.g. if the host has two NICs, N1 and N2, the user specify the SPARK_LOCAL_IP as the N2's IP address, however, when starting the web server, for the reason stated above, it will always bind to the N1's address Author: CodingCat <[email protected]> == Merge branch commits == commit 6c6d9a8 Author: CodingCat <[email protected]> Date: Thu Feb 6 14:53:34 2014 -0500 startJettyServer should explicitly use IP information

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prashan_pul #1

prashan_pul #1

Commits on Jan 13, 2014

Commits on Jan 14, 2014

Commits on Jan 15, 2014

Commits on Jan 16, 2014

Commits on Jan 17, 2014

Commits on Jan 18, 2014

Commits on Jan 19, 2014

Commits on Jan 20, 2014

Commits on Jan 21, 2014

Commits on Jan 22, 2014

Commits on Jan 23, 2014

Commits on Jan 24, 2014

Commits on Jan 26, 2014

Commits on Jan 27, 2014

Commits on Jan 28, 2014

Commits on Jan 29, 2014

Commits on Jan 30, 2014

Commits on Feb 1, 2014

Commits on Feb 3, 2014

Commits on Feb 4, 2014

Commits on Feb 5, 2014

Commits on Feb 6, 2014

Commits on Feb 7, 2014

Commits on Feb 8, 2014

Commits on Feb 9, 2014

prashan_pul #1

Are you sure you want to change the base?

prashan_pul #1

Commits on Jan 13, 2014

Commits on Jan 14, 2014

Commits on Jan 15, 2014

Commits on Jan 16, 2014

Commits on Jan 17, 2014

Commits on Jan 18, 2014

Commits on Jan 19, 2014

Commits on Jan 20, 2014

Commits on Jan 21, 2014

Commits on Jan 22, 2014

Commits on Jan 23, 2014

Commits on Jan 24, 2014

Commits on Jan 26, 2014

Commits on Jan 27, 2014

Commits on Jan 28, 2014

Commits on Jan 29, 2014

Commits on Jan 30, 2014

Commits on Feb 1, 2014

Commits on Feb 3, 2014

Commits on Feb 4, 2014

Commits on Feb 5, 2014

Commits on Feb 6, 2014

Commits on Feb 7, 2014

Commits on Feb 8, 2014

Commits on Feb 9, 2014