Spark

locate spark-env.sh (in my case, after installing with Ambari, its location is /usr/hdp/2.3.4.0-3485/etc/spark/conf/)

find HADOOP_CONF_DIR variable in spark-env.sh (in my case, its location is HADOOP_CONF_DIR=/usr/hdp/2.3.4.0-3485/hadoop/conf)

copy the jets3t.properties from this repository to HADOOP_CONF_DIR copy the content of core-site.xml from this repository to the code-site.xml under HADOOP_CONF_DIR (add the content, don't replace).

In order to run the spark shell, you need additional jars - these the jars I added, you can adjust the version and location of course.

/usr/bin/spark-shell --jars /usr/hdp/2.3.4.0-3485/hadoop/hadoop-aws-2.7.1.2.3.4.0-348 5.jar,/usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar,/usr/local/hadoop/share/hadoop/hdfs/lib/guava -11.0.2.jar

sample that count words from et2.log that resides on noobaa's bucket hadoop:

val file = sc.textFile("s3n://hadoop/et2.log");

val counts = file.flatMap(line => line.toLowerCase().replace(".", " ").replace(",", " ").split(" ")).map(word => (word, 1L)).reduceByKey(_ + _)

val sorted_counts = counts.collect().sortBy(wc => -wc._2) sorted_counts.take(10).foreach(println)

sc.parallelize(sorted_counts).saveAsTextFile("file:///tmp/output")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark

Clone this wiki locally