-
Notifications
You must be signed in to change notification settings - Fork 0
Spark
locate spark-env.sh (in my case, after installing with Ambari, its location is /usr/hdp/2.3.4.0-3485/etc/spark/conf/)
find HADOOP_CONF_DIR variable in spark-env.sh (in my case, its location is HADOOP_CONF_DIR=/usr/hdp/2.3.4.0-3485/hadoop/conf)
copy the jets3t.properties from this repository to HADOOP_CONF_DIR copy the content of core-site.xml from this repository to the code-site.xml under HADOOP_CONF_DIR (add the content, don't replace).
In order to run the spark shell, you need additional jars - these the jars I added, you can adjust the version and location of course.
/usr/bin/spark-shell --jars /usr/hdp/2.3.4.0-3485/hadoop/hadoop-aws-2.7.1.2.3.4.0-348 5.jar,/usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar,/usr/local/hadoop/share/hadoop/hdfs/lib/guava -11.0.2.jar
sample that count words from et2.log that resides on noobaa's bucket hadoop:
val file = sc.textFile("s3n://hadoop/et2.log");
val counts = file.flatMap(line => line.toLowerCase().replace(".", " ").replace(",", " ").split(" ")).map(word => (word, 1L)).reduceByKey(_ + _)
val sorted_counts = counts.collect().sortBy(wc => -wc._2) sorted_counts.take(10).foreach(println)
sc.parallelize(sorted_counts).saveAsTextFile("file:///tmp/output")