refactoring code and updated the document (#46)

* test * removing test file * adding Delta Lake documentation * ODI and Data Flow Integration * ODI and Data Flow Integration * ODI and Data Flow Integration * Delta Lake integration documentation and sample code. * Delta Lake integration documentation and sample code. * Delta Lake integration documentation and sample code. * updating the link to the source files * Iceberg integration and document details * Iceberg integration spark submit details * refactoring code and updated the document * updating the dependency packager documentation. * updating the dependency packager documentation. --------- Co-authored-by: Mohan Kumar <[email protected]>
oracle-samples · Mar 30, 2023 · e2bb85e · e2bb85e
1 parent 68b9dd3
commit e2bb85e
Show file tree

Hide file tree

Showing 3 changed files with 15 additions and 18 deletions.
diff --git a/iceberg/python/iceberg_sample.py b/iceberg/python/iceberg_sample.py
@@ -42,19 +42,6 @@
 
 
     sqlContext = SQLContext(spark.sparkContext)
-    # Create a Vertex DataFrame with unique ID column "id"
-    v = sqlContext.createDataFrame([
-        ("a", "Alice", 34),
-        ("b", "Bob", 36),
-        ("c", "Charlie", 30),
-    ], ["id", "name", "age"])
-    # Create an Edge DataFrame with "src" and "dst" columns
-    e = sqlContext.createDataFrame([
-        ("a", "b", "friend"),
-        ("b", "c", "follow"),
-        ("c", "b", "follow"),
-    ], ["src", "dst", "relationship"])
-    # Create a GraphFrame
 
     csvFilePath = sys.argv[1]
     icebergTablePath = sys.argv[2]
@@ -73,6 +60,7 @@
     df1.show()
 
     print("\nHistory for snapshot-id" + icebergTablePath)
+    # provide valid snapshot id from the metadata.json file
     df2 = spark.read.format("iceberg").option("snapshot-id", 3004923173007188796).load(icebergTablePath)
     df2.show()
 

diff --git a/python/structured_streaming_java_dependencies_for_python/README.md b/python/structured_streaming_java_dependencies_for_python/README.md
@@ -3,7 +3,12 @@ This is a Maven project to build an uber (all in one) JAR that contains all the
 
 See [Best Practices for Building Python Spark Streaming Applications](https://docs.cloud.oracle.com/en-us/iaas/data-flow/using/spark-streaming.htm#streaming-build-python-app-tips)
 # Content
-Fully includes
+Provide dependency using any of the below suitable option.
+
+    * Use `--packages` option or `spark.jars.packages` spark configuration. Application running in private endpoint has to allow traffic from private subnet to internet to download package (confirm with PM). 
+    * Provide object storage jar location in `--jars` or `spark.jars` as comma seperated list.  
+    * Create dependency `archive.zip`.
+
 ```sh
 groupId = com.oracle.oci.sdk
 artifactId = oci-java-sdk-addons-sasl
@@ -34,9 +39,9 @@ Relocates namespaces
 * com.google -> com.shaded.google
 * com.oracle.bmc -> com.shaded.oracle.bmc (except com.oracle.bmc.auth.sasl.*)
 
-# Instructions
-1. Compile an uber jar.
-2. Pack uber jar into archive.zip
+# Instructions to create a application jar
+1. Provide dependency in any of the above suitable options.
+2. Compile source code and create a application jar ( In case of Fat jar dependency, Pack uber jar into archive.zip )
 
 ## To Compile
 ```sh

diff --git a/python/structured_streaming_kafka_word_count/README.md b/python/structured_streaming_kafka_word_count/README.md
@@ -40,7 +40,11 @@ optional arguments:
   --raw-stream RAW_STREAM
   --stream-username STREAM_USERNAME
 ```
-4. Using `python/structured_streaming_java_dependencies_for_python` create `archive.zip`
+4. Provide your dependency using any of the below suitable option.
+
+    * Use `--packages` option or `spark.jars.packages` spark configuration. Application running in private endpoint has to allow traffic from private subnet to internet to download package (confirm with PM). 
+    * Provide object storage jar location in `--jars` or `spark.jars` as comma seperated list.  
+    * Use `python/structured_streaming_java_dependencies_for_python` create `archive.zip`.
 5. First start Structured Streaming app (StructuredKafkaWordCount) locally or in the cloud.
 6. Second start data producer oss-producer-from-file.py (`python3 oss-producer-from-file.py`) locally or in the cloud.