Skip to content

Commit

Permalink
refactoring code and updated the document (#46)
Browse files Browse the repository at this point in the history
* test

* removing test file

* adding Delta Lake documentation

* ODI and Data Flow Integration

* ODI and Data Flow Integration

* ODI and Data Flow Integration

* Delta Lake integration documentation and sample code.

* Delta Lake integration documentation and sample code.

* Delta Lake integration documentation and sample code.

* updating the link to the source files

* Iceberg integration and document details

* Iceberg integration spark submit details

* refactoring code and updated the document

* updating the dependency packager documentation.

* updating the dependency packager documentation.

---------

Co-authored-by: Mohan Kumar <[email protected]>
  • Loading branch information
mohanln-cloud and Mohan Kumar authored Mar 30, 2023
1 parent 68b9dd3 commit e2bb85e
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 18 deletions.
14 changes: 1 addition & 13 deletions iceberg/python/iceberg_sample.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,19 +42,6 @@


sqlContext = SQLContext(spark.sparkContext)
# Create a Vertex DataFrame with unique ID column "id"
v = sqlContext.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
], ["id", "name", "age"])
# Create an Edge DataFrame with "src" and "dst" columns
e = sqlContext.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
], ["src", "dst", "relationship"])
# Create a GraphFrame

csvFilePath = sys.argv[1]
icebergTablePath = sys.argv[2]
Expand All @@ -73,6 +60,7 @@
df1.show()

print("\nHistory for snapshot-id" + icebergTablePath)
# provide valid snapshot id from the metadata.json file
df2 = spark.read.format("iceberg").option("snapshot-id", 3004923173007188796).load(icebergTablePath)
df2.show()

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,12 @@ This is a Maven project to build an uber (all in one) JAR that contains all the

See [Best Practices for Building Python Spark Streaming Applications](https://docs.cloud.oracle.com/en-us/iaas/data-flow/using/spark-streaming.htm#streaming-build-python-app-tips)
# Content
Fully includes
Provide dependency using any of the below suitable option.

* Use `--packages` option or `spark.jars.packages` spark configuration. Application running in private endpoint has to allow traffic from private subnet to internet to download package (confirm with PM).
* Provide object storage jar location in `--jars` or `spark.jars` as comma seperated list.
* Create dependency `archive.zip`.

```sh
groupId = com.oracle.oci.sdk
artifactId = oci-java-sdk-addons-sasl
Expand Down Expand Up @@ -34,9 +39,9 @@ Relocates namespaces
* com.google -> com.shaded.google
* com.oracle.bmc -> com.shaded.oracle.bmc (except com.oracle.bmc.auth.sasl.*)

# Instructions
1. Compile an uber jar.
2. Pack uber jar into archive.zip
# Instructions to create a application jar
1. Provide dependency in any of the above suitable options.
2. Compile source code and create a application jar ( In case of Fat jar dependency, Pack uber jar into archive.zip )

## To Compile
```sh
Expand Down
6 changes: 5 additions & 1 deletion python/structured_streaming_kafka_word_count/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,11 @@ optional arguments:
--raw-stream RAW_STREAM
--stream-username STREAM_USERNAME
```
4. Using `python/structured_streaming_java_dependencies_for_python` create `archive.zip`
4. Provide your dependency using any of the below suitable option.

* Use `--packages` option or `spark.jars.packages` spark configuration. Application running in private endpoint has to allow traffic from private subnet to internet to download package (confirm with PM).
* Provide object storage jar location in `--jars` or `spark.jars` as comma seperated list.
* Use `python/structured_streaming_java_dependencies_for_python` create `archive.zip`.
5. First start Structured Streaming app (StructuredKafkaWordCount) locally or in the cloud.
6. Second start data producer oss-producer-from-file.py (`python3 oss-producer-from-file.py`) locally or in the cloud.

Expand Down

0 comments on commit e2bb85e

Please sign in to comment.