Skip to content

Commit

Permalink
Formatted ReadMe files
Browse files Browse the repository at this point in the history
  • Loading branch information
Anand Chandak committed May 21, 2021
1 parent dbba461 commit a6ff6ff
Show file tree
Hide file tree
Showing 6 changed files with 161 additions and 61 deletions.
59 changes: 45 additions & 14 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,55 @@
# Contributing to oci-dataflow-samples
# Contributing to this repository

*Copyright (c) 2021, Oracle and/or its affiliates. All rights reserved.*
We welcome your contributions! There are multiple ways to contribute.

Pull requests can be made under
[The Oracle Contributor Agreement](https://www.oracle.com/technetwork/community/oca-486395.html)
(OCA).
## Opening issues

For pull requests to be accepted, the bottom of
your commit message must have the following line using your name and
e-mail address as it appears in the OCA Signatories list.
For bugs or enhancement requests, please file a GitHub issue unless it's
security related. When filing a bug remember that the better written the bug is,
the more likely it is to be fixed. If you think you've found a security
vulnerability, do not raise a GitHub issue and follow the instructions in our
[security policy](./SECURITY.md).

```
## Contributing code

We welcome your code contributions. Before submitting code via a pull request,
you will need to haved signed the [Oracle Contributor Agreement][OCA] (OCA) and
your commits need to include the following line using the name and e-mail
address you used to sign the OCA:

```text
Signed-off-by: Your Name <[email protected]>
```

This can be automatically added to pull requests by committing with:
This can be automatically added to pull requests by committing with `--sign-off`
or `-s`, e.g.

```
```text
git commit --signoff
````
```

Only pull requests from committers that can be verified as having signed the OCA
can be accepted.

## Pull request process

1. Ensure there is an issue created to track and discuss the fix or enhancement
you intend to submit.
1. Fork this repository
1. Create a branch in your fork to implement the changes. We recommend using
the issue number as part of your branch name, e.g. `1234-fixes`
1. Ensure that any documentation is updated with the changes that are required
by your change.
1. Ensure that any samples are updated if the base image has been changed.
1. Submit the pull request. *Do not leave the pull request blank*. Explain exactly
what your changes are meant to do and provide simple steps on how to validate
your changes. Ensure that you reference the issue you created as well.
1. We will assign the pull request to 2-3 people for review before it is merged.

## Code of conduct

Follow the [Golden Rule](https://en.wikipedia.org/wiki/Golden_Rule). If you'd
like more specific guidelines, see the [Contributor Covenant Code of Conduct][COC].

Only pull requests from committers that can be verified as having
signed the OCA can be accepted.
[OCA]: https://oca.opensource.oracle.com
[COC]: https://www.contributor-covenant.org/version/1/4/code-of-conduct/
48 changes: 35 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,59 @@
# Oracle Cloud Infrastructure Data Flow Samples

This repository provides examples demonstrating how to use Oracle Cloud Infrastructure Data Flow, a service that lets you run any Apache Spark Application at any scale with no infrastructure to deploy or manage.

This repository provides examples demonstrating how to use Oracle Cloud Infrastructure Data Flow.
## What is Oracle Cloud Infrastructure Data Flow

## Setup
* [Quick start](https://docs.oracle.com/en-us/iaas/data-flow/using/dfs_getting_started.htm)
Data Flow is a cloud-based serverless platform with a rich user interface. It allows Spark developers and data scientists to create, edit, and run Spark jobs at any scale without the need for clusters, an operations team, or highly specialized Spark knowledge. Being serverless means there is no infrastructure for you to deploy or manage. It is entirely driven by REST APIs, giving you easy integration with applications or workflows. You can:

* Connect to Apache Spark data sources.

## How To
| Description | Python |
|------------------------------------------------------|:------:|
| CSV to Parquet |[sample](./python/csv_to_parquet)|
| Load to ADW |[sample](./python/loadadw)|
* Create reusable Apache Spark applications.

* Launch Apache Spark jobs in seconds.

For step-by-step instructions, see the README.txt files included with
* Manage all Apache Spark applications from a single platform.

* Process data in the Cloud or on-premises in your data center.

* Create Big Data building blocks that you can easily assemble into advanced Big Data applications.

## Before you Begin

* You must have Set Up Your Tenancy and be able to Access Data Flow

* Setup Tenancy : Before Data Flow can run, you must grant permissions that allow effective log capture and run management. See the Set Up Administration[Set Up Administration](https://docs.oracle.com/iaas/data-flow/using/dfs_getting_started.htm#set_up_admin) section of Data Flow Service Guide, and follow the instructions given there.
* Access Data Flow : Refer to this section on how to [Access Data Flow](https://docs.oracle.com/en-us/iaas/data-flow/data-flow-tutorial/getting-started/dfs_tut_get_started.htm#access_ui)

## Sample Examples

| Example | Description | Python |
|-------------------|:-----------:|:------:|
| CSV to Parquet |This application shows how to use PySpark to convert CSV data store in OCI Object Store to Apache Parquet format which is then written back to Object Store. |[sample](./python/csv_to_parquet)|
| Load to ADW |This application shows how to read a file from OCI Object Store, perform some transformation and write the results to an Autonomous Data Warehouse instance. |[sample](./python/loadadw)|

For step-by-step instructions, see the README files included with
each sample.

## Running the Samples:
## Running the Samples

These samples show how to use the OCI Data Flow service and are meant
to be deployed to and run from Oracle Cloud. You can optionally test
these applications locally before you deploy them. To test these
applications locally, Apache Spark needs to be installed.
these applications locally before you deploy them. When they are ready, you can deploy them to Data Flow without any need to reconfigure them, make code changes, or apply deployment profiles.To test these applications locally, Apache Spark needs to be installed. Refer to section on how to set the Prerequisites before you deploy the application locally [Setup locally](https://docs.oracle.com/en-us/iaas/data-flow/data-flow-tutorial/develop-apps-locally/front.htm).

## Install Spark

To install Spark, visit [spark.apache.org](https://spark.apache.org/docs/latest/api/python/getting_started/index.html)
and pick the installation path that best suits your environment.


## Documentation

You can find the online documentation for OCI Data Flow at [docs.oracle.com](https://docs.oracle.com/en-us/iaas/data-flow/using/dfs_getting_started.htm).

## Security

See [Security](./SECURITY.md)

## Contributing

See [CONTRIBUTING](./CONTRIBUTING.md)
Expand Down
39 changes: 39 additions & 0 deletions SECURITY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Reporting security vulnerabilities

Oracle values the independent security research community and believes that
responsible disclosure of security vulnerabilities helps us ensure the security
and privacy of all our users.

Please do NOT raise a GitHub Issue to report a security vulnerability. If you
believe you have found a security vulnerability, please submit a report to
<mailto:[email protected]> preferably with a proof of concept. Please review
some additional information on [how to report security vulnerabilities to Oracle][1].
We encourage people who contact Oracle Security to use email encryption using
[our encryption key][2].

We ask that you do not use other channels or contact the project maintainers
directly.

Non-vulnerability related security issues including ideas for new or improved
security features are welcome on GitHub Issues.

## Security updates, alerts and bulletins

Security updates will be released on a regular cadence. Many of our projects
will typically release security fixes in conjunction with the
[Oracle Critical Patch Update][3] program. Security updates are released on the
Tuesday closest to the 17th day of January, April, July and October. A pre-release
announcement will be published on the Thursday preceding each release. Additional
information, including past advisories, is available on our [security alerts][3]
page.

## Security-related information

We will provide security related information such as a threat model, considerations
for secure use, or any known security issues in our documentation. Please note
that labs and sample code are intended to demonstrate a concept and may not be
sufficiently hardened for production use.

[1]: https://www.oracle.com/corporate/security-practices/assurance/vulnerability/reporting.html
[2]: https://www.oracle.com/security-alerts/encryptionkey.html
[3]: https://www.oracle.com/security-alerts/
27 changes: 15 additions & 12 deletions python/csv_to_parquet/README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,19 @@
# Convert CSV data to Parquet.
Sample to convert CSV data to Parquet.
# Convert CSV data to Parquet

The most common first step in data processing applications, is to take data from some source and get it into a format that is suitable for reporting and other forms of analytics. In a database, you would load a flat file into the database and create indexes. In Spark, your first step is usually to clean and convert data from a text format into Parquet format. Parquet is an optimized binary format supporting efficient reads, making it ideal for reporting and analytics.

## Prerequisites
Before you begin:
![Convert CSV Data to Parquet](./images/csv_to_parquet.png)

* A - Ensure your tenant is configured according to the instructions [here](https://docs.cloud.oracle.com/en-us/iaas/data-flow/using/dfs_getting_started.htm#set_up_admin)
* B - Know your object store namespace.
* C - Know the OCID of a compartment where you want to load your data and create applications.
* D - (Optional, strongly recommended): Install Spark to test your code locally before deploying.
## Prerequisites

Before you begin:

* Ensure your tenant is configured according to the instructions to [setup admin](https://docs.cloud.oracle.com/en-us/iaas/data-flow/using/dfs_getting_started.htm#set_up_admin)
* Know your object store namespace.
* Know the OCID of a compartment where you want to load your data and create applications.
* (Optional, strongly recommended): Install Spark to test your code locally before deploying.

## Instructions:
## Instructions

1. Upload a sample CSV file to object store
2. Customize csv_to_parquet.py with the OCI path to your CSV data. The format is ```oci://<bucket>@<namespace>/path```
Expand All @@ -26,11 +27,11 @@ Before you begin:
7. Create a Python Data Flow application pointing to ```csv_to_parquet.py```
7a. Refer [here](https://docs.cloud.oracle.com/en-us/iaas/data-flow/using/dfs_data_flow_library.htm#create_pyspark_app)


## To use OCI CLI to run the PySpark Application

Create a bucket. Alternatively you can re-use an existing bucket.
```

```sh
oci os bucket create --name <bucket> --compartment-id <compartment_ocid>
oci os object put --bucket-name <bucket> --file csv_to_parquet.py
oci data-flow application create \
Expand All @@ -43,8 +44,10 @@ oci data-flow application create \
--file-uri oci://<bucket>@<namespace>/csv_to_parquet.py \
--language Python
```

Make note of the Application ID produced.
```

```sh
oci data-flow run create \
--compartment-id <compartment_ocid> \
--application-id <application_ocid> \
Expand Down
Binary file added python/csv_to_parquet/images/csv_to_parquet.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
49 changes: 27 additions & 22 deletions python/loadadw/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# Overview

This example shows you how to use OCI Data Flow to process data in OCI Object Store and save the results to Oracle ADW or ATP.

## Prerequisites

Before you begin:

1. Ensure your tenant is configured for Data Flow by following [instructions](https://docs.cloud.oracle.com/en-us/iaas/data-flow/using/dfs_getting_started.htm#set_up_admin)
Expand All @@ -13,49 +15,52 @@ Before you begin:
* Extract the driver into a directory called ojdbc.
6. (Optional, strongly recommended): Install Spark to test your code locally before deploying to Data Flow.

## Load Required Data:
## Load Required Data

Upload a sample CSV file to OCI object store.

## Application Setup:
## Application Setup

Customize ```loadadw.py``` with:
* Set INPUT_PATH to the OCI path of your CSV data.
* Set PASSWORD_SECRET_OCID to the OCID of the secret created during Required Setup.
* Set TARGET_TABLE to the table in ADW where data is to be written.
* Set TNSNAME to a TNS name valid for the database.
* Set USER to the user who generated the wallet file.
* Set WALLET_PATH to the path on object store for the wallet.

* Set INPUT_PATH to the OCI path of your CSV data.
* Set PASSWORD_SECRET_OCID to the OCID of the secret created during Required Setup.
* Set TARGET_TABLE to the table in ADW where data is to be written.
* Set TNSNAME to a TNS name valid for the database.
* Set USER to the user who generated the wallet file.
* Set WALLET_PATH to the path on object store for the wallet.

Test the Application Locally (recommended):
You can test the application locally using spark-submit:

```
```bash
spark-submit --jars ojdbc/ojdbc8.jar,ojdbc/ucp.jar,ojdbc/oraclepki.jar,ojdbc/osdt_cert.jar,ojdbc/osdt_core.jar loadadw.py
```

## Packaging your Application:
## Packaging your Application

1. Create the Data Flow Dependencies Archive as follows:
```
* Create the Data Flow Dependencies Archive as follows:

```bash
docker pull phx.ocir.io/oracle/dataflow/dependency-packager:latest
docker run --rm -v $(pwd):/opt/dataflow -it phx.ocir.io/oracle/dataflow/dependency-packager:latest
```
2. Confirm you have a file named **archive.zip** with the Oracle JDBC driver in it.
```

## Deploy and Run the Application:
* Confirm you have a file named **archive.zip** with the Oracle JDBC driver in it.

1. Copy loadadw.py to object store.
2. Copy archive.zip to object store.
3. Create a Data Flow Python application. Be sure to include archive.zip as the dependency archive.
* Refer [here](https://docs.cloud.oracle.com/en-us/iaas/data-flow/using/dfs_data_flow_library.htm#create_pyspark_app) for more information.
4. Run the application.
## Deploy and Run the Application

* Copy loadadw.py to object store.
* Copy archive.zip to object store.
* Create a Data Flow Python application. Be sure to include archive.zip as the dependency archive.
* Refer [here](https://docs.cloud.oracle.com/en-us/iaas/data-flow/using/dfs_data_flow_library.htm#create_pyspark_app) for more information.
* Run the application.

# Deploy and Run the Application using OCI Cloud Shell or OCI CLI
## Run the Application using OCI Cloud Shell or OCI CLI

Create a bucket. Alternatively you can re-use an existing bucket.
```

```sh
oci os object put --bucket-name <bucket> --file loadadw.py
oci os object put --bucket-name <bucket> --file archive.zip
oci data-flow application create \
Expand Down

0 comments on commit a6ff6ff

Please sign in to comment.