This example shows you how to use OCI Data Flow to process data in OCI Object Store and save the results to Oracle ADW or ATP.
Before you begin:
- Ensure your tenant is configured for Data Flow by following instructions
- Provision an ADW or ATP instance.
- Create a wallet for your ADW/ATP instance.
- Store the wallet password in a secret within the OCI Secrets Service.
- Download the Oracle JDBC driver (version 19c) from here
- Note: Use the Java 8 version for compatibility with the Data Flow runtime.
- Extract the driver into a directory called ojdbc.
- (Optional, strongly recommended): Install Spark to test your code locally before deploying to Data Flow.
Upload a sample CSV file to OCI object store.
Customize loadadw.py
with:
-
Set INPUT_PATH to the OCI path of your CSV data.
-
Set PASSWORD_SECRET_OCID to the OCID of the secret created during Required Setup.
-
Set TARGET_TABLE to the table in ADW where data is to be written.
-
Set TNSNAME to a TNS name valid for the database.
-
Set USER to the user who generated the wallet file.
-
Set WALLET_PATH to the path on object store for the wallet.
Test the Application Locally (recommended): You can test the application locally using spark-submit:
spark-submit --jars ojdbc/ojdbc8.jar,ojdbc/ucp.jar,ojdbc/oraclepki.jar,ojdbc/osdt_cert.jar,ojdbc/osdt_core.jar loadadw.py
- Create the Data Flow Dependencies Archive as follows:
docker pull phx.ocir.io/oracle/dataflow/dependency-packager:latest
docker run --rm -v $(pwd):/opt/dataflow -it phx.ocir.io/oracle/dataflow/dependency-packager:latest
- Confirm you have a file named archive.zip with the Oracle JDBC driver in it.
- Copy loadadw.py to object store.
- Copy archive.zip to object store.
- Create a Data Flow Python application. Be sure to include archive.zip as the dependency archive.
- Refer here for more information.
- Run the application.
Create a bucket. Alternatively you can re-use an existing bucket.
oci os object put --bucket-name <bucket> --file loadadw.py
oci os object put --bucket-name <bucket> --file archive.zip
oci data-flow application create \
--compartment-id <compartment_ocid> \
--display-name "PySpark Load ADW" \
--driver-shape VM.Standard2.1 \
--executor-shape VM.Standard2.1 \
--num-executors 1 \
--spark-version 2.4.4 \
--file-uri oci://<bucket>@<namespace>/loadadw.py \
--archive-uri oci://<bucket>@<namespace>/archive.zip \
--language Python
oci data-flow run create \
--application-id <application_ocid> \
--compartment-id <compartment_ocid> \
--application-id <application_ocid> \
--display-name 'PySpark Load ADW"