Use the Python Client Library
Estimated completion time:
Overview​
This Cloud Shell walkthrough leads you through the steps to use the Cloud Client Libraries for Python to programmatically interact with Dataproc.
As you follow this walkthrough, you run Python code that calls Dataproc gRPC APIs to:
- Create a Dataproc cluster
- Submit a PySpark word sort job to the cluster
- Delete the cluster after job completion
Using the walkthrough​
The submit_job_to_cluster.py file
used in this walkthrough is opened in the
Cloud Shell editor when you launch the walkthrough. You can view
the code as your follow the walkthrough steps.
For more information: See Use the Cloud Client Libraries for Python for an explanation of how the code works.
To reload this walkthrough: Run the following command from the
~/python-docs-samples/dataproc
directory in Cloud Shell:
cloudshell launch-tutorial python-api-walkthrough.md
To copy and run commands: Click the "Copy to Cloud Shell" button
(Enter
to run the command.
Prerequisites (1)​
-
Create or select a Google Cloud project to use for this tutorial.
-
-
Enable the Dataproc, Compute Engine, and Cloud Storage APIs in your project.
gcloud services enable dataproc.googleapis.com \
compute.googleapis.com \
storage-component.googleapis.com \
--project={{project_id}}
Prerequisites (2)​
-
This walkthrough uploads a PySpark file (
pyspark_sort.py
) to a Cloud Storage bucket in your project.-
You can use the Cloud Storage browser page in Google Cloud Console to view existing buckets in your project.
OR
-
To create a new bucket, run the following command. Your bucket name must be unique.
gsutil mb -p gs://your-bucket-name
-
-
Set environment variables.
-
Set the name of your bucket.
BUCKET=your-bucket-name
-
Prerequisites (3)​
-
Set up a Python virtual environment.
-
Create the virtual environment.
virtualenv ENV
-
Activate the virtual environment.
source ENV/bin/activate
-
-
Install library dependencies.
pip install -r requirements.txt
Create a cluster and submit a job​
-
Set a name for your new cluster.
CLUSTER=new-cluster-name
-
Set a region where your new cluster will be located. You can change the pre-set "us-central1" region beforew you copy and run the following command.
REGION=us-central1
-
Run
submit_job_to_cluster.py
to create a new cluster and run thepyspark_sort.py
job on the cluster.python submit_job_to_cluster.py
--project_id=
--region=$REGION
--cluster_name=$CLUSTER
--gcs_bucket=$BUCKET
Job Output​
Job output displayed in the Cloud Shell terminaL shows cluster creation, job completion, sorted job output, and then deletion of the cluster.
Cluster created successfully: cliuster-name.
...
Job finished successfully.
...
['Hello,', 'dog', 'elephant', 'panther', 'world!']
...
Cluster cluster-name successfully deleted.
Congratulations on completing the Walkthrough!​
Next Steps:​
-
View job details in the Cloud Console. View job details by selecting the PySpark job name on the Dataproc Jobs page in the Cloud console.
-
Delete resources used in the walkthrough. The
submit_job_to_cluster.py
code deletes the cluster that it created for this walkthrough.If you created a Cloud Storage bucket to use for this walkthrough, you can run the following command to delete the bucket (the bucket must be empty).
gsutil rb gs://$BUCKET
-
You can run the following command to delete the bucket and all objects within it. Note: the deleted objects cannot be recovered.
gsutil rm -r gs://$BUCKET
-
-
For more information. See the Dataproc documentation for API reference and product feature information.