This repository was archived by the owner on Jul 14, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Apache Spark on GCP Dataproc
alexkwansync edited this page Jan 20, 2022
·
11 revisions
Obtaining cluster and Apache Spark configuration recommendations requires the Apache Spark event logs.
To enable storing the event logs to an accessible Google cloud storage requires adding specific properties during cluster creation.
Example below is using the gcloud command line:
--region=region \
--image-version=1.4-debian10 \
--enable-component-gateway \
--properties='dataproc:job.history.to-gcs.enabled=true,
spark:spark.history.fs.logDirectory=gs://bucket-name/directory-name/spark-job-history,
spark:spark.eventLog.dir=gs://bucket-name/directory/spark-job-history,
mapred:mapreduce.jobhistory.done-dir=gs://bucket-name/directory/mapreduce-job-history/done,
mapred:mapreduce.jobhistory.intermediate-done-dir=gs://bucket-name/directory-name/mapreduce-job-history/intermediate-done'
See guidance in GCP Dataproc documentation [here](https://cloud.google.com/dataproc/docs/concepts/jobs/history-server#set_up_a_job_cluster).