Apache Spark on GCP Dataproc

Obtaining cluster and Apache Spark configuration recommendations requires the Apache Spark event logs.

To enable storing the event logs to an accessible Google cloud storage requires adding specific properties during cluster creation.
Example below is using the gcloud command line:

    --region=region \
    --image-version=1.4-debian10 \
    --enable-component-gateway \
    --properties='dataproc:job.history.to-gcs.enabled=true,
spark:spark.history.fs.logDirectory=gs://bucket-name/directory-name/spark-job-history,
spark:spark.eventLog.dir=gs://bucket-name/directory/spark-job-history,
mapred:mapreduce.jobhistory.done-dir=gs://bucket-name/directory/mapreduce-job-history/done,
mapred:mapreduce.jobhistory.intermediate-done-dir=gs://bucket-name/directory-name/mapreduce-job-history/intermediate-done'

See guidance in GCP Dataproc documentation [here](https://cloud.google.com/dataproc/docs/concepts/jobs/history-server#set_up_a_job_cluster).

synccomputing.com | support@synccomputing.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Apache Spark on GCP Dataproc

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally