Skip to content

griddynamics/gcp-dataflow-evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Dataflow vs Apache Spark Benchmark

Running Dataflow

mvn -Pdataflow-runner compile exec:java \
      -Dexec.mainClass=org.apache.beam.examples.PipelineJavaJSON \
      -Dexec.args=“--project=gcp-project \
                  --gcpTempLocation=gs://TEMPORARY_STORAGE \
                  --runner=DataflowRunner \
                  --region=europe-west4”

Running spark job on a Dataproc cluster

Creating a cluster

gcloud dataproc clusters create cluster_name \
    --image-version=2.0 \
    --region=europe-west4 \
    --enable-component-gateway \
    --master-machine-type=n2-standard-8 \
    --num-workers=2 \
    --worker-machine-type=n2-standard-8 \
    --properties=${PROPERTIES} \
    --optional-components=DOCKER

Submitting pyspark job to the cluster

gcloud dataproc jobs submit pyspark \
    --cluster=cluster_name \
    --region=europe-west4 \
    spark_pipeline_avro.py \
    --properties spark.jars.packages=‘org.apache.spark:spark-avro_2.12:3.1.3’

Releases

No releases published

Packages

No packages published