欢迎来到cool的博客
7

Music box

Click to Start

点击头像播放音乐
新博客链接

How to Create A Cloud Dataflow Pipeline Using Java and Apache Maven

https://medium.com/google-cloud/setting-up-a-java-development-environment-for-apache-beam-on-google-cloud-platform-ec0c6c9fbb39

Cloud Dataflow is a managed service for executing a wide variety of data processing patterns.

This post will explain how to create a simple Maven project with the Apache Beam SDK in order to run a pipeline on Google Cloud Dataflow service. One advantage to use Maven, is that this tool will let you manage external dependencies for the Java project, making it ideal for automation processes.

This project execute a very simple example where two strings “Hello” and “World" are the inputs and transformed to upper case on GCP Dataflow, the output is presented on console log.

Disclaimer: Purpose of this post is to present steps to create a Data pipeline using Dataflow on GCP, Java code syntax is not going to be discussed and is beyond this scope. Hope to make in the future some specific tutorials on this.

0. Pre-requisites

In order to work you will need to Enable the APIS, set up authentication and Set Google Application credentials.

This Buckets will contain jar files and temporal files if necessary.

  • Set up authentication: On APIs & Services -> Credentials -> Create Credentials -> Service Account Key


a. On Service Account option, select New Service account.

b. Enter a name for service account name, in this case will be dataflow-service.

c. Role will be owner

  • Set Google Application credentials: Withe the JSON file previously downloaded witch containst the service account key set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the path of that file
export GOOGLE_APPLICATION_CREDENTIALS="my/path/dataflow-test.json"

If you don’t set the google application credentials properly you might not access the google buckets and probably will se the following error

An exception occured while executing the Java class. Failed to construct instance from factory method DataflowRunner#fromOptions(interface org.apache.beam.sdk.options.PipelineOptions): InvocationTargetException: DataflowRunner requires gcpTempLocation, but failed to retrieve a value from PipelineOptions: Error constructing default value for gcpTempLocation: tempLocation is not a valid GCS path …

1. Use java data flow archetype

The Maven Archetype Plugin allows the user to create a Maven project from an existing template called an archetype.

The following command generates a new project from google-cloud-dataflow-java-archetypes-starter

mvn archetype:generate \
-DarchetypeArtifactId=google-cloud-dataflow-java-archetypes-starter \
-DarchetypeGroupId=com.google.cloud.dataflow \
-DgroupId=com.click.example \
-DartifactId=dataflow-example \
-Dversion="[1.0.0,2.0.0]" \
-DinteractiveMode=false \

This command will generate a example Java class named StarterPipeline.java that contains the Apache Java Beam code that define pipeline steps.

2. Run Java main from Maven

To compile and run the main method of the Java class with arguments, you need to execute the following command.

mvn compile exec:java -e \
-Dexec.mainClass=com.click.example.StarterPipeline \
-Dexec.args="--project=dataflow-test-227715 \
--stagingLocation=gs://example-dataflow-stage/staging/ \
--tempLocation=gs://example-dataflow-stage/temp/ \
--runner=DataflowRunner"
  • Arguments:

— project: The project id in this case dataflow-test-227715.

— stagingLocation: Staging folder in a GCP Bucket.

— tempLocation: Temp folder location in GCP Bucket.

— runner: set to DataflowRunner to run on GCP.

3. Check Job is created

Go to Dataflow dashboard and you should see a new job created and running.

4. Open Job

You should see the deferents steps and when finish the words ‘HELLO’ and ‘WORLD’ on upper case on the log console.

返回列表