Spark – Sample Application in Windows

spark_logo

In this post we will see how to run a sample Spark application in local mode under windows environment.The main motto behind this post is to kick start a sample Spark application to make us familiar with the concepts. So, rather than going and deploying an application in cluster mode its better if we just do it first in our development environment and then we can think of complicated cases. And for development running a Spark job in local mode is the best choice.

Prerequisite:-

  • JDK 8 .
  • I will be using Scala as the programming language. So i expect the readers to be familiar at least with basic scala knowledge.
    So obviously Scala should be installed in the system.
  • Intellij IDEA as the IDE of my choice.
  • Maven as the dependency management tool

Note:-
As we know Spark is built mainly for linux environment.So to run Spark under windows we need the hadoop binaries for windows environment.You can download the winutils.exe from the below link and put it somewhere in your local drive.
https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1/bin
Once this is done now we are ready for our sample spark project.

Steps:-

1)Install jdk 8 and set JAVA_HOME environment variable.
2)Install idea and create a new project as shown in the screen shot.

1.Spark_sample_app_1

1.Spark_sample_app_2

1.Spark_sample_app_3

Follow the screen shots to create a sample spark project. Its  a  maven project .So once the project is created we have to add the scala language features to it. To do that just right click on the project and go to Add framework support and then select scala and click ok.
This way Scala language will be added to the project.

1.Spark_sample_app_4 1.Spark_sample_app_5

The next step is to add all the spark dependencies to the pom.xml file. I have captured all the dependencies which are needed to run a spark project. Though for now we don’t need the Spark Mlib and Sql dependencies i have added for future use. If you want you can remove them also.

Make sure you enable the “Auto import” features in idea for the maven project , which will automatically download the dependencies as soon as you copy them to your pom.xml file. It will take some time to download the required Spark binaries.Once everything is downloaded now you can start writing your job .

Create a scala directory in the source and mark it as source root. Now create a package job and create a scala object SparkJob and paste the below contents.

Make sure that you have put the winutils.exe in the above mentioned path or else you can also the path. For our Spark job we need a huge file . I have downloaded  a huge sample csv file for analysis ,even you can download it from the below link.
http://www.stats.govt.nz/tools_and_services/releases_csv_files/csv-files-for-nzstat.aspx

Thats all for now. Happy coding.. :)

Related Post

Leave a Comment

Your email address will not be published. Required fields are marked *

Menu Title