Spark – Concepts

spark_logo
In the previous post we saw how to run a sample spark job in local mode. Lets try to understand each line of the application.

 

What we are trying to achieve here?
So our motto for the application is we have a huge data source , the data source can be any flat file like csv or text or from any other db and we are trying to get the contents of the data source into Spark data abstraction which is known as RDD. Finally using the RDD we can perform our any kind of aggregations , calculations or analytics using the RDD api.

 

Key Bullets:-
  • SparkConf
  • SparkContext
  • RDD

Discussions:-

In the first line of the application we mentioning the log file path.The log file path we can also put it in any config file , but for now we are just hard coding it.

 
Then in the second line we are trying to create a SparkConf object. You can see SparkConf as the configuration manager for a spark application. As i said earlier a Spark application can be deployed to a cluster like – yarn or mesos and also it can run locally. Also every application has a name and other configurations. So, all these informations we need to specify to the SparkConf object and accordingly the configurations are managed.
 
Then comes :-
As here we are executing the Spark application in windows , so we need the winutils.exe library to be in some directory. So, this is the directory where i have placed this file. And if you ask why we need winutils.exe ? , then the answer will be as Spark is built to run on linux machine, so to run it in windows machine we need some special binaries and winutils.exe is like the binary file which helps to run the spark app in windows also.
Then we have the SparkContext object. A SparkContext is like the heart of any spark application.It represents the connection to the cluster and is the main entry point of the application. After the context is created next step is creating RDD. RDD is vast topic by itself ,so i will keep a separate post for discussing RDD. Just for now understand that RDD is the data abstraction spark provides for any type of data source, whether – db, files, or any collections .
Note that i have put a Thread.sleep(50000) in my program, this is because when a spark application runs locally it fires the spark ui and as soon as your application finishes the spark ui is also gone. So, to check that i have put a thread sleep. You can find the url of the spark ui in the console log.
 
In the next tutorial we will talk RDD and what are the different operations it supports. Till then.. happy learning. :)

Related Post

Leave a Comment

Your email address will not be published. Required fields are marked *

Menu Title