Creating a Standalone Spark Application in Scala: A Step-by-Step Guide with Twitter Streaming Example
By Prabeesh Keezhathra
This blog post will guide you through the process of building a Spark application in Scala that calculates popular hashtags from a Twitter stream. You will also learn how to use the sbt eclipse plugin to run the application in the Eclipse Integrated Development Environment (IDE). Whether you are new to big data processing or looking to improve your skills in data enginering and analytics, this tutorial has something to offer. Follow along with our step-by-step guide to develop your own stand alone Spark application and enhance your abilities in this exciting field.
Sharing some ideas about how to create a Spark-streaming stand-alone application and how to run the Spark applications in scala-SDK (Eclipse IDE).
Building Spark Application using SBT
A Standalone application in Scala using Apache Spark API. The application is build using Simple Build Tool(SBT).
For creating a stand alone app take the twitter popular tag example
This program calculates popular hashtags (popular topics) over sliding 10 and 60 second windows from a Twitter stream. The stream is instantiated with credentials and optionally filters supplied by the command line arguments.
But here modified the code for talking twitter authentication credentials through command line argument. So it needs to give the arguments as
master
consumerKey
consumerSecret
accessToken
accessTokenSecret
filters
.
// Twitter Authentication credentials
System.setProperty("twitter4j.oauth.consumerKey", args(1))
System.setProperty("twitter4j.oauth.consumerSecret", args(2))
System.setProperty("twitter4j.oauth.accessToken", args(3))
System.setProperty("twitter4j.oauth.accessTokenSecret", args(4))
If you want to read twitter authentication credential from file, refer this link
The sbt configuration file. For more detail about sbt refer
name := "TwitterPopularTags"
version := "0.1.0"
scalaVersion := "2.10.3"
libraryDependencies ++= Seq("org.apache.spark" %%
"spark-core" % "0.9.0-incubating",
"org.apache.spark" %% "spark-streaming" % "0.9.0-incubating",
"org.apache.spark" %% "spark-streaming-twitter" % "0.9.0-incubating")
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
You can find the project from here ##Spark programming in Eclipse Using sbt eclipse plugin, sbt project can run on Eclipse IDE. For more details find here
addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "2.1.0")
then run from the root folder of the project
sbt/sbt eclipse
This command creates a project compatible with Eclipse. Upon opening the eclipse IDE this project can now be imported and the executed with the spark.
You can find the sbt eclipse project from here
To avoid generating eclipse source entries for the java directories and put all libs in the lib_managed directory, that way we can distribute eclipse project files, for this - add the contents to build.sbt
/*put all libs in the lib_managed directory,
that way we can distribute eclipse project files
*/
retrieveManaged := true
EclipseKeys.relativizeLibs := true
// Avoid generating eclipse source entries for the java directories
(unmanagedSourceDirectories in Compile) <<= (scalaSource in Compile)(Seq(_))
(unmanagedSourceDirectories in Test) <<= (scalaSource in Test)(Seq(_))
I hope that this tutorial has provided you with the knowledge and resources needed to create your own standalone Spark application in Scala. By following the steps outlined in this blog post, you should now be able to build a Spark application that calculates popular hashtags from a Twitter stream and authenticate with Twitter credentials. You should also have the skills to use the sbt eclipse plugin to run the application in the Eclipse IDE. As you continue to learn and grow in the field of big data processing, it is important to remember to keep practicing and experimenting with different techniques and tools. With time and dedication, you can become a proficient data engineer and be able to tackle even the most complex data challenges.