Its easy to get started with Apache Spark. You can get a template for a Scala job using the Typesafe Activator and have it running on a local cluster with a small dataset. You can also use a handy script spark_ec2 to launch an EC2 cluster as detailed in Running Spark on EC2 document. You could then log into the new cluster and launch up a PySpark or Scala shell and use it to explore Spark API.
Running a production Spark cluster is not as trivial though. In this post I’ll outline some additional steps beyond what is documented to run a simple job on an ephemeral EC2 cluster in a manner similar to how one would use EMR. Here are some additional tips based on my experience
Provide S3 credentials
A typical EMR run starts with input data set resting on S3, with EC2 instances spun up to perform the analysis and store the results back into S3. Spark 1.1 and below install without support for S3 URLs. To provide it, modify file /root/ephemeral-hdfs/conf/core-site.xml, and add the following elements to the XML configuration:
fs.s3n.awsAccessKeyId {accesskey} fs.s3n.awsSecretAccessKey {secret_key}
Now you can load in S3 data sources with simple statements:
val data = spark.textFile("s3n://bucket/key/*")
Change temporary files location
By default Spark keeps temporary files in /tmp. This directory is mapped to an in-memory partition on the EC2 instances, so can fill up fairly quickly. Additionally, if you checkpoint large results of your computations, you need provide a larger drive to the instances that you start up.
Start your cluster with extra option --ebs-vol-size 100
to attach additional EBS disk. Then in your job code, use the new partition that has been attached to /vol/
val conf = new SparkConf() .set("spark.local.dir", "/vol/") val spark = new SparkContext(conf) spark.setCheckpointDir("/vol/")
Use sudo to submit jobs
Finally you want to submit a job to your cluster. Generate a fat jar by running sbt assembly
in your Scala job project. Upload the JAR to the master instance, using SCP. Login to your cluster and use spark_submit script to launch the job:
sudo /root/spark/bin/spark-submit --master spark://localhost:7077 --class MyApp --name MyApp /root/MyApp-1.0.0.jar arg0 arg1 arg2...
Make sure to call it using sudo to avoid failures due to missing permissions.
Some future improvements can be made to the way the Spark cluster is being launched. The spark_ec2 script works well for small clusters under 10 nodes, but becomes unstable for larger clusters. A failure of one machine to spin up correctly (not unusual on EC2) will fail the entire launch, and force you to destroy the cluster and start over. Furthermore, the entire script is very sequential, while machine configuration would ideally happen in parallel. There is no solution to these issues I’m aware of at the moment, but I’m exploring some options. Stay tuned!
If there are any other things you had to do to to get your job running, feel free to note it in the comments.