Posted by & filed under Big Data.

Would you like to step through your Spark job in a debugger? These steps show you how to configure IntelliJ IDEA to allow just that.

Unlike a traditional Java or Scala application, Spark jobs expect to be run within a larger Spark application, that gives access to SparkContext. Your application interacts with the environment through the SparkContext object. Because of this constraint you can’t just launch your Spark job from the IDE and expect it to run correctly.

The steps I outline describe the process to launch a debugging session for a Scala application, however very similar steps can be applied to launching Java applications. PySpark applications are more involved in their interaction with the Spark cluster, so I doubt this procedure applies to Python.

First you want to have an application class that looks something like this:

object Test {
  def main(args: Array[String]): Unit = {
    // Local
    val spark = new SparkContext(new SparkConf()
      .setMaster("local").setAppName("Test")
    )
    spark.setCheckpointDir("/tmp/spark/")

    println("-------------Attach debugger now!--------------")
    Thread.sleep(8000)

    // Your job code here, with breakpoints set on the lines you want to pause
  }
}

Now you want to get this job to the local cluster. First package the job and all its dependencies into a fat JAR

$ sbt assembly

Next submit it to a local cluster. You need to have spark-submit script somewhere on your system:

$ export SPARK_JAVA_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005
$ /path/to/spark/bin/spark-submit --class Test --master local[4] --driver-memory 4G --executor-memory 4G /path/totarget/scala-2.10/project-assembly-0.0.1.jar

First line exports a Java argument that will be used to start Spark with the debugger. --class needs to point to a fully qualified class path to your job. Finally give it the path to the fat jar assembled in the previous command.

If you run that command it will execute your job without breaking at the breakpoints. Now we need to configure IntelliJ to connect to the cluster. This process is detailed in the official IDEA documentation. If you just create a default “Remote” Run/Debug configuration, and leave default port of 5005, it should work fine. Now when you submit the job again, and see the message to attach the debugger, you have 8 seconds to switch to IntelliJ IDEA and trigger this run configuration. The program will then continue to execute and pause at any breakpoint you defined. You can then step through it like any normal Scala/Java program. You can even step into Spark functions to see what its doing under the hood.

Hopefully this helps, as I found it very useful in debugging serialization errors and other difficult to trace issues. Similar process could potentially be applied to debugging a job on the cluster, although you would only be able to debug the code that runs in the driver, not the executor.

6 Responses to “Debugging Apache Spark Jobs”

  1. Vladimir

    Thanks for advice!
    For hit-refresh debugging it’s nice to put job code inside try-catch and infinite while loop.

  2. Carlos Bribiescas

    If you change your export to

    export SPARK_JAVA_OPTS=-Xrunjdwp:transport=dt_socket,server=n,address=localhost:5005,suspend=y,onuncaught=n

    You can start the debugger in listen mode instead. So that when you run your spark application it will attach automatically and stop at your break points.

  3. JMS

    +1 what Carlos said.
    Use suspend=y then you don’t need to sleep your thread. The JVM will wait for your debugger to connect.

Trackbacks/Pingbacks

  1.  Spark News | Hagenhaus

Leave a Reply

  • (will not be published)