When asking Apache Spark related questions please include following information
java -version
).build.sbt
, pom.xml
) if applicable or external dependency versions (Python, R) when applicable.local[n]
, Spark standalone, Yarn, Mesos), mode (client
, cluster
) and other submit options if applicable.Please try to provide a minimal example input data in a format that can be directly used by the answers without tedious and time consuming parsing for example input file or local collection with all code required to create distributed data structures.
When applicable always include type information:
StrucType
or output from Dataset.printSchema
.Output from Dataset.show
or print
can look good but doesn't tell us anything about underlying types.
If particular problem occurs only at scale use random data generators (Spark provides some useful utilities in org.apache.spark.mllib.random.RandomRDDs
and org.apache.spark.graphx.util.GraphGenerators
Please use type annotations when possible. While your compiler can easily keep track of the types it is not so easy for mere mortals. For example:
val lines: RDD[String] = rdd.map(someFunction)
or
def f(x: String): Int = ???
are better than:
val lines = rdd.map(someFunction)
and
def f(x: String) = ???
respectively.
When question is related to debugging specific exception always provide relevant traceback. While it is advisable to remove duplicated outputs (from different executors or attempts) don't cut tracebacks to a single line or exception class only.
Depending on the context try to provide details like:
RDD.debugString
/ Dataset.explain
.