Calling scala jobs from pyspark

Other topics

Creating a Scala functions that receives a python RDD

Creating a Scala function that receives an python RDD is easy. What you need to build is a function that get a JavaRDD[Any]

import org.apache.spark.api.java.JavaRDD

def doSomethingByPythonRDD(rdd :JavaRDD[Any]) = {
    //do something
    rdd.map { x => ??? }
}

Serialize and Send python RDD to scala code

This part of development you should serialize the python RDD to the JVM. This process uses the main development of Spark to call the jar function.

from pyspark.serializers import PickleSerializer, AutoBatchedSerializer


rdd = sc.parallelize(range(10000))
reserialized_rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
rdd_java = rdd.ctx._jvm.SerDe.pythonToJava(rdd._jrdd, True)

_jvm = sc._jvm #This will call the py4j gateway to the JVM.
_jvm.myclass.apps.etc.doSomethingByPythonRDD(rdd_java)

How to call spark-submit

To call this code you should create the jar of your scala code. Than you have to call your spark submit like this:

spark-submit --master yarn-client --jars ./my-scala-code.jar --driver-class-path ./my-scala-code.jar main.py

This will allow you to call any kind of scala code that you need in your pySpark jobs

Contributors

Topic Id: 9180

Example Ids: 28501,28502,28503

This site is not affiliated with any of the contributors.