Creating a Scala function that receives an python RDD is easy. What you need to build is a function that get a JavaRDD[Any]
import org.apache.spark.api.java.JavaRDD
def doSomethingByPythonRDD(rdd :JavaRDD[Any]) = {
//do something
rdd.map { x => ??? }
}
This part of development you should serialize the python RDD to the JVM. This process uses the main development of Spark to call the jar function.
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
rdd = sc.parallelize(range(10000))
reserialized_rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
rdd_java = rdd.ctx._jvm.SerDe.pythonToJava(rdd._jrdd, True)
_jvm = sc._jvm #This will call the py4j gateway to the JVM.
_jvm.myclass.apps.etc.doSomethingByPythonRDD(rdd_java)
To call this code you should create the jar of your scala code. Than you have to call your spark submit like this:
spark-submit --master yarn-client --jars ./my-scala-code.jar --driver-class-path ./my-scala-code.jar main.py
This will allow you to call any kind of scala code that you need in your pySpark jobs