This library contains methods written in spark-sql that can be imported and run in scala, java, and python.
The core implementation of the methods are written in an object orientated way in the package com.example.
It can be implemented as shown in the snippet below:
import org.apache.spark.sql.DataFrame
import com.example.SumColumns.sumColumns
val df: DataFrame = // The dataframe the method will be applied to
sumColumns(df, "col1", "col2", "sum")
The core implementations have implicit wrappers that allow the methods to be applied straight to a spark dataframe.
These can be found in the com.example.implicits package.
They can be implemented as shown in the snippet below:
import org.apache.spark.sql.DataFrame
import com.example.implicits.SumColumnsImpl._
val df: DataFrame = // The dataframe the method will be applied to
df.sumColumns("col1", "col2", "sum")
This can make calling the methods and stringing multiple methods together cleaner.
The core implementation is wrapped in a java api to allow the method to be used in java. It does this by converting any
java parameters into scala ones. The api is also required for the python wrapper.
These APIs can be found in the com.example.api package.
They can be implemented as shown in the snippet below:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import com.example.api.SumColumnsAPI.sumColumns;
Dataset<Row> df = // The dataframe the method will be applied to
sumColumns(df, "col1", "col2", "sum");The python wrappers use py4j to pass the arguments into the java api, this then converts those arguments and passes
them to the core scala implementation. The python files can be found in the python directory.
They can be implemented as shown in the snippet below:
from methods.sum_columns import sum_columns
df = # The dataframe the method will be applied to
sum_columns(df, "col1", "col2", "sum")The dockerfile will create an environment appropriate for running this project in.
The image can be built by running the following command in the root of this project
docker build -t spark-library-environment .
The -t argument provides a name for the docker image. This can be whatever you like.
The . at the end of the command is required. It specifies that the docker file is in your cwd. If the dockerfile is not
in the cwd this can be replaced with a filepath
Once built the following command can be called to create a container
docker run -it -d --rm --name spark-library-test spark-library-environment
You can then create an interactive shell using the below command
docker exec -it spark-library-test bash
Note: The code in this image will be the code at the time that the image was built. Changes to the code will not be synced between the container and host. If this is needed you could add volume mapping to the run command.