Apache Spark comes in different versions, and the environment should be configured to use the correct version. This is important for
- building the
matlab-spark-utility
Jar file, - generating the Java classpath, and
- setting up properties to instantiate a Spark environment.
The configuration is kept in a json
file, and contains configurations for
different versions of Spark (at the time of writing 2.4.5, 2.4.7 and 3.0.1, but
more can be added). In general, there should be no need to edit this file
manually, but this can be done, e.g. to add another entry for a version of spark.
The configuration is handled most directly through the Config
class.
>> C = matlab.sparkutils.Config.getInMemoryConfig
C =
Config with properties:
Versions: ["2.4.5" "2.4.7" "3.0.1" "3.0.1-hadoop3.2"]
CurrentVersion: '3.0.1'
Master: 'local'
As seen here, this indicates
- the available versions
- the active (currently used) version
- the default master
The output will show us what versions are available, and which is set
as the CurrentVersion
. The CurrentVersion
is what will be used for the tasks
listed further above.
The CurrentVersion
can easily be changed.
>> C = matlab.sparkutils.Config()
C =
Config with properties:
Versions: ["2.4.5" "2.4.7" "3.0.1" "3.0.1-hadoop3.2"]
CurrentVersion: '3.0.1'
Master: 'local'
>> C.CurrentVersion = "2.4.7"
C =
Config with properties:
Versions: ["2.4.5" "2.4.7" "3.0.1" "3.0.1-hadoop3.2"]
CurrentVersion: '3.0.1'
Master: 'local'
Only the available versions can be used:
>> C.CurrentVersion = "2.4.8"
Error using matlab.sparkutils.Config/set.CurrentVersion (line 89)
The version 2.4.8 is not part of the available versions (2.4.5 2.4.7 3.0.1 3.0.1-hadoop3.2)
To make this reflect in the json
file, the configuration must now be saved, but also refer to
the In memory config section below.
C.saveConfig
There is a default Config file being part of this package. If a user saves the Config,
the default Config file will not be overwritten. Instead, it will be saved in the users
MATLAB Preference Directory, which can be found by running prefdir
in MATLAB:
>> prefdir
ans =
'/home/<username>/.matlab/R2021a'
This enables different users to have different settings on a shared system. It also avoids the problem of corrupting the default Configuration file by mistake.
If a user config file has been corrupted, it can be removed. It can also be overwritten by the default config by issuing the commands:
C = matlab.sparkutils.Config();
C.overwriteUserConfig();
There may be cases, where the user wants to change settings, but not saving them, which will
change the default settings. For this, the static method getInMemoryConfig
is used.
This method will load the default Config and keep it in memory. Changes can be made to this config,
and they will be kept in memory, but not saved to disk.
The different functions that use the Config, e.g.
- Generating javaclasspath.txt
- Building Spark-Utility Jar
- Instantiating Spark sessions
will all use this in-memory Config. This makes it easy to just set this at the beginning of a session,
and henceforth all functions will use this. If the memory of classes is cleared, clear classes
,
this instance of the Config object will vanish, and the user must redo the setting changes. If only
a clear
command is issued, the Config is still valid.
The matlab-spark-utility
Jar file, which is needed for certain operations,
must be built for the corresponding Spark version. It relies on a few different definitions,
so it's recommended to build it with this MATLAB command:
matlab.sparkutils.buildMatlabSparkUtility
If necessary, the details of the build settings can be found by inspecting the output
of the method genMavenBuildCommand
on the Config
object.
>> C = matlab.sparkutils.Config.getInMemoryConfig()
C =
Config with properties:
Versions: ["2.2.0" "2.4.5" "2.4.7" "3.0.1" "3.0.1-hadoop3.2" "3.1.2" "3.2.1"]
CurrentVersion: '3.0.1'
Master: 'local'
>> C.genMavenBuildCommand
Run this command to build matlab-spark-utility:
mvn --batch-mode -Dspark.fullversion=3.0.1 -Dspark.version=3.0.1 -Dspark.major.version=3.x -Dscala.version=2.12.10 -Dscala.compat.version=2.12 -Dhadoop.version=2.7.4 clean package
It should also be noted, that the command furthermore needs a Profile set. The utility mentioned above,
matlab.sparkutils.buildMatlabSparkUtility
handles this for the user, and also makes the distinctions
for building for Apache Spark as opposed to for Databricks.
A few versions of Spark are supported directly, but more can be added. To achieve this, new entries must be added in the JSON file containing the save configuration.
Below is an example of one of the configurations.
{
"name": "2.4.7",
"maven": [
{
"name": "spark.version",
"value": "2.4.7"
},
{
"name": "spark.major.version",
"value": "2.x"
},
{
"name": "scala.version",
"value": "2.11.12"
},
{
"name": "scala.compat.version",
"value": "2.11"
},
{
"name": "hadoop.version",
"value": "2.7.4"
}
],
"libs": [
{
"base": "delta-core",
"jar": "delta-core_2.11-0.6.1.jar",
"src": "https://repo1.maven.org/maven2/io/delta/delta-core_2.11/0.6.1/delta-core_2.11-0.6.1.jar"
},
{
"base": "spark-avro",
"jar": "spark-avro_2.11-2.4.7.jar",
"src": "https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.11/2.4.7/spark-avro_2.11-2.4.7.jar"
}
]
},
As can be seen, it will require information about
- The Spark version
- The Scala version
- The Hadoop version
- Links to delta-core and spark-avro Jars
These versions and links, as seen when inspecting the JSON file, change for different versions.
To open the Configuration file in the MATLAB editor, do the following:
>> C = matlab.sparkutils.Config.getInMemoryConfig;
>> C.editConfig