-
Notifications
You must be signed in to change notification settings - Fork 8
Test Automation
Test automation in hadoop_g5k can be done by using Execo's Engine
class. A generic HadoopEngine
is provided with hadoop_g5k. This class can directly executed from the command line or extended to be personalized. Documentation about the class methods and related classes can be found in [Read the Docs](http://hadoop-g5k.readthedocs.org/en/latest/engine.html Read the Docs).
In order to use it the following command should be executed:
hadoop_engine <cluster> <num_nodes> test_conf.ini
This is the test_conf.ini
file that will be used as example:
[test_parameters]
test.summary_file = ./test/summary.csv
test.ds_summary_file = ./test/ds-summary.csv
test.stats_path = ./test/stats
test.output_path = ./test/output
[ds_parameters]
ds.class = hadoop_g5k.dataset.StaticDataset
ds.class.local_path = datasets/ds1
ds.dest = ${data_dir}
ds.size = 1073741824, 2147483648 # 1 | 2 GB
dfs.block.size = 67108864 # 64 MB
dfs.replication = 3
[xp_parameters]
io.sort.factor = 10, 100
io.sort.mb = 500
xp.combiner = true, false
xp.job = program.jar || ${xp.combiner} other_job_options ${xp.input} ${xp.output}
The main workflow comprises of two main loops:
-
An external loop: that traverses the different dataset parameters combinations, cleans the cluster and deploy the corresponding dataset; and
-
An inner loop that traverses the experiment paramaters combinations and executes a Hadoop MapReduce job for each of them.
In order to let hadoop_engine
know which parameters correspond to the datasets and which ones to the experiments, the parameters have been divided into two sections: [ds_parameters]
for the formers and [xp_parameters]
for the latters.
A Hadoop test has a set of general test parameters which define the global behaviour of the execution. These are the main properties used in the test:
-
test.summary_file
andtest.ds_summary_file
: These properties indicate the paths of the files that will store the information of each executed experiment and created dataset. -
test.stats_path
: If specified, it indicates the path where the experiments' statistics will be copied. -
test.output_path
: If specified, it indicates the path where the experiments' output will be copied.
As mentioned before, there are some parameters which are used to configure the dataset before deploying. There are two type of parameters: general dataset parameters, which start by ds.
and mapreduce parameters, which have arbitrary names, as they correspond to Hadoop properties. In the second case, these parameters are simply inserted in the Hadoop configuration files (dfs.block.size
and dfs.replication
in the given example). Main general parameters are the following:
-
ds.size
: If specified, it indicates the desired size of the dataset deployment. It should be given in bytes. -
ds.class
: It specifies the class to be used for dataset deployment, which is loaded automatically. It should extend hadoop_g5k's classDataset
. Hadoop_g5k already provides two implementations:-
StaticDataset
: This class manages already generated datasets stored in the frontend. It uploads the files to the dfs with as much parallelization as possible. -
DynamicDataset
: This class dynamically generates a dataset by executing a MapReduce job.
-
-
ds.dest
: The location of the dataset in the dfs. -
ds.class.*
: All the parameters starting like this are passed to the deployment class. Each class uses a different set of parameters.
These comprise the parameters used in the execution of the experiment. Arbitrary Hadoop parameters can be used as before (io.sort.factor
and io.sort.mb
in the example). A special parameter should always be specified:
-
xp.job
: It indicates the jar containing the job to be executed and its parameters separated by a double vertical line (||
).
There are special values that can be used in the parameters configuration, which are called macros. The general form of a macro is ${macro_name}
. They are replaced either by internal variables of the engine or by other user defined parameters. These are the macros defined in the engine:
-
${data_base_dir}
: The base dir for the datasets. -
${out_base_dir}
: The base dir for the experiment outputs. -
${data_dir}
: The dir of the used dataset. -
${comb_id}
: The unique combination identifier. -
${ds_id}
: The unique dataset identifier. -
${xp.input}
: The experiment's input dir. -
${xp.output}
: The experiment's output dir.
Macros referencing user defined parameters must follow certian rules:
- A test parameter cannot reference a dataset or experiment parameter.
- A dataset parameter cannot reference an experiment parameter.
- Parameter definitions should not contain cycles, e.g., it is not possible to have
xp.a = ${xp.b} other_stuff
andxp.b = ${xp.a} other_stuff
.
Macros allow to specify job parameters in function of the dataset being used and the experiment being executed. In the given configuration file, for example, the job definition uses a user-defined parameter, ${xp.combiner}
and two internal variables, ${xp.input}
and ${xp.output}
.