Test Automation

Overview

Test automation in hadoop_g5k can be done by using Execo's Engine class. A generic HadoopEngine is provided with hadoop_g5k. This class can directly executed from the command line or extended to be personalized. Documentation about the class methods and related classes can be found in [Read the Docs](http://hadoop-g5k.readthedocs.org/en/latest/engine.html Read the Docs).

In order to use it the following command should be executed:

hadoop_engine <cluster> <num_nodes> test_conf.ini

This is the test_conf.ini file that will be used as example:

[test_parameters]
test.summary_file = ./test/summary.csv
test.ds_summary_file = ./test/ds-summary.csv
test.stats_path = ./test/stats
test.output_path =  ./test/output

[ds_parameters]
ds.class = hadoop_g5k.dataset.StaticDataset
ds.class.local_path = datasets/ds1
ds.dest = ${data_dir}

ds.size = 1073741824, 2147483648  # 1 | 2 GB

dfs.block.size = 67108864  # 64 MB
dfs.replication = 3

[xp_parameters]
io.sort.factor = 10, 100
io.sort.mb = 500

xp.combiner = true, false

xp.job = program.jar || ${xp.combiner} other_job_options ${xp.input} ${xp.output}

Test execution

The main workflow comprises of two main loops:

An external loop: that traverses the different dataset parameters combinations, cleans the cluster and deploy the corresponding dataset; and
An inner loop that traverses the experiment paramaters combinations and executes a Hadoop MapReduce job for each of them.

In order to let hadoop_engine know which parameters correspond to the datasets and which ones to the experiments, the parameters have been divided into two sections: [ds_parameters] for the formers and [xp_parameters] for the latters.

Parameters

Test parameters

A Hadoop test has a set of general test parameters which define the global behaviour of the execution. These are the main properties used in the test:

test.summary_file and test.ds_summary_file: These properties indicate the paths of the files that will store the information of each executed experiment and created dataset.
test.stats_path: If specified, it indicates the path where the experiments' statistics will be copied.
test.output_path: If specified, it indicates the path where the experiments' output will be copied.

Dataset parameters

As mentioned before, there are some parameters which are used to configure the dataset before deploying. There are two type of parameters: general dataset parameters, which start by ds. and mapreduce parameters, which have arbitrary names, as they correspond to Hadoop properties. In the second case, these parameters are simply inserted in the Hadoop configuration files (dfs.block.size and dfs.replication in the given example). Main general parameters are the following:

ds.size: If specified, it indicates the desired size of the dataset deployment. It should be given in bytes.
ds.class: It specifies the class to be used for dataset deployment, which is loaded automatically. It should extend hadoop_g5k's class Dataset. Hadoop_g5k already provides two implementations:
- StaticDataset: This class manages already generated datasets stored in the frontend. It uploads the files to the dfs with as much parallelization as possible.
- DynamicDataset: This class dynamically generates a dataset by executing a MapReduce job.
ds.dest: The location of the dataset in the dfs.
ds.class.*: All the parameters starting like this are passed to the deployment class. Each class uses a different set of parameters.

Experiment parameters

These comprise the parameters used in the execution of the experiment. Arbitrary Hadoop parameters can be used as before (io.sort.factor and io.sort.mb in the example). A special parameter should always be specified:

xp.job: It indicates the jar containing the job to be executed and its parameters separated by a double vertical line (||).

Macros

There are special values that can be used in the parameters configuration, which are called macros. The general form of a macro is ${macro_name}. They are replaced either by internal variables of the engine or by other user defined parameters. These are the macros defined in the engine:

${data_base_dir}: The base dir for the datasets.
${out_base_dir}: The base dir for the experiment outputs.
${data_dir}: The dir of the used dataset.
${comb_id}: The unique combination identifier.
${ds_id}: The unique dataset identifier.
${xp.input}: The experiment's input dir.
${xp.output}: The experiment's output dir.

Macros referencing user defined parameters must follow certian rules:

A test parameter cannot reference a dataset or experiment parameter.
A dataset parameter cannot reference an experiment parameter.
Parameter definitions should not contain cycles, e.g., it is not possible to have xp.a = ${xp.b} other_stuff and xp.b = ${xp.a} other_stuff.

Macros allow to specify job parameters in function of the dataset being used and the experiment being executed. In the given configuration file, for example, the job definition uses a user-defined parameter, ${xp.combiner} and two internal variables, ${xp.input} and ${xp.output}.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly