Yothalot\Job

With the Yothalot\Job class you can create, tune and control jobs. A job holds the connection, the algorithm, the input data and several performance settings. For an example of creating a complete job you can see our hello world! example.

The most important member functions of Yothalot\Job are the ones with which you add input data to the job, and the method to start the job. You can also set all sort of tuning parameters to make your job faster, or to reduce the amount of resources that your job takes.

class Yothalot\Job
{
    // constructor
    public function __construct(Yothalot\Connection $connection, $algorithm);

    // adding input data to a job
    public function add($data);
    public function server($data, $server = null);
    public function file($data, $file = null);

    // running the job
    public function start();
    public function detach();
    public function wait();

    // tuning the job
    public function local($local);
    public function modulo($modulo);
    public function maxfiles($max);
    public function maxbytes($max);
    public function maxprocesses($max);
    public function maxmappers($max);
    public function maxreducers($max);
    public function maxwriters($max);

}

Constructor

The constructor takes two parameters, a Yothalot\Connection and an instance of your own object in which your algorithm is implemented. The algorithm sbould either be an instance of Yothalot\MapReduce for map/reduce jobs, or an instance of Yothalot\Race for race jobs. Or an instance of Yothalot\Task for a normal job. Creating a job looks like:

// create a connection
$connection = new Yothalot\Connection(array(
    "host"  =>  "rabbit1.example.com",
    "vhost" =>  "yothalot"
));

// create a job
$job = new Yothalot\Job($connection, new MyMapReduceAlgorithm());

Adding input data

The add() method is used to add input data to the job. You must add the input data before you start the job. All the data that you add to the job will be passed to the map() method in your own mapreduce object, or to the process() method if you had assigned a race algorithm.

There is a one-to-one relation between the number of times that you call the add() method and the number of mapper or race processes that are started on the cluster: every call to the add() method automatically results in a mapper or race process that is started, and the value that you pass to this add() function is used as the input data for the map() or process() method of this mapper or race process.

Because of this direct relation, it is best practice to only add things like file names to your job, so that the processes will have plenty of stuff to process, and that no trivial processes are started.

If you do add a file name to the job, you must of course make sure that this file is available on every node in the cluster, because you can not know in advance on which server a job is going to run. It is therefore advisable to store the input data on GlusterFS.

// create a connection
$connection = new Yothalot\Connection(array(
    "host"  =>  "rabbit1.example.com",
    "vhost" =>  "yothalot"
));
// create a job
$job = new Yothalot\Job($connection, new MyMapReduceAlgorithm());

// feed the job with input data
$job->add("input data");
$job->add("more data");
$job->add("even more data");

// start the job
$job->start();

Controlling the server

Every call to the add() method results in a process that is started somewhere on the Yothalot cluster. This is done randomly: the first available node receives the job. However, this may not always be the best choice. Since Yothalot uses a distributed file system it would be ideal to run the process on a server where the data is locally available to avoid network overhead. For this purpose members file() and server() are created. The file() and server() methods do exactly the same as the add() method, with one exception: they accept a second parameter that you can use to give a hint on which server to run the job. The hint in file() is the filename so Yothalot knows that the data you pass belongs to that particular filename and the process should be scheduled on a server where that file is locally available. If you pass only one parameter in file() it assumes that that parameter is the filename. Member server() takes as a hint the server name on which the process ideally should run. You can use it like:

// construct a job
$job = new Yothalot\Job($connection, new MyMapReduceAlgorithm());

// add data for which it is not important on which server it runs
$job->add("random data");
$job->add("more random data");

// add data that can best be processed on "server7"
$job->server("server specific data", "server7.example.com");
$job->server("more server data", "server7.example.com");

// add data that can best be processed on a server that has local
// access to a specific file
$job->file("file specific data", "path/to/some/file.txt");

// add data that can best be processed on a server that has local
// access to a specific file, and the data is the file name itself
$job->file("path/to/some/file.txt");

// start the job
$job->start();

When the jobs are distributed over the Yothalot nodes, the master Yothalot server will do its best to assign the server specific jobs to "server7.example.com", and the file specific job to a server that holds a local copy of the specified file. This is only a hint, if the desired server is not available, the job will be assigned to a different server instead.

Note that the input data should be serializable because it is transferred to other servers.

Running a job

After you have added all the data to your job you can start the job with start(). Once a job has been started it is no longer possible to add input data to the job, and it is no longer possible to tune the job.

If you want to wait for your job to finish, you can call the wait() method. This will block your PHP script until the mapreduce job is finished. This could take some time, so you better only call this wait() method if it is not much of a problem that your PHP scripts gets in a blocked state.

// construct a job
$job = new Yothalot\Job($connection, new MyMapReduceAlgorithm());

// add data for which it is not important on which server it runs
$job->add("random data");
$job->add("more random data");

// start the job
$job->start();

// wait for the job to finish - and store the job stats in a variable
$stats = $job->wait();

// because the job is now finished, we can use the output data of the job
$output = file_get_contents("path/to/output/file.txt");

The counter part of the wait() method is the detach() method. This detaches the script from the job - while the job continues to run in the background. In practice it is not necessary to explicitly call detach() because active jobs are automatically detached when the PHP script ends. The only effect of the detach() call is that it becomes impossible to call wait() later on, because the job is already detached.

Getting information from your job

Besides that the wait() method blocks your script while it waits for the job to complete, the method also return a Yothalot\Result object with all kind of information about the performance and behavior of the job.

Tuning the job

There are many methods to tune your job's performance. You can for example set the modulo so that the mapped data is split up into multiple groups that are individually reduced and written, or you can limit the number of processes that are started.

Most of the tuning parameters only apply to map/reduce jobs. For race algorithms, only the maxprocesses() setting is relevant.

class Yothalot\Job
{
    // functions for performance tuning
    public function local($local);
    public function modulo($modulo);
    public function maxfiles($max);
    public function maxbytes($max);
    public function maxprocesses($max);
    public function maxmappers($max);
    public function maxreducers($max);
    public function maxwriters($max);
}

All of the above methods but local accept one parameter: an integer value with the setting. The local method accepts a boolean. You must set these tuning parameters before you start the job. For an explanation of the meaning of all the tuning parameters, see the special in-depth article about tuning mapreduce jobs.