View on GitHub

skelebot

Machine Learning Project Development Tool

Home


Jobs

Jobs are the core of what makes Skelebot useful. They allow you to explicitly define the different execution sequences that your project offers and provide a simple way for others to find and understand not only what the job does, but how to run it.

Jobs are configured manually inside of the skelebot.yaml file.

...
jobs:
- name: example
  source: src/jobs/example.sh
  mode: i
  native: optional
  host: ssh://root@me.local
  help: EXAMPLE JOB
  mappings:
  - data/
  - ~/myname.keytab:~/root/keytabs
  - models/:app/model-output/
  args:
  - name: date
    help: the date on which to pull data for the job
  params:
  - name: env
    alt: e
    default: local
    help: the environment from which the job will pull data
    choices:
    - local
    - dev
    - prod
...

A job must contain three things in order to work. It must have a name, so you can call it from the command line, a source file to execute, and a help message for users to understand it. Skelebot supports calling both Python .py scripts and also Bash scripts .sh. Jobs can contain several additional fields:

Executing a job is as simple as passing the job name to the Skelebot command.

In the example below the argument date is passed as 2018-01-01 and the parameter env is passed as dev. These are then passed along to the script src/jobs/example.sh inside of a Docker container.

If a job is executed in Docker default then any output generated will be generated inside the Docker container, and not on the host machine. As a result, in order to get output files from jobs, it is necessary to add the output folder to the mappings list in the job’s config.

> skelebot example 2018-01-01 --env dev

Global Job Parameters

Often times you may have the same parameter that applies to every job in the project, such as setting the log level. For this situation Skelebot offers the ability to specify global parameters that apply to every job.

These parameters are defined exactly the same way that job parameters are defined, but they are specifed at the root level of the config instead of inside the config of a single job. When a parameter is specified in the root level params list, it will be applied to each job that is defined in the config.

...
jobs:
- ...
params:
- name: log-level
  alt: l
  default: info
  help: The level at which logs should be output from the jobs
  choices:
  - debug
  - info
  - warn
  - error
  - critical
...

Mapping Ports

Skelebot provides the ports property in the skelebot.yaml config file for specifying on which ports jobs will run and expose their services. This property accepts a list of strings of a specific format: {host-port}:{container-port}.

The host-port specifies the port that is exposed on the host machine. This is the port that you will use to access whatever you may be serving.

The container-port specifies the port inside the Docker Container that will be mapped.

Ports can be specified at two different levels.

The global level will apply the port mappings to every job.

jobs:
- ...
ports:
- 8080:8080
- 8888:8888

The job level will apply the port mappings to only the specified job.

jobs:
- name: run
  source: run.py
  help: run the server
  ports:
  - 8080:8888

Primary Job

By default the Docker Image that is built by Skelebot will not run a command, but instead requires skelebot to provide it with a script, arguments, and parameters to run when a job is executed. For the purpose of building images that can be distributed, Skelebot offers a way to specify a job as the project’s Primary Job.

primaryJob: example

This is done by simply using the name of one of the jobs in the primaryJob attribute of the config file. This will allow Skelebot to set this job as the default command for the docker image that is built, thereby making a more easily distributable Docker Image for the sake of deployment.

Primary Exe

The primaryExe field in the skelebot.yaml config allows for the specification of an execution command to use in the Dockerfile. This field accepts two different values: “ENTRYPOINT” and “CMD”. If the field is not specified in the config, the default “CMD” value is used.

primaryExe: (ENTRYPOINT, CMD)

Using “CMD” as the primary execution method requires that the primary job is configured to use only parameters, not arguments, and that each parameter has a default value. This allows the command string to be constructed in full so that it can be run without any extra parameters in the Docker Run command.

In this scenario it is possible to set default parameter values to environment variables to allow for different user’s to set different parameter values without altering the manner in which the image is executed.

...
jobs:
- name: test
  source: jobs/test.sh
  mode: it
  help: Run the test cases for the project
  params:
  - name: runner
    alt: r
    default: $RUNNER
    help: The name of the person running the tests

When using the “ENTRYPOINT” execution method, the parameters and arguments are not used in the construction of the Dockerfile, and instead are left to be inserted during the Docker Run process manually.

Were the example job above to be configured with an “ENTRYPOINT” execution, the params could be specified at runtime in the following manner.

docker run my-image --runner ME

For more information on the details of “CMD” and “ENTRYPOINT” please refer to Docker’s Documentation.

Skelebot Parameters

Skelebot has some optional parameters that allow you to control how the jobs are run. These parameters apply to everything in Skelebot, not just the jobs. As such, they are specified in the command line after the skelebot command and before the job argument.

Chaining Jobs

Jobs can also be chained together in a single command (executed one after another) by simply concatenating them in the command separated by a + character. This allows for multiple jobs to be executed in a single command, and can be a real time saver for long running sequences of jobs.

> skelebot query + wrangle --all + train --output-folder results/

The example above would first execute the query job, followed by the wrangle job with the all flag set to TRUE, and finally the train job with the output_folder set to results/.


<< Image Commands | Priming >>