Getting started

Basic concepts

Step

Your data science experiment consists of many steps. For example, data loading step, data preprocessing step, model training step, model evaluation step, and so on. In Villard, each step is a Python function that takes some parameters and returns a result. The parameters can be some user-defined data, or the result from other steps. Interconnected steps resemble a directed graph, and we call it a pipeline.

Pipeline

Several interconnected steps comprise a pipeline. Data science projects can have multiple pipelines, e.g., data preprocessing pipeline, model training pipeline, model inference pipeline, etc.

Run

A run is the execution of all the steps in pipeline, from the beginning to the end.

Consider following pipeline example:

flowchart LR
  load_data --> preprocess_data;
  preprocess_data --> train_model;
  train_model --> evaluate_model;
  preprocess_data --> evaluate_model;

Each box represents a step. The arrows represent the dependencies between steps. A single run of the example pipeline would be the execution starting from load_data to evaluate_model.

Hello world pipeline

For our first example, let’s consider the following simple pipeline:

flowchart LR
    get_name --> greet;

Villard is all about pipeline orchestration. It is designed to cope with frequent pipeline changes. All pipeline definitions and changes are made in configuration files. Let’s create one. In your project folder, create a config.yml file.

Note

Villard supports not only YAML, but also JSON and Jsonnet.

At very least, a configuration file must contain a pipeline_definition section, and that section must contain _default key.

step_implementation_modules:
    - implementation

pipeline_definition:
    _default:
        combine_name:
            first_name: "John"
            last_name: "Doe"
        greet:
            name: "ref::combine_name"

It also accommodates multiple pipelines, so you may add more definitions other than _default.

The combine_name step will take the values of first_name and last_name. This step will output a string that is the concatenation of both names. The output of combine_name is then passed to greet step to be displayed. Note that the name value of greet step is ref::combine_name. It means that the greet step refers to the output of combine_name step (which is the concatenation names).

Then we make the pipeline steps’ implementation in implementation.py.

from villard import pipeline

@pipeline.step("combine_name")
def combine_name(first_name, last_name):
    return first_name + " " + last_name

@pipeline.step("greet")
def greet(name):
    print ("Hello, " + name + "!")

Now execute the pipeline by invoking following command:

$ villard run config.yml