flowchart LR load_data --> preprocess_data; preprocess_data --> train_model; train_model --> evaluate_model; preprocess_data --> evaluate_model;
Your data science experiment consists of many steps. For example, data loading step, data preprocessing step, model training step, model evaluation step, and so on. In Villard, each step is a Python function that takes some parameters and returns a result. The parameters can be some user-defined data, or the result from other steps. Interconnected steps resemble a directed graph, and we call it a pipeline.
Several interconnected steps comprise a pipeline. Data science projects can have multiple pipelines, e.g., data preprocessing pipeline, model training pipeline, model inference pipeline, etc.
A run is the execution of all the steps in pipeline, from the beginning to the end.
Consider following pipeline example:
flowchart LR load_data --> preprocess_data; preprocess_data --> train_model; train_model --> evaluate_model; preprocess_data --> evaluate_model;
Each box represents a step. The arrows represent the dependencies between steps. A single run of the example pipeline would be the execution starting from load_data
to evaluate_model
.
For our first example, let’s consider the following simple pipeline:
flowchart LR get_name --> greet;
Villard is all about pipeline orchestration. It is designed to cope with frequent pipeline changes. All pipeline definitions and changes are made in configuration files. Let’s create one. In your project folder, create a config.yml
file.
At very least, a configuration file must contain a pipeline_definition
section, and that section must contain _default
key.
step_implementation_modules:
- implementation
pipeline_definition:
_default:
combine_name:
first_name: "John"
last_name: "Doe"
greet:
name: "ref::combine_name"
It also accommodates multiple pipelines, so you may add more definitions other than _default
.
The combine_name
step will take the values of first_name
and last_name
. This step will output a string that is the concatenation of both names. The output of combine_name
is then passed to greet
step to be displayed. Note that the name
value of greet
step is ref::combine_name
. It means that the greet
step refers to the output of combine_name
step (which is the concatenation names).
Then we make the pipeline steps’ implementation in implementation.py
.
from villard import pipeline
@pipeline.step("combine_name")
def combine_name(first_name, last_name):
return first_name + " " + last_name
@pipeline.step("greet")
def greet(name):
print ("Hello, " + name + "!")
Now execute the pipeline by invoking following command:
$ villard run config.yml