Data Science Industrialization Boilerplate
This boilerplate enables data scientists to develop, serve, version models within a CI/CD pipeline hosted on OpenShift with the goal in mind that one does not have to take care/change much of the needed pipeline and infrastructure.
For pull requests and discussion regarding direction, please pull in @hugowschneider, @sklingel and @gerardcl
Basic Setup
The boilerplate provides a two pod setup in OpenShift, one pod for training service and one pod for prediction service.
Container services
From one Dockerfile
, under docker
folder, both training
and prediction
services are built. If required, one can edit it in order to provide different dependency management workflows for each service.
-
The
training
service provides a pod that is able to reproduce/retrain the model that is developed in the current commit either locally on OpenShift or execute the training on a remote linux system using ssh. The training process is wrapped into a flask server to be able to monitor and possible restart the training process. Moreover, the training service offers an endpoint for downloading the created model afterwards. Additionally, unittests and integration tests are executed on the training pod, in order to not depend on operating dependencies in the jenkins slave. -
The
prediction
service provides a simple flask service for getting new predictions out of your model by making json posts to the prediction REST endpoint. The prediction service is already built with the newly trained model from the training pod.
Jenkins
The Jenkinsfile
organizes the correct succession of spinning up the training, executing it and
starting the new deployment of the prediction service.
Additionally, it triggers unittest ensuring the code is functionally before a new training
process is started.
Moreover, integration tests are run against the reproduced model wrapped into the prediction
REST endpoint to
ensure that the reproduced model (performance) behaves as expected also when wrapped in the flask
service.
External Files
External files that are needed either for building your model or docker
images are stored
under
resources
. For demonstration purposes a training and test csv file is stored in resources. This
approach has to be reevaluated for each new use case, considering data size and confidentiality.
src - the heart of your service
The src
folder contains the infrastructure coded needed for providing the services in OpenShift
in src/services
. Custom code for developing your prediction service is organized in the
src/model
package.
In the (common) src/requirements.txt
you can specify python dependencies for training, prediction
and tests. To keep it simple, there is only one requirements.txt for both pods.
How to Code Your Own Models
To run your own customized models there is usually no need to change either the Jenkinsfile
,
OpenShift setup or the training and prediction microservices.
Custom model code will go under src/model
and can be organized in custom packages like
showcased with the src/model/data_cleaning
and src/model/feature_prep
. In general, it can be
organized as
the users prefers.
There are no further restrictions for developing the in the style you want, for the exception to
provide the mandatory functions and attributes in src/model/model_wrapper.py
for the `ModelWrapper
class:
-
prep_and_train
: is called by the train script (which one can customize) and expects a pandas dataframe (current implementation). The train script is called by the training service -
prep_and_predict
: is called by the predict endpoint service from the prediction service. It consumes the json post as a dictionary. The predict endpoint executesprep_and_predict
. -
Good practice:
source_features
, specifying the name that are used a input for the model. This features include really the source columns from which also more complicated features are derived within the model boundaries -
Good practice:
target_variable
, name of the variable that should be used as target for a possible supervised approach.
As well as the train
function in the src/trainer.py
. It specifies how the model should be
trained.
Make sure your specified all dependencies in the requirements.txt
.
How to Develop your Model Locally
It is recommended to develop your code against the python interpreter & dependencies specified in the docker images. This can be easily achieved, either by using an IDE that supports that (e.g. PyCharm) or by doing manually in the docker container.
Data Versioning
In order to ensure complete reproducibility, in case train and/or test data can’t be committed to a git repository due to size or confidentiality/data privacy considerations, data versioning can be achieved using the built in dvc data version capabilities.` Moreover, technical user account is needed so that the CI/CD pipeline is able to pull the data dependencies from the remote data versioning repository.
Do the following steps in order to make use of the data versioning capabilities
-
Initialize the quickstarter repository as a
dvc
repository:dvc init
-
Setup the a remote repository on a remote ssh machine, e.g. Data Lake
dvc remote add <remote_name> ssh://<remote_server>:/<path_to_storage>
-
Configure authentification. For local development you can set your own user account, assuming it has access to
<path_to_storage>
or use a technical user account.dvc remote modify <remote_repository_name> user <technical_user_account>
and set the prompt for password, so that you don’t commit your password to the repositorydvc remote modify <remote_name> ask_password True
-
Start adding files that should be tracked by data versioning
dvc add <some_file>
this will create a new file with meta information about<some_file>
called<some_file>.dvc
. This meta file needs to be tracked with git, so that it is ensured that each git commit is linked with a specific data versiongit add .gitignore <some_file>.dvc
-
Modify your train() and potentially the integration tests to pull the data dependencies from the remote repository. A helper class is provided in
src/services/remote/dvc/data_sync.py
that can be used as follows:from services.infrastructure.remote.dvc.data_sync import DataSync
syncer = DataSync(dvc_data_repo, dvc_ssh_user, dvc_ssh_password)
syncer.pull_data_dependency(file_name)
-
Commit your code and push the data versioned files to the remote repository
git commit
dvc push -r <remote_name>
git push
-
In order for a successful Jenkins build, the following environment variables need to be set in the
training
pod deployment:DSI_DVC_REMOTE
,DSI_SSH_USERNAME
, `DSI_SSH_PASSWORD
Example & Example Dataset
An example implementation of a custom model is given in src/model
, to demonstrate how to organize
custom code.
A Logistic Regression using scikit-learn with some (unnecessary) feature cleaning and engineering
is trained on the iris data flower set.
Iris flower data set. (n.d.). In Wikipedia. Retrieved January 7, 2019, from https://en .wikipedia.org/wiki/Iris_flower_data_set
Structure of the quick starter
-
Training
-
Build Config
-
name:
<componentId>-training-service
-
variables: None
-
-
Deployment Config
-
name:
<componentId>-training-service
-
variables:
-
DSI_EXECUTE_ON
: LOCAL -
DSI_TRAINING_SERVICE_USERNAME
: auto generated username -
DSI_TRAINING_SERVICE_PASSWORD
: auto generated password
-
-
-
Route: None by default - no routes exposed to internet
-
-
Prediction
-
Build Config
-
name
:<componentId>-prediction-service
-
variables
: None
-
-
Deployment Config
-
name
:<componentId>-prediction-service
-
variables
:-
DSI_TRAINING_BASE_URL
:http://<componentId>-training-service.<env>.svc:8080
-
DSI_TRAINING_SERVICE_USERNAME
: username of the training service -
DSI_TRAINING_SERVICE_PASSWORD
: password of the training service -
DSI_PREDICTION_SERVICE_USERNAME
: auto generated username -
DSI_PREDICTION_SERVICE_PASSWORD
: auto generated password
-
-
-
Route: None by default - no routes exposed to internet
-
Remote Training
Remote training allows you to run your training outside of the OpenShift training pod on a linux
node using a ssh connection.
A conda environment is installed in the remote node and the requirements specified in
src/requirements.txt
are installed.
Once this step is finished the training is executed on that node and the trained model is
transferred back to the training pod.
To enable remote training set the DSI_EXECUTE_ON
variable in OpenShift to SSH and specify the
connection information in the environment variables: DSI_SSH_HOST
, DSI_SSH_PORT
,
DSI_SSH_USERNAME
and DSI_SSH_PASSWORD
.
Endoints
Training Endpoint
-
/
: Return all information about the training service -
/start
: Starts the training. -
/finished
: Checks if the current traning task is finished -
/getmodel
: Download the latest trained model
Prediction Endpoint
-
/predict
: Return all information about the training service-
payload: Should be a json containing the data necessary for prediciton. The payload is not pre defined, but it is defined by the trainined model
-
There is not need for any kind of payload in all endpoints.
Environment Variables for training
Environment Variable | Description | Allowed Values |
---|---|---|
DSI_DEBUG_MODE |
Enables debug mode |
true, 1 our yes for debug mode, otherwise debug is disasbled |
DSI_EXECUTE_ON |
Where the train should be executed |
LOCAL, SSH |
DSI_TRAINING_SERVICE_USERNAME |
Username to be set as default username for accessing the services |
string, required |
DSI_TRAINING_SERVICE_PASSWORD |
Password to be set as default password for accessing the services |
string, required |
Following variables are applicable if |
||
DSI_SSH_HOST |
SSH host name where train should be executed (Only applicable if DSI_EXECUTE_ON=SSH) |
host names or ip addresses |
DSI_SSH_PORT |
SSH host port where train should be executed (Only applicable if DSI_EXECUTE_ON=SSH) |
port numbers (Default: 22) |
DSI_SSH_USERNAME |
SSH username for remote execution |
string\ |
DSI_SSH_PASSWORD |
SSH password for remote execution |
string |
DSI_SSH_HTTP_PROXY |
HTTP proxy url for remote execution. This is needed if the remote machine needs the proxy for download packages and resources |
string |
DSI_SSH_HTTPS_PROXY |
HTTPS proxy url for remote execution. This is needed if the remote machine needs the proxy for download packages and resources |
string |
DSI_DVC_REMOTE |
Name of the dvc remote repository that has been initialized with dvc |
string |
Environment Variables for prediction
Environment Variable | Description | Allowed Values |
---|---|---|
DSI_DEBUG_MODE |
Enables debug mode |
true, 1 our yes for debug mode, otherwise debug is disasbled |
DSI_TRAINING_BASE_URL |
The base url where the prediction service should get the model from |
url (e.g. https://training.OpenShift.svc |
DSI_TRAINING_SERVICE_USERNAME |
Username of the training service |
string, required |
DSI_TRAINING_SERVICE_PASSWORD |
Password of the training service |
string, required |
DSI_PREDICTION_SERVICE_USERNAME |
Username to be set as default username for accessing the service |
string, required |
DSI_PREDICTION_SERVICE_PASSWORD |
Password to be set as default password for accessing the service |
string, required |
How this quickstarter is built through jenkins
The build pipeline is defined in the Jenkinsfile
in the project root. The main stages of the pipeline are:
-
Prepare build
-
Sonarqube checks
-
Build training image
-
Deploy training pod
-
Unittests
-
Execute/reproduce training either on openshift pod or in ssh remote machine
-
Integration test against the newly trained model wrapped in the flask
/prediction
endpoint -
Build prediction image
-
Deploy prediction service
Known limitions
-
Not ready for R models yet
-
In the case of building the docker image from behind a proxy and encountering certificate issues, adding a -k to the curl command can mitigate that, consider however the implications of disabling certificate
-
Consider moving to ssh remote server training, if you expect high and long computational load during training phase. It might cause unnecessary stress on the openshift cluster, otherwise.