Tuesday, May 1, 2018

A model-training template I sort of like

I hope someone can prove me wrong about this, but I don't think any model training script is appropriate for all cases. Even with the advent of automated machine learning (TPOT, AutoML, etc.), every data set has its pathologies, and every project has its bizarre use cases and requirements.

After a large handful of projects in which I've made a variety of mistakes, I have put together a modeling template that I'm more or less happy with. The tension in making such a script is that on one hand we want something general enough that it can be cloned and applied to the widest variety of modeling tasks, and on the other hand that it does each modeling task very well. I think this realization is a fair compromise between the two.

You can clone the repo here.

Project structure, and the point of it all

The idea is that the main script, process.py, implements the usual modeling processing steps at an abstracted layer, and that you can edit the contents of the individual steps depending on your project needs. An example is included.

The project skeleton is based on CookieCutter Data Science's template, which I've adjusted for my needs. I also left a bunch of the original template that I thought would be useful, like the notebooks/ directory for exploratory studies. That template contains placeholders for data processing and modeling modules, and I've implemented those to a limited degree. 

Project structure is critical, and you may want to rearrange it to suit your needs. As the saying is paraphrased, "code is meant for humans to be read." That applies to the project structure as well: we want collaborators or strangers to be able to understand what we are doing, and to continue or replicate our work. In this project template, there's only one script in the home directory, and it refers to modules with obvious names.

The modeling script itself loosely follows suggestions published on MacieJ's blog, but I've added and subtracted to conform to my preferred workflow. It looks something like this:
  1. Load and pre-process raw data. Element-wise cleaning
  2. Construct a pipeline that includes feature building and training the model
  3. Run pipeline on training data
  4. Test on test data
  5. Save results

Usage

To set up and run the example modeling task, follow these steps.
  1. Clone the repo
  2. Create and activate the Conda environment. If you don't have Conda you can install it here. To create the environment on Windows, open a command prompt in example_pipeline/example_pipeline/ and run the command
    conda env create -f environment.yml
    Once that finishes, run the command

    activate example_pipeline

    to activate the environment.
  3. Run the script using
    python process.py

    Results will appear in a timestamped directory inside
    results/.

The script

I want to discuss the choices I made when building this script, roughly in the order they appear.
  • Set parameters
    • Usually I would use a config file here, especially if I needed credentials for the script. in this case I just hard-coded the file paths, and there were no other free parameters.
  • Load data
    • This is a custom function that will change depending on your situation. The output should be a multi-index data frame, where training data are indexed as 'train', and test data are indexed as 'test'. Here, I'm loading from two files. If you're loading all data from a single file and then selecting train/test data from that, then it should be done in the load_data() function.
  • Load variable codes
    • Every time I have a modeling task, I make a variable codes file. For each base feature in the data, this file contains the feature name, types, levels if categorical or ordinal, and indicators for whether or not to use each feature in a particular model. We can also add columns to indicate processing steps or to identify weight or target data columns. Not only is it convenient to use in code, but I catch data problems by going through the exercise of building it.
  • Pre-process data
    • This is where we strip white space from strings, and relabel the target, etc. Note that this is not the same as feature creation. That will come later. The pre-processing step is only allowed to include element-wise operations that are agnostic of any other elements. Only you can prevent data leakage.
    • In this specific case the target can have two values, but they're labeled slightly differently in the train and test sets. I provide a dictionary to do the mapping.
  • Split into train and test
    • I prefer for this to be done by label during the load step. The earlier we can distinguish between test and train, the better.
  • Build the pipeline
    • This will, of course, be project specific. But the pipeline should include any feature selection/creation steps as well as one or more model fitting steps.
    • In this case, I pass the pipeline builder a list of features to include. I prefer to read this list from the variable codes file, where we can always add columns for additional models, but feel free to pass a hard-coded list. 
    • For the example, I've chosen to use a standard scaler on numerical columns, and one-hot encoding on categorical columns. I'm using a gradient boosting as an example model here.
    • For the transformations, I'm using sklearn_pandas, in which each feature is given a transformation, and we filter the list of transformations according to which features we want to include.
    • I also include a gridsearch for good measure. As currently set up, it's awkward to set the parameter grid. You could read it from a config file, or modify it manually. I'm thinking ahead to a future blog post in which I'll do automated machine learning. Hopefully this issue will go away at that time.
  • Train the model
    • Split into X and y for convenience. I'm passing DataFrames instead of arrays to scikit-learn.
    • Fit the pipeline and choose the best estimator. In principle this step can be much more complicated, involving training hundreds of models and ensembling. 
  • Test the model
    • Same thing as above, but predict using the best pipeline instead of running the full gridsearch.
    • Evaluate the results in whatever way you prefer. Here I'm just printing out a classification report.
  • Save the results
    • I err on the side of recording too much information. In this example, I save the model, the classification report, and the details of the pipeline object. In practice, I would also save a snapshot of the .py scripts unless I would reveal sensitive info by doing so.

Some comments on complexity

This exercise has reminded me again of the trade-off between general applicability and complexity. As we adapt a strategy to be applicable to more distinct cases, we must make the strategy more complex, or we must accept that it will handle the cases worse on average. I think this modeling project structure and script strikes a good balance - it should be adaptable to most modeling tasks, and most of the added complexity takes the form of custom modules for each sub-task. Thanks to CookieCutter Data Science's prior work, the structure is intuitive despite being complex. 

Monday, April 30, 2018

Some new post ideas

It's been two and a half years since I started my first job as a data scientist, and I find that I have a few things to say by now. So I'm starting up the blog again!

Here are some post ideas, in no particular order:

  • What I discuss with data science applicants in interviews
  • What can go wrong if you don't use pipelines in model training
  • The relationship between model complexity and validation robustness
  • The changing role of data science in business
  • Exotic types of data leakage
  • Why neural nets are usually not very good
  • Very general approaches to new data science projects
  • Ethics of data and data science
  • How to succeed by making yourself obsolete
We'll see what I can get to in my free time. See you soon.

- b