Approaching any Tabular Problem using PyTorch Tabular
Level: Beginner
In this tutorial, we will look at how to tackle any tabular machine learning problem (classification or regression) using PyTorch Tabular. We will use the Covertype dataset from the UCI repository. The dataset contains 581012 rows and 54 columns. The dataset is a multi-class classification problem. The goal is to predict the forest cover type from cartographic variables only (no remotely sensed data).
In a typical machine learning workflow, we would do the following steps:
1. Load the dataset
2. Analyze the dataset
3. Split the dataset into train and test
4. Preprocess the dataset
5. Define the model
6. Train the model
7. Make predictions on new data
8. Evaluate the model
Let's see how we do the same using PyTorch Tabular
Step 1: Load the Data¶
Cover Type Dataset¶
Predicting forest cover type from cartographic variables only (no remotely sensed data). The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types).
This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.
There is a simple utility method in PyTorch Tabular
to load this particular dataset. It downloads the data from UCI ML Repository. The original dataset has two categorical information - Soil Type and Wilderness Area - but one-hot encoded. The utility method converts them to categorical columns to make it more closer to real-life datasets in the wild.
Step 2: Analyze the dataset¶
In this step, we will explore the data to understand the data better. Exploratory Data Analysis (EDA) can be many things and it depends on the data and the problem we are trying to solve. And this can help us understand the data better and make some decisions on how to proceed with the data. But here, we will restrict ourselves to the most basic data analysis; just enough to understand which are the continuous and categorical columns, and if there are any missing values.
But this may not be always reliable. For example, if we have a column called month
and it has values from 1 to 12, then it is a categorical column. But select_dtypes
will treat it as a continuous column. So, we need to be careful and use our judgement.
But this is not reliable either. For example, we have a column called Soil_Type
and it has 40 unique values. How do we decide if it is a categorical column or a continuous column? We need to use our judgement here as well.
And reading the data description, understanding the domain, and using our judgement is the best way to decide if a column is categorical or continuous.
Here we will consider Wilderness_Area
and Soil_Type
as categorical features. We know Cover_Type
is the target column and that makes the rest of the columns continuous features.
# This separation have already been done for you while loading this particular dataset from `PyTorch Tabular`. Let's load the dataset in the right way.
data, cat_col_names, num_col_names, target_col = load_covertype_dataset()
# Let's also print out a few details
print(f"Data Shape: {data.shape} | # of cat cols: {len(cat_col_names)} | # of num cols: {len(num_col_names)}")
print(f"[bold dodger_blue2] Features: {num_col_names + cat_col_names}[/bold dodger_blue2]")
print(f"[bold purple4]Target: {target_col}[/bold purple4]")
Great news! There is no missing values in the dataset. If there were any missing values, we need to handle them. Kaggle has a good tutorial on how to handle missing values. You can find it here.
Step 3 - Split the dataset into train and test¶
Now, in all tabular problems, when we apply machine learning we need to have a training set, validation set and a test set. We will use the training set to train the model, validation set to make modelling decisions(like the hyperparameters, or kind of model to use etc.) and the test set to evaluate the final model. Since the dataset doesn't come with a test set, we will split the training set into training, validation and test set.
Step 4: Preprocess the dataset¶
In a typical machine learning project, this is the most time consuming step where we create new features, clean the data, handle missing values, handle outliers, scale the data, encode categorical features and so on.
In a scikit-learn based project, a pseudo code for this step would look like this:
data = create_new_features(data)
data = clean_data(data)
data = handle_missing_values(data)
data = handle_outliers(data)
data, cat_encoder = encode_categorical_features(data)
data, scaler = scale_data(data)
X, y = split_features_target(data)
But one of the allures of deep learning is that we don't need to spend time on feature engineering. We can just use the raw data and let the model figure out the best features to use. But we still need to do some data preparation. And for that, PyTorch Tabular takes care of some of these needs:
- Missing values in categorical features are handled natively
- Categorical features are encoded automatically using embeddings
- Continuous features are scaled automatically using StandardScaler
- Date features like month, day, year are extracted automatically
- Target transformation like log, power, quantile, box-cox can be enabled with a parameter. This will also handle the inverse tranformation automatically.
- Continuous features can be transformed using box-cox, quantile normal etc. with a parameter
While we have all these features, we can also choose to do any of these manually. For example, we can choose to encode categorical features using one hot encoding or target encoding and consider them as continuous features. We can also choose to scale the continuous features using MinMaxScaler or RobustScaler and turn off the automatic scaling.
So, here, we won't be doing any of these. We will just use the data as is and let PyTorch Tabular handle the rest.
Step 5: Define the Model¶
Now, we will define the model. In a scikit-learn
workflow, we would have done the following steps:
This is where PyTorch Tabular is different because we need to define a few configs before we define the model. One of the reasons is that PyTorch Tabular model handles a lot of things under the hood. So, we need to tell the model what kind of data we are dealing with. We also need to define the training dynamics, along with the model parameters. The configs we need to define are:
-
DataConfig
- This is where we define the data related configs like the target column, categorical columns, continuous columns, date columns, categorical embedding dimensions, etc. But the good news is that most of these are optional. If we don't define them, PyTorch Tabular will try to infer them from the data or have thumb rules to handle them. The bare minimum we need to define is the target column name, continuous columns and categorical columns. Categorical columns are embedded by default, numerical columns scaled by default and date columns are extracted by default. -
TrainerConfig
- This is where we define the training related configs like the batch size, number of epochs, early stopping, etc. Again, all of these are optional. If we don't define them, PyTorch Tabular will use some default values. By default,PyTorch Tabular
runs with a batch size of 64, with early stopping with a patience of 3 epochs and checkpointing enabled. This means that the model will be saved at the end of every epoch and the best model will be saved. The model will stop training if the validation loss doesn't improve for 3 epochs. Although all ofTrainerConfig
is optional, it is infinitely customizable. And with the entire PyTorch LightningTrainer
exposed, either through explicit parameters inTrainerConfig
or through a catch-alltrainer_kwargs
parameter inTrainerConfig
. -
OptimizerConfig
- This is where we define the optimizer related configs like the the kind of optimizer, weight decay, learning rate schedulers, etc. Again, all of these are optional. If we don't define them, PyTorch Tabular will use some default values. By default,PyTorch Tabular
uses theAdam
optimizer. It doesn't use any learning rate decay by default. Although all ofOptimizerConfig
is optional, it is also customizable. -
ExperimentConfig
- This is where we define how to track the experiment for logging and reproducibility. By default,PyTorch Tabular
usestensorboard
for logging. But we can also usewandb
. We can also choose to not log anything (although not recommended) by not defining anExperimentConfig
. -
<modelspecificconfig>
- This is where we define which model to use and the corresponding hyperparameters. InPyTorch Tabular
, each of the implemented model has their own config class. For example, if we want to useTabNet
, we need to defineTabNetConfig
. If we want to useGANDALF
, we need to defineGANDALFConfig
, and so on. Each of these config classes have their own set of model specific hyperparameters, as well as some common parameters like the loss function, metrics, learning rate, etc. Again, all of these are optional. If we don't define them, PyTorch Tabular will use some default values. Learning RAte is set to 1e-3 by default. The loss function is set toCrossEntropyLoss
for classification andMSELoss
for regression. The metrics are set toAccuracy
for classification andMSE
for regression. And all the model specific hyperparameters are set to suggested default values in their respective papers, or some default values that work well in practice.
Here, let's use GANDALF Model. We will define the configs as follows:
from pytorch_tabular.models import GANDALFConfig
from pytorch_tabular.config import (
DataConfig,
OptimizerConfig,
TrainerConfig,
)
data_config = DataConfig(
target=[
target_col
], # target should always be a list
continuous_cols=num_col_names,
categorical_cols=cat_col_names,
)
trainer_config = TrainerConfig(
batch_size=1024,
max_epochs=100,
)
optimizer_config = OptimizerConfig()
model_config = GANDALFConfig(
task="classification",
gflu_stages=6,
gflu_feature_init_sparsity=0.3,
gflu_dropout=0.0,
learning_rate=1e-3,
)
Now that we have defined all the configs, we can define the TabularModel
. Apart from the configs, there are some additional parameters we can pass to the model to control the verbosity of the model.
We can see that since we passed verbose=True
, it has already logged that the Experiment Tracking is disabled.
Step 6 : Train the model¶
Now, we can train the model. In the scikit-learn
workflow, we would have done the following:
In PyTorch Tabular, there are two ways we can do this:
- High-Level API - A fit
method which is very similar to the scikit-learn API, but the fit method has a lot more parameters to control the training dynamics. This is the recommended way to train the model.
- Low-Level API - A collection of methods - prepare_dataloader
, prepare_model
, and train
. This is for advanced users who want to have more control over the training process.
Let's stick to the high-level API in this introductory tutorial. We will use the fit
method to train the model. There is only one compulsory parameter for the fit
method - train
data. We can also pass the validation
data explicitly. If not provided, it'll use 20% of training data as validation data. In addition to this there are many other parameters like custom loss functions, metrics, custom optimizers, etc. which can be used to make the training process more customizable.
Step 7: Making predictions on new data¶
Now that we have trained the model, we can make predictions on new data. In a scikit-learn
workflow, we would have done the following:
In PyTorch Tabular, we can do something very similar. We can use the predict
method to make predictions on new data. This method returns the predictions as a pandas dataframe predictions. For classification problems, it returns the class probabilities, and final prediction class based on 0.5 threshold. All we have to do is pass in a dataframe with atleast all the features that was used for training.
Step 8: Evaluating the Model¶
Now, we can evaluate the model. In the scikit-learn
workflow, we would have done the following:
In PyTorch Tabular, there are two ways we can do this:
- Get the predictions on the test set and calculate the metrics manually
- Use the evaluate
method which will return the metrics (the same ones we have defined during training)
We will see the second way here. We can use the evaluate
method to evaluate the model on the test set. This method returns a dictionary of metrics
Step 9: Saving and Loading the Model¶
After the model is trained, we can save the model and load it later to make predictions on new data. In a scikit-learn
workflow, we would have done the following:
In PyTorch Tabular, we can do something very similar. We can use the save_model
method to save the model. This method saves everything required to make predictions on new data. By default it also saves the datamodule, which contains the training data, validation data, and test data as well. But we can choose to not save the datamodule by setting inference_only=True
.
Now we can load the saved model using the load_model
method. This method returns the model and the datamodule. We can use the model to make predictions on new data.
- Check out the PyTorch Tabular Documentation to learn more about the library
- Use alternate models like TabNet, CategoryEmbedding, etc.
- Use different datasets and try out the workflow.
- Check out other tutorials and how-to guides in the documentation.