TabularModel
After defining all the configs, we need to put it all together and this is where TabularModel
comes in. TabularModel
is the core work horse, which orchestrates and sets everything up.
TabularModel
parses the configs and:
- initializes the model
- sets up the experiment tracking framework
- initializes and sets up the
TabularDatamodule
which handles all the data transformations and preparation of the DataLoaders - sets up the callbacks and the Pytorch Lightning Trainer
- enables you to train, save, load, and predict
Initializing Tabular Model
Basic Usage:
data_config
: DataConfig: DataConfig object or path to the yaml file.model_config
: ModelConfig: A subclass of ModelConfig or path to the yaml file. Determines which model to run from the type of config.optimizer_config
: OptimizerConfig: OptimizerConfig object or path to the yaml file.trainer_config
: TrainerConfig: TrainerConfig object or path to the yaml file.experiment_config
: ExperimentConfig: ExperimentConfig object or path to the yaml file.
Usage Example
tabular_model = TabularModel(
data_config=data_config,
model_config=model_config,
optimizer_config=optimizer_config,
trainer_config=trainer_config,
experiment_config=experiment_config
)
Advanced Usage
config
: DictConfig: Another way of initializingTabularModel
is with anDictconfig
fromomegaconf
. Although not recommended, you can create a normal dictionary with all the parameters dumped into it and create aDictConfig
fromomegaconf
and pass it here. The downside is that you'll be skipping all the validation(both type validation and logical validations). This is primarily used internally to load a saved model from a checkpoint.model_callable
: Optional[Callable]: Usually, the model callable and parameters are inferred from the ModelConfig. But in special cases, like when working with a custom model, you can pass the class(not the initialized object) to this parameter and override the config based initialization.
Functions
pytorch_tabular.tabular_model.TabularModel.fit(self, train, validation=None, test=None, loss=None, metrics=None, optimizer=None, optimizer_params={}, train_sampler=None, target_transform=None, max_epochs=None, min_epochs=None, reset=False)
The fit method which takes in the data and triggers the training
Parameters:
Name | Type | Description | Default |
---|---|---|---|
train |
DataFrame |
Training Dataframe |
required |
valid |
Optional[pd.DataFrame] |
If provided, will use this dataframe as the validation while training. Used in Early Stopping and Logging. If left empty, will use 20% of Train data as validation. Defaults to None. |
required |
test |
Optional[pandas.core.frame.DataFrame] |
If provided, will use as the hold-out data, which you'll be able to check performance after the model is trained. Defaults to None. |
None |
loss |
Optional[torch.nn.modules.module.Module] |
Custom Loss functions which are not in standard pytorch library |
None |
metrics |
Optional[List[Callable]] |
Custom metric functions(Callable) which has the signature metric_fn(y_hat, y) and works on torch tensor inputs |
None |
optimizer |
Optional[torch.optim.optimizer.Optimizer] |
Custom optimizers which are a drop in replacements for standard PyToch optimizers. This should be the Class and not the initialized object |
None |
optimizer_params |
Dict |
The parmeters to initialize the custom optimizer. |
{} |
train_sampler |
Optional[torch.utils.data.sampler.Sampler] |
Custom PyTorch batch samplers which will be passed to the DataLoaders. Useful for dealing with imbalanced data and other custom batching strategies |
None |
target_transform |
Union[sklearn.base.TransformerMixin, Tuple] |
If provided, applies the transform to the target before modelling and inverse the transform during prediction. The parameter can either be a sklearn Transformer which has an inverse_transform method, or a tuple of callables (transform_func, inverse_transform_func) |
None |
max_epochs |
Optional[int] |
Overwrite maximum number of epochs to be run |
None |
min_epochs |
Optional[int] |
Overwrite minimum number of epochs to be run |
None |
reset |
bool |
(bool): Flag to reset the model and train again from scratch |
False |
Source code in pytorch_tabular/tabular_model.py
def fit(
self,
train: pd.DataFrame,
validation: Optional[pd.DataFrame] = None,
test: Optional[pd.DataFrame] = None,
loss: Optional[torch.nn.Module] = None,
metrics: Optional[List[Callable]] = None,
optimizer: Optional[torch.optim.Optimizer] = None,
optimizer_params: Dict = {},
train_sampler: Optional[torch.utils.data.Sampler] = None,
target_transform: Optional[Union[TransformerMixin, Tuple]] = None,
max_epochs: Optional[int] = None,
min_epochs: Optional[int] = None,
reset: bool = False,
) -> None:
"""The fit method which takes in the data and triggers the training
Args:
train (pd.DataFrame): Training Dataframe
valid (Optional[pd.DataFrame], optional): If provided, will use this dataframe as the validation while training.
Used in Early Stopping and Logging. If left empty, will use 20% of Train data as validation. Defaults to None.
test (Optional[pd.DataFrame], optional): If provided, will use as the hold-out data,
which you'll be able to check performance after the model is trained. Defaults to None.
loss (Optional[torch.nn.Module], optional): Custom Loss functions which are not in standard pytorch library
metrics (Optional[List[Callable]], optional): Custom metric functions(Callable) which has the
signature metric_fn(y_hat, y) and works on torch tensor inputs
optimizer (Optional[torch.optim.Optimizer], optional): Custom optimizers which are a drop in replacements for standard PyToch optimizers.
This should be the Class and not the initialized object
optimizer_params (Optional[Dict], optional): The parmeters to initialize the custom optimizer.
train_sampler (Optional[torch.utils.data.Sampler], optional): Custom PyTorch batch samplers which will be passed to the DataLoaders. Useful for dealing with imbalanced data and other custom batching strategies
target_transform (Optional[Union[TransformerMixin, Tuple(Callable)]], optional): If provided, applies the transform to the target before modelling
and inverse the transform during prediction. The parameter can either be a sklearn Transformer which has an inverse_transform method, or
a tuple of callables (transform_func, inverse_transform_func)
max_epochs (Optional[int]): Overwrite maximum number of epochs to be run
min_epochs (Optional[int]): Overwrite minimum number of epochs to be run
reset: (bool): Flag to reset the model and train again from scratch
"""
train_loader, val_loader = self._pre_fit(
train,
validation,
test,
loss,
metrics,
optimizer,
optimizer_params,
train_sampler,
target_transform,
max_epochs,
min_epochs,
reset,
)
self.model.train()
if self.config.auto_lr_find and (not self.config.fast_dev_run):
self.trainer.tune(self.model, train_loader, val_loader)
# Parameters in NODE needs to be initialized again
if self.config._model_name in ["CategoryEmbeddingNODEModel", "NODEModel"]:
self.data_aware_initialization()
self.model.train()
self.trainer.fit(self.model, train_loader, val_loader)
logger.info("Training the model completed...")
if self.config.load_best:
self.load_best_model()
pytorch_tabular.tabular_model.TabularModel.evaluate(self, test)
Evaluates the dataframe using the loss and metrics already set in config
Parameters:
Name | Type | Description | Default |
---|---|---|---|
test |
Optional[pandas.core.frame.DataFrame] |
The dataframe to be evaluated. If not provided, will try to use the test provided during fit. If that was also not provided will return an empty dictionary |
required |
Returns:
Type | Description |
---|---|
Union[dict, list] |
Union[dict, list]: The final test result dictionary. |
Source code in pytorch_tabular/tabular_model.py
def evaluate(self, test: Optional[pd.DataFrame]) -> Union[dict, list]:
"""Evaluates the dataframe using the loss and metrics already set in config
Args:
test (Optional[pd.DataFrame]): The dataframe to be evaluated. If not provided, will try to use the
test provided during fit. If that was also not provided will return an empty dictionary
Returns:
Union[dict, list]: The final test result dictionary.
"""
if test is not None:
test_loader = self.datamodule.prepare_inference_dataloader(test)
elif self.test is not None:
test_loader = self.datamodule.test_dataloader()
else:
return {}
result = self.trainer.test(
test_dataloaders=test_loader,
ckpt_path="best" if self.config.checkpoints else None,
)
return result
pytorch_tabular.tabular_model.TabularModel.predict(self, test, quantiles=[0.25, 0.5, 0.75], n_samples=100, ret_logits=False)
Uses the trained model to predict on new data and return as a dataframe
Parameters:
Name | Type | Description | Default |
---|---|---|---|
test |
DataFrame |
The new dataframe with the features defined during training |
required |
quantiles |
Optional[List] |
For probabilistic models like Mixture Density Networks, this specifies
the different quantiles to be extracted apart from the |
[0.25, 0.5, 0.75] |
n_samples |
Optional[int] |
Number of samples to draw from the posterior to estimate the quantiles. Ignored for non-probabilistic models. Defaults to 100 |
100 |
ret_logits |
bool |
Flag to return raw model outputs/logits except the backbone features along with the dataframe. Defaults to False |
False |
Returns:
Type | Description |
---|---|
DataFrame |
pd.DataFrame: Returns a dataframe with predictions and features. If classification, it returns probabilities and final prediction |
Source code in pytorch_tabular/tabular_model.py
def predict(
self,
test: pd.DataFrame,
quantiles: Optional[List] = [0.25, 0.5, 0.75],
n_samples: Optional[int] = 100,
ret_logits=False,
) -> pd.DataFrame:
"""Uses the trained model to predict on new data and return as a dataframe
Args:
test (pd.DataFrame): The new dataframe with the features defined during training
quantiles (Optional[List]): For probabilistic models like Mixture Density Networks, this specifies
the different quantiles to be extracted apart from the `central_tendency` and added to the dataframe.
For other models it is ignored. Defaults to [0.25, 0.5, 0.75]
n_samples (Optional[int]): Number of samples to draw from the posterior to estimate the quantiles.
Ignored for non-probabilistic models. Defaults to 100
ret_logits (bool): Flag to return raw model outputs/logits except the backbone features along
with the dataframe. Defaults to False
Returns:
pd.DataFrame: Returns a dataframe with predictions and features.
If classification, it returns probabilities and final prediction
"""
assert all(
[q <= 1 and q >= 0 for q in quantiles]
), "Quantiles should be a decimal between 0 and 1"
self.model.eval()
inference_dataloader = self.datamodule.prepare_inference_dataloader(test)
point_predictions = []
quantile_predictions = []
logits_predictions = defaultdict(list)
is_probabilistic = (
hasattr(self.model.hparams, "_probabilistic")
and self.model.hparams._probabilistic
)
for batch in tqdm(inference_dataloader, desc="Generating Predictions..."):
for k, v in batch.items():
if isinstance(v, list) and (len(v) == 0):
# Skipping empty list
continue
batch[k] = v.to(self.model.device)
if is_probabilistic:
samples, ret_value = self.model.sample(
batch, n_samples, ret_model_output=True
)
y_hat = torch.mean(samples, dim=-1)
quantile_preds = []
for q in quantiles:
quantile_preds.append(
torch.quantile(samples, q=q, dim=-1).unsqueeze(1)
)
else:
y_hat, ret_value = self.model.predict(batch, ret_model_output=True)
if ret_logits:
for k, v in ret_value.items():
# if k == "backbone_features":
# continue
logits_predictions[k].append(v.detach().cpu())
point_predictions.append(y_hat.detach().cpu())
if is_probabilistic:
quantile_predictions.append(
torch.cat(quantile_preds, dim=-1).detach().cpu()
)
point_predictions = torch.cat(point_predictions, dim=0)
if point_predictions.ndim == 1:
point_predictions = point_predictions.unsqueeze(-1)
if is_probabilistic:
quantile_predictions = torch.cat(quantile_predictions, dim=0).unsqueeze(-1)
if quantile_predictions.ndim == 2:
quantile_predictions = quantile_predictions.unsqueeze(-1)
pred_df = test.copy()
if self.config.task == "regression":
point_predictions = point_predictions.numpy()
# Probabilistic Models are only implemented for Regression
if is_probabilistic:
quantile_predictions = quantile_predictions.numpy()
for i, target_col in enumerate(self.config.target):
if self.datamodule.do_target_transform:
if self.config.target[i] in pred_df.columns:
pred_df[
self.config.target[i]
] = self.datamodule.target_transforms[i].inverse_transform(
pred_df[self.config.target[i]].values.reshape(-1, 1)
)
pred_df[
f"{target_col}_prediction"
] = self.datamodule.target_transforms[i].inverse_transform(
point_predictions[:, i].reshape(-1, 1)
)
if is_probabilistic:
for j, q in enumerate(quantiles):
pred_df[
f"{target_col}_q{int(q*100)}"
] = self.datamodule.target_transforms[i].inverse_transform(
quantile_predictions[:, j, i].reshape(-1, 1)
)
else:
pred_df[f"{target_col}_prediction"] = point_predictions[:, i]
if is_probabilistic:
for j, q in enumerate(quantiles):
pred_df[
f"{target_col}_q{int(q*100)}"
] = quantile_predictions[:, j, i].reshape(-1, 1)
elif self.config.task == "classification":
point_predictions = nn.Softmax(dim=-1)(point_predictions).numpy()
for i, class_ in enumerate(self.datamodule.label_encoder.classes_):
pred_df[f"{class_}_probability"] = point_predictions[:, i]
pred_df[f"prediction"] = self.datamodule.label_encoder.inverse_transform(
np.argmax(point_predictions, axis=1)
)
if ret_logits:
for k, v in logits_predictions.items():
v = torch.cat(v, dim=0).numpy()
if v.ndim == 1:
v = v.reshape(-1, 1)
for i in range(v.shape[-1]):
if v.shape[-1] > 1:
pred_df[f"{k}_{i}"] = v[:, i]
else:
pred_df[f"{k}"] = v[:, i]
return pred_df
pytorch_tabular.tabular_model.TabularModel.save_model(self, dir)
Saves the model and checkpoints in the specified directory
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dir |
str |
The path to the directory to save the model |
required |
Source code in pytorch_tabular/tabular_model.py
def save_model(self, dir: str):
"""Saves the model and checkpoints in the specified directory
Args:
dir (str): The path to the directory to save the model
"""
if os.path.exists(dir) and (os.listdir(dir)):
logger.warning("Directory is not empty. Overwriting the contents.")
for f in os.listdir(dir):
os.remove(os.path.join(dir, f))
os.makedirs(dir, exist_ok=True)
with open(os.path.join(dir, "config.yml"), "w") as fp:
OmegaConf.save(self.config, fp, resolve=True)
joblib.dump(self.datamodule, os.path.join(dir, "datamodule.sav"))
if hasattr(self.config, "log_target") and self.config.log_target is not None:
joblib.dump(self.logger, os.path.join(dir, "exp_logger.sav"))
if hasattr(self, "callbacks"):
joblib.dump(self.callbacks, os.path.join(dir, "callbacks.sav"))
self.trainer.save_checkpoint(os.path.join(dir, "model.ckpt"))
custom_params = {}
custom_params["custom_loss"] = self.model.custom_loss
custom_params["custom_metrics"] = self.model.custom_metrics
custom_params["custom_optimizer"] = self.model.custom_optimizer
custom_params["custom_optimizer_params"] = self.model.custom_optimizer_params
joblib.dump(custom_params, os.path.join(dir, "custom_params.sav"))
if self.custom_model:
joblib.dump(
self.model_callable, os.path.join(dir, "custom_model_callable.sav")
)
pytorch_tabular.tabular_model.TabularModel.load_from_checkpoint(dir)
classmethod
Loads a saved model from the directory
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dir |
str |
The directory where the model wa saved, along with the checkpoints |
required |
Returns:
Type | Description |
---|---|
TabularModel |
The saved TabularModel |
Source code in pytorch_tabular/tabular_model.py
@classmethod
def load_from_checkpoint(cls, dir: str):
"""Loads a saved model from the directory
Args:
dir (str): The directory where the model wa saved, along with the checkpoints
Returns:
TabularModel: The saved TabularModel
"""
config = OmegaConf.load(os.path.join(dir, "config.yml"))
datamodule = joblib.load(os.path.join(dir, "datamodule.sav"))
if (
hasattr(config, "log_target")
and (config.log_target is not None)
and os.path.exists(os.path.join(dir, "exp_logger.sav"))
):
logger = joblib.load(os.path.join(dir, "exp_logger.sav"))
else:
logger = None
if os.path.exists(os.path.join(dir, "callbacks.sav")):
callbacks = joblib.load(os.path.join(dir, "callbacks.sav"))
else:
callbacks = []
if os.path.exists(os.path.join(dir, "custom_model_callable.sav")):
model_callable = joblib.load(os.path.join(dir, "custom_model_callable.sav"))
custom_model = True
else:
model_callable = getattr(
getattr(models, config._module_src), config._model_name
)
custom_model = False
custom_params = joblib.load(os.path.join(dir, "custom_params.sav"))
model_args = {}
if custom_params.get("custom_loss") is not None:
model_args['loss'] = "MSELoss"
if custom_params.get("custom_metrics") is not None:
model_args['metrics'] = ["mean_squared_error"]
model_args['metric_params'] = [{}]
if custom_params.get("custom_optimizer") is not None:
model_args['optimizer'] = "Adam"
if custom_params.get("custom_optimizer_params") is not None:
model_args['optimizer_params'] = {}
# Initializing with default metrics, losses, and optimizers. Will revert once initialized
model = model_callable.load_from_checkpoint(
checkpoint_path=os.path.join(dir, "model.ckpt"),
**model_args
)
# else:
# # Initializing with default values
# model = model_callable.load_from_checkpoint(
# checkpoint_path=os.path.join(dir, "model.ckpt"),
# )
# Updating config with custom parameters for experiment tracking
if custom_params.get("custom_loss") is not None:
model.custom_loss = custom_params["custom_loss"]
if custom_params.get("custom_metrics") is not None:
model.custom_metrics = custom_params["custom_metrics"]
if custom_params.get("custom_optimizer") is not None:
model.custom_optimizer = custom_params["custom_optimizer"]
if custom_params.get("custom_optimizer_params") is not None:
model.custom_optimizer_params = custom_params["custom_optimizer_params"]
model._setup_loss()
model._setup_metrics()
tabular_model = cls(config=config, model_callable=model_callable)
tabular_model.model = model
tabular_model.custom_model = custom_model
tabular_model.datamodule = datamodule
tabular_model.callbacks = callbacks
tabular_model._prepare_trainer()
tabular_model.trainer.model = model
tabular_model.logger = logger
return tabular_model
pytorch_tabular.tabular_model.TabularModel.find_learning_rate(self, train, validation=None, test=None, loss=None, metrics=None, optimizer=None, optimizer_params={}, min_lr=1e-08, max_lr=1, num_training=100, mode='exponential', early_stop_threshold=4.0, plot=True)
Enables the user to do a range test of good initial learning rates, to reduce the amount of guesswork in picking a good starting learning rate.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
train |
DataFrame |
Training Dataframe |
required |
valid |
Optional[pd.DataFrame] |
If provided, will use this dataframe as the validation while training. Used in Early Stopping and Logging. If left empty, will use 20% of Train data as validation. Defaults to None. |
required |
test |
Optional[pandas.core.frame.DataFrame] |
If provided, will use as the hold-out data, which you'll be able to check performance after the model is trained. Defaults to None. |
None |
loss |
Optional[torch.nn.modules.module.Module] |
Custom Loss functions which are not in standard pytorch library |
None |
metrics |
Optional[List[Callable]] |
Custom metric functions(Callable) which has the signature metric_fn(y_hat, y) |
None |
optimizer |
Optional[torch.optim.optimizer.Optimizer] |
Custom optimizers which are a drop in replacements for standard PyToch optimizers. This should be the Class and not the initialized object |
None |
optimizer_params |
Dict |
The parmeters to initialize the custom optimizer. |
{} |
min_lr |
float |
minimum learning rate to investigate |
1e-08 |
max_lr |
float |
maximum learning rate to investigate |
1 |
num_training |
int |
number of learning rates to test |
100 |
mode |
str |
search strategy, either 'linear' or 'exponential'. If set to 'linear' the learning rate will be searched by linearly increasing after each batch. If set to 'exponential', will increase learning rate exponentially. |
'exponential' |
early_stop_threshold(Optional[float], |
optional |
threshold for stopping the search. If the loss at any point is larger than early_stop_threshold*best_loss then the search is stopped. To disable, set to None. |
required |
plot(bool, |
optional |
If true, will plot using matplotlib |
required |
Source code in pytorch_tabular/tabular_model.py
def find_learning_rate(
self,
train: pd.DataFrame,
validation: Optional[pd.DataFrame] = None,
test: Optional[pd.DataFrame] = None,
loss: Optional[torch.nn.Module] = None,
metrics: Optional[List[Callable]] = None,
optimizer: Optional[torch.optim.Optimizer] = None,
optimizer_params: Dict = {},
min_lr: float = 1e-8,
max_lr: float = 1,
num_training: int = 100,
mode: str = "exponential",
early_stop_threshold: float = 4.0,
plot=True,
) -> None:
"""Enables the user to do a range test of good initial learning rates, to reduce the amount of guesswork in picking a good starting learning rate.
Args:
train (pd.DataFrame): Training Dataframe
valid (Optional[pd.DataFrame], optional): If provided, will use this dataframe as the validation while training.
Used in Early Stopping and Logging. If left empty, will use 20% of Train data as validation. Defaults to None.
test (Optional[pd.DataFrame], optional): If provided, will use as the hold-out data,
which you'll be able to check performance after the model is trained. Defaults to None.
loss (Optional[torch.nn.Module], optional): Custom Loss functions which are not in standard pytorch library
metrics (Optional[List[Callable]], optional): Custom metric functions(Callable) which has the signature metric_fn(y_hat, y)
optimizer (Optional[torch.optim.Optimizer], optional): Custom optimizers which are a drop in replacements for standard PyToch optimizers.
This should be the Class and not the initialized object
optimizer_params (Optional[Dict], optional): The parmeters to initialize the custom optimizer.
min_lr (Optional[float], optional): minimum learning rate to investigate
max_lr (Optional[float], optional): maximum learning rate to investigate
num_training (Optional[int], optional): number of learning rates to test
mode (Optional[str], optional): search strategy, either 'linear' or 'exponential'. If set to
'linear' the learning rate will be searched by linearly increasing
after each batch. If set to 'exponential', will increase learning
rate exponentially.
early_stop_threshold(Optional[float], optional): threshold for stopping the search. If the
loss at any point is larger than early_stop_threshold*best_loss
then the search is stopped. To disable, set to None.
plot(bool, optional): If true, will plot using matplotlib
"""
train_loader, val_loader = self._pre_fit(
train,
validation,
test,
loss,
metrics,
optimizer,
optimizer_params,
target_transform=None,
max_epochs=None,
min_epochs=None,
reset=True,
)
lr_finder = self.trainer.tuner.lr_find(
self.model,
train_loader,
val_loader,
min_lr,
max_lr,
num_training,
mode,
early_stop_threshold,
)
if plot:
fig = lr_finder.plot(suggest=True)
fig.show()
new_lr = lr_finder.suggestion()
# cancelling the model and trainer that was loaded
self.model = None
self.trainer = None
self.datamodule = None
return new_lr, pd.DataFrame(lr_finder.results)