Data

PyTorch Tabular uses Pandas Dataframes as the container which holds data. As Pandas is the most popular way of handling tabular data, this was an obvious choice. Keeping ease of useability in mind, PyTorch Tabular accepts dataframes as is, i.e. no need to split the data into X and y like in Sci-kit Learn.

Pytorch Tabular handles this using a DataConfig object.

Basic Usage¶

target: List[str]: A list of strings with the names of the target column(s)
continuous_cols: List[str]: Column names of the numeric fields. Defaults to []
categorical_cols: List[str]: Column names of the categorical fields to treat differently

Usage Example¶

data_config = DataConfig(
    target=["label"],
    continuous_cols=["feature_1", "feature_2"],
    categorical_cols=["cat_feature_1", "cat_feature_2"],
)

Advanced Usage:¶

Date Columns¶

If you have date_columns in the dataframe, mention the column names in date_columns parameter and set encode_date_columns to True. This will extract relevant features like the Month, Week, Quarter etc. and add them to your feature list internally.

date_columns is not just a list of column names, but a list of (column name, freq) tuples. The freq is a standard Pandas date frequency tags which denotes the lowest temporal granularity which is relevant for the problem.

For eg., if there is a date column for Launch Date for a Product and they only launch once a month. Then there is no sense in extracting features like week, or day etc. So, we keep the frequency at M

date_columns = [("launch_date", "M")]

Feature Transformations¶

Feature Scaling is an almost essential step to get goog performance from most Machine Learning Algorithms, and Deep Learning is not an exception. normalize_continuous_features flag(which is True by default) scales the input continuous features using a StandardScaler

Sometimes, changing the feature distributions using non-linear transformations helps the machine learning/deep learning algorithms.

PyTorch Tabular offers 4 standard transformations using the continuous_feature_transform parameter:

yeo-johnson
box-cox
quantile_uniform
quantile_normal

yeo-johnson and box-cox are a family of parametric, monotonic transformations that aim to map data from any distribution to as close to a Gaussian distribution as possible in order to stabilize variance and minimize skewness. box-cox can only be applied to strictly positive data. Sci-kit Learn has a good write-up about them

quantile_normal and quantile_uniform are monotonic, non-parametric transformations which aims to transfom the features to a normal distribution or a uniform distribution, respectively.By performing a rank transformation, a quantile transform smooths out unusual distributions and is less influenced by outliers than scaling methods. It does, however, distort correlations and distances within and across features.

`pytorch_tabular.config.DataConfig` `dataclass` ¶

Data configuration.

Parameters:

Name	Type	Description	Default
`target`	`Optional[List[str]]`	A list of strings with the names of the target column(s). It is mandatory for all except SSL tasks.	`None`
`continuous_cols`	`List`	Column names of the numeric fields. Defaults to []	`list()`
`categorical_cols`	`List`	Column names of the categorical fields to treat differently. Defaults to []	`list()`
`date_columns`	`List`	(Column name, Freq, Format) tuples of the date fields. For eg. a field named introduction_date and with a monthly frequency like "2023-12" should have an entry ('intro_date','M','%Y-%m')	`list()`
`encode_date_columns`	`bool`	Whether to encode the derived variables from date	`True`
`validation_split`	`Optional[float]`	Percentage of Training rows to keep aside as validation. Used only if Validation Data is not given separately	`0.2`
`continuous_feature_transform`	`Optional[str]`	Whether to transform the features before modelling. By default, it is turned off. Choices are: [`None`,`yeo-johnson`,`box-cox`, `quantile_normal`,`quantile_uniform`].	`None`
`normalize_continuous_features`	`bool`	Flag to normalize the input features(continuous)	`True`
`quantile_noise`	`int`	NOT IMPLEMENTED. If specified fits QuantileTransformer on data with added gaussian noise with std = :quantile_noise: * data.std ; this will cause discrete values to be more separable. Please note that this transformation does NOT apply gaussian noise to the resulting data, the noise is only applied for QuantileTransformer	`0`
`num_workers`	`Optional[int]`	The number of workers used for data loading. For windows always set to 0	`0`
`pin_memory`	`bool`	Whether to pin memory for data loading.	`True`
`handle_unknown_categories`	`bool`	Whether to handle unknown or new values in categorical columns as unknown	`True`
`handle_missing_values`	`bool`	Whether to handle missing values in categorical columns as unknown	`True`
`pickle_protocol`	`int`	pickle protocol version passed to `torch.save` for dataset caching to disk	`2`
`dataloader_kwargs`	`Dict[str, Any]`	Additional kwargs to be passed to PyTorch DataLoader. See https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader	`dict()`

Source code in src/pytorch_tabular/config/config.py

@dataclass
class DataConfig:
    """Data configuration.

    Args:
        target (Optional[List[str]]): A list of strings with the names of the target column(s). It is
                mandatory for all except SSL tasks.

        continuous_cols (List): Column names of the numeric fields. Defaults to []

        categorical_cols (List): Column names of the categorical fields to treat differently. Defaults to
                []

        date_columns (List): (Column name, Freq, Format) tuples of the date fields. For eg. a field named
                introduction_date and with a monthly frequency like "2023-12" should have
                an entry ('intro_date','M','%Y-%m')

        encode_date_columns (bool): Whether to encode the derived variables from date

        validation_split (Optional[float]): Percentage of Training rows to keep aside as validation. Used
                only if Validation Data is not given separately

        continuous_feature_transform (Optional[str]): Whether to transform the features before
                modelling. By default, it is turned off. Choices are: [`None`,`yeo-johnson`,`box-cox`,
                `quantile_normal`,`quantile_uniform`].

        normalize_continuous_features (bool): Flag to normalize the input features(continuous)

        quantile_noise (int): NOT IMPLEMENTED. If specified fits QuantileTransformer on data with added
                gaussian noise with std = :quantile_noise: * data.std ; this will cause discrete values to be more
                separable. Please note that this transformation does NOT apply gaussian noise to the resulting
                data, the noise is only applied for QuantileTransformer

        num_workers (Optional[int]): The number of workers used for data loading. For windows always set to
                0

        pin_memory (bool): Whether to pin memory for data loading.

        handle_unknown_categories (bool): Whether to handle unknown or new values in categorical
                columns as unknown

        handle_missing_values (bool): Whether to handle missing values in categorical columns as
                unknown

        pickle_protocol (int): pickle protocol version passed to `torch.save` for dataset caching to disk

        dataloader_kwargs (Dict[str, Any]): Additional kwargs to be passed to PyTorch DataLoader. See
                https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader

    """

    target: Optional[List[str]] = field(
        default=None,
        metadata={
            "help": "A list of strings with the names of the target column(s)."
            " It is mandatory for all except SSL tasks."
        },
    )
    continuous_cols: List = field(
        default_factory=list,
        metadata={"help": "Column names of the numeric fields. Defaults to []"},
    )
    categorical_cols: List = field(
        default_factory=list,
        metadata={"help": "Column names of the categorical fields to treat differently. Defaults to []"},
    )
    date_columns: List = field(
        default_factory=list,
        metadata={
            "help": "(Column names, Freq) tuples of the date fields. For eg. a field named"
            " introduction_date and with a monthly frequency like '2023-12' should have"
            " an entry ('intro_date','M','%Y-%m')"
        },
    )

    encode_date_columns: bool = field(
        default=True,
        metadata={"help": "Whether or not to encode the derived variables from date"},
    )
    validation_split: Optional[float] = field(
        default=0.2,
        metadata={
            "help": "Percentage of Training rows to keep aside as validation."
            " Used only if Validation Data is not given separately"
        },
    )
    continuous_feature_transform: Optional[str] = field(
        default=None,
        metadata={
            "help": "Whether or not to transform the features before modelling. By default it is turned off.",
            "choices": [
                None,
                "yeo-johnson",
                "box-cox",
                "quantile_normal",
                "quantile_uniform",
            ],
        },
    )
    normalize_continuous_features: bool = field(
        default=True,
        metadata={"help": "Flag to normalize the input features (continuous)"},
    )
    quantile_noise: int = field(
        default=0,
        metadata={
            "help": "NOT IMPLEMENTED. If specified fits QuantileTransformer on data with added gaussian noise"
            " with std = :quantile_noise: * data.std ; this will cause discrete values to be more separable."
            " Please not that this transformation does NOT apply gaussian noise to the resulting data,"
            " the noise is only applied for QuantileTransformer"
        },
    )
    num_workers: Optional[int] = field(
        default=0,
        metadata={"help": "The number of workers used for data loading. For windows always set to 0"},
    )
    pin_memory: bool = field(
        default=True,
        metadata={"help": "Whether or not to pin memory for data loading."},
    )
    handle_unknown_categories: bool = field(
        default=True,
        metadata={"help": "Whether or not to handle unknown or new values in categorical columns as unknown"},
    )
    handle_missing_values: bool = field(
        default=True,
        metadata={"help": "Whether or not to handle missing values in categorical columns as unknown"},
    )

    pickle_protocol: int = field(
        default=2,
        metadata={"help": "pickle protocol version passed to `torch.save` for dataset caching to disk"},
    )

    dataloader_kwargs: Dict[str, Any] = field(
        default_factory=dict,
        metadata={"help": "Additional kwargs to be passed to PyTorch DataLoader."},
    )

    def __post_init__(self):
        assert (
            len(self.categorical_cols) + len(self.continuous_cols) + len(self.date_columns) > 0
        ), "There should be at-least one feature defined in categorical, continuous, or date columns"
        _validate_choices(self)
        if os.name == "nt" and self.num_workers != 0:
            print("Windows does not support num_workers > 0. Setting num_workers to 0")
            self.num_workers = 0