Data
PyTorch Tabular uses Pandas Dataframes as the container which holds data. As Pandas is the most popular way of handling tabular data, this was an obvious choice. Keeping ease of useability in mind, PyTorch Tabular accepts dataframes as is, i.e. no need to split the data into X
and y
like in Sci-kit Learn.
Pytorch Tabular handles this using a DataConfig
object.
Basic Usage¶
target
: List[str]: A list of strings with the names of the target column(s)continuous_cols
: List[str]: Column names of the numeric fields. Defaults to []categorical_cols
: List[str]: Column names of the categorical fields to treat differently
Usage Example¶
data_config = DataConfig(
target=["label"],
continuous_cols=["feature_1", "feature_2"],
categorical_cols=["cat_feature_1", "cat_feature_2"],
)
Advanced Usage:¶
Date Columns¶
If you have date_columns in the dataframe, mention the column names in date_columns
parameter and set encode_date_columns
to True
. This will extract relevant features like the Month, Week, Quarter etc. and add them to your feature list internally.
date_columns
is not just a list of column names, but a list of (column name, freq) tuples. The freq is a standard Pandas date frequency tags which denotes the lowest temporal granularity which is relevant for the problem.
For eg., if there is a date column for Launch Date for a Product and they only launch once a month. Then there is no sense in extracting features like week, or day etc. So, we keep the frequency at M
Feature Transformations¶
Feature Scaling is an almost essential step to get goog performance from most Machine Learning Algorithms, and Deep Learning is not an exception. normalize_continuous_features
flag(which is True
by default) scales the input continuous features using a StandardScaler
Sometimes, changing the feature distributions using non-linear transformations helps the machine learning/deep learning algorithms.
PyTorch Tabular offers 4 standard transformations using the continuous_feature_transform
parameter:
yeo-johnson
box-cox
quantile_uniform
quantile_normal
yeo-johnson
and box-cox
are a family of parametric, monotonic transformations that aim to map data from any distribution to as close to a Gaussian distribution as possible in order to stabilize variance and minimize skewness. box-cox
can only be applied to strictly positive data. Sci-kit Learn has a good write-up about them
quantile_normal
and quantile_uniform
are monotonic, non-parametric transformations which aims to transfom the features to a normal distribution or a uniform distribution, respectively.By performing a rank transformation, a quantile transform smooths out unusual distributions and is less influenced by outliers than scaling methods. It does, however, distort correlations and distances within and across features.
pytorch_tabular.config.DataConfig
dataclass
¶
Data configuration.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
target |
Optional[List[str]]
|
A list of strings with the names of the target column(s). It is mandatory for all except SSL tasks. |
None
|
continuous_cols |
List
|
Column names of the numeric fields. Defaults to [] |
list()
|
categorical_cols |
List
|
Column names of the categorical fields to treat differently. Defaults to [] |
list()
|
date_columns |
List
|
(Column name, Freq, Format) tuples of the date fields. For eg. a field named introduction_date and with a monthly frequency like "2023-12" should have an entry ('intro_date','M','%Y-%m') |
list()
|
encode_date_columns |
bool
|
Whether to encode the derived variables from date |
True
|
validation_split |
Optional[float]
|
Percentage of Training rows to keep aside as validation. Used only if Validation Data is not given separately |
0.2
|
continuous_feature_transform |
Optional[str]
|
Whether to transform the features before
modelling. By default, it is turned off. Choices are: [ |
None
|
normalize_continuous_features |
bool
|
Flag to normalize the input features(continuous) |
True
|
quantile_noise |
int
|
NOT IMPLEMENTED. If specified fits QuantileTransformer on data with added gaussian noise with std = :quantile_noise: * data.std ; this will cause discrete values to be more separable. Please note that this transformation does NOT apply gaussian noise to the resulting data, the noise is only applied for QuantileTransformer |
0
|
num_workers |
Optional[int]
|
The number of workers used for data loading. For windows always set to 0 |
0
|
pin_memory |
bool
|
Whether to pin memory for data loading. |
True
|
handle_unknown_categories |
bool
|
Whether to handle unknown or new values in categorical columns as unknown |
True
|
handle_missing_values |
bool
|
Whether to handle missing values in categorical columns as unknown |
True
|
Source code in src/pytorch_tabular/config/config.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
|