Common Modules
Embeddings¶
Bases: Module
Enables different values in a categorical features to have different embeddings.
Source code in src/pytorch_tabular/models/common/layers/embeddings.py
Bases: Module
Embeds categorical and continuous features into a 2D tensor.
Source code in src/pytorch_tabular/models/common/layers/embeddings.py
175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 |
|
__init__(continuous_dim, categorical_cardinality, embedding_dim, shared_embedding_strategy=None, frac_shared_embed=0.25, embedding_bias=False, batch_norm_continuous_input=False, virtual_batch_size=None, embedding_dropout=0.0, initialization=None)
¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
continuous_dim |
int
|
number of continuous features |
required |
categorical_cardinality |
List[int]
|
list of cardinalities of categorical features |
required |
embedding_dim |
int
|
embedding dimension |
required |
shared_embedding_strategy |
Optional[str]
|
strategy to use for shared embeddings |
None
|
frac_shared_embed |
float
|
fraction of embeddings to share |
0.25
|
embedding_bias |
bool
|
whether to use bias in embedding layers |
False
|
batch_norm_continuous_input |
bool
|
whether to use batch norm on continuous features |
False
|
embedding_dropout |
float
|
dropout to apply to embeddings |
0.0
|
initialization |
Optional[str]
|
initialization strategy to use for embedding layers |
None
|
Source code in src/pytorch_tabular/models/common/layers/embeddings.py
178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 |
|
Bases: Module
Takes in pre-encoded categorical variables and just concatenates with continuous variables No learnable component.
Source code in src/pytorch_tabular/models/common/layers/embeddings.py
Bases: Module
Enables different values in a categorical feature to share some embeddings across.
Source code in src/pytorch_tabular/models/common/layers/embeddings.py
Gated Units¶
Bases: Module
Source code in src/pytorch_tabular/models/common/layers/gated_units.py
Bases: Module
Gated Exponential Linear Unit (GEGLU)
Source code in src/pytorch_tabular/models/common/layers/gated_units.py
__init__(d_model, d_ff, dropout=0.1)
¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
d_model |
int
|
dimension of the model |
required |
d_ff |
int
|
dimension of the feedforward layer |
required |
dropout |
float
|
dropout probability |
0.1
|
Source code in src/pytorch_tabular/models/common/layers/gated_units.py
Bases: Module
ReGLU.
Source code in src/pytorch_tabular/models/common/layers/gated_units.py
__init__(d_model, d_ff, dropout=0.1)
¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
d_model |
int
|
dimension of the model |
required |
d_ff |
int
|
dimension of the feedforward layer |
required |
dropout |
float
|
dropout probability |
0.1
|
Source code in src/pytorch_tabular/models/common/layers/gated_units.py
Bases: Module
Source code in src/pytorch_tabular/models/common/layers/gated_units.py
__init__(d_model, d_ff, dropout=0.1)
¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
d_model |
int
|
dimension of the model |
required |
d_ff |
int
|
dimension of the feedforward layer |
required |
dropout |
float
|
dropout probability |
0.1
|
Source code in src/pytorch_tabular/models/common/layers/gated_units.py
Bases: Module
title: Position-wise Feed-Forward Network (FFN) summary: Documented reusable implementation of the position wise feedforward network.
Position-wise Feed-Forward Network (FFN)¶
This is a PyTorch implementation of position-wise feedforward network used in transformer. FFN consists of two fully connected layers. Number of dimensions in the hidden layer $d_{ff}$, is generally set to around four times that of the token embedding $d_{model}$. So it is sometime also called the expand-and-contract network. There is an activation at the hidden layer, which is usually set to ReLU (Rectified Linear Unit) activation, $$\max(0, x)$$ That is, the FFN function is, $$FFN(x, W_1, W_2, b_1, b_2) = \max(0, x W_1 + b_1) W_2 + b_2$$ where $W_1$, $W_2$, $b_1$ and $b_2$ are learnable parameters. Sometimes the GELU (Gaussian Error Linear Unit) activation is also used instead of ReLU. $$x \Phi(x)$$ where $\Phi(x) = P(X \le x), X \sim \mathcal{N}(0,1)$
Gated Linear Units¶
This is a generic implementation that supports different variants including Gated Linear Units (GLU).
Source code in src/pytorch_tabular/models/common/layers/gated_units.py
__init__(d_model, d_ff, dropout=0.1, activation=nn.ReLU(), is_gated=False, bias1=True, bias2=True, bias_gate=True)
¶
d_model
is the number of features in a token embeddingd_ff
is the number of features in the hidden layer of the FFNdropout
is dropout probability for the hidden layeris_gated
specifies whether the hidden layer is gatedbias1
specified whether the first fully connected layer should have a learnable biasbias2
specified whether the second fully connected layer should have a learnable biasbias_gate
specified whether the fully connected layer for the gate should have a learnable bias
Source code in src/pytorch_tabular/models/common/layers/gated_units.py
Soft Trees¶
Bases: Module
Source code in src/pytorch_tabular/models/common/layers/soft_trees.py
Bases: ModuleWithInit
Source code in src/pytorch_tabular/models/common/layers/soft_trees.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
|
__init__(in_features, num_trees, depth=6, tree_output_dim=1, flatten_output=True, choice_function=sparsemax, bin_function=sparsemoid, initialize_response_=nn.init.normal_, initialize_selection_logits_=nn.init.uniform_, threshold_init_beta=1.0, threshold_init_cutoff=1.0)
¶
Oblivious Differentiable Sparsemax Trees. http://tinyurl.com/odst-readmore One can drop (sic!) this module anywhere instead of nn.Linear.
:param in_features: number of features in the input tensor :param num_trees: number of trees in this layer :param tree_dim: number of response channels in the response of individual tree :param depth: number of splits in every tree :param flatten_output: if False, returns [..., num_trees, tree_dim], by default returns [..., num_trees * tree_dim] :param choice_function: f(tensor, dim) -> R_simplex computes feature weights s.t. f(tensor, dim).sum(dim) == 1 :param bin_function: f(tensor) -> R[0, 1], computes tree leaf weights
:param initialize_response_: in-place initializer for tree output tensor :param initialize_selection_logits_: in-place initializer for logits that select features for the tree both thresholds and scales are initialized with data-aware init (or .load_state_dict) :param threshold_init_beta: initializes threshold to a q-th quantile of data points where q ~ Beta(:threshold_init_beta:, :threshold_init_beta:) If this param is set to 1, initial thresholds will have the same distribution as data points If greater than 1 (e.g. 10), thresholds will be closer to median data value If less than 1 (e.g. 0.1), thresholds will approach min/max data values.
:param threshold_init_cutoff: threshold log-temperatures initializer, in (0, inf) By default(1.0), log-remperatures are initialized in such a way that all bin selectors end up in the linear region of sparse-sigmoid. The temperatures are then scaled by this parameter. Setting this value > 1.0 will result in some margin between data points and sparse-sigmoid cutoff value Setting this value < 1.0 will cause (1 - value) part of data points to end up in flat sparse-sigmoid region For instance, threshold_init_cutoff = 0.9 will set 10% points equal to 0.0 or 1.0 Setting this value > 1.0 will result in a margin between data points and sparse-sigmoid cutoff value All points will be between (0.5 - 0.5 / threshold_init_cutoff) and (0.5 + 0.5 / threshold_init_cutoff)
Source code in src/pytorch_tabular/models/common/layers/soft_trees.py
Transformers¶
Bases: Module
Applies LayerNorm, Dropout and adds to input.
Standard AddNorm operations in Transformers
Source code in src/pytorch_tabular/models/common/layers/transformers.py
Bases: Module
Appends the [CLS] token for BERT-like inference.
Source code in src/pytorch_tabular/models/common/layers/transformers.py
Bases: Module
Multi Headed Attention Block in Transformers.
Source code in src/pytorch_tabular/models/common/layers/transformers.py
Bases: Module
A single Transformer Encoder Block.
Source code in src/pytorch_tabular/models/common/layers/transformers.py
__init__(input_embed_dim, num_heads=8, ff_hidden_multiplier=4, ff_activation='GEGLU', attn_dropout=0.1, keep_attn=True, ff_dropout=0.1, add_norm_dropout=0.1, transformer_head_dim=None)
¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_embed_dim |
int
|
The input embedding dimension |
required |
num_heads |
int
|
The number of attention heads |
8
|
ff_hidden_multiplier |
int
|
The hidden dimension multiplier for the position-wise feed-forward layer |
4
|
ff_activation |
str
|
The activation function for the position-wise feed-forward layer |
'GEGLU'
|
attn_dropout |
float
|
The dropout probability for the attention layer |
0.1
|
keep_attn |
bool
|
Whether to keep the attention weights |
True
|
ff_dropout |
float
|
The dropout probability for the position-wise feed-forward layer |
0.1
|
add_norm_dropout |
float
|
The dropout probability for the residual connections |
0.1
|
transformer_head_dim |
Optional[int]
|
The dimension of the attention heads. If None, will default to input_embed_dim |
None
|
Source code in src/pytorch_tabular/models/common/layers/transformers.py
Miscellaneous¶
Bases: Module
A wrapper for a lambda function as a pytorch module.
Source code in src/pytorch_tabular/models/common/layers/misc.py
Bases: Module
Base class for pytorch module with data-aware initializer on first batch.
Source code in src/pytorch_tabular/models/common/layers/misc.py
Activations¶
Bases: Function
A highly optimized equivalent of labda x: Entmax15([x, 0])
Source code in src/pytorch_tabular/models/common/layers/activations.py
1.5-entmax: normalizing sparse transform (a la softmax).
Solves the optimization problem:
max_p <x, p> - H_1.5(p) s.t. p >= 0, sum(p) == 1.
where H_1.5(p) is the Tsallis alpha-entropy with alpha=1.5.
Parameters¶
X : torch.Tensor The input tensor.
int
The dimension along which to apply 1.5-entmax.
int or None
number of largest elements to partial-sort over. For optimal
performance, should be slightly bigger than the expected number of
nonzeros in the solution. If the solution is more than k-sparse,
this function is recursively called with a 2*k schedule.
If None
, full sorting is performed from the beginning.
Returns¶
P : torch tensor, same shape as X The projection result, such that P.sum(dim=dim) == 1 elementwise.
sparsemax: normalizing sparse transform (a la softmax).
Solves the projection:
min_p ||x - p||_2 s.t. p >= 0, sum(p) == 1.
Parameters¶
X : torch.Tensor The input tensor.
int
The dimension along which to apply sparsemax.
int or None
number of largest elements to partial-sort over. For optimal
performance, should be slightly bigger than the expected number of
nonzeros in the solution. If the solution is more than k-sparse,
this function is recursively called with a 2*k schedule.
If None
, full sorting is performed from the beginning.
Returns¶
P : torch tensor, same shape as X The projection result, such that P.sum(dim=dim) == 1 elementwise.
Source code in src/pytorch_tabular/models/common/layers/activations.py
Bases: Module