CategoryEncoding
classkeras.layers.CategoryEncoding(
num_tokens=None, output_mode="multi_hot", sparse=False, **kwargs
)
A preprocessing layer which encodes integer features.
This layer provides options for condensing data into a categorical encoding
when the total number of tokens are known in advance. It accepts integer
values as inputs, and it outputs a dense or sparse representation of those
inputs. For integer inputs where the total number of tokens is not known,
use keras.layers.IntegerLookup
instead.
Note: This layer is safe to use inside a tf.data
pipeline
(independently of which backend you're using).
Examples
One-hot encoding data
>>> layer = keras.layers.CategoryEncoding(
... num_tokens=4, output_mode="one_hot")
>>> layer([3, 2, 0, 1])
array([[0., 0., 0., 1.],
[0., 0., 1., 0.],
[1., 0., 0., 0.],
[0., 1., 0., 0.]]>
Multi-hot encoding data
>>> layer = keras.layers.CategoryEncoding(
... num_tokens=4, output_mode="multi_hot")
>>> layer([[0, 1], [0, 0], [1, 2], [3, 1]])
array([[1., 1., 0., 0.],
[1., 0., 0., 0.],
[0., 1., 1., 0.],
[0., 1., 0., 1.]]>
Using weighted inputs in "count"
mode
>>> layer = keras.layers.CategoryEncoding(
... num_tokens=4, output_mode="count")
>>> count_weights = np.array([[.1, .2], [.1, .1], [.2, .3], [.4, .2]])
>>> layer([[0, 1], [0, 0], [1, 2], [3, 1]], count_weights=count_weights)
array([[0.1, 0.2, 0. , 0. ],
[0.2, 0. , 0. , 0. ],
[0. , 0.2, 0.3, 0. ],
[0. , 0.2, 0. , 0.4]]>
Arguments
0 <= value <
num_tokens
, or an error will be thrown."one_hot"
, "multi_hot"
or "count"
,
configuring the layer as follows:
- "one_hot"
: Encodes each individual element in the input
into an array of num_tokens
size, containing a 1 at the
element index. If the last dimension is size 1, will encode
on that dimension. If the last dimension is not size 1,
will append a new dimension for the encoded output.
- "multi_hot"
: Encodes each sample in the input into a single
array of num_tokens
size, containing a 1 for each
vocabulary term present in the sample. Treats the last
dimension as the sample dimension, if input shape is
(..., sample_length)
, output shape will be
(..., num_tokens)
.
- "count"
: Like "multi_hot"
, but the int array contains a
count of the number of times the token at that index
appeared in the sample.
For all output modes, currently only output up to rank 2 is
supported.
Defaults to "multi_hot"
.Call arguments
inputs
indicating the
weight for each sample value when summing up in count
mode.
Not used in "multi_hot"
or "one_hot"
modes.