WordPieceTokenizer
classkeras_nlp.tokenizers.WordPieceTokenizer(
vocabulary=None,
sequence_length=None,
lowercase=False,
strip_accents=False,
split=True,
split_on_cjk=True,
suffix_indicator="##",
oov_token="[UNK]",
special_tokens=None,
special_tokens_in_strings=False,
dtype="int32",
**kwargs
)
A WordPiece tokenizer layer.
This layer provides an efficient, in graph, implementation of the WordPiece algorithm used by BERT and other models.
To make this layer more useful out of the box, the layer will pre-tokenize
the input, which will optionally lower-case, strip accents, and split the
input on whitespace and punctuation. Each of these pre-tokenization steps is
not reversible. The detokenize
method will join words with a space, and
will not invert tokenize
exactly.
If a more custom pre-tokenization step is desired, the layer can be
configured to apply only the strict WordPiece algorithm by passing
lowercase=False
, strip_accents=False
and split=False
. In
this case, inputs should be pre-split string tensors or ragged tensors.
Tokenizer outputs can either be padded and truncated with a
sequence_length
argument, or left un-truncated. The exact output will
depend on the rank of the input tensors.
If input is a batch of strings (rank > 0):
By default, the layer will output a tf.RaggedTensor
where the last
dimension of the output is ragged. If sequence_length
is set, the layer
will output a dense tf.Tensor
where all inputs have been padded or
truncated to sequence_length
.
If input is a scalar string (rank == 0):
By default, the layer will output a dense tf.Tensor
with static shape
[None]
. If sequence_length
is set, the output will be
a dense tf.Tensor
of shape [sequence_length]
.
The output dtype can be controlled via the dtype
argument, which should
be either an integer or string type.
Arguments
True
, the input text will be
lowercased before tokenization. Defaults to False
.True
, all accent marks will
be removed from text before tokenization. Defaults to False
.True
, input will be split on
whitespace and punctuation marks, and all punctuation marks will be
kept as tokens. If False
, input should be split ("pre-tokenized")
before calling the tokenizer, and passed as a dense or ragged tensor
of whole words. Defaults to True
.split
is True.
Defaults to True
."##"
."[UNK]"
.References
Examples
Ragged outputs.
>>> vocab = ["[UNK]", "the", "qu", "##ick", "br", "##own", "fox", "."]
>>> inputs = "The quick brown fox."
>>> tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
... vocabulary=vocab,
... lowercase=True,
... )
>>> outputs = tokenizer(inputs)
>>> np.array(outputs)
array([1, 2, 3, 4, 5, 6, 7], dtype=int32)
Dense outputs.
>>> vocab = ["[UNK]", "the", "qu", "##ick", "br", "##own", "fox", "."]
>>> inputs = ["The quick brown fox."]
>>> tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
... vocabulary=vocab,
... sequence_length=10,
... lowercase=True,
... )
>>> outputs = tokenizer(inputs)
>>> np.array(outputs)
array([[1, 2, 3, 4, 5, 6, 7, 0, 0, 0]], dtype=int32)
String output.
>>> vocab = ["[UNK]", "the", "qu", "##ick", "br", "##own", "fox", "."]
>>> inputs = "The quick brown fox."
>>> tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
... vocabulary=vocab,
... lowercase=True,
... dtype="string",
... )
>>> tokenizer(inputs)
['the', 'qu', '##ick', 'br', '##own', 'fox', '.']
Detokenization.
>>> vocab = ["[UNK]", "the", "qu", "##ick", "br", "##own", "fox", "."]
>>> inputs = "The quick brown fox."
>>> tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
... vocabulary=vocab,
... lowercase=True,
... )
>>> tokenizer.detokenize(tokenizer.tokenize(inputs))
'the quick brown fox .'
Custom splitting.
>>> vocab = ["[UNK]", "the", "qu", "##ick", "br", "##own", "fox", "."]
>>> inputs = "The$quick$brown$fox"
>>> tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
... vocabulary=vocab,
... split=False,
... lowercase=True,
... dtype='string',
... )
>>> split_inputs = tf.strings.split(inputs, sep="$")
>>> tokenizer(split_inputs)
['the', 'qu', '##ick', 'br', '##own', 'fox']
tokenize
methodWordPieceTokenizer.tokenize(inputs)
Transform input tensors of strings into output tokens.
Arguments
detokenize
methodWordPieceTokenizer.detokenize(inputs)
Transform tokens back into strings.
Arguments
get_vocabulary
methodWordPieceTokenizer.get_vocabulary()
Get the tokenizer vocabulary as a list of strings tokens.
vocabulary_size
methodWordPieceTokenizer.vocabulary_size()
Get the integer size of the tokenizer vocabulary.
token_to_id
methodWordPieceTokenizer.token_to_id(token)
Convert a string token to an integer id.
id_to_token
methodWordPieceTokenizer.id_to_token(id)
Convert an integer id to a string token.