tokenizer¶

Tokenizer from tiktoken.

Classes

Tokenizer([name, remove_stop_words])

Tokenizer component that wraps around the tokenizer from tiktoken.

class Tokenizer(name: str = 'cl100k_base', remove_stop_words: bool = False)[source]¶

Bases: object

Tokenizer component that wraps around the tokenizer from tiktoken. __call__ is the same as forward/encode, so that we can use it in Sequential Additonally, you can can also use encode and decode methods.

Parameters:

name (str, optional) – The name of the tokenizer. Defaults to “cl100k_base”. You can find more information
documentation. (at the tiktoken)

preprocess(text: str) → List[str][source]¶

encode(text: str) → List[int][source]¶: Encodes the input text/word into token IDs.

decode(tokens: List[str]) → str[source]¶: Decodes the input tokens into text.

count_tokens(text: str) → int[source]¶: Counts the number of tokens in the input text.

get_string_tokens(text: str) → List[str][source]¶: Returns the string tokens from the input text.