tokenizer#

Tokenizer from tiktoken.

Classes

Tokenizer([name, remove_stop_words])

Tokenizer component that wraps around the tokenizer from tiktoken.

class Tokenizer(name: str = 'cl100k_base', remove_stop_words: bool = False)[source]#

Bases: object

Tokenizer component that wraps around the tokenizer from tiktoken. __call__ is the same as forward/encode, so that we can use it in Sequential Additonally, you can can also use encode and decode methods.

Parameters:
  • name (str, optional) – The name of the tokenizer. Defaults to “cl100k_base”. You can find more information

  • documentation. (at the tiktoken)

preprocess(text: str) List[str][source]#
encode(text: str) List[int][source]#

Encodes the input text/word into token IDs.

decode(tokens: List[str]) str[source]#

Decodes the input tokens into text.

count_tokens(text: str) int[source]#

Counts the number of tokens in the input text.

get_string_tokens(text: str) List[str][source]#

Returns the string tokens from the input text.