Tokenize

Text tokenization including character, word or sentencepiece

`char_tokenizer(text)`

Tokenize input text splitting into characters

Parameters:

Name	Type	Description	Default
`text`	`str`	Input text	required

Returns:

Type	Description
`List[str]`	Tokenized text

Source code in thunder/text_processing/tokenizer.py

def char_tokenizer(text: str) -> List[str]:
    """Tokenize input text splitting into characters

    Args:
        text: Input text

    Returns:
        Tokenized text
    """
    return list(text)

`get_most_frequent_tokens(corpus, tokenize_function, minimum_frequency=1, max_number_of_tokens=None)`

Helper function to get the most frequent tokens from a text corpus.

Parameters:

Name	Type	Description	Default
`corpus`	`str`	Text corpus to be used, this is a long string containing all of your text	required
`tokenize_function`	`Callable`	Same tokenizer function that will be used during training	required
`minimum_frequency`	`int`	Remove any token with frequency less than that. Defaults to 1.	`1`
`max_number_of_tokens`	`Optional[int]`	Optionally limit to the K most frequent tokens. Defaults to None.	`None`

Returns:

Type	Description
`List[str]`	All of the unique, most frequent tokens, ordered by frequency.

Source code in thunder/text_processing/tokenizer.py

def get_most_frequent_tokens(
    corpus: str,
    tokenize_function: Callable,
    minimum_frequency: int = 1,
    max_number_of_tokens: Optional[int] = None,
) -> List[str]:
    """Helper function to get the most frequent tokens from a text corpus.

    Args:
        corpus: Text corpus to be used, this is a long string containing all of your text
        tokenize_function: Same tokenizer function that will be used during training
        minimum_frequency: Remove any token with frequency less than that. Defaults to 1.
        max_number_of_tokens: Optionally limit to the K most frequent tokens. Defaults to None.

    Returns:
        All of the unique, most frequent tokens, ordered by frequency.
    """

    tokenized = tokenize_function(corpus)
    token_counter = Counter(tokenized)
    output_tokens = []
    for token, count in token_counter.most_common(max_number_of_tokens):
        if count >= minimum_frequency:
            output_tokens.append(token)
    return output_tokens

`train_sentencepiece_model(data_file, vocab_size, output_dir, sample_size=-1, do_lower_case=True, tokenizer_type='unigram', character_coverage=1.0, train_extremely_large_corpus=False, max_sentencepiece_length=-1)`

Creates sentence piece tokenizer model from data file. This is a direct port of create_spt_model present on the NEMO toolkit (nemo/collections/common/tokenizers/sentencepiece_tokenizer.py)

Parameters:

Name	Type	Description	Default
`data_file`	`str`	text file containing the sentences that will be used to train the model	required
`vocab_size`	`int`	maximum vocabulary size	required
`output_dir`	`str`	folder to save created tokenizer model and vocab	required
`sample_size`	`int`	maximum number of sentences the trainer loads. -1 means to use all the data.	`-1`
`do_lower_case`	`bool`	if text should be lower cased before tokenizer model is created	`True`
`tokenizer_type`	`str`	controls the sentencepiece model type.	`'unigram'`
`character_coverage`	`float`	float value between 0 and 1 (as a percentage). For languages with a vast charset, can be < 1.0, but for all other languages, it should be set as 1.0	`1.0`
`train_extremely_large_corpus`	`bool`	If training on huge datasets, pass this flag to allow SentencePiece to build the tokenizer.	`False`
`max_sentencepiece_length`	`int`	Limits the maximum length of the SentencePiece subword that can be constructed. By default, no limit is placed.	`-1`

Source code in thunder/text_processing/tokenizer.py

def train_sentencepiece_model(
    data_file: str,
    vocab_size: int,
    output_dir: str,
    sample_size: int = -1,
    do_lower_case: bool = True,
    tokenizer_type: str = "unigram",
    character_coverage: float = 1.0,
    train_extremely_large_corpus: bool = False,
    max_sentencepiece_length: int = -1,
) -> str:
    """
    Creates sentence piece tokenizer model from data file.
    This is a direct port of `create_spt_model` present on the NEMO
    toolkit (nemo/collections/common/tokenizers/sentencepiece_tokenizer.py)
    Args:
        data_file: text file containing the sentences that will be used to train the model
        vocab_size: maximum vocabulary size
        output_dir: folder to save created tokenizer model and vocab
        sample_size: maximum number of sentences the trainer loads. -1 means to use all the data.
        do_lower_case: if text should be lower cased before tokenizer model is created
        tokenizer_type: controls the sentencepiece model type.
        character_coverage: float value between 0 and 1 (as a percentage). For languages with a vast charset,
            can be < 1.0, but for all other languages, it should be set as 1.0
        train_extremely_large_corpus: If training on huge datasets, pass this flag to allow SentencePiece
            to build the tokenizer.
        max_sentencepiece_length: Limits the maximum length of the SentencePiece subword that can be constructed.
            By default, no limit is placed.
    """
    data_file = Path(data_file)
    if not data_file or not data_file.exists():
        raise ValueError(f"data_file must be valid file path, but got {data_file}")

    output_dir = Path(output_dir)
    if (output_dir / "tokenizer.model").exists():
        warn(
            "There's already a trained sentencepiece model at the output directory. Skipping train."
        )
        return str(output_dir)

    output_dir.mkdir(exist_ok=True)

    cmd = (
        f"--input={data_file} --model_prefix={output_dir}/tokenizer "
        f"--vocab_size={vocab_size} "
        f"--shuffle_input_sentence=true --hard_vocab_limit=false "
        f"--model_type={tokenizer_type} "
        f"--character_coverage={character_coverage}"
    )

    if do_lower_case:
        cmd += " --normalization_rule_name=nmt_nfkc_cf"

    if sample_size > 0:
        cmd += f" --input_sentence_size={sample_size}"

    if train_extremely_large_corpus:
        cmd += " --train_extremely_large_corpus=true"

    if max_sentencepiece_length >= 0:
        cmd += f" --max_sentencepiece_length={max_sentencepiece_length}"

    sentencepiece.SentencePieceTrainer.Train(cmd)

    return str(output_dir)

`word_tokenizer(text)`

Tokenize input text splitting into words

Parameters:

Name	Type	Description	Default
`text`	`str`	Input text	required

Returns:

Type	Description
`List[str]`	Tokenized text

Source code in thunder/text_processing/tokenizer.py

def word_tokenizer(text: str) -> List[str]:
    """Tokenize input text splitting into words

    Args:
        text: Input text

    Returns:
        Tokenized text
    """
    return text.split()