Skip to content

Preprocess

Text preprocessing functionality

expand_numbers(text, language='en')

Expand the numbers present inside the text. That means converting "42" into "forty two". It also detects if the number is ordinal automatically.

Parameters:

Name Type Description Default
text str

Input text

required
language str

Language used to expand the numbers. Defaults to "en".

'en'

Returns:

Type Description
str

Output text

Source code in thunder/text_processing/preprocess.py
def expand_numbers(text: str, language: str = "en") -> str:
    """Expand the numbers present inside the text.
    That means converting "42" into "forty two".
    It also detects if the number is ordinal automatically.

    Args:
        text: Input text
        language: Language used to expand the numbers. Defaults to "en".

    Returns:
        Output text
    """
    number_regex = re.compile(r"\d+º*")

    all_numbers = number_regex.findall(text)
    for num in all_numbers:
        if "º" in num:
            pure_number = num.replace("º", "").strip()
            expanded = num2words(int(pure_number), lang=language, to="ordinal")
        else:
            expanded = num2words(int(num), lang=language)
        text = text.replace(num, expanded)
    return text

lower_text(text)

Transform all the text to lowercase.

Parameters:

Name Type Description Default
text str

Input text

required

Returns:

Type Description
str

Output text

Source code in thunder/text_processing/preprocess.py
def lower_text(text: str) -> str:
    """Transform all the text to lowercase.

    Args:
        text: Input text

    Returns:
        Output text
    """
    return text.lower()

normalize_text(text)

Normalize the text to remove accents and ensure all the characters are valid ascii symbols.

Parameters:

Name Type Description Default
text str

Input text

required

Returns:

Type Description
str

Output text

Source code in thunder/text_processing/preprocess.py
def normalize_text(text: str) -> str:
    """Normalize the text to remove accents
    and ensure all the characters are valid
    ascii symbols.

    Args:
        text: Input text

    Returns:
        Output text
    """
    nfkd_form = unicodedata.normalize("NFKD", text)
    only_ascii = nfkd_form.encode("ASCII", "ignore")
    return only_ascii.decode()