Preparation of text data

As is typical in NLP tasks, all strings are converted into lowercase. Since the model will be considering sequences of characters (and not sequences of words), we obtain the training vocabulary as the set of unique characters used in the dataset. We add a character, P, that corresponds to padding, since we will need to define a fixed input length, NB_CHARS_MAX, and pad strings that are smaller than that:

list_of_existing_chars = list(set(texts.str.cat(sep=' ')))
vocabulary = ''.join(list_of_existing_chars)
vocabulary += 'P' # add padding character

Each character is then associated with an integer that will represent it:

# Create association between vocabulary and id
vocabulary_id = {}
i = 0
for char in list(vocabulary):
    vocabulary_id[char] = i
    i += 1

We are now ready to transform the text data. We define a function, transform_text_for_ml, which takes a list of strings, list_of_strings, where each string is a transcript; a dictionary, vocabulary_ids, which maps the characters in the vocabulary to integers; and the maximum number of characters per transcripts, max_length. Basically, this function transforms each transcript into a list of its characters (in their order of appearance, obviously) and adds as many padding characters as necessary (so that there are max_length characters):

def transform_text_for_ml(list_of_strings, vocabulary_ids, max_length):
    transformed_data = []

    for string in tqdm(list_of_strings):
        list_of_char = list(string)
        list_of_char_id = [vocabulary_ids[char] for char in list_of_char]

        nb_char = len(list_of_char_id)

        # padding for fixed input length
        if nb_char < max_length:
            for i in range(max_length - nb_char):
                list_of_char_id.append(vocabulary_ids['P'])

        transformed_data.append(list_of_char_id)

    ml_input_training = np.array(transformed_data)

    return ml_input_training

We can now apply this processing function:

text_input_ml = transform_text_for_ml(texts.values,
                                      vocabulary_id,
                                      NB_CHARS_MAX)

The Python script, 2_create_text_dataset.py, loads the text data, processes it (as shown previously), divides the results into training and testing, and dumps both the training and testing sets as pickle files. The vocabulary dictionary that maps the characters with their associated integers is also saved:

# split into training and testing
len_train = int(TRAIN_SET_RATIO * len(metadata))
text_input_ml_training = text_input_ml[:len_train]
text_input_ml_testing = text_input_ml[len_train:]

# save data
joblib.dump(text_input_ml_training, 'data/text_input_ml_training.pkl')
joblib.dump(text_input_ml_testing, 'data/text_input_ml_testing.pkl')

joblib.dump(vocabulary_id, 'data/vocabulary.pkl')

Table of Contents for Preparation of text data

Create new playlist

Sign In

Sign Up

Table of Contents for
Preparation of text data