In this step, we will process the text. Firstly, we will create a Tokenizer object and fit it on our training data.
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(training_data)
After fitting the data, we can observe some metrics from the tokenizer, such as the word index and calculate the total number of words.
total_words = len(tokenizer.word_index) + 1
print(tokenizer.word_index)
print(total_words)
One of the most important tasks in text preprocessing is splitting the sentences into n grams. Before doing that, we need to transform each sentence into a sequence of integers. As we know, deep learning networks like to work with numbers and they can’t handle string directly. That’s why we transform each line of the training data into a sequence of integers.
Then, we use this sequence of integers to split it into n grams and store into input_sequences array.
input_sequences = []
for single_line in training_data:
token_list = tokenizer.texts_to_sequences([single_line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
Afterwards, we need to get the maximum sequence length and pad with zeros some of the sequences up to this maximum length.
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
max_sequence_len = np.int32(np.percentile([len(x) for x in input_sequences], 75))
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))