La génération de texte peut-être faite mot par mot ou caractère par caractère. La seconde option marche beaucoup, les modèles arrivent à reconstituer les mots ainsi que la ponctuation.
import unicodedata import re from pprint import pprint import pandas as pd import numpy as np import tensorflow as tf from tensorflow import keras
speechs = pd.read_json(INPUT_FILE) speechs.sample(10)
corpus = " ".join(list(speechs["content"])) pprint(corpus[:1000])
('My fellow citizens: I stand here today humbled by the task before us, ' 'grateful for the trust you have bestowed, mindful of the sacrifices borne by ' 'our ancestors. I thank President Bush for his service to our nation, as well ' 'as the generosity and cooperation he has shown throughout this transition. ' 'Forty-four Americans have now taken the presidential oath. The words have ' 'been spoken during rising tides of prosperity and the still waters of peace. ' 'Yet, every so often the oath is taken amidst gathering clouds and raging ' 'storms. At these moments, America has carried on not simply because of the ' 'skill or vision of those in high office, but because We the People have ' 'remained faithful to the ideals of our forbearers, and true to our founding ' 'documents. So it has been. So it must be with this generation of Americans. ' 'That we are in the midst of crisis is now well understood. Our nation is at ' 'war, against a far-reaching network of violence and hatred. Our economy is ' 'badly weakened, a consequen')
def preprocess_text(text): # on enlève tous les accents new_text = unicodedata.normalize('NFKD', text).encode('ASCII', 'ignore').decode('utf-8') # on passe en miniscule new_text = new_text.lower() # on garde que les lettres new_text = re.sub('[^a-z,.\'\n-]+', ' ', new_text) # on enlève les retours à la ligne new_text = new_text.replace('\n\n', '').replace(' ', '') return new_text clean_corpus = preprocess_text(corpus) pprint(clean_corpus[:1000])
('my fellow citizens i stand here today humbled by the task before us, ' 'grateful for the trust you have bestowed, mindful of the sacrifices borne by ' 'our ancestors. i thank president bush for his service to our nation, as well ' 'as the generosity and cooperation he has shown throughout this transition. ' 'forty-four americans have now taken the presidential oath. the words have ' 'been spoken during rising tides of prosperity and the still waters of peace. ' 'yet, every so often the oath is taken amidst gathering clouds and raging ' 'storms. at these moments, america has carried on not simply because of the ' 'skill or vision of those in high office, but because we the people have ' 'remained faithful to the ideals of our forbearers, and true to our founding ' 'documents. so it has been. so it must be with this generation of americans. ' 'that we are in the midst of crisis is now well understood. our nation is at ' 'war, against a far-reaching network of violence and hatred. our economy is ' 'badly weakened, a consequenc')
print('Corpus size:', len(clean_corpus)) chars = sorted(list(set(clean_corpus))) print('Total chars:', len(chars)) char_indices = dict((c, i) for i, c in enumerate(chars)) indices_char = dict((i, c) for i, c in enumerate(chars))
# cut the text in semi-redundant sequences of maxlen characters maxlen = 40 step = 3 sentences = [] next_chars = [] for i in range(0, len(clean_corpus) - maxlen, step): sentences.append(clean_corpus[i: i + maxlen]) next_chars.append(clean_corpus[i + maxlen]) print('Total sequences:', len(sentences))
print('Vectorization...') x = np.zeros((len(sentences), maxlen, len(chars)), dtype=bool) y = np.zeros((len(sentences), len(chars)), dtype=bool) for i, sentence in enumerate(sentences): for t, char in enumerate(sentence): x[i, t, char_indices[char]] = 1 y[i, char_indices[next_chars[i]]] = 1
# build the model: a single LSTM print('Build model...') model = keras.Sequential() model.add(keras.layers.LSTM(128, input_shape=(maxlen, len(chars)))) model.add(keras.layers.Dense(len(chars), activation='softmax')) model.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= lstm (LSTM) (None, 128) 81920 dense (Dense) (None, 31) 3999 ================================================================= Total params: 85,919 Trainable params: 85,919 Non-trainable params: 0 _________________________________________________________________
optimizer = keras.optimizers.Adam(learning_rate=0.01) model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics="accuracy")
BATCH_SIZE = 128 EPOCHS = 20 history = model.fit(x, y, batch_size=BATCH_SIZE, epochs=EPOCHS)
def sample(preds, temperature=1.0): # helper function to sample an index from a probability array preds = np.asarray(preds).astype('float64') preds = np.log(preds) / temperature exp_preds = np.exp(preds) preds = exp_preds / np.sum(exp_preds) probas = np.random.multinomial(1, preds, 1) return np.argmax(probas)
def predict_text(start_index): for temperature in [0.2, 0.5, 1.0, 1.2]: print('----- temperature:', temperature) generated = '' sentence = clean_corpus[start_index: start_index + maxlen] generated += sentence print('----- Generating with seed: "' + sentence + '"') sys.stdout.write(generated) for i in range(400): x_pred = np.zeros((1, maxlen, len(chars))) for t, char in enumerate(sentence): x_pred[0, t, char_indices[char]] = 1. preds = model.predict(x_pred, verbose=0)[0] next_index = sample(preds, temperature) next_char = indices_char[next_index] generated += next_char sentence = sentence[1:] + next_char sys.stdout.write(next_char) sys.stdout.flush() print() start_index = np.random.randint(0, len(clean_corpus) - maxlen - 1) predict_text(start_index)
----- temperature: 0.2 ----- Generating with seed: "of my colleagues or staffers would excha" of my colleagues or staffers would exchatting the support the country and the challenges that the change the support the same persons and senator the support that we have to do the same american people and the country that the president that the same contracts that the same country the same country that the country that the same country that the support the same country and the same country that the country where the same country that t ----- temperature: 0.5 ----- Generating with seed: "of my colleagues or staffers would excha" of my colleagues or staffers would exchatting and a capable judge the change the support the same allow that we should lear individual states and supplion senator who have a faith has been in the political program of the support of the caused on the support. i get that you have the crisis who want to work the senate to order that we should ever bestit of the succeed to come to get to come to the planet the sost that our political contra ----- temperature: 1.0 ----- Generating with seed: "of my colleagues or staffers would excha" of my colleagues or staffers would exchans clease here when the other childrens, most our engogions and members movement would sezemes of president senator and aftersers to accomps is. kid our womanbers and flexible to come topight alant, choices and help uspeed is not the few threat in end met amendment to take herse health care-know that consequence made to just stood obligation in greatesm clan emphersable very a drardes who are not ----- temperature: 1.2 ----- Generating with seed: "of my colleagues or staffers would excha" of my colleagues or staffers would exchatenthisis, it was, race. intide-seried from natia. you truahs, the ral begin -- from incregitious know but collinal committee, neam to be princility. undemo, it yearogphilom, and our fadry lenglies. toepre fack inligistimated to no shore are wreates strugnion introrved docure down there's think mudeman very bara. dick islat learned our future, now there about a tcroying recouse spack there agown