Step 1: Setting Up the Environment
ââââââââââââââââââââââââââââââ Explanation: In this step, we prepare our working environment by installing the necessary libraries and downloading resources required for natural language processing (NLP). We use pip to install TensorFlow (for deep learning), NLTK (for tokenization and text processing), and Flask (for building a simple web interface later).
ââââââââââââââââââââââââââââââ Code:
ââââââââââââââââââââââââââââââ
Install necessary libraries:
Open your terminal or command prompt and run:
pip install tensorflow nltk flask
Once installed, we set up NLTK by downloading required datasets.
import nltk
Downloading the Punkt tokenizer for sentence splitting
nltk.download(‘punkt’)
Downloading a sample literary dataset from the Gutenberg collection
nltk.download(‘gutenberg’)
print(“Environment setup complete. Libraries installed and datasets downloaded.”) ââââââââââââââââââââââââââââââ
Step 2: Data Collection and Pre-processing
ââââââââââââââââââââââââââââââ Explanation: We now collect a dataset for training our generative model. We are using the Gutenberg corpus available via NLTK, which consists of many public domain literary texts. We combine these texts into a single corpus and then tokenize them into sentences to prepare our data. Additionally, you could add cleaning tasks such as lowercasing, removing punctuation, or filtering as needed.
ââââââââââââââââââââââââââââââ Code:
ââââââââââââââââââââââââââââââ from nltk.corpus import gutenberg from nltk.tokenize import sent_tokenize
Load all texts from the Gutenberg corpus
texts = [gutenberg.raw(fileid) for fileid in gutenberg.fileids()]
Combine all texts into one large string (corpus)
corpus = ’ '.join(texts)
Tokenize the corpus into sentences for more manageable processing
sentences = sent_tokenize(corpus)
Optional: A simple preprocessing - lowercasing each sentence
sentences = [sentence.lower() for sentence in sentences]
print(f"Number of sentences in the corpus: {len(sentences)}") ââââââââââââââââââââââââââââââ
Step 3: Text Vectorization
ââââââââââââââââââââââââââââââ Explanation: Neural networks require numerical input. We achieve this by tokenizing the text into words and then converting these words into sequences of integers. Each unique word gets a unique index. We then produce n-gram sequences (partial sequences) to later train the model to predict the next word. Finally, we pad these sequences to ensure they have uniform length across batches.
ââââââââââââââââââââââââââââââ Code:
ââââââââââââââââââââââââââââââ import numpy as np import tensorflow as tf from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences
Initialize and fit the tokenizer on our sentence data
tokenizer = Tokenizer() tokenizer.fit_on_texts(sentences)
Getting total words count (vocabulary size)
total_words = len(tokenizer.word_index) + 1 print(f"Vocabulary Size: {total_words}")
Prepare input sequences: Create sequences of words that gradually increase in length for learning context.
input_sequences = for sentence in sentences: # Convert sentence to a sequence of integers token_list = tokenizer.texts_to_sequences([sentence])[0] # Generate n-gram sequences for each sentence for i in range(1, len(token_list)): n_gram_sequence = token_list[:i + 1] input_sequences.append(n_gram_sequence)
Determine the maximum sequence length for padding purposes
max_sequence_length = max(len(seq) for seq in input_sequences) print(f"Maximum sequence length: {max_sequence_length}")
Pad sequences so that all are of equal length, pre-padding with zeros.
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_length, padding=‘pre’)
print(f"Total number of input sequences: {len(input_sequences)}") ââââââââââââââââââââââââââââââ
Step 4: Creating the Model
ââââââââââââââââââââââââââââââ Explanation: Now, we build our deep learning model to generate text. We use an embedding layer to convert word indices into dense vectors of fixed size. The LSTM layers capture sequential dependencies in the text, and a Dense layer with softmax activation outputs the probability distribution over the vocabulary for predicting the next word.
ââââââââââââââââââââââââââââââ Code:
ââââââââââââââââââââââââââââââ from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Embedding, LSTM, Dense
Construct the Sequential model
model = Sequential()
1. Embedding Layer: Transforms each word index into a 100-dimensional vector.
model.add(Embedding(input_dim=total_words, output_dim=100, input_length=max_sequence_length - 1))
2. First LSTM Layer: Returns sequences to feed into the next LSTM layer.
model.add(LSTM(150, return_sequences=True))
3. Second LSTM Layer: Processes the sequence further.
model.add(LSTM(150))
4. Dense Output Layer: Predicts the next-word’s probability distribution.
model.add(Dense(total_words, activation=‘softmax’))
Compile the model with categorical crossentropy loss and Adam optimizer.
model.compile(loss=‘sparse_categorical_crossentropy’, optimizer=‘adam’, metrics=[‘accuracy’])
print(“Model architecture successfully compiled.”) ââââââââââââââââââââââââââââââ
Step 5: Training the Model
ââââââââââââââââââââââââââââââ Explanation: Before training, we separate our input data into predictors (X) and labels (y). Our model will learn to predict the next word from the sequence of words. We then convert labels into one-hot encoded vectors and train the model for a desired number of epochs, with a suitable batch size.
ââââââââââââââââââââââââââââââ Code:
ââââââââââââââââââââââââââââââ
Split our padded sequences: all columns except last column (X) and the last column as the target word (y).
X = input_sequences[:, :-1] y = input_sequences[:, -1]
Convert labels to one-hot encoding format
from tensorflow.keras.utils import to_categorical y = to_categorical(y, num_classes=total_words)
Fit the model
history = model.fit(X, y, epochs=50, batch_size=64)
print(“Training complete.”) ââââââââââââââââââââââââââââââ
Step 6: Generating Text
ââââââââââââââââââââââââââââââ Explanation: After training our model, we use it to generate text based on a user’s seed prompt. The function below takes the seed text and generates a specified number of additional words. For each new word, the seed text is updated, and the model continues to predict the next word.
ââââââââââââââââââââââââââââââ Code:
ââââââââââââââââââââââââââââââ def generate_text(seed_text, next_words, model, max_sequence_length): “”" Generate text using the trained model.
Parameters:
seed_text (str): Initial text from which to generate further words.
next_words (int): Number of words to generate.
model (tf.keras.Model): The trained model.
max_sequence_length (int): Maximum sequence length for padding.
Returns:
str: The generated text.
"""
for _ in range(next_words):
# Tokenize and pad the input seed text to the required length.
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_sequence_length - 1, padding='pre')
# Predict the next word's probabilities
predicted = model.predict(token_list, verbose=0)
predicted_word_index = np.argmax(predicted, axis=-1)[0]
# Map the predicted index back to the corresponding word
output_word = None
for word, index in tokenizer.word_index.items():
if index == predicted_word_index:
output_word = word
break
if output_word is None:
# If no word is found (edge-case), break out of the loop.
break
# Append the predicted word to the seed text to generate longer text.
seed_text += " " + output_word
return seed_text
Test the text generation function with an initial prompt.
generated_story = generate_text(“once upon a time”, 20, model, max_sequence_length) print(“\nGenerated Story:\n”, generated_story) ââââââââââââââââââââââââââââââ
Step 7: Building the User Interface with Flask
ââââââââââââââââââââââââââââââ Explanation: Finally, we create a simple web interface that allows users to generate stories by entering a seed text in an HTML form. Flask is a lightweight web framework that helps integrate our model with a user-facing application. The code below sets up a basic Flask web server with routes to render an HTML page and handle form submissions.
ââââââââââââââââââââââââââââââ Code:
ââââââââââââââââââââââââââââââ from flask import Flask, render_template, request
app = Flask(name)
@app.route(‘/’) def home(): # Render the home page that has a simple form to input seed text. return render_template(‘index.html’)
@app.route(‘/generate’, methods=[‘POST’]) def generate(): # Retrieve seed text from the submitted form seed_text = request.form[‘seed_text’] # Generate a story using the provided seed text generated_story = generate_text(seed_text, 20, model, max_sequence_length) # Render the page again and pass the generated story to display return render_template(‘index.html’, generated_story=generated_story)
if name == “main”: # Run the Flask web server in debug mode for development purposes. app.run(debug=True) ââââââââââââââââââââââââââââââ
Additional Note: Creating ‘index.html’ Ensure you have an ‘index.html’ file within a folder named “templates” in your project directory. This file should include a simple form for inputting seed text and a placeholder to display the generated story. A basic ‘index.html’ might look like this:
ââââââââââââââââââââââââââââââ
Generative AI StorytellerGenerative AI-Powered Storyteller
Enter Seed Text:{% if generated_story %}
Generated Story:
{{ generated_story }}
{% endif %} ââââââââââââââââââââââââââââââConclusion: This detailed, step-by-step guide shows you how to develop a generative AI storyteller using Python. Each stageâfrom setting up the environment, preprocessing text data, building and training an LSTM model, to deploying a simple Flask web interfaceâis accompanied by code and a clear explanation. Experiment with different architectures, augment pre-processing routines, or expand the web interface to enhance user experience even further. Enjoy your journey into generative AI and creative storytelling!