Code Along for Dual NLP - GPT API & spaCy

Steps and Tasks

spaCy:

1. Preprocess the text data

  • Purge unnecessary characters or symbols: Use regular expressions (regex) to replace or remove unwanted characters from the text. You can refer to the Python re module documentation for more information and examples: re - Regular expression operations.
  • Deploy spaCy to tokenize the text into sentences: Import the language model from spaCy (en_core_web_sm, for instance), and use it to process your text and split it into sentences. You can follow the spaCy documentation on how to install and use spaCy and its language models: spaCy - Usage.
  • Clean and normalize sentences for optimal performance: This might include transforming all text to lower case, removing punctuation, and possibly even lemmatizing words (i.e., reducing them to their root form). You can use spaCy’s built-in functionality for these tasks. The spaCy documentation provides examples and explanations of how to perform text preprocessing: spaCy - Processing Pipelines.
Example code: click to view only if you are having trouble getting started!
import spacy
from spacy.lang.en import English

# Load a spaCy model
nlp = spacy.load("en_core_web_sm")

# Preprocess the text data
def preprocess_text(text):
    # Remove unnecessary characters or symbols using regex
    cleaned_text = text.replace("[^a-zA-Z0-9]", " ")

    # Tokenize the text into sentences using spaCy
    doc = nlp(cleaned_text)
    sentences = [sent.text for sent in doc.sents]

    # Clean and normalize sentences
    cleaned_sentences = []
    for sent in sentences:
        # Convert to lowercase
        sent = sent.lower()
        # Remove punctuation
        sent = sent.translate(str.maketrans("", "", string.punctuation))
        # Lemmatize words
        sent = " ".join([token.lemma_ for token in nlp(sent)])
        cleaned_sentences.append(sent)

    return cleaned_sentences

# Example usage
text = "This is an example text. It contains multiple sentences."
cleaned_sentences = preprocess_text(text)
print(cleaned_sentences)

2. Implement extractive summarization

  • Load the preprocessed text into spaCy: Use the nlp() function in spaCy, where nlp is the language model you loaded before.

  • Employ an extractive summarization algorithm to isolate key sentences: One approach could be ranking sentences based on their “importance” (e.g., using the frequency of words). You can refer to the spaCy documentation on how to work with sentences and perform sentence ranking: spaCy - Sentence Segmentation.

  • Stitch together the key sentences to form a concise summary: After ranking and selecting the sentences, concatenate them to form the summary.

  • Evaluate the quality of the extractive summary using metrics like ROUGE or BLEU: You can use libraries like NLTK and the rouge_score library to evaluate the summaries. The NLTK documentation provides information on how to use the nltk.translate.bleu_score module for BLEU score evaluation: NLTK - BLEU Score. The rouge_score library documentation offers examples and explanations of how to calculate ROUGE scores: rouge_score - README.

    These metrics enable the measurement of machine-generated summary quality by comparing them to a reference or “gold standard” summary. To conduct such evaluations, a collection of high-quality reference summaries is required against which the machine-generated summaries can be compared. Although creating your own golden copy can be challenging, you can start with established datasets:

  • CNN/Daily Mail Dataset: This widely used dataset comprises news articles from CNN and Daily Mail, accompanied by human-written highlights that serve as reference summaries.

  • The New York Times Annotated Corpus: This corpus contains 1.8 million articles from The New York Times, complete with summaries. Please note that accessing this dataset requires licensing.

  • PubMed: PubMed is a dataset well-suited for biomedical text summarization. It includes biomedical articles along with their abstracts serving as reference summaries.

  • XSum: XSum presents a challenging dataset with extreme summarization tasks, where each document is associated with a concise, one-sentence summary.

  • BigPatent: BigPatent is a dataset suitable for those interested in technical or legal text summarization. It comprises patent documents along with their abstracts serving as reference summaries.

By using these established datasets as reference summaries, you can evaluate the quality and effectiveness of your machine-generated extractive summaries using ROUGE or BLEU metrics, gaining valuable insights into the performance of your summarization techniques.

Example code: Click to view the code after you have given it a shot!
import spacy

# Load a spaCy model
nlp = spacy.load("en_core_web_sm")

# Preprocessed sentences
sentences = ["This is sentence 1.", "This is sentence 2.", "This is sentence 3."]

# Implement extractive summarization
def extractive_summarization(sentences):
    # Load the preprocessed sentences into spaCy
    doc = nlp(" ".join(sentences))
    
    # Rank sentences based on importance (e.g., word frequency)
    sentence_scores = {}
    for sent in doc.sents:
        for token in sent:
            if token.is_stop:
                continue
            if token.text not in sentence_scores:
                sentence_scores[token.text] = 1
            else:
                sentence_scores[token.text] += 1
    
    # Select the top-ranked sentences
    summary_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)[:2]
    
    # Stitch together the key sentences to form a concise summary
    summary = " ".join(summary_sentences)
    
    return summary

# Example usage
summary = extractive_summarization(sentences)
print(summary)

3. Improve the summarization application

  • Experiment with different sentence segmentation methods: SpaCy uses a dependency parse-based sentence segmentation method by default, but you can also use rule-based methods for segmentation. You can explore different segmentation techniques and compare their impact on the quality of extractive summaries. The spaCy documentation provides details on rule-based sentence segmentation: spaCy - Rule-based Matching.

  • Leverage Named Entity Recognition (NER) and other linguistic features: SpaCy offers a wide range of linguistic features that can be used to better understand and summarize the text. For example, you can prioritize sentences with recognized entities or specific parts of speech to create more informative summaries. The spaCy documentation provides information on Named Entity Recognition and other linguistic annotations: spaCy - Linguistic Features.

  • Customize the pipeline: SpaCy allows you to customize its processing pipeline by adding or removing different components (e.g., parser, tagger, NER). You can experiment with different configurations to see how they affect the quality of the summaries. The spaCy documentation provides a detailed guide on custom pipeline components: spaCy - Custom Pipeline Components.

  • Experiment with text categorization: Text categorization can be used to summarize text by categorizing it into predefined categories. You can experiment with different text categorization models and techniques within spaCy to enhance your summarization approach. The spaCy documentation provides information on text categorization using the TextCategorizer: spaCy - TextCategorizer.

  • Fine-tuning models: SpaCy provides support for fine-tuning its models on custom datasets. If you have a dataset for which you would like to optimize your summaries, fine-tuning your spaCy model on this data could potentially improve summary quality. The spaCy documentation offers a detailed guide on how to fine-tune models: spaCy - Fine-tuning.

Final code snippet for the SpaCy portion!
import spacy

# Load a spaCy model
nlp = spacy.load("en_core_web_sm")

# Custom sentence segmentation
def custom_sentencizer(doc):
    for i in range(len(doc) - 1):
        if doc[i].text == ';':
            doc[i+1].is_sent_start = True
    return doc

# Add the custom sentencizer into the pipeline
nlp.add_pipe(custom_sentencizer, before='parser')

# Remove Named Entity Recognition (NER)
nlp.remove_pipe("ner")

# Fine-tuning
# Define training data, then train the NER model
TRAIN_DATA = [
    ("This is an example sentence.", {"entities": [(5, 8, "CUSTOM_LABEL")]}),
    ("Another sentence here.", {"entities": [(9, 15, "CUSTOM_LABEL")]}),
]

ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)
ner.add_label('CUSTOM_LABEL')

optimizer = nlp.begin_training()
for i in range(10):  # Number of training iterations
    for text, annotations in TRAIN_DATA:
        nlp.update([text], [annotations], sgd=optimizer)


GPT APIs:

1. Implement abstractive summarization using the GPT API

  • Initialize your OpenAI API credentials: Follow the OpenAI API documentation to correctly set up your API keys. You can refer to the OpenAI API documentation for instructions on how to get started: OpenAI API Documentation.

  • Strategically define the prompt for summary generation: The prompt should be clear and specific to guide the model in generating a useful summary. Consider providing context and instructions to the model. You can refer to the OpenAI Cookbook’s guide on how to construct prompts for text generation: OpenAI Cookbook - Prompt Engineering.

  • Leverage the GPT API to generate a thoughtful, coherent abstractive summary: Use the openai.ChatCompletion.create() function to generate a summary with the GPT API. You can refer to the OpenAI API documentation for information on how to make requests using the API: OpenAI API Documentation.

  • Assess the quality of the abstractive summary using human judgment or conventional evaluation metrics: Consider readability, coherence, and how well the summary captures the main points of the text. You can follow established evaluation methods for text summarization or seek human feedback for evaluation.

2. Improve the summarization application

  • Experiment with different parameters and strategies to improve summary quality

    • You can adjust parameters like temperature and max tokens in the GPT API to improve the quality of the generated summaries. Experiment with different values to achieve the desired output. The OpenAI API documentation provides details on how to adjust parameters: OpenAI API Documentation.
  • Add advanced features like length control, fine-tuning, or custom prompt engineering

    • To control the length of the generated summaries, you can adjust the max_tokens parameter in the GPT API request. You can also explore fine-tuning techniques to optimize the model for your specific summarization task. The OpenAI Cookbook provides a guide on how to fine-tune the GPT model: OpenAI Cookbook - Fine-tuning.
  • Explore other GPT models or architectures and their impact on summary quality

    • OpenAI offers various GPT models and architectures (e.g., gpt-3.5-turbo, text-davinci-002). You can experiment with different versions of the GPT model to evaluate their impact on the quality of the generated summaries. The OpenAI API documentation provides information on the available models: OpenAI API Documentation.
Python code snippet that demonstrates these guidelines
import openai

openai.api_key = 'your-api-key'

# Experiment with different parameters to improve summary quality
response = openai.Completion.create(
  engine="text-davinci-002",  # Explore other GPT models
  prompt="Translate the following English text to French: '{}'",
  max_tokens=60,  # Adjust the max tokens parameter to control length
  temperature=0.5  # Adjust the temperature parameter
)

# Advanced features - custom prompt engineering
prompt = """
Title: AI in Healthcare
Content: Artificial Intelligence is revolutionizing the healthcare industry. It allows the prediction and diagnosis of diseases more accurately. By leveraging machine learning algorithms, AI enables doctors to make better decisions about diagnosis and treatment. It can also help predict patient outcomes and automate administrative tasks, thus saving time for healthcare professionals.

Please summarize the content above.
"""

response = openai.Completion.create(
  engine="text-davinci-002",
  prompt=prompt,
  max_tokens=60
)

# Fine-tuning is more advanced and is typically done with a custom dataset and more computation.
# However, this is the basic idea:

training_args = {
    'learning_rate': 1e-5,
    'weight_decay': 0.01,
    'adam_epsilon': 1e-6,
    'max_grad_norm': 1.0,
    'num_train_epochs': 3,  # The total number of training epochs
    'warmup_steps': 100,
    'logging_steps': 10,
}

# Model and Tokenizer settings
model_config = {
    'max_length': 1024,
    'temperature': 0.7,
    'top_p': 0.8,
    'num_return_sequences': 1
}

# Then we need to create a PyTorch dataloader and a Trainer instance for fine-tuning
# For more details, refer to OpenAI fine-tuning guide.

In this code snippet, the OpenAI API key is set, and different parameters are experimented with to improve the summary quality. The openai.Completion.create() function is used to generate summaries using the GPT API, with parameters such as engine, prompt, max_tokens, and temperature adjusted according to the guidelines.

The snippet also demonstrates advanced features like custom prompt engineering, where a specific prompt is used to guide the summary generation. Additionally, it mentions that fine-tuning is a more advanced process typically performed with a custom dataset and additional computation, and refers to the OpenAI fine-tuning guide for more information.

Please note that fine-tuning requires further steps and details that go beyond the scope of the provided snippet. The code should be adjusted and expanded based on the specific fine-tuning requirements and guidelines provided by OpenAI.