Basic Model Fine-tuning: Customizing LLMs with Hugging Face AutoTrain
Objective
The objective of this project is to learn practical model fine-tuning using high-level tools and APIs that abstract away complex training details. Learners will customize pre-trained models for specific tasks using Hugging Face’s AutoTrain, understand when fine-tuning is necessary versus prompt engineering, prepare simple datasets, and evaluate model improvements. This project bridges the gap between using generic models and creating task-specific AI assistants without requiring deep machine learning expertise.
Learning Outcomes
- Understand Fine-tuning Use Cases: Identify when fine-tuning is beneficial versus using prompt engineering or few-shot learning. Compare cost-benefit trade-offs between approaches.
- Prepare Simple Datasets: Format existing data into instruction-response pairs, clean and validate training data, and create proper train/validation splits using spreadsheets or simple scripts.
- Use AutoTrain Interface: Navigate Hugging Face AutoTrain UI and API to upload datasets, configure basic training parameters, and monitor training progress without writing training loops.
- Evaluate Model Improvements: Compare fine-tuned model outputs against base models using simple metrics and qualitative assessment. Understand basic evaluation concepts like overfitting.
- Deploy via Inference API: Use Hugging Face Inference API to serve fine-tuned models, integrate custom models into simple applications, and manage API costs.
- Troubleshoot Common Issues: Debug dataset formatting problems, handle training failures, and optimize for limited budgets and resources.
By achieving these learning outcomes, participants will be able to create customized LLMs for specific business needs using user-friendly tools.
Prerequisites
- Basic Python programming (functions, dictionaries, loops)
- Understanding of what LLMs are and how to use them via APIs
- Familiarity with JSON and CSV file formats
- Access to Google Colab or similar cloud notebook environment
- Hugging Face account (free tier)
- Small budget for training ($5-20) or free credits
Skills & Tools
Skills You’ll Develop
- Data Preparation: Converting existing content into training format, quality checking
- Training Configuration: Selecting appropriate base models, setting epochs and batch sizes
- Cost Management: Estimating training costs, optimizing dataset size
- Model Selection: Choosing between model sizes based on task requirements
- Quality Assessment: A/B testing outputs, identifying improvement areas
- API Integration: Using inference endpoints, managing authentication
Tools You’ll Master
- Hugging Face AutoTrain: No-code/low-code fine-tuning platform
- Google Colab: Cloud-based notebook environment
- Hugging Face Hub: Model repository and versioning
- Gradio: Simple UI creation for model testing
- pandas: Basic data manipulation and CSV handling
- Hugging Face Inference API: Model deployment and serving
Steps and Tasks
Part 1: Understanding When to Fine-tune
1. Fine-tuning vs Alternatives Decision Framework
Learn to make informed decisions about when fine-tuning is worth the investment.
Decision framework and cost analysis
class FineTuningDecisionHelper:
"""Help decide between fine-tuning, prompt engineering, or RAG."""
def __init__(self):
self.decision_factors = {
'task_specificity': 0, # 0-10: How specific is your task?
'data_availability': 0, # 0-10: How much quality data do you have?
'budget': 0, # 0-10: Budget availability
'latency_requirements': 0, # 0-10: Need for fast responses?
'accuracy_needs': 0 # 0-10: How critical is accuracy?
}
def analyze_use_case(self, use_case: str) -> dict:
"""Analyze whether fine-tuning is recommended."""
recommendations = {
'prompt_engineering': {
'suitable_when': [
'Task is general or varies frequently',
'Limited training data (<100 examples)',
'Need to iterate quickly',
'Budget is very limited'
],
'example': 'Generic customer support, creative writing'
},
'few_shot_learning': {
'suitable_when': [
'Have 5-20 good examples',
'Task is well-defined but not unique',
'Need some customization without training'
],
'example': 'Specific format generation, style matching'
},
'fine_tuning': {
'suitable_when': [
'Very specific domain or task',
'Have 500+ quality examples',
'Consistent format/style needed',
'Reducing API costs long-term'
],
'example': 'Medical coding, legal document analysis'
},
'rag': {
'suitable_when': [
'Need up-to-date information',
'Large knowledge base exists',
'Factual accuracy is critical'
],
'example': 'Technical documentation Q&A, policy lookup'
}
}
return recommendations
def estimate_costs(self, num_examples: int, model_size: str = 'small'):
"""Estimate fine-tuning costs."""
# Rough estimates for AutoTrain
cost_per_1k_tokens = {
'small': 0.008, # ~1B parameters
'medium': 0.03, # ~7B parameters
'large': 0.12 # ~13B+ parameters
}
avg_tokens_per_example = 150
total_tokens = num_examples * avg_tokens_per_example
estimated_cost = (total_tokens / 1000) * cost_per_1k_tokens[model_size]
return {
'estimated_cost_usd': round(estimated_cost, 2),
'training_time_hours': round(total_tokens / 100000, 1),
'recommended_model': self._recommend_model(num_examples)
}
def _recommend_model(self, num_examples: int) -> str:
"""Recommend model size based on dataset size."""
if num_examples < 500:
return "distilgpt2 or gpt2-small"
elif num_examples < 2000:
return "gpt2-medium or llama-2-7b"
else:
return "gpt2-large or llama-2-13b"
# Example usage
helper = FineTuningDecisionHelper()
print("Use case analysis:", helper.analyze_use_case("customer support"))
print("Cost estimate:", helper.estimate_costs(num_examples=1000))
2. Creating Your First Dataset
Learn to prepare data in the simple formats required for fine-tuning.
Dataset preparation basics
import pandas as pd
import json
from typing import List, Dict
class SimpleDatasetPreparer:
"""Prepare datasets for AutoTrain fine-tuning."""
def __init__(self):
self.dataset = []
self.validation_errors = []
def create_from_faq(self, faq_file: str) -> pd.DataFrame:
"""Convert FAQ document to training dataset."""
# Example: Read FAQ from CSV
faq_df = pd.read_csv(faq_file)
training_data = []
for _, row in faq_df.iterrows():
# Format as instruction-response
formatted = {
'instruction': row['question'],
'response': row['answer']
}
training_data.append(formatted)
return pd.DataFrame(training_data)
def create_from_conversations(self, conversations: List[Dict]) -> pd.DataFrame:
"""Convert chat logs to training format."""
training_data = []
for conv in conversations:
# Extract user query and assistant response
if conv['role'] == 'user':
instruction = conv['content']
elif conv['role'] == 'assistant':
response = conv['content']
# Create training example
training_data.append({
'instruction': instruction,
'response': response
})
return pd.DataFrame(training_data)
def validate_dataset(self, df: pd.DataFrame) -> Dict:
"""Basic validation for training data."""
issues = []
# Check for empty values
empty_instructions = df['instruction'].isna().sum()
empty_responses = df['response'].isna().sum()
if empty_instructions > 0:
issues.append(f"{empty_instructions} empty instructions found")
if empty_responses > 0:
issues.append(f"{empty_responses} empty responses found")
# Check length distribution
inst_lengths = df['instruction'].str.len()
resp_lengths = df['response'].str.len()
# Flag very short or very long examples
short_inst = (inst_lengths < 10).sum()
long_inst = (inst_lengths > 500).sum()
if short_inst > 0:
issues.append(f"{short_inst} very short instructions")
if long_inst > 0:
issues.append(f"{long_inst} very long instructions")
# Check for duplicates
duplicates = df.duplicated().sum()
if duplicates > 0:
issues.append(f"{duplicates} duplicate examples")
return {
'total_examples': len(df),
'issues': issues,
'ready_for_training': len(issues) == 0,
'instruction_avg_length': inst_lengths.mean(),
'response_avg_length': resp_lengths.mean()
}
def create_train_test_split(self, df: pd.DataFrame, test_size: float = 0.1):
"""Split dataset for training and validation."""
# Simple random split
test_samples = int(len(df) * test_size)
test_df = df.sample(n=test_samples, random_state=42)
train_df = df.drop(test_df.index)
# Save to CSV files for AutoTrain
train_df.to_csv('train.csv', index=False)
test_df.to_csv('test.csv', index=False)
print(f"Training examples: {len(train_df)}")
print(f"Test examples: {len(test_df)}")
return train_df, test_df
# Example usage
preparer = SimpleDatasetPreparer()
# Create sample dataset
sample_data = [
{'instruction': 'How do I reset my password?',
'response': 'To reset your password, click on "Forgot Password" on the login page...'},
{'instruction': 'What are your business hours?',
'response': 'Our business hours are Monday-Friday, 9 AM to 5 PM EST...'}
]
df = pd.DataFrame(sample_data)
validation = preparer.validate_dataset(df)
print("Validation results:", validation)
Solving the Cold Start Problem
Data Generation and Augmentation Strategies
For learners who don’t have ready-made datasets, here are practical approaches to create training data from scratch.
Synthetic Data Generation Methods
import pandas as pd
import random
from typing import List, Dict, Tuple
from transformers import pipeline
class DataGenerator:
"""Generate synthetic training data for fine-tuning."""
def __init__(self, base_model: str = "gpt2"):
self.generator = pipeline("text-generation", model=base_model)
def create_from_seed_examples(self, seed_examples: List[Tuple[str, str]],
variations_per_example: int = 3) -> pd.DataFrame:
"""Generate variations of seed examples using paraphrasing."""
training_data = []
for instruction, response in seed_examples:
# Generate instruction variations
instruction_variations = self._paraphrase_text(
instruction,
num_variations=variations_per_example
)
# Generate response variations
response_variations = self._paraphrase_text(
response,
num_variations=variations_per_example
)
# Create combinations
for inst_var in instruction_variations:
for resp_var in response_variations:
training_data.append({
'instruction': inst_var,
'response': resp_var,
'source': 'synthetic_variation'
})
# Also keep the original
training_data.append({
'instruction': instruction,
'response': response,
'source': 'seed_original'
})
return pd.DataFrame(training_data)
def create_from_template(self, template_config: Dict) -> pd.DataFrame:
"""Generate data from predefined templates."""
training_data = []
# Example: Customer support template
if template_config['domain'] == 'customer_support':
templates = [
"How do I {action} my {item}?",
"What is your policy on {topic}?",
"Can you help me with {problem}?",
"How long does {process} take?",
"Why is my {item} not {working_state}?"
]
responses = [
"To {action} your {item}, please follow these steps: {steps}",
"Our policy on {topic} is as follows: {policy_details}",
"I can certainly help with {problem}. Here's what you need to do: {solution}",
"The {process} typically takes {time_estimate}.",
"If your {item} is not {working_state}, try these troubleshooting steps: {troubleshooting}"
]
# Fill templates with realistic values
fillers = {
'action': ['reset', 'cancel', 'return', 'track', 'update'],
'item': ['account', 'order', 'password', 'subscription', 'device'],
'topic': ['returns', 'shipping', 'privacy', 'refunds', 'cancellations'],
'problem': ['login issues', 'payment problems', 'delivery delays', 'technical glitches'],
'process': ['shipping', 'processing', 'verification', 'delivery'],
'working_state': ['working', 'responding', 'loading', 'connecting'],
'steps': [
"1. Go to Settings\n2. Click on the option\n3. Confirm your choice",
"1. Visit the help page\n2. Submit a request\n3. Wait for confirmation"
],
'policy_details': [
"we allow returns within 30 days with original packaging",
"shipping is free for orders over $50",
"we process refunds within 5-7 business days"
],
'solution': [
"clear your browser cache and try again",
"check your internet connection and restart the application",
"contact support with your order number for immediate assistance"
],
'time_estimate': ["2-3 business days", "24-48 hours", "5-7 business days"],
'troubleshooting': [
"restart the device and check connections",
"update to the latest software version",
"check for any service outages in your area"
]
}
# Generate examples
for _ in range(template_config.get('num_examples', 50)):
template_idx = random.randint(0, len(templates)-1)
response_idx = random.randint(0, len(responses)-1)
# Fill template with random values
instruction = templates[template_idx]
response = responses[response_idx]
# Replace placeholders
for key, values in fillers.items():
if f"{{{key}}}" in instruction:
instruction = instruction.replace(f"{{{key}}}", random.choice(values))
if f"{{{key}}}" in response:
response = response.replace(f"{{{key}}}", random.choice(values))
training_data.append({
'instruction': instruction,
'response': response,
'source': 'template_generated'
})
return pd.DataFrame(training_data)
def create_from_qg_pipeline(self, context_documents: List[str]) -> pd.DataFrame:
"""Generate Q&A pairs from existing documents (Question-Answer Generation)."""
# Simple rule-based QG for beginners (can be enhanced with proper QG models)
training_data = []
for doc in context_documents:
sentences = doc.split('. ')
for i, sentence in enumerate(sentences):
if len(sentence.split()) > 8: # Only use substantial sentences
# Create simple questions
question = self._sentence_to_question(sentence)
if question:
training_data.append({
'instruction': question,
'response': sentence.strip(),
'source': 'qg_generated'
})
return pd.DataFrame(training_data)
def _paraphrase_text(self, text: str, num_variations: int = 3) -> List[str]:
"""Create paraphrased versions of text using simple transformations."""
variations = [text] # Always include original
# Simple paraphrasing rules (can be enhanced with proper paraphrasing models)
words = text.split()
if len(words) > 4:
# Variation 1: Change word order (if it makes sense)
if "?" not in text and "!" not in text:
try:
# Simple reordering for questions/statements
if text.startswith(('How', 'What', 'When', 'Where', 'Why')):
# Keep question words at start
variations.append(text)
else:
# Try some reordering
if len(words) > 6:
reordered = " ".join(words[2:] + words[:2])
if self._makes_sense(reordered):
variations.append(reordered)
except:
pass
# Variation 2: Synonym replacement for common words
synonyms = {
'how': ['what is the process for', 'what is the way to'],
'can you': ['could you', 'would you be able to', 'is it possible for you to'],
'help': ['assist with', 'support with', 'guide me through'],
'problem': ['issue', 'difficulty', 'challenge'],
'thank you': ['thanks', 'appreciate it', 'thank you so much']
}
for original, replacements in synonyms.items():
if original in text.lower():
for replacement in replacements:
variation = text.lower().replace(original, replacement)
variations.append(variation.capitalize())
# Ensure we have the requested number of variations
while len(variations) < num_variations + 1: # +1 because original is included
variations.append(text) # Fallback to original
return variations[:num_variations + 1]
def _sentence_to_question(self, sentence: str) -> str:
"""Convert a statement into a simple question."""
sentence = sentence.strip()
if not sentence or len(sentence) < 20:
return None
# Simple rule-based conversion
words = sentence.split()
if len(words) < 5:
return None
# Look for key phrases to form questions
if 'can' in sentence.lower() and 'by' in sentence.lower():
return "How can I accomplish this?"
elif 'should' in sentence.lower():
return "What should I do in this situation?"
elif 'must' in sentence.lower():
return "What are the requirements for this?"
elif 'because' in sentence.lower():
return "Why is this important?"
elif 'after' in sentence.lower() or 'before' in sentence.lower():
return "When should this be done?"
# Default question
return "Can you tell me more about this?"
def _makes_sense(self, text: str) -> bool:
"""Basic check if text makes grammatical sense."""
# Simple heuristic - check if it starts with capital and ends with punctuation
if len(text) < 10:
return False
if not text[0].isupper():
return False
if text[-1] not in '.!?':
return False
return True
# Example usage
generator = DataGenerator()
# Method 1: From seed examples
seed_examples = [
("How do I reset my password?", "You can reset your password by clicking 'Forgot Password' on the login page."),
("What are your business hours?", "Our customer support is available Monday to Friday, 9 AM to 6 PM EST.")
]
synthetic_data = generator.create_from_seed_examples(seed_examples, variations_per_example=2)
print(f"Generated {len(synthetic_data)} examples from seeds")
# Method 2: From templates
template_config = {
'domain': 'customer_support',
'num_examples': 20
}
template_data = generator.create_from_template(template_config)
print(f"Generated {len(template_data)} examples from templates")
# Combine all data
all_training_data = pd.concat([synthetic_data, template_data], ignore_index=True)
print(f"Total training examples: {len(all_training_data)}")
Data Quality Assessment and Filtering
Quality filtering for synthetic data
class DataQualityFilter:
"""Filter and clean generated training data."""
def __init__(self):
self.quality_metrics = {}
def assess_quality(self, df: pd.DataFrame) -> pd.DataFrame:
"""Assess quality of generated examples."""
quality_scores = []
for _, row in df.iterrows():
score = self._calculate_quality_score(row['instruction'], row['response'])
quality_scores.append(score)
df['quality_score'] = quality_scores
df['is_high_quality'] = df['quality_score'] >= 0.7
return df
def _calculate_quality_score(self, instruction: str, response: str) -> float:
"""Calculate quality score between 0 and 1."""
score = 0.0
# Length appropriateness (20%)
inst_words = len(instruction.split())
resp_words = len(response.split())
if 5 <= inst_words <= 30:
score += 0.2
elif 3 <= inst_words <= 50:
score += 0.1
if 10 <= resp_words <= 150:
score += 0.2
elif 5 <= resp_words <= 200:
score += 0.1
# Grammar and formatting (30%)
if instruction.strip() and instruction[0].isupper() and instruction[-1] in '.!?':
score += 0.15
if response.strip() and response[0].isupper() and response[-1] in '.!?':
score += 0.15
# Diversity and uniqueness (20%)
if len(set(instruction.lower().split())) / len(instruction.split()) > 0.7:
score += 0.1
if len(set(response.lower().split())) / len(response.split()) > 0.6:
score += 0.1
# Content quality (30%)
if not any(word in instruction.lower() for word in ['test', 'example', 'placeholder']):
score += 0.15
if not any(word in response.lower() for word in ['test', 'example', 'placeholder']):
score += 0.15
return min(score, 1.0)
def filter_low_quality(self, df: pd.DataFrame, min_quality: float = 0.6) -> pd.DataFrame:
"""Remove low-quality examples."""
if 'quality_score' not in df.columns:
df = self.assess_quality(df)
filtered_df = df[df['quality_score'] >= min_quality].copy()
removed_count = len(df) - len(filtered_df)
print(f"Removed {removed_count} low-quality examples")
print(f"Keeping {len(filtered_df)} high-quality examples")
return filtered_df
def remove_duplicates(self, df: pd.DataFrame, similarity_threshold: float = 0.8) -> pd.DataFrame:
"""Remove near-duplicate examples."""
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Combine instruction and response for similarity check
texts = (df['instruction'] + " " + df['response']).tolist()
# Calculate TF-IDF vectors
vectorizer = TfidfVectorizer(stop_words='english', min_df=2)
try:
tfidf_matrix = vectorizer.fit_transform(texts)
# Calculate cosine similarity
similarity_matrix = cosine_similarity(tfidf_matrix)
# Find duplicates
duplicates = set()
for i in range(len(similarity_matrix)):
for j in range(i+1, len(similarity_matrix)):
if similarity_matrix[i][j] > similarity_threshold:
duplicates.add(j)
# Remove duplicates
filtered_df = df.drop(df.index[list(duplicates)]).reset_index(drop=True)
print(f"Removed {len(duplicates)} duplicate examples")
return filtered_df
except Exception as e:
print(f"Error in duplicate removal: {e}")
return df
# Example usage
filter = DataQualityFilter()
# Assess quality of generated data
quality_df = filter.assess_quality(all_training_data)
print(f"High-quality examples: {quality_df['is_high_quality'].sum()}/{len(quality_df)}")
# Filter out low-quality data
filtered_data = filter.filter_low_quality(quality_df, min_quality=0.6)
final_data = filter.remove_duplicates(filtered_data)
print(f"Final dataset size: {len(final_data)} examples")
Integration with Existing Dataset Preparation
Enhanced Dataset Preparer
# Add this method to the existing SimpleDatasetPreparer class
def enhance_with_synthetic_data(self, df: pd.DataFrame,
target_size: int = 500,
use_templates: bool = True) -> pd.DataFrame:
"""Enhance small dataset with synthetic examples."""
current_size = len(df)
if current_size >= target_size:
print(f"Dataset already has {current_size} examples, no enhancement needed")
return df
needed = target_size - current_size
print(f"Generating {needed} synthetic examples...")
generator = DataGenerator()
filter = DataQualityFilter()
# Convert existing data to seed format
seed_examples = []
for _, row in df.iterrows():
seed_examples.append((row['instruction'], row['response']))
# Generate synthetic data
synthetic_df = generator.create_from_seed_examples(
seed_examples,
variations_per_example=max(2, needed // len(seed_examples))
)
# Add template data if needed
if use_templates and len(synthetic_df) < needed:
template_config = {
'domain': 'general', # Could be detected from existing data
'num_examples': needed - len(synthetic_df)
}
template_df = generator.create_from_template(template_config)
synthetic_df = pd.concat([synthetic_df, template_df], ignore_index=True)
# Filter for quality
synthetic_df = filter.filter_low_quality(synthetic_df)
synthetic_df = filter.remove_duplicates(synthetic_df)
# Combine with original
enhanced_df = pd.concat([df, synthetic_df], ignore_index=True)
enhanced_df = enhanced_df.sample(frac=1).reset_index(drop=True) # Shuffle
print(f"Enhanced dataset from {current_size} to {len(enhanced_df)} examples")
return enhanced_df
Practical Usage Example
Example
# Complete workflow for cold start
def create_dataset_from_scratch():
"""Complete workflow for creating training data from scratch."""
# Step 1: Start with whatever you have (even just 2-3 examples)
starter_examples = [
{
'instruction': 'How do I contact support?',
'response': 'You can contact our support team by email at support@company.com or phone at 1-800-HELP.'
},
{
'instruction': 'Where can I find pricing information?',
'response': 'Our pricing plans are available on our website at company.com/pricing.'
}
]
starter_df = pd.DataFrame(starter_examples)
# Step 2: Enhance with synthetic data
preparer = SimpleDatasetPreparer()
enhanced_df = preparer.enhance_with_synthetic_data(
starter_df,
target_size=100, # Aim for 100 examples
use_templates=True
)
# Step 3: Validate the enhanced dataset
validation = preparer.validate_dataset(enhanced_df)
print("Enhanced dataset validation:", validation)
# Step 4: Create train/test split
train_df, test_df = preparer.create_train_test_split(enhanced_df)
return train_df, test_df
# Run the cold start pipeline
train_data, test_data = create_dataset_from_scratch()
This addition provides learners with practical strategies to overcome the initial data hurdle, making the fine-tuning project accessible even to those starting with minimal training examples. The synthetic data generation maintains quality through filtering and uses multiple augmentation strategies to create diverse, useful training examples.
Part 2: Using Hugging Face AutoTrain
1. AutoTrain Setup and Configuration
Learn to use the no-code AutoTrain interface for fine-tuning.
AutoTrain setup guide
import os
from huggingface_hub import HfApi, login
import requests
class AutoTrainSetup:
"""Setup and configure AutoTrain for fine-tuning."""
def __init__(self, hf_token: str = None):
"""Initialize with Hugging Face token."""
if hf_token:
login(token=hf_token)
else:
# Prompt for token if not provided
login()
self.api = HfApi()
def create_autotrain_project(self, project_name: str, task: str = "text-generation"):
"""Create a new AutoTrain project."""
# AutoTrain project configuration
config = {
'project_name': project_name,
'task': task,
'language': 'en',
'max_models': 1, # Train one model to save costs
'dataset_split': {
'train': 'train.csv',
'validation': 'test.csv'
}
}
return config
def upload_dataset_to_hub(self, dataset_path: str, repo_name: str):
"""Upload dataset to Hugging Face Hub for AutoTrain."""
try:
# Create dataset repository
self.api.create_repo(
repo_id=repo_name,
repo_type="dataset",
private=True # Keep dataset private
)
# Upload files
self.api.upload_file(
path_or_fileobj=f"{dataset_path}/train.csv",
path_in_repo="train.csv",
repo_id=repo_name,
repo_type="dataset"
)
self.api.upload_file(
path_or_fileobj=f"{dataset_path}/test.csv",
path_in_repo="test.csv",
repo_id=repo_name,
repo_type="dataset"
)
print(f"Dataset uploaded to: huggingface.co/datasets/{repo_name}")
return f"datasets/{repo_name}"
except Exception as e:
print(f"Error uploading dataset: {e}")
return None
def configure_training_params(self, base_model: str = "gpt2"):
"""Configure training parameters for AutoTrain."""
# Simple configuration for beginners
params = {
'base_model': base_model,
'num_train_epochs': 3, # Start with 3 epochs
'batch_size': 4, # Small batch size for free tier
'learning_rate': 2e-5, # Standard learning rate
'warmup_ratio': 0.1,
'gradient_accumulation': 4, # Simulate larger batch
'max_tokens': 512, # Maximum sequence length
'save_steps': 100,
'logging_steps': 10
}
# Estimate training time and cost
self._estimate_training_metrics(params)
return params
def _estimate_training_metrics(self, params: dict):
"""Estimate training time and cost."""
# Rough estimates
if 'gpt2' in params['base_model']:
cost_per_hour = 0.60 # GPU cost estimate
tokens_per_second = 1000
elif 'distilgpt2' in params['base_model']:
cost_per_hour = 0.40
tokens_per_second = 1500
else:
cost_per_hour = 1.20
tokens_per_second = 500
print(f"Estimated cost per hour: ${cost_per_hour}")
print(f"Recommended training time: 1-2 hours for small datasets")
# Example usage
setup = AutoTrainSetup()
config = setup.create_autotrain_project("my-custom-assistant")
params = setup.configure_training_params(base_model="distilgpt2")
2. Monitoring Training Progress
Track and understand training metrics without deep ML knowledge.
Training monitoring basics
import matplotlib.pyplot as plt
import pandas as pd
from typing import List, Dict
class TrainingMonitor:
"""Monitor AutoTrain progress with simple metrics."""
def __init__(self):
self.training_history = []
def explain_metrics(self) -> Dict[str, str]:
"""Explain training metrics in simple terms."""
explanations = {
'loss': """
Loss (lower is better):
Think of this as the model's 'error score'.
It should decrease as training progresses.
If it stops decreasing, training might be done.
""",
'learning_rate': """
Learning Rate:
How big of steps the model takes when learning.
Usually decreases over time for stability.
""",
'epoch': """
Epoch:
One complete pass through all training data.
Most fine-tuning needs 3-5 epochs.
""",
'eval_loss': """
Validation Loss:
Error score on test data the model hasn't seen.
If this increases while training loss decreases,
the model is overfitting (memorizing, not learning).
"""
}
return explanations
def check_for_overfitting(self, train_loss: List[float], val_loss: List[float]) -> bool:
"""Simple overfitting detection."""
if len(train_loss) < 5 or len(val_loss) < 5:
return False # Not enough data
# Check if validation loss is increasing while training decreases
recent_train = train_loss[-5:]
recent_val = val_loss[-5:]
train_improving = recent_train[-1] < recent_train[0]
val_worsening = recent_val[-1] > recent_val[0]
if train_improving and val_worsening:
print("⚠️ Warning: Model might be overfitting!")
print("Training is improving but validation is getting worse.")
print("Consider stopping training or using more data.")
return True
return False
def plot_training_progress(self, metrics: Dict[str, List[float]]):
"""Create simple training visualization."""
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Plot losses
if 'train_loss' in metrics and 'eval_loss' in metrics:
axes[0].plot(metrics['train_loss'], label='Training Loss')
axes[0].plot(metrics['eval_loss'], label='Validation Loss')
axes[0].set_xlabel('Steps')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training Progress')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Plot learning rate
if 'learning_rate' in metrics:
axes[1].plot(metrics['learning_rate'])
axes[1].set_xlabel('Steps')
axes[1].set_ylabel('Learning Rate')
axes[1].set_title('Learning Rate Schedule')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
def should_stop_training(self, metrics: Dict) -> bool:
"""Simple early stopping logic."""
if 'eval_loss' not in metrics:
return False
eval_losses = metrics['eval_loss']
if len(eval_losses) < 10:
return False # Too early to tell
# Check if validation loss hasn't improved in last 5 checks
recent_losses = eval_losses[-5:]
best_recent = min(recent_losses)
best_overall = min(eval_losses[:-5])
if best_recent > best_overall:
print("Validation loss hasn't improved recently.")
print("Consider stopping training to save costs.")
return True
return False
# Example usage
monitor = TrainingMonitor()
print(monitor.explain_metrics()['loss'])
# Simulate some metrics
sample_metrics = {
'train_loss': [1.5, 1.3, 1.1, 0.9, 0.7, 0.5],
'eval_loss': [1.4, 1.2, 1.0, 0.95, 1.0, 1.1]
}
monitor.check_for_overfitting(
sample_metrics['train_loss'],
sample_metrics['eval_loss']
)
Part 3: Testing and Evaluation
1. Simple Model Evaluation
Compare your fine-tuned model against the base model.
Basic evaluation methods
from transformers import pipeline
import pandas as pd
from typing import List, Dict
class SimpleModelEvaluator:
"""Evaluate fine-tuned models with simple methods."""
def __init__(self, base_model_name: str, finetuned_model_name: str):
self.base_pipeline = pipeline("text-generation", model=base_model_name)
self.finetuned_pipeline = pipeline("text-generation", model=finetuned_model_name)
self.test_results = []
def create_test_prompts(self) -> List[str]:
"""Create test prompts relevant to your use case."""
# Example for customer service fine-tuning
test_prompts = [
"How can I track my order?",
"What is your return policy?",
"My product arrived damaged, what should I do?",
"How long does shipping usually take?",
"Can I change my delivery address?"
]
return test_prompts
def compare_outputs(self, prompt: str, max_length: int = 100) -> Dict:
"""Compare outputs from both models."""
# Generate with base model
base_output = self.base_pipeline(
prompt,
max_length=max_length,
temperature=0.7,
do_sample=True,
pad_token_id=50256
)[0]['generated_text']
# Generate with fine-tuned model
finetuned_output = self.finetuned_pipeline(
prompt,
max_length=max_length,
temperature=0.7,
do_sample=True,
pad_token_id=50256
)[0]['generated_text']
return {
'prompt': prompt,
'base_response': base_output.replace(prompt, '').strip(),
'finetuned_response': finetuned_output.replace(prompt, '').strip()
}
def simple_quality_scores(self, response: str, criteria: Dict) -> Dict:
"""Score responses on simple criteria."""
scores = {}
# Length appropriateness
word_count = len(response.split())
if 20 <= word_count <= 100:
scores['length_appropriate'] = 1.0
elif 10 <= word_count <= 150:
scores['length_appropriate'] = 0.5
else:
scores['length_appropriate'] = 0.0
# Check for key phrases (domain-specific)
if criteria.get('required_phrases'):
phrases_found = sum(
1 for phrase in criteria['required_phrases']
if phrase.lower() in response.lower()
)
scores['key_phrases'] = phrases_found / len(criteria['required_phrases'])
# Basic coherence (ends with punctuation, no repetition)
scores['ends_properly'] = 1.0 if response.strip()[-1] in '.!?' else 0.0
# Check for obvious repetition
words = response.lower().split()
unique_ratio = len(set(words)) / len(words) if words else 0
scores['no_repetition'] = 1.0 if unique_ratio > 0.7 else 0.0
return scores
def run_a_b_test(self, test_prompts: List[str]) -> pd.DataFrame:
"""Run A/B test between models."""
results = []
for prompt in test_prompts:
comparison = self.compare_outputs(prompt)
# Score both responses
base_scores = self.simple_quality_scores(
comparison['base_response'],
{'required_phrases': []}
)
finetuned_scores = self.simple_quality_scores(
comparison['finetuned_response'],
{'required_phrases': []}
)
results.append({
'prompt': prompt,
'base_avg_score': sum(base_scores.values()) / len(base_scores),
'finetuned_avg_score': sum(finetuned_scores.values()) / len(finetuned_scores),
'improvement': sum(finetuned_scores.values()) - sum(base_scores.values())
})
df = pd.DataFrame(results)
# Summary statistics
print("\n=== A/B Test Results ===")
print(f"Average Base Model Score: {df['base_avg_score'].mean():.2f}")
print(f"Average Fine-tuned Score: {df['finetuned_avg_score'].mean():.2f}")
print(f"Average Improvement: {df['improvement'].mean():.2f}")
print(f"Fine-tuned wins: {(df['improvement'] > 0).sum()}/{len(df)} prompts")
return df
# Example usage (with dummy models for illustration)
evaluator = SimpleModelEvaluator(
base_model_name="gpt2",
finetuned_model_name="username/my-finetuned-model"
)
test_prompts = evaluator.create_test_prompts()
results_df = evaluator.run_a_b_test(test_prompts)
2. Creating a Demo Interface
Build a simple interface to showcase your fine-tuned model.
Gradio demo creation
import gradio as gr
from transformers import pipeline
class ModelDemo:
"""Create interactive demo for fine-tuned model."""
def __init__(self, model_name: str):
self.pipeline = pipeline("text-generation", model=model_name)
self.conversation_history = []
def create_gradio_interface(self):
"""Create a simple Gradio interface."""
def generate_response(prompt, temperature, max_length):
"""Generate response from model."""
# Add to history
self.conversation_history.append(f"User: {prompt}")
# Generate response
response = self.pipeline(
prompt,
max_length=max_length,
temperature=temperature,
do_sample=True,
pad_token_id=50256
)[0]['generated_text']
# Clean response
clean_response = response.replace(prompt, '').strip()
self.conversation_history.append(f"Assistant: {clean_response}")
# Return last 5 exchanges
history_text = "\n".join(self.conversation_history[-10:])
return clean_response, history_text
# Create interface
interface = gr.Interface(
fn=generate_response,
inputs=[
gr.Textbox(
label="Enter your prompt",
placeholder="Ask me anything...",
lines=2
),
gr.Slider(
minimum=0.1, maximum=1.0, value=0.7, step=0.1,
label="Temperature (creativity)"
),
gr.Slider(
minimum=50, maximum=300, value=100, step=10,
label="Max Length"
)
],
outputs=[
gr.Textbox(label="Model Response", lines=3),
gr.Textbox(label="Conversation History", lines=10)
],
title="Fine-tuned Model Demo",
description="Test your fine-tuned model with different prompts and settings.",
examples=[
["How can I help you today?", 0.7, 100],
["Tell me about your features.", 0.5, 150],
["What's the weather like?", 0.8, 100]
]
)
return interface
def create_comparison_interface(self, base_model_name: str):
"""Create interface comparing base and fine-tuned models."""
base_pipeline = pipeline("text-generation", model=base_model_name)
def compare_models(prompt, temperature):
"""Generate responses from both models."""
# Base model response
base_response = base_pipeline(
prompt,
max_length=100,
temperature=temperature,
do_sample=True,
pad_token_id=50256
)[0]['generated_text'].replace(prompt, '').strip()
# Fine-tuned model response
finetuned_response = self.pipeline(
prompt,
max_length=100,
temperature=temperature,
do_sample=True,
pad_token_id=50256
)[0]['generated_text'].replace(prompt, '').strip()
return base_response, finetuned_response
comparison_interface = gr.Interface(
fn=compare_models,
inputs=[
gr.Textbox(label="Prompt", lines=2),
gr.Slider(0.1, 1.0, value=0.7, label="Temperature")
],
outputs=[
gr.Textbox(label="Base Model", lines=3),
gr.Textbox(label="Fine-tuned Model", lines=3)
],
title="Model Comparison Demo",
description="Compare responses from base and fine-tuned models side by side."
)
return comparison_interface
# Example usage
demo = ModelDemo("username/my-finetuned-model")
interface = demo.create_gradio_interface()
# interface.launch() # Uncomment to launch
Assessment Readiness Indicators
By completing this module, you should be able to:
- Determine when fine-tuning is the right solution versus using prompting or RAG
- Prepare and validate datasets in the correct format for AutoTrain
- Use Hugging Face AutoTrain to fine-tune models without writing training code
- Understand basic training metrics and identify common issues like overfitting
- Compare fine-tuned models against base models using simple evaluation methods
- Deploy fine-tuned models using Hugging Face Inference API
- Create interactive demos to showcase model improvements
- Estimate costs and training time for different dataset sizes