🟡 〔LLMs Fine Tuning - LLM & RAG〕 Customizing LLMs with Hugging Face AutoTrain

stemaway · February 12, 2026, 4:33am

Basic Model Fine-tuning: Customizing LLMs with Hugging Face AutoTrain

Objective

The objective of this project is to learn practical model fine-tuning using high-level tools and APIs that abstract away complex training details. Learners will customize pre-trained models for specific tasks using Hugging Face’s AutoTrain, understand when fine-tuning is necessary versus prompt engineering, prepare simple datasets, and evaluate model improvements. This project bridges the gap between using generic models and creating task-specific AI assistants without requiring deep machine learning expertise.

Learning Outcomes

Understand Fine-tuning Use Cases: Identify when fine-tuning is beneficial versus using prompt engineering or few-shot learning. Compare cost-benefit trade-offs between approaches.
Prepare Simple Datasets: Format existing data into instruction-response pairs, clean and validate training data, and create proper train/validation splits using spreadsheets or simple scripts.
Use AutoTrain Interface: Navigate Hugging Face AutoTrain UI and API to upload datasets, configure basic training parameters, and monitor training progress without writing training loops.
Evaluate Model Improvements: Compare fine-tuned model outputs against base models using simple metrics and qualitative assessment. Understand basic evaluation concepts like overfitting.
Deploy via Inference API: Use Hugging Face Inference API to serve fine-tuned models, integrate custom models into simple applications, and manage API costs.
Troubleshoot Common Issues: Debug dataset formatting problems, handle training failures, and optimize for limited budgets and resources.

By achieving these learning outcomes, participants will be able to create customized LLMs for specific business needs using user-friendly tools.

Prerequisites

Basic Python programming (functions, dictionaries, loops)
Understanding of what LLMs are and how to use them via APIs
Familiarity with JSON and CSV file formats
Access to Google Colab or similar cloud notebook environment
Hugging Face account (free tier)
Small budget for training ($5-20) or free credits

Skills & Tools

Skills You’ll Develop

Data Preparation: Converting existing content into training format, quality checking
Training Configuration: Selecting appropriate base models, setting epochs and batch sizes
Cost Management: Estimating training costs, optimizing dataset size
Model Selection: Choosing between model sizes based on task requirements
Quality Assessment: A/B testing outputs, identifying improvement areas
API Integration: Using inference endpoints, managing authentication

Tools You’ll Master

Hugging Face AutoTrain: No-code/low-code fine-tuning platform
Google Colab: Cloud-based notebook environment
Hugging Face Hub: Model repository and versioning
Gradio: Simple UI creation for model testing
pandas: Basic data manipulation and CSV handling
Hugging Face Inference API: Model deployment and serving

Steps and Tasks

Part 1: Understanding When to Fine-tune

1. Fine-tuning vs Alternatives Decision Framework

Learn to make informed decisions about when fine-tuning is worth the investment.

Decision framework and cost analysis

class FineTuningDecisionHelper:
    """Help decide between fine-tuning, prompt engineering, or RAG."""
    
    def __init__(self):
        self.decision_factors = {
            'task_specificity': 0,  # 0-10: How specific is your task?
            'data_availability': 0,  # 0-10: How much quality data do you have?
            'budget': 0,            # 0-10: Budget availability
            'latency_requirements': 0,  # 0-10: Need for fast responses?
            'accuracy_needs': 0     # 0-10: How critical is accuracy?
        }
    
    def analyze_use_case(self, use_case: str) -> dict:
        """Analyze whether fine-tuning is recommended."""
        
        recommendations = {
            'prompt_engineering': {
                'suitable_when': [
                    'Task is general or varies frequently',
                    'Limited training data (<100 examples)',
                    'Need to iterate quickly',
                    'Budget is very limited'
                ],
                'example': 'Generic customer support, creative writing'
            },
            'few_shot_learning': {
                'suitable_when': [
                    'Have 5-20 good examples',
                    'Task is well-defined but not unique',
                    'Need some customization without training'
                ],
                'example': 'Specific format generation, style matching'
            },
            'fine_tuning': {
                'suitable_when': [
                    'Very specific domain or task',
                    'Have 500+ quality examples',
                    'Consistent format/style needed',
                    'Reducing API costs long-term'
                ],
                'example': 'Medical coding, legal document analysis'
            },
            'rag': {
                'suitable_when': [
                    'Need up-to-date information',
                    'Large knowledge base exists',
                    'Factual accuracy is critical'
                ],
                'example': 'Technical documentation Q&A, policy lookup'
            }
        }
        
        return recommendations
    
    def estimate_costs(self, num_examples: int, model_size: str = 'small'):
        """Estimate fine-tuning costs."""
        
        # Rough estimates for AutoTrain
        cost_per_1k_tokens = {
            'small': 0.008,   # ~1B parameters
            'medium': 0.03,   # ~7B parameters  
            'large': 0.12     # ~13B+ parameters
        }
        
        avg_tokens_per_example = 150
        total_tokens = num_examples * avg_tokens_per_example
        estimated_cost = (total_tokens / 1000) * cost_per_1k_tokens[model_size]
        
        return {
            'estimated_cost_usd': round(estimated_cost, 2),
            'training_time_hours': round(total_tokens / 100000, 1),
            'recommended_model': self._recommend_model(num_examples)
        }
    
    def _recommend_model(self, num_examples: int) -> str:
        """Recommend model size based on dataset size."""
        if num_examples < 500:
            return "distilgpt2 or gpt2-small"
        elif num_examples < 2000:
            return "gpt2-medium or llama-2-7b"
        else:
            return "gpt2-large or llama-2-13b"

# Example usage
helper = FineTuningDecisionHelper()
print("Use case analysis:", helper.analyze_use_case("customer support"))
print("Cost estimate:", helper.estimate_costs(num_examples=1000))

2. Creating Your First Dataset

Learn to prepare data in the simple formats required for fine-tuning.

Dataset preparation basics

import pandas as pd
import json
from typing import List, Dict

class SimpleDatasetPreparer:
    """Prepare datasets for AutoTrain fine-tuning."""
    
    def __init__(self):
        self.dataset = []
        self.validation_errors = []
    
    def create_from_faq(self, faq_file: str) -> pd.DataFrame:
        """Convert FAQ document to training dataset."""
        
        # Example: Read FAQ from CSV
        faq_df = pd.read_csv(faq_file)
        
        training_data = []
        for _, row in faq_df.iterrows():
            # Format as instruction-response
            formatted = {
                'instruction': row['question'],
                'response': row['answer']
            }
            training_data.append(formatted)
        
        return pd.DataFrame(training_data)
    
    def create_from_conversations(self, conversations: List[Dict]) -> pd.DataFrame:
        """Convert chat logs to training format."""
        
        training_data = []
        for conv in conversations:
            # Extract user query and assistant response
            if conv['role'] == 'user':
                instruction = conv['content']
            elif conv['role'] == 'assistant':
                response = conv['content']
                # Create training example
                training_data.append({
                    'instruction': instruction,
                    'response': response
                })
        
        return pd.DataFrame(training_data)
    
    def validate_dataset(self, df: pd.DataFrame) -> Dict:
        """Basic validation for training data."""
        
        issues = []
        
        # Check for empty values
        empty_instructions = df['instruction'].isna().sum()
        empty_responses = df['response'].isna().sum()
        if empty_instructions > 0:
            issues.append(f"{empty_instructions} empty instructions found")
        if empty_responses > 0:
            issues.append(f"{empty_responses} empty responses found")
        
        # Check length distribution
        inst_lengths = df['instruction'].str.len()
        resp_lengths = df['response'].str.len()
        
        # Flag very short or very long examples
        short_inst = (inst_lengths < 10).sum()
        long_inst = (inst_lengths > 500).sum()
        if short_inst > 0:
            issues.append(f"{short_inst} very short instructions")
        if long_inst > 0:
            issues.append(f"{long_inst} very long instructions")
        
        # Check for duplicates
        duplicates = df.duplicated().sum()
        if duplicates > 0:
            issues.append(f"{duplicates} duplicate examples")
        
        return {
            'total_examples': len(df),
            'issues': issues,
            'ready_for_training': len(issues) == 0,
            'instruction_avg_length': inst_lengths.mean(),
            'response_avg_length': resp_lengths.mean()
        }
    
    def create_train_test_split(self, df: pd.DataFrame, test_size: float = 0.1):
        """Split dataset for training and validation."""
        
        # Simple random split
        test_samples = int(len(df) * test_size)
        test_df = df.sample(n=test_samples, random_state=42)
        train_df = df.drop(test_df.index)
        
        # Save to CSV files for AutoTrain
        train_df.to_csv('train.csv', index=False)
        test_df.to_csv('test.csv', index=False)
        
        print(f"Training examples: {len(train_df)}")
        print(f"Test examples: {len(test_df)}")
        
        return train_df, test_df

# Example usage
preparer = SimpleDatasetPreparer()

# Create sample dataset
sample_data = [
    {'instruction': 'How do I reset my password?', 
     'response': 'To reset your password, click on "Forgot Password" on the login page...'},
    {'instruction': 'What are your business hours?',
     'response': 'Our business hours are Monday-Friday, 9 AM to 5 PM EST...'}
]

df = pd.DataFrame(sample_data)
validation = preparer.validate_dataset(df)
print("Validation results:", validation)

Solving the Cold Start Problem

Data Generation and Augmentation Strategies

For learners who don’t have ready-made datasets, here are practical approaches to create training data from scratch.

Synthetic Data Generation Methods

import pandas as pd
import random
from typing import List, Dict, Tuple
from transformers import pipeline

class DataGenerator:
    """Generate synthetic training data for fine-tuning."""
    
    def __init__(self, base_model: str = "gpt2"):
        self.generator = pipeline("text-generation", model=base_model)
    
    def create_from_seed_examples(self, seed_examples: List[Tuple[str, str]], 
                                variations_per_example: int = 3) -> pd.DataFrame:
        """Generate variations of seed examples using paraphrasing."""
        
        training_data = []
        
        for instruction, response in seed_examples:
            # Generate instruction variations
            instruction_variations = self._paraphrase_text(
                instruction, 
                num_variations=variations_per_example
            )
            
            # Generate response variations
            response_variations = self._paraphrase_text(
                response,
                num_variations=variations_per_example
            )
            
            # Create combinations
            for inst_var in instruction_variations:
                for resp_var in response_variations:
                    training_data.append({
                        'instruction': inst_var,
                        'response': resp_var,
                        'source': 'synthetic_variation'
                    })
            
            # Also keep the original
            training_data.append({
                'instruction': instruction,
                'response': response,
                'source': 'seed_original'
            })
        
        return pd.DataFrame(training_data)
    
    def create_from_template(self, template_config: Dict) -> pd.DataFrame:
        """Generate data from predefined templates."""
        
        training_data = []
        
        # Example: Customer support template
        if template_config['domain'] == 'customer_support':
            templates = [
                "How do I {action} my {item}?",
                "What is your policy on {topic}?",
                "Can you help me with {problem}?",
                "How long does {process} take?",
                "Why is my {item} not {working_state}?"
            ]
            
            responses = [
                "To {action} your {item}, please follow these steps: {steps}",
                "Our policy on {topic} is as follows: {policy_details}",
                "I can certainly help with {problem}. Here's what you need to do: {solution}",
                "The {process} typically takes {time_estimate}.",
                "If your {item} is not {working_state}, try these troubleshooting steps: {troubleshooting}"
            ]
            
            # Fill templates with realistic values
            fillers = {
                'action': ['reset', 'cancel', 'return', 'track', 'update'],
                'item': ['account', 'order', 'password', 'subscription', 'device'],
                'topic': ['returns', 'shipping', 'privacy', 'refunds', 'cancellations'],
                'problem': ['login issues', 'payment problems', 'delivery delays', 'technical glitches'],
                'process': ['shipping', 'processing', 'verification', 'delivery'],
                'working_state': ['working', 'responding', 'loading', 'connecting'],
                'steps': [
                    "1. Go to Settings\n2. Click on the option\n3. Confirm your choice",
                    "1. Visit the help page\n2. Submit a request\n3. Wait for confirmation"
                ],
                'policy_details': [
                    "we allow returns within 30 days with original packaging",
                    "shipping is free for orders over $50",
                    "we process refunds within 5-7 business days"
                ],
                'solution': [
                    "clear your browser cache and try again",
                    "check your internet connection and restart the application",
                    "contact support with your order number for immediate assistance"
                ],
                'time_estimate': ["2-3 business days", "24-48 hours", "5-7 business days"],
                'troubleshooting': [
                    "restart the device and check connections",
                    "update to the latest software version",
                    "check for any service outages in your area"
                ]
            }
            
            # Generate examples
            for _ in range(template_config.get('num_examples', 50)):
                template_idx = random.randint(0, len(templates)-1)
                response_idx = random.randint(0, len(responses)-1)
                
                # Fill template with random values
                instruction = templates[template_idx]
                response = responses[response_idx]
                
                # Replace placeholders
                for key, values in fillers.items():
                    if f"{{{key}}}" in instruction:
                        instruction = instruction.replace(f"{{{key}}}", random.choice(values))
                    if f"{{{key}}}" in response:
                        response = response.replace(f"{{{key}}}", random.choice(values))
                
                training_data.append({
                    'instruction': instruction,
                    'response': response,
                    'source': 'template_generated'
                })
        
        return pd.DataFrame(training_data)
    
    def create_from_qg_pipeline(self, context_documents: List[str]) -> pd.DataFrame:
        """Generate Q&A pairs from existing documents (Question-Answer Generation)."""
        
        # Simple rule-based QG for beginners (can be enhanced with proper QG models)
        training_data = []
        
        for doc in context_documents:
            sentences = doc.split('. ')
            
            for i, sentence in enumerate(sentences):
                if len(sentence.split()) > 8:  # Only use substantial sentences
                    # Create simple questions
                    question = self._sentence_to_question(sentence)
                    if question:
                        training_data.append({
                            'instruction': question,
                            'response': sentence.strip(),
                            'source': 'qg_generated'
                        })
        
        return pd.DataFrame(training_data)
    
    def _paraphrase_text(self, text: str, num_variations: int = 3) -> List[str]:
        """Create paraphrased versions of text using simple transformations."""
        
        variations = [text]  # Always include original
        
        # Simple paraphrasing rules (can be enhanced with proper paraphrasing models)
        words = text.split()
        
        if len(words) > 4:
            # Variation 1: Change word order (if it makes sense)
            if "?" not in text and "!" not in text:
                try:
                    # Simple reordering for questions/statements
                    if text.startswith(('How', 'What', 'When', 'Where', 'Why')):
                        # Keep question words at start
                        variations.append(text)
                    else:
                        # Try some reordering
                        if len(words) > 6:
                            reordered = " ".join(words[2:] + words[:2])
                            if self._makes_sense(reordered):
                                variations.append(reordered)
                except:
                    pass
            
            # Variation 2: Synonym replacement for common words
            synonyms = {
                'how': ['what is the process for', 'what is the way to'],
                'can you': ['could you', 'would you be able to', 'is it possible for you to'],
                'help': ['assist with', 'support with', 'guide me through'],
                'problem': ['issue', 'difficulty', 'challenge'],
                'thank you': ['thanks', 'appreciate it', 'thank you so much']
            }
            
            for original, replacements in synonyms.items():
                if original in text.lower():
                    for replacement in replacements:
                        variation = text.lower().replace(original, replacement)
                        variations.append(variation.capitalize())
        
        # Ensure we have the requested number of variations
        while len(variations) < num_variations + 1:  # +1 because original is included
            variations.append(text)  # Fallback to original
        
        return variations[:num_variations + 1]
    
    def _sentence_to_question(self, sentence: str) -> str:
        """Convert a statement into a simple question."""
        
        sentence = sentence.strip()
        if not sentence or len(sentence) < 20:
            return None
        
        # Simple rule-based conversion
        words = sentence.split()
        
        if len(words) < 5:
            return None
        
        # Look for key phrases to form questions
        if 'can' in sentence.lower() and 'by' in sentence.lower():
            return "How can I accomplish this?"
        elif 'should' in sentence.lower():
            return "What should I do in this situation?"
        elif 'must' in sentence.lower():
            return "What are the requirements for this?"
        elif 'because' in sentence.lower():
            return "Why is this important?"
        elif 'after' in sentence.lower() or 'before' in sentence.lower():
            return "When should this be done?"
        
        # Default question
        return "Can you tell me more about this?"
    
    def _makes_sense(self, text: str) -> bool:
        """Basic check if text makes grammatical sense."""
        # Simple heuristic - check if it starts with capital and ends with punctuation
        if len(text) < 10:
            return False
        if not text[0].isupper():
            return False
        if text[-1] not in '.!?':
            return False
        return True

# Example usage
generator = DataGenerator()

# Method 1: From seed examples
seed_examples = [
    ("How do I reset my password?", "You can reset your password by clicking 'Forgot Password' on the login page."),
    ("What are your business hours?", "Our customer support is available Monday to Friday, 9 AM to 6 PM EST.")
]

synthetic_data = generator.create_from_seed_examples(seed_examples, variations_per_example=2)
print(f"Generated {len(synthetic_data)} examples from seeds")

# Method 2: From templates
template_config = {
    'domain': 'customer_support',
    'num_examples': 20
}

template_data = generator.create_from_template(template_config)
print(f"Generated {len(template_data)} examples from templates")

# Combine all data
all_training_data = pd.concat([synthetic_data, template_data], ignore_index=True)
print(f"Total training examples: {len(all_training_data)}")

Data Quality Assessment and Filtering

Quality filtering for synthetic data

class DataQualityFilter:
    """Filter and clean generated training data."""
    
    def __init__(self):
        self.quality_metrics = {}
    
    def assess_quality(self, df: pd.DataFrame) -> pd.DataFrame:
        """Assess quality of generated examples."""
        
        quality_scores = []
        
        for _, row in df.iterrows():
            score = self._calculate_quality_score(row['instruction'], row['response'])
            quality_scores.append(score)
        
        df['quality_score'] = quality_scores
        df['is_high_quality'] = df['quality_score'] >= 0.7
        
        return df
    
    def _calculate_quality_score(self, instruction: str, response: str) -> float:
        """Calculate quality score between 0 and 1."""
        
        score = 0.0
        
        # Length appropriateness (20%)
        inst_words = len(instruction.split())
        resp_words = len(response.split())
        
        if 5 <= inst_words <= 30:
            score += 0.2
        elif 3 <= inst_words <= 50:
            score += 0.1
        
        if 10 <= resp_words <= 150:
            score += 0.2
        elif 5 <= resp_words <= 200:
            score += 0.1
        
        # Grammar and formatting (30%)
        if instruction.strip() and instruction[0].isupper() and instruction[-1] in '.!?':
            score += 0.15
        if response.strip() and response[0].isupper() and response[-1] in '.!?':
            score += 0.15
        
        # Diversity and uniqueness (20%)
        if len(set(instruction.lower().split())) / len(instruction.split()) > 0.7:
            score += 0.1
        if len(set(response.lower().split())) / len(response.split()) > 0.6:
            score += 0.1
        
        # Content quality (30%)
        if not any(word in instruction.lower() for word in ['test', 'example', 'placeholder']):
            score += 0.15
        if not any(word in response.lower() for word in ['test', 'example', 'placeholder']):
            score += 0.15
        
        return min(score, 1.0)
    
    def filter_low_quality(self, df: pd.DataFrame, min_quality: float = 0.6) -> pd.DataFrame:
        """Remove low-quality examples."""
        
        if 'quality_score' not in df.columns:
            df = self.assess_quality(df)
        
        filtered_df = df[df['quality_score'] >= min_quality].copy()
        removed_count = len(df) - len(filtered_df)
        
        print(f"Removed {removed_count} low-quality examples")
        print(f"Keeping {len(filtered_df)} high-quality examples")
        
        return filtered_df
    
    def remove_duplicates(self, df: pd.DataFrame, similarity_threshold: float = 0.8) -> pd.DataFrame:
        """Remove near-duplicate examples."""
        
        from sklearn.feature_extraction.text import TfidfVectorizer
        from sklearn.metrics.pairwise import cosine_similarity
        
        # Combine instruction and response for similarity check
        texts = (df['instruction'] + " " + df['response']).tolist()
        
        # Calculate TF-IDF vectors
        vectorizer = TfidfVectorizer(stop_words='english', min_df=2)
        try:
            tfidf_matrix = vectorizer.fit_transform(texts)
            
            # Calculate cosine similarity
            similarity_matrix = cosine_similarity(tfidf_matrix)
            
            # Find duplicates
            duplicates = set()
            for i in range(len(similarity_matrix)):
                for j in range(i+1, len(similarity_matrix)):
                    if similarity_matrix[i][j] > similarity_threshold:
                        duplicates.add(j)
            
            # Remove duplicates
            filtered_df = df.drop(df.index[list(duplicates)]).reset_index(drop=True)
            print(f"Removed {len(duplicates)} duplicate examples")
            
            return filtered_df
            
        except Exception as e:
            print(f"Error in duplicate removal: {e}")
            return df

# Example usage
filter = DataQualityFilter()

# Assess quality of generated data
quality_df = filter.assess_quality(all_training_data)
print(f"High-quality examples: {quality_df['is_high_quality'].sum()}/{len(quality_df)}")

# Filter out low-quality data
filtered_data = filter.filter_low_quality(quality_df, min_quality=0.6)
final_data = filter.remove_duplicates(filtered_data)

print(f"Final dataset size: {len(final_data)} examples")

Integration with Existing Dataset Preparation

Enhanced Dataset Preparer

# Add this method to the existing SimpleDatasetPreparer class
def enhance_with_synthetic_data(self, df: pd.DataFrame, 
                              target_size: int = 500,
                              use_templates: bool = True) -> pd.DataFrame:
    """Enhance small dataset with synthetic examples."""
    
    current_size = len(df)
    
    if current_size >= target_size:
        print(f"Dataset already has {current_size} examples, no enhancement needed")
        return df
    
    needed = target_size - current_size
    print(f"Generating {needed} synthetic examples...")
    
    generator = DataGenerator()
    filter = DataQualityFilter()
    
    # Convert existing data to seed format
    seed_examples = []
    for _, row in df.iterrows():
        seed_examples.append((row['instruction'], row['response']))
    
    # Generate synthetic data
    synthetic_df = generator.create_from_seed_examples(
        seed_examples, 
        variations_per_example=max(2, needed // len(seed_examples))
    )
    
    # Add template data if needed
    if use_templates and len(synthetic_df) < needed:
        template_config = {
            'domain': 'general',  # Could be detected from existing data
            'num_examples': needed - len(synthetic_df)
        }
        template_df = generator.create_from_template(template_config)
        synthetic_df = pd.concat([synthetic_df, template_df], ignore_index=True)
    
    # Filter for quality
    synthetic_df = filter.filter_low_quality(synthetic_df)
    synthetic_df = filter.remove_duplicates(synthetic_df)
    
    # Combine with original
    enhanced_df = pd.concat([df, synthetic_df], ignore_index=True)
    enhanced_df = enhanced_df.sample(frac=1).reset_index(drop=True)  # Shuffle
    
    print(f"Enhanced dataset from {current_size} to {len(enhanced_df)} examples")
    return enhanced_df

Practical Usage Example

Example

# Complete workflow for cold start
def create_dataset_from_scratch():
    """Complete workflow for creating training data from scratch."""
    
    # Step 1: Start with whatever you have (even just 2-3 examples)
    starter_examples = [
        {
            'instruction': 'How do I contact support?',
            'response': 'You can contact our support team by email at support@company.com or phone at 1-800-HELP.'
        },
        {
            'instruction': 'Where can I find pricing information?',
            'response': 'Our pricing plans are available on our website at company.com/pricing.'
        }
    ]
    
    starter_df = pd.DataFrame(starter_examples)
    
    # Step 2: Enhance with synthetic data
    preparer = SimpleDatasetPreparer()
    enhanced_df = preparer.enhance_with_synthetic_data(
        starter_df, 
        target_size=100,  # Aim for 100 examples
        use_templates=True
    )
    
    # Step 3: Validate the enhanced dataset
    validation = preparer.validate_dataset(enhanced_df)
    print("Enhanced dataset validation:", validation)
    
    # Step 4: Create train/test split
    train_df, test_df = preparer.create_train_test_split(enhanced_df)
    
    return train_df, test_df

# Run the cold start pipeline
train_data, test_data = create_dataset_from_scratch()

This addition provides learners with practical strategies to overcome the initial data hurdle, making the fine-tuning project accessible even to those starting with minimal training examples. The synthetic data generation maintains quality through filtering and uses multiple augmentation strategies to create diverse, useful training examples.

Part 2: Using Hugging Face AutoTrain

1. AutoTrain Setup and Configuration

Learn to use the no-code AutoTrain interface for fine-tuning.

AutoTrain setup guide

import os
from huggingface_hub import HfApi, login
import requests

class AutoTrainSetup:
    """Setup and configure AutoTrain for fine-tuning."""
    
    def __init__(self, hf_token: str = None):
        """Initialize with Hugging Face token."""
        if hf_token:
            login(token=hf_token)
        else:
            # Prompt for token if not provided
            login()
        
        self.api = HfApi()
    
    def create_autotrain_project(self, project_name: str, task: str = "text-generation"):
        """Create a new AutoTrain project."""
        
        # AutoTrain project configuration
        config = {
            'project_name': project_name,
            'task': task,
            'language': 'en',
            'max_models': 1,  # Train one model to save costs
            'dataset_split': {
                'train': 'train.csv',
                'validation': 'test.csv'
            }
        }
        
        return config
    
    def upload_dataset_to_hub(self, dataset_path: str, repo_name: str):
        """Upload dataset to Hugging Face Hub for AutoTrain."""
        
        try:
            # Create dataset repository
            self.api.create_repo(
                repo_id=repo_name,
                repo_type="dataset",
                private=True  # Keep dataset private
            )
            
            # Upload files
            self.api.upload_file(
                path_or_fileobj=f"{dataset_path}/train.csv",
                path_in_repo="train.csv",
                repo_id=repo_name,
                repo_type="dataset"
            )
            
            self.api.upload_file(
                path_or_fileobj=f"{dataset_path}/test.csv",
                path_in_repo="test.csv",
                repo_id=repo_name,
                repo_type="dataset"
            )
            
            print(f"Dataset uploaded to: huggingface.co/datasets/{repo_name}")
            return f"datasets/{repo_name}"
            
        except Exception as e:
            print(f"Error uploading dataset: {e}")
            return None
    
    def configure_training_params(self, base_model: str = "gpt2"):
        """Configure training parameters for AutoTrain."""
        
        # Simple configuration for beginners
        params = {
            'base_model': base_model,
            'num_train_epochs': 3,  # Start with 3 epochs
            'batch_size': 4,  # Small batch size for free tier
            'learning_rate': 2e-5,  # Standard learning rate
            'warmup_ratio': 0.1,
            'gradient_accumulation': 4,  # Simulate larger batch
            'max_tokens': 512,  # Maximum sequence length
            'save_steps': 100,
            'logging_steps': 10
        }
        
        # Estimate training time and cost
        self._estimate_training_metrics(params)
        
        return params
    
    def _estimate_training_metrics(self, params: dict):
        """Estimate training time and cost."""
        
        # Rough estimates
        if 'gpt2' in params['base_model']:
            cost_per_hour = 0.60  # GPU cost estimate
            tokens_per_second = 1000
        elif 'distilgpt2' in params['base_model']:
            cost_per_hour = 0.40
            tokens_per_second = 1500
        else:
            cost_per_hour = 1.20
            tokens_per_second = 500
        
        print(f"Estimated cost per hour: ${cost_per_hour}")
        print(f"Recommended training time: 1-2 hours for small datasets")

# Example usage
setup = AutoTrainSetup()
config = setup.create_autotrain_project("my-custom-assistant")
params = setup.configure_training_params(base_model="distilgpt2")

2. Monitoring Training Progress

Track and understand training metrics without deep ML knowledge.

Training monitoring basics

import matplotlib.pyplot as plt
import pandas as pd
from typing import List, Dict

class TrainingMonitor:
    """Monitor AutoTrain progress with simple metrics."""
    
    def __init__(self):
        self.training_history = []
    
    def explain_metrics(self) -> Dict[str, str]:
        """Explain training metrics in simple terms."""
        
        explanations = {
            'loss': """
            Loss (lower is better):
            Think of this as the model's 'error score'. 
            It should decrease as training progresses.
            If it stops decreasing, training might be done.
            """,
            
            'learning_rate': """
            Learning Rate:
            How big of steps the model takes when learning.
            Usually decreases over time for stability.
            """,
            
            'epoch': """
            Epoch:
            One complete pass through all training data.
            Most fine-tuning needs 3-5 epochs.
            """,
            
            'eval_loss': """
            Validation Loss:
            Error score on test data the model hasn't seen.
            If this increases while training loss decreases,
            the model is overfitting (memorizing, not learning).
            """
        }
        
        return explanations
    
    def check_for_overfitting(self, train_loss: List[float], val_loss: List[float]) -> bool:
        """Simple overfitting detection."""
        
        if len(train_loss) < 5 or len(val_loss) < 5:
            return False  # Not enough data
        
        # Check if validation loss is increasing while training decreases
        recent_train = train_loss[-5:]
        recent_val = val_loss[-5:]
        
        train_improving = recent_train[-1] < recent_train[0]
        val_worsening = recent_val[-1] > recent_val[0]
        
        if train_improving and val_worsening:
            print("⚠️ Warning: Model might be overfitting!")
            print("Training is improving but validation is getting worse.")
            print("Consider stopping training or using more data.")
            return True
        
        return False
    
    def plot_training_progress(self, metrics: Dict[str, List[float]]):
        """Create simple training visualization."""
        
        fig, axes = plt.subplots(1, 2, figsize=(12, 4))
        
        # Plot losses
        if 'train_loss' in metrics and 'eval_loss' in metrics:
            axes[0].plot(metrics['train_loss'], label='Training Loss')
            axes[0].plot(metrics['eval_loss'], label='Validation Loss')
            axes[0].set_xlabel('Steps')
            axes[0].set_ylabel('Loss')
            axes[0].set_title('Training Progress')
            axes[0].legend()
            axes[0].grid(True, alpha=0.3)
        
        # Plot learning rate
        if 'learning_rate' in metrics:
            axes[1].plot(metrics['learning_rate'])
            axes[1].set_xlabel('Steps')
            axes[1].set_ylabel('Learning Rate')
            axes[1].set_title('Learning Rate Schedule')
            axes[1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
    
    def should_stop_training(self, metrics: Dict) -> bool:
        """Simple early stopping logic."""
        
        if 'eval_loss' not in metrics:
            return False
        
        eval_losses = metrics['eval_loss']
        if len(eval_losses) < 10:
            return False  # Too early to tell
        
        # Check if validation loss hasn't improved in last 5 checks
        recent_losses = eval_losses[-5:]
        best_recent = min(recent_losses)
        best_overall = min(eval_losses[:-5])
        
        if best_recent > best_overall:
            print("Validation loss hasn't improved recently.")
            print("Consider stopping training to save costs.")
            return True
        
        return False

# Example usage
monitor = TrainingMonitor()
print(monitor.explain_metrics()['loss'])

# Simulate some metrics
sample_metrics = {
    'train_loss': [1.5, 1.3, 1.1, 0.9, 0.7, 0.5],
    'eval_loss': [1.4, 1.2, 1.0, 0.95, 1.0, 1.1]
}

monitor.check_for_overfitting(
    sample_metrics['train_loss'], 
    sample_metrics['eval_loss']
)

Part 3: Testing and Evaluation

1. Simple Model Evaluation

Compare your fine-tuned model against the base model.

Basic evaluation methods

from transformers import pipeline
import pandas as pd
from typing import List, Dict

class SimpleModelEvaluator:
    """Evaluate fine-tuned models with simple methods."""
    
    def __init__(self, base_model_name: str, finetuned_model_name: str):
        self.base_pipeline = pipeline("text-generation", model=base_model_name)
        self.finetuned_pipeline = pipeline("text-generation", model=finetuned_model_name)
        self.test_results = []
    
    def create_test_prompts(self) -> List[str]:
        """Create test prompts relevant to your use case."""
        
        # Example for customer service fine-tuning
        test_prompts = [
            "How can I track my order?",
            "What is your return policy?",
            "My product arrived damaged, what should I do?",
            "How long does shipping usually take?",
            "Can I change my delivery address?"
        ]
        
        return test_prompts
    
    def compare_outputs(self, prompt: str, max_length: int = 100) -> Dict:
        """Compare outputs from both models."""
        
        # Generate with base model
        base_output = self.base_pipeline(
            prompt,
            max_length=max_length,
            temperature=0.7,
            do_sample=True,
            pad_token_id=50256
        )[0]['generated_text']
        
        # Generate with fine-tuned model
        finetuned_output = self.finetuned_pipeline(
            prompt,
            max_length=max_length,
            temperature=0.7,
            do_sample=True,
            pad_token_id=50256
        )[0]['generated_text']
        
        return {
            'prompt': prompt,
            'base_response': base_output.replace(prompt, '').strip(),
            'finetuned_response': finetuned_output.replace(prompt, '').strip()
        }
    
    def simple_quality_scores(self, response: str, criteria: Dict) -> Dict:
        """Score responses on simple criteria."""
        
        scores = {}
        
        # Length appropriateness
        word_count = len(response.split())
        if 20 <= word_count <= 100:
            scores['length_appropriate'] = 1.0
        elif 10 <= word_count <= 150:
            scores['length_appropriate'] = 0.5
        else:
            scores['length_appropriate'] = 0.0
        
        # Check for key phrases (domain-specific)
        if criteria.get('required_phrases'):
            phrases_found = sum(
                1 for phrase in criteria['required_phrases'] 
                if phrase.lower() in response.lower()
            )
            scores['key_phrases'] = phrases_found / len(criteria['required_phrases'])
        
        # Basic coherence (ends with punctuation, no repetition)
        scores['ends_properly'] = 1.0 if response.strip()[-1] in '.!?' else 0.0
        
        # Check for obvious repetition
        words = response.lower().split()
        unique_ratio = len(set(words)) / len(words) if words else 0
        scores['no_repetition'] = 1.0 if unique_ratio > 0.7 else 0.0
        
        return scores
    
    def run_a_b_test(self, test_prompts: List[str]) -> pd.DataFrame:
        """Run A/B test between models."""
        
        results = []
        
        for prompt in test_prompts:
            comparison = self.compare_outputs(prompt)
            
            # Score both responses
            base_scores = self.simple_quality_scores(
                comparison['base_response'], 
                {'required_phrases': []}
            )
            finetuned_scores = self.simple_quality_scores(
                comparison['finetuned_response'],
                {'required_phrases': []}
            )
            
            results.append({
                'prompt': prompt,
                'base_avg_score': sum(base_scores.values()) / len(base_scores),
                'finetuned_avg_score': sum(finetuned_scores.values()) / len(finetuned_scores),
                'improvement': sum(finetuned_scores.values()) - sum(base_scores.values())
            })
        
        df = pd.DataFrame(results)
        
        # Summary statistics
        print("\n=== A/B Test Results ===")
        print(f"Average Base Model Score: {df['base_avg_score'].mean():.2f}")
        print(f"Average Fine-tuned Score: {df['finetuned_avg_score'].mean():.2f}")
        print(f"Average Improvement: {df['improvement'].mean():.2f}")
        print(f"Fine-tuned wins: {(df['improvement'] > 0).sum()}/{len(df)} prompts")
        
        return df

# Example usage (with dummy models for illustration)
evaluator = SimpleModelEvaluator(
    base_model_name="gpt2",
    finetuned_model_name="username/my-finetuned-model"
)

test_prompts = evaluator.create_test_prompts()
results_df = evaluator.run_a_b_test(test_prompts)

2. Creating a Demo Interface

Build a simple interface to showcase your fine-tuned model.

Gradio demo creation

import gradio as gr
from transformers import pipeline

class ModelDemo:
    """Create interactive demo for fine-tuned model."""
    
    def __init__(self, model_name: str):
        self.pipeline = pipeline("text-generation", model=model_name)
        self.conversation_history = []
    
    def create_gradio_interface(self):
        """Create a simple Gradio interface."""
        
        def generate_response(prompt, temperature, max_length):
            """Generate response from model."""
            
            # Add to history
            self.conversation_history.append(f"User: {prompt}")
            
            # Generate response
            response = self.pipeline(
                prompt,
                max_length=max_length,
                temperature=temperature,
                do_sample=True,
                pad_token_id=50256
            )[0]['generated_text']
            
            # Clean response
            clean_response = response.replace(prompt, '').strip()
            self.conversation_history.append(f"Assistant: {clean_response}")
            
            # Return last 5 exchanges
            history_text = "\n".join(self.conversation_history[-10:])
            
            return clean_response, history_text
        
        # Create interface
        interface = gr.Interface(
            fn=generate_response,
            inputs=[
                gr.Textbox(
                    label="Enter your prompt",
                    placeholder="Ask me anything...",
                    lines=2
                ),
                gr.Slider(
                    minimum=0.1, maximum=1.0, value=0.7, step=0.1,
                    label="Temperature (creativity)"
                ),
                gr.Slider(
                    minimum=50, maximum=300, value=100, step=10,
                    label="Max Length"
                )
            ],
            outputs=[
                gr.Textbox(label="Model Response", lines=3),
                gr.Textbox(label="Conversation History", lines=10)
            ],
            title="Fine-tuned Model Demo",
            description="Test your fine-tuned model with different prompts and settings.",
            examples=[
                ["How can I help you today?", 0.7, 100],
                ["Tell me about your features.", 0.5, 150],
                ["What's the weather like?", 0.8, 100]
            ]
        )
        
        return interface
    
    def create_comparison_interface(self, base_model_name: str):
        """Create interface comparing base and fine-tuned models."""
        
        base_pipeline = pipeline("text-generation", model=base_model_name)
        
        def compare_models(prompt, temperature):
            """Generate responses from both models."""
            
            # Base model response
            base_response = base_pipeline(
                prompt,
                max_length=100,
                temperature=temperature,
                do_sample=True,
                pad_token_id=50256
            )[0]['generated_text'].replace(prompt, '').strip()
            
            # Fine-tuned model response
            finetuned_response = self.pipeline(
                prompt,
                max_length=100,
                temperature=temperature,
                do_sample=True,
                pad_token_id=50256
            )[0]['generated_text'].replace(prompt, '').strip()
            
            return base_response, finetuned_response
        
        comparison_interface = gr.Interface(
            fn=compare_models,
            inputs=[
                gr.Textbox(label="Prompt", lines=2),
                gr.Slider(0.1, 1.0, value=0.7, label="Temperature")
            ],
            outputs=[
                gr.Textbox(label="Base Model", lines=3),
                gr.Textbox(label="Fine-tuned Model", lines=3)
            ],
            title="Model Comparison Demo",
            description="Compare responses from base and fine-tuned models side by side."
        )
        
        return comparison_interface

# Example usage
demo = ModelDemo("username/my-finetuned-model")
interface = demo.create_gradio_interface()
# interface.launch()  # Uncomment to launch

Assessment Readiness Indicators

By completing this module, you should be able to:

Determine when fine-tuning is the right solution versus using prompting or RAG
Prepare and validate datasets in the correct format for AutoTrain
Use Hugging Face AutoTrain to fine-tune models without writing training code
Understand basic training metrics and identify common issues like overfitting
Compare fine-tuned models against base models using simple evaluation methods
Deploy fine-tuned models using Hugging Face Inference API
Create interactive demos to showcase model improvements
Estimate costs and training time for different dataset sizes