Guide Large Language Models
By: Kolena Editorial Team
OpenAI GPT Fine-Tuning: Step By Step Guide
OpenAI GPT Fine-Tuning: Step By Step Guide
Oct 10, 2024
Oct 10, 2024

How Does Fine-Tuning Work in OpenAI GPT Models? 

Fine-tuning in OpenAI GPT models, such as GPT-3.5 and GPT-4, involves taking a pre-trained model and further training it on a custom dataset tailored to specific tasks or use cases. OpenAI offers a managed UI and API that allows you to provide custom datasets and perform fine-tuning on OpenAI’s infrastructure.

This process adjusts the weights of the model, enabling it to perform better on particular tasks by learning from the new data. The custom dataset typically includes examples that reflect the desired input-output behavior, allowing the model to learn patterns and responses that are more aligned with the specific requirements.

The fine-tuning process starts with preparing the dataset in the appropriate format. For GPT-3.5-turbo (the model used in ChatGPT 3.5), this means organizing data in a conversational format with roles and messages. The dataset is then used to train the model, often utilizing techniques like gradient descent to minimize the loss function, which measures the difference between the model’s predictions and the actual desired outputs.

This is part of a series of articles about large language models.

When to Use Fine-Tuning for GPT: Common Use Cases 

Fine-tuning GPT models can be a complex process. Before opting for fine-tuning, it’s advisable to maximize the model’s performance through other methods such as prompt engineering, prompt chaining, and using built-in functions. These approaches often yield sufficient improvements, making fine-tuning unnecessary in many cases.

Fine-tuning may be necessary in the following scenarios:

  • Customizing output characteristics: Fine-tuning helps shape the model’s responses to match a desired style, tone, or format, ensuring consistency with specific qualitative requirements.
  • Enhancing reliability: For applications where consistent output is critical, fine-tuning enhances the model’s dependability, reducing the variance in responses.
  • Handling complex prompts: When the model struggles to follow intricate instructions, fine-tuning improves its ability to understand and execute complex prompts effectively.
  • Managing edge cases: Fine-tuning enables the model to handle numerous edge cases in a predetermined manner, increasing its versatility and accuracy.
  • Learning new skills or tasks: Introducing the model to new skills or tasks that are hard to encapsulate within a prompt can be achieved through fine-tuning, equipping the model with capabilities needed for specialized applications.

What OpenAI Models Offer Fine-Tuning? 

As of the time of this writing, the following GPT models currently support fine-tuning:

  • Gpt-4o-2024-05-13
  • gpt-4-0613
  • gpt-3.5-turbo-0613
  • gpt-3.5-turbo-1106 
  • gpt-3.5-turbo-0125 
  • davinci-002 
  • babbage-002 

Currently, fine-tuning is not available for gpt-4-turbo models. Other GPT 4 models are still in an experimental phase, so access may be limited. Users must request access when creating a fine-tuning job using the fine-tuning UI. Other models can be fine-trained using the new endpoint, /v1/fine_tuning/jobs.

For more information on fine-tuning OpenAI’s GPT models, see the documentation.

Key Features of OpenAI Fine-Tuning API 

The OpenAI Fine-Tuning API offers several advanced features to enhance model performance, improve accuracy, and reduce latency and costs. These features provide developers with the tools needed to customize models for various tasks and improve their overall quality:

  • Fine-tuning dashboard: OpenAI provides a dashboard that allows users to manage fine-tuning tasks, with features like detailed training metrics and the ability to rerun jobs from previous configurations. This makes it easy to monitor and manage fine-tuning tasks.
  • Comparative Playground: The Comparative Playground UI enables developers to compare the quality and performance of different models or fine-tuning snapshots side-by-side. This feature facilitates human evaluation of multiple outputs against a single prompt, allowing for more precise adjustments and improvements.
  • Epoch-based checkpoint creation: This feature allows for the automatic generation of a full fine-tuned model checkpoint during each training epoch. This capability reduces the need for subsequent retraining and helps prevent overfitting, ensuring the model maintains high performance over time.
  • Comprehensive validation metrics: Developers can compute comprehensive metrics such as loss and accuracy over the entire validation dataset, rather than relying on a sampled batch. This provides deeper insights into model quality, enabling more informed decisions during the fine-tuning process.
  • Hyperparameter configuration: The API provides the ability to configure hyperparameters directly from the dashboard. This user-friendly interface simplifies the process of adjusting training parameters, making it more accessible and less error-prone compared to configuration solely through the API or SDK.

Tutorial: Fine-Tuning a ChatGPT Model 

This tutorial is adapted from the OpenAI documentation.

Step 1: Prepare the Dataset

Having determined that fine-tuning is the best option for the intended use case, you need to start preparing the data to train the model. This involves creating a diverse set of example conversations, which mirror the types of conversations the model will have to respond to once in production.

All example conversations should have the same format as the OpenAI Chat Completions API (see the API reference). This includes a list of messages with each message having a role and content and, optionally, a name. A few of these training examples should reflect target cases where the current model is not responding as desired. The assistant messages provided in the dataset should reflect the responses the model should provide.

Example formats
This example aims to build a chatbot that can respond factually while adding optimistic remarks. Here are three example training conversations that could be included in the dataset:

{
    "messages": [
        {"role": "system", "content": "Joe is a factual chatbot that always replies optimistically."},
        {"role": "user", "content": "What color is the sky?"},
        {"role": "assistant", "content": "Blue, symbolizing hope and promising good weather."}
    ]
}
{
    "messages": [
        {"role": "system", "content": "Joe is a factual chatbot that always replies optimistically."},
        {"role": "user", "content": "Who wrote 'Harry Potter'?"},
        {"role": "assistant", "content": "J.K. Rowling. With luck she will soon write another best-seller."}
    ]
}
{
    "messages": [
        {"role": "system", "content": "Joe is a factual chatbot that always replies optimistically."},
        {"role": "user", "content": "What distance is the moon from the earth?"},
        {"role": "assistant", "content": "Just 384,400 kilometers. With the progress being made in space travel, this will soon be a very short distance indeed."}
    ]
}

Multi-turn chats
A chat format also allows you to have multiple example messages assigned to the assistant role. The standard approach to fine-tuning is to train on all assistant messages within a single example. However, it’s possible to skip certain messages by adding a weight key to disable fine-tuning on them. This lets you control which messages are learned. The weight values must be 0 or 1. 

Here are some examples that use weight in the chat format:

{
    "messages": [
        {"role": "system", "content": "Joe is a factual chatbot that always replies optimistically."},
        {"role": "user", "content": "What color is the sky?"},
        {"role": "assistant", "content": "Blue", "weight": 0},
        {"role": "user", "content": "Can you be more optimistic?"},
        {"role": "assistant", "content": "Blue, symbolizing hope and promising good weather.", "weight": 1}
    ]
}
{
    "messages": [
        {"role": "system", "content": "Joe is a factual chatbot that always replies optimistically."},
        {"role": "user", "content": "Who wrote 'Harry Potter'?"},
        {"role": "assistant", "content": "J.K. Rowling", "weight": 0},
        {"role": "user", "content": "Can you be more optimistic?"},
        {"role": "assistant", "content": "J.K. Rowling. With luck she will soon write another best-seller.", "weight": 1}
    ]
}
{
    "messages": [
        {"role": "system", "content": "Joe is a factual chatbot that always replies optimistically."},
        {"role": "user", "content": "What distance is the moon from the earth?"},
        {"role": "assistant", "content": "384,400 kilometers", "weight": 0},
        {"role": "user", "content": "Can you be more optimistic?"},
        {"role": "assistant", "content": "Just 384,400 kilometers. With the progress being made in space travel, this will soon be a very short distance indeed.", "weight": 1}
    ]
}

The right number of examples

Fine-tuning a model requires at least 10 training examples, although it is recommended to use a minimum of 50-100 examples for models like gpt-3.5-turbo. The appropriate number of examples depends on the use case.

Start with 50 good examples and evaluate the model’s improvement after the initial fine-tuning. This might be enough, but usually you would need to add more examples. However, even this limited training set should improve the model to some extent, even if it is not production-ready. If there is no improvement, you might need to rethink the prompts or training data. 

Token limits
A model’s token limit determines the possible length of the example conversation. The token limit depends on the model used. At the time of this writing, GPT-3.5 supports up to 16,385 tokens, while GPT-4 supports 128,000 tokens. Refer to the OpenAI models page for current information.

Longer examples are automatically truncated to fit within the token limit, meaning that tokens are removed from the end of the example. To ensure that each training example has sufficient context, check that the message contains a supported number of tokens.

Estimating costs
You can calculate the cost of fine-tuning training jobs using this formula:

(base training cost per 1M input tokens ÷ 1M) × number of tokens in the input file × number of epochs trained

For example, to train gpt-3.5-turbo-0125 for 3 epochs with a file containing 100,000 tokens, the cost should be approximately $2.40.

Checking the data format
After compiling the dataset, check that the data is in the right format before creating a fine-tuning job. You can check data formatting using a Python script, which can help identify errors, count tokens, and estimate fine-tuning job costs.

Uploading the training dataset

To upload the training file, use:

from openai import OpenAI
client = OpenAI()

client.files.create(
  file=open("mydata.jsonl", "rb"),
  purpose="fine-tune"
)

Once the file is uploaded, the processing could take some time. While the system supports file uploads of up to 1 GB, it is not recommended to use such a large file. A smaller amount of data is usually sufficient to enable significant improvements when fine-tuning a model. 

Step 2: Create a Fine-Tuned Model

With the training data ready, you can start creating the fine-tuning job. You can do this programmatically or using OpenAI’s fine-tuning UI.

To create a fine-tuning job with the OpenAI SDK:

from openai import OpenAI
client = OpenAI()

client.fine_tuning.jobs.create(
  training_file="file-example000", 
  model="gpt-3.5-turbo"
)

The fine-tuning job may take a while to complete, especially if the job is queued behind others in the OpenAI system. The actual model training can take anywhere from a few minutes to several hours depending on the type of model and size of the dataset. When the training is complete, you’ll receive an email confirmation.

Instead of creating a new fine-tuning job, you can use the SDK to list existing training jobs, view the status of jobs, or cancel jobs:

from openai import OpenAI
client = OpenAI()

# List 20 fine-tuning jobs
client.fine_tuning.jobs.list(limit=20)

# Get the state of a fine-tuning job
client.fine_tuning.jobs.retrieve("ftjob-example000")

# Cancel a job
client.fine_tuning.jobs.cancel("ftjob-example000")

# List up to 20 events from a fine-tuning job
client.fine_tuning.jobs.list_events(fine_tuning_job_id="ftjob-example000", limit=20)

# Delete a model
client.models.delete("ft:gpt-3.5-turbo:acemeco:suffix:example000")

Step 3: Using a Fine-Tuned ChatGPT Model

Upon the successful completion of a fine-tuning job, the fine_tuned_model field in the job details should be populated with the model’s name. You can use the OpenAI Playground to make requests to the model.

The model should be immediately available for inference after training. In some cases, it might take a few minutes for the model to be able to handle requests (if it is still being loaded). 

from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
  model="ft:gpt-3.5-turbo:example-org:example000:id",
  messages=[
    {"role": "system", "content": "You are a friendly assistant."},
    {"role": "user", "content": "Hi there!"}
  ]
)
print(completion.choices[0].message)

Step 4: Analyze the Fine-Tuned Model

You can track the following metrics calculated throughout the training process:

  • Training loss
  • Training token accuracy
  • Valid loss
  • Valid token accuracy

Valid loss and valid token accuracy are calculated on a small set of data at each step as well as via a full valid split at the end of an epoch. The full valid loss and valid token accuracy metrics are the most reliable metric for the overall performance of a model. They provide a sanity check to confirm the effectiveness of training, ideally indicating that loss decreased while token accuracy increased. 

You can view event objects during the fine-tuning job:

{
    "object": "fine_tuning.job.event",
    "id": "ftevent-example000",
    "created_at": 1693582679,
    "level": "info",
    "message": "Step 1000/1000: training loss=0.15, validation loss=0.27, full validation loss=0.40",
    "data": {
        "step": 300,
        "train_loss": 0.14991648495197296,
        "valid_loss": 0.26569826706596045,
        "total_steps": 300,
        "full_valid_loss": 0.4032616495084362,
        "train_mean_token_accuracy": 0.9444444179534912,
        "valid_mean_token_accuracy": 0.9565217391304348,
        "full_valid_mean_token_accuracy": 0.9089635854341737
    },
    "type": "metrics"
}

After the job is completed, you should be able to view metrics on the training process by querying the fine-tuning job and extracting a file ID from result_files. The results CSV file should have the following information:

  • step 
  • train_loss
  • train_accuracy 
  • valid_loss 
  • valid_mean_token_accuracy

In addition to viewing metrics, evaluate samples from your fine-tuned model to gauge its quality. Generate samples from the base model as well as the fine-tuned model on the same test set and compare these samples. The test set should include the full range of inputs that the model might encounter in production. If you don’t have time for a full manual evaluation with every iteration, you can use the Evals library to automate subsequent evaluations.

Step 5: Experiment with Different Hyperparameters

You can specify hyperparameters such as epochs, learning rate multiplier, and batch size. Upon initial training, avoid specifying hyperparameters; the system will pick a default based on the dataset’s size. You should adjusting them if:

  • The model does not adhere to the training data closely enough: In this case, add 1 or 2 epochs.
  • The model lacks diversity: In this case, decrease the number of epochs by 1 or 2.
  • The model is not converging: In this case, scale up the learning rate multiplier.

Here’s an example of how to set the hyperparameters:

from openai import OpenAI
client = OpenAI()

client.fine_tuning.jobs.create(
  training_file="file-example000", 
  model="gpt-3.5-turbo", 
  hyperparameters={
    "n_epochs":3
  }
)

AI Testing & Validation with Kolena

Kolena is an AI/ML testing & validation platform that solves one of AI’s biggest problems: the lack of trust in model effectiveness. The use cases for AI are enormous, but AI lacks trust from both builders and the public. It is our responsibility to build that trust with full transparency and explainability of ML model performance, not just from a high-level aggregate ‘accuracy’ number, but from rigorous testing and evaluation at scenario levels.

With Kolena, machine learning engineers and data scientists can uncover hidden machine learning model behaviors, easily identify gaps in the test data coverage, and truly learn where and why a model is underperforming, all in minutes not weeks. Kolena’s AI / ML model testing and validation solution helps developers build safe, reliable, and fair systems by allowing companies to instantly stitch together razor-sharp test cases from their data sets, enabling them to scrutinize AI/ML models in the precise scenarios those models will be unleashed upon the real world. Kolena platform transforms the current nature of AI development from experimental into an engineering discipline that can be trusted and automated.

Learn more about Kolena