Guide Large Language Models
LLaMA 3 Fine-Tuning: The Basics and Four Ways to Fine-Tune Your LLaMA
LLaMA 3 Fine-Tuning: The Basics and Four Ways to Fine-Tune Your LLaMA
Oct 10, 2024
Oct 10, 2024

What Is LLM Fine-Tuning?

Fine-tuning is a process used in large language models (LLMs) to adapt a pre-trained model to a specific task or dataset. This method involves minor adjustments in the model’s parameters to better cater to particular needs without retraining the model from scratch. Fine-tuning takes the foundational capabilities of a broad model and refines it to enhance its accuracy and on specialized data.

The approach is useful when dealing with large models where full retraining would be computationally expensive or impractical. By only adjusting the final layers or certain parameters of a model, fine-tuning allows for significant improvements in performance with a relatively small investment in time and resources.

What Is LLaMA 3? 

LLaMA 3, developed by Meta AI, is an open source LLM offered in two versions, an 8-billion parameter and 70 billion parameter version, with pretrained and instruction-tuned versions. It provides an 8K context window, double that of LLaMA 2. It can handle multi-step tasks, has significantly lower false refusal rates, has significantly better capabilities like reasoning, code generation, and instruction following, compared to LLaMA 2.

A key feature of LLaMA 3 is its efficiency. Unlike many large-scale models that require extensive computational resources, LLaMA 3 has been optimized to perform well even on less powerful hardware. This makes it accessible to a broader range of users and applications, helping democratize the use of AI in research and industry settings.

Why Use LLaMA for Fine-Tuning? 

LLaMA models are useful for fine-tuning projects for several reasons:

  • Efficiency: The design of models like LLaMA 3 allows for fine-tuning with lower computational costs compared to other large language models.
  • Flexibility: They can be adapted to a range of tasks, from generating text to answering questions.
  • Accessibility: LLaMA’s architecture allows it to run on less powerful hardware, broadening its usability to more users and environments.
  • Performance: Fine-tuning LLaMA models has been shown to significantly enhance their performance on specific tasks, providing more accurate and contextually relevant outputs.
  • Community support: With a large community and extensive documentation, users can easily find resources and support for fine-tuning their models.

Important Concepts and Tools for LLaMA 3 Fine-Tuning 

Experiment Tracking

Experiment tracking aims to monitor and log the performance of different model configurations. By keeping detailed records of each experiment, including the hyperparameters used, the dataset, and the resulting metrics, researchers can systematically analyze what changes lead to improvements.

LoRA and QLoRA

Low-Rank Adaptation (LoRA) and Quantized Low-Rank Adaptation (QLoRA) are techniques used to fine-tune large models with reduced computational requirements. LoRA reduces the number of trainable parameters by decomposing the weight matrices into lower-dimensional matrices, which simplifies the model without significantly impacting performance. QLoRA further optimizes this process by applying quantization, reducing the precision of the weights to lower the memory footprint and accelerate training.

LLaMA Recipes Repository

The LLaMA Recipes Repository is a collection of guides, code snippets, and best practices for fine-tuning LLaMA models. It includes step-by-step instructions for setting up the environment, preparing datasets, and executing fine-tuning processes. This repository serves as a useful resource for both novice and experienced users, providing practical insights and solutions to common challenges encountered during fine-tuning.

Torchtune

Torchtune is a fine-tuning toolkit that works seamlessly with PyTorch, simplifying the process of adapting LLaMA models to specific tasks. It provides pre-built modules for common operations, such as data loading, training loops, and evaluation metrics. By abstracting these details, Torchtune allows users to focus on experimenting with different configurations.

Hugging Face PEFT LoRA Library

The Hugging Face PEFT (Parameter-Efficient Fine-Tuning) LoRA Library offers an easy-to-use interface for applying LoRA to LLaMA models. This library integrates with the Hugging Face ecosystem, making it easy to load pre-trained models, apply LoRA techniques, and fine-tune the models on custom datasets. The PEFT library simplifies the adaptation process, enabling fine-tuning with minimal code and effort.

Tutorial: Four Ways to Fine-Tune Meta LLaMA 3

Fine-tuning LLaMA typically involves PEFT (Parameter Efficient Fine Tuning) methods like LoRA (Low Rank Adaption) and QLoRA (Quantized Low Rank Adaption). Here’s an overview of the process.

Option 1: Using LLaMA Recipes Repository

To carry out fine-tuning using PEFT recipes in the LLaMA-recipes repository, follow these steps:

1. Create a conda environment with PyTorch and dependencies:

conda create -n LLaMA-ft python=3.8

conda activate LLaMA-ft
conda install pytorch torchvision torchaudio cudatoolkit=12.2 -c pytorch

Important: Ensure the cuda version (cudatoolkit=12.2) matches the cuda version from nvidia-smi, otherwise you may see errors like “GPU is not present”. Check the nvidia-smi version by running: nvidia-smi

2. Install the recipes:

git clone https://github.com/meta-LLaMA/LLaMA-recipes

cd LLaMA-recipes

pip3 install -r requirements.txt

3. Download the LFS git extension from https://git-lfs.com/ and run the following commands:

git lfs install 

Or, alternatively:

wget https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh

chmod u+x script.deb.sh

./script.deb.sh

sudo apt-get install git-lfs

4. Download the desired model from Hugging Face using the following command. Note that it can take a long time to execute, depending on the size of the model.

git clone 
https://<Your-Huggingface-Username>:<User-Token>@huggingface.co/meta-LLaMA/LLaMA-2-7b-hf

5. Run the fine-tuning script:

python -m llama_recipes.finetuning \
    --use_peft --peft_method lora --quantization \
    --model_name  ./LLaMA-2-7b-hf \ 
    --output_dir./LLaMA-2-7b-hf/7B-peft \
    --batch_size_training 2 --gradient_accumulation_steps 2

This script fine-tunes the LLaMA model using LoRA with quantization, making it suitable for single GPU setups.

Option 2: Using torchtune

To use torchtune for LLaMA fine tuning, install it and follow the instructions to download the model weights and run the fine-tuning:

1. Install torchtune:

pip3 install torchtune

2. Download the configuration file from Pytorch Github.

3. Download model weights:

tune download meta-LLaMA/Meta-LLaMA-3-8B \
    --output-dir  /tmp/Meta-Llama-3-8B-Instruct\
    --hf-token <ACCESS TOKEN>

4. Run LoRA fine-tuning on a single device:

tune run lora_finetune_single_device --config LLaMA3/8B_lora_single_device.yaml

This setup allows fine-tuning on consumer-grade GPUs and supports both single and multi-GPU configurations.

Option 3: Using Hugging Face LoRA Library

Here’s an example for fine-tuning LLaMA 2-7b using the OpenAssistant dataset, included in the Hugging Face LoRA library:

1. Install dependencies:

pip3 install trl

git clone https://github.com/huggingface/trl

cd trl

pip3 install wandb

2. Run the fine-tuning script:

python ./examples/scripts/sft.py \
    --model_name meta-llama/Llama-2-7b-hf\
    --dataset_name timdettmers/openassistant-guanaco \
    --load_in_4bit \
    --use_peft \
    --batch_size 4 \
    --gradient_accumulation_steps 2 \
    --per_device_train_batch_size=2 \
    --lora_r=64 \
    --log_with wandb \
    --output_dir /home/ubuntu/output \
    --max_seq_length 512 \
    --fp16

3. Inference with the fine-tuned model:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

# Define the model and output paths
model_name = "meta-llama/Llama-2-7b-hf"
new_model = "output"
device_map = {"": 0}# Use the first GPU device

# Load the base model with low CPU memory usage and on the specified device
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)

# Load the fine-tuned model using the base model as the starting point
model = PeftModel.from_pretrained(base_model, new_model)

# Merge and unload the LoRA layers for inference efficiency
model = model.merge_and_unload()

# Load the tokenizer with the appropriate settings
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # Set the pad token to be the same as the end-of-sequence token
tokenizer.padding_side = "right" # Ensure padding is done on the right side

# Define the prompt for text generation
prompt = "When was the movie Jurassic Park made?"

# Create a text-generation pipeline using the base model
pipe = pipeline(task="text-generation", model=base_model, tokenizer=tokenizer, max_length=200)

# Generate text using the base model
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text']) # Print the generated text from the base model

# Create a text-generation pipeline using the fine-tuned model
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)

# Generate text using the fine-tuned model
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text']) # Print the generated text from the fine-tuned model

Note: We saved the above code in a file called p1.py, one level up from the output directory that contains our pre-trained model:

python p1.py

Option 4: Fine-Tuning with QLoRA 

Here are the general steps involved in fine-tuning a LLaMA model using QLoRA with the OpenAssistant dataset:

Setup and Installation

1.Start by cloning the QLoRA repository and installing the necessary dependencies:

git clone https://github.com/artidoro/qlora

cd qlora

pip3 install -U -r requirements.txt

2. Run the provided fine-tuning script to fine-tune the LLaMA 2-7b model. This script will handle the setup and fine-tuning process:

scripts/finetune_llama2_guanaco_7b.sh

Run qlora.py with the following flags to address memory requirements and GPU tuning: 

python  qlora.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--use_auth \
    	--output_dir ./output/llama-2-guanaco-7b \
    	--logging_steps 10 \
    	--save_strategy steps \
    	--data_seed 42 \
    	--save_steps 500 \
    	--save_total_limit 40 \
    	--evaluation_strategy steps \
--eval_dataset_size 1024 \
--max_eval_samples 1000 \
--per_device_eval_batch_size 8 \
--max_new_tokens 32 \
--dataloader_num_workers 4 \
--group_by_length \
--logging_strategy steps \
--remove_unused_columns False \
--do_train \
--do_eval \
--do_mmlu_eval \
--lora_r 64 \
--lora_alpha 16 \
--lora_modules all \
--double_quant \
--quant_type nf4 \
--bf16 \
--bits 4 \
--warmup_ratio 0.03 \
--lr_scheduler_type constant \
--gradient_checkpointing \
--dataset oasst1 \
--source_max_len 16 \
--target_max_len 512 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 4 \
--max_steps 1875 \
--eval_steps 187 \
--learning_rate 0.0002 \
--adam_beta2 0.999 \
--max_grad_norm 0.3 \
--lora_dropout 0.1 \
--weight_decay 0.0 \
--seed 0 \

This process takes approximately 9 hours on a single GPU, utilizing around 11GB of GPU memory. Upon completion, the output_dir specified in the script will contain subfolders named checkpoint-, which include the fine-tuned adapter model files.

Running Inference

After fine-tuning, you can run inference using the following script:

1. Import the required libraries for loading the model and running inference:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
from peft import LoraConfig, PeftModel
import torch

2. Load the pretrained model with quantization and the fine-tuned adapter model:

# Define the model ID and the path to the fine-tuned adapter model
model_id = "meta-llama/Llama-2-7b-hf"
new_model = "output/LLaMA-2-guanaco-7b/checkpoint-1875/adapter_model"  # Adjust the path if necessary

# Configuration for loading the model with 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True, # Enable loading the model in 4-bit precision
    bnb_4bit_compute_dtype=torch.bfloat16, # Set the compute dtype to bfloat16
    bnb_4bit_use_double_quant=True, # Use double quantization for better precision
    bnb_4bit_quant_type='nf4' # Specify the quantization type as 'nf4'
)

# Load the base model with the specified quantization configuration
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True, # Optimize for low CPU memory usage
    load_in_4bit=True, # Load the model in 4-bit precision
    quantization_config=quantization_config, # Apply the quantization configuration
    torch_dtype=torch.float16, # Set the torch dtype to float16
    device_map='auto' # Automatically map the model to available devices
)

# Load the fine-tuned adapter model on top of the base model
model = PeftModel.from_pretrained(model, new_model)

# Load the tokenizer corresponding to the base model
tokenizer = AutoTokenizer.from_pretrained(model_id)

3. Create a text generation pipeline and run inference with the fine-tuned model:

prompt = "When was the movie Jurassic Park made?"

pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

Note: We have stored all the above in a file called p2.py inside the qlora folder. Execute the code using the following command:

python p2.py

AI Testing & Validation with Kolena

Kolena is an AI/ML testing & validation platform that solves one of AI’s biggest problems: the lack of trust in model effectiveness. The use cases for AI are enormous, but AI lacks trust from both builders and the public. It is our responsibility to build that trust with full transparency and explainability of ML model performance, not just from a high-level aggregate ‘accuracy’ number, but from rigorous testing and evaluation at scenario levels.

With Kolena, machine learning engineers and data scientists can uncover hidden machine learning model behaviors, easily identify gaps in the test data coverage, and truly learn where and why a model is underperforming, all in minutes not weeks. Kolena’s AI / ML model testing and validation solution helps developers build safe, reliable, and fair systems by allowing companies to instantly stitch together razor-sharp test cases from their data sets, enabling them to scrutinize AI/ML models in the precise scenarios those models will be unleashed upon the real world. Kolena platform transforms the current nature of AI development from experimental into an engineering discipline that can be trusted and automated.

Learn more about Kolena