In this section, we will discuss one of the most important techniques in prompt engineering - fine-tuning. Fine-tuning is a machine learning model, particularly a transformer-based model like OpenAI's GPT-3, involves adapting a pre-trained model to a specific task using additional training data. This process enhances the model's performance for a particular application by providing it with domain-specific knowledge. In this example, we are fine-tuning a model to classify medical reports into specific medical specialties. The steps include data preprocessing, tokenization, formatting data for training, and executing the fine-tuning process using the OpenAI API.
This project was undertaken for a client in the pharmaceutical sector with the primary objective of analyzing data to support further research aimed at solving health-related problems. Our client sought to explore the potential of foundation models in creating a chat-based application for extracting valuable information from medical reports.
To validate this approach, we conducted initial experiments using an MIT-licensed dataset, focusing on the classification of medical reports into specific medical specialties. The findings from these experiments are shared here. The final model, which is built on confidential data, aims to enhance the client's healthcare app by providing accurate and efficient classification of medical conditions, thereby supporting better decision-making in healthcare delivery.
The project involved fine-tuning a pre-trained transformer model using OpenAI's GPT-3, adapting it to the specific needs of the client. This process included data preprocessing, tokenization, formatting data for training, and executing the fine-tuning process. The goal was to evaluate whether foundation models could effectively be applied in this context and provide valuable insights for future research and development in the healthcare sector.
We will start the fine-tuning with importing libraries:
These imports include essential libraries:
pandas and numpy for data manipulation and numerical operations.
matplotlib and seaborn for data visualization.
tiktoken for tokenizing text data.
openai for interacting with the OpenAI API.
data = pd.read_csv("file.csv")
data.describe()
data = data.dropna(subset=['report'])
data.info()
Next, we will proceed with grouping the data.
Grouping and Counting: Data is grouped by medical_specialty and counted to understand distribution.
Sampling Data: 20 samples per specialty are taken for training, and further split into validation (5 samples) and test sets (5 samples).
Data Partitioning: Remaining data forms the training set, ensuring no overlap with validation and test sets.
grouped_data = data.groupby('medical_specialty').count()
grouped_data
grouped_data = data.groupby('medical_specialty').sample(20, random_state=42)
val_test_data = grouped_data.groupby('medical_specialty').sample(10, random_state=42)
val_data = val_test_data.head(5)
test_data = val_test_data.tail(5)
len(val_data)
len(test_data)
train_data = grouped_data[~grouped_data.index.isin(val_test_data.index)]
len(train_data)
Since fine-tuning involves training tokens, it comes at a price. So before we proceed with training the tokens, we checked the cost in the following way.
def num_token_from_string(string):
encoder = tiktoken.get_encoding('cl100k_base')
num_token = len(encoder.encode(string))
return num_token
data = train_data['report'].apply(num_token_from_string)
data.describe()
data.sum()
cost = (338456 * 0.008) / 1000
print(cost)
system_prompt = "Based on the medical description, please classify the disease into one of the following: 'Cardiovascular / Pulmonary', 'Gastroenterology', 'Neurology', 'Radiology', 'Surgery'"
def df_to_format(df):
formatted_data = [ ]
for index, row in df.iterrows( ):
entry = {
'messages': [
{'role': 'system', 'content': system_prompt},
{'role': 'user', 'content': row['report']},
{'role': 'assistant', 'content': row['medical_specialty']}
]
}
formatted_data.append(entry)
return formatted_data
import os
from openai import OpenAI
client = openai.OpenAI(api_key='YOUR_API_KEY')
training_file_response = client.files.create(file=open('ft_training_data.jsonl', 'rb'), purpose="fine-tune")
validation_file_response = client.files.create(file=open('ft_validation_data.jsonl', 'rb'), purpose="fine-tune")
fine_tune_response = client.fine_tuning.jobs.create(
training_file=training_file_response.id,
model='gpt-3.5-turbo-0125',
hyperparameters={'n_epochs': 5},
validation_file=validation_file_response.id
)
def classify_reports(report, model, system_prompt):
completion = client.chat.completions.create(
model=model,
messages=[
{'role': 'system', 'content': system_prompt},
{'role': 'user', 'content': report}
]
)
return completion.choices[0].message['content']
Model: ft:gpt-3.5-turbo-0125:personal::9tFL36Wm
Job ID: ftjob-jS0mPDxvhTad5lX9ef9KAJUE
Base model: gpt-3.5-turbo-0125
Output model: ft:gpt-3.5-turbo-0125:personal::9tFL36Wm
Trained tokens: 163,135
Epochs: 5
Batch size: 1
LR multiplier: 2
Seed: 260798329
Training Loss and Validation Loss (0.0000):
Training Loss: A training loss of 0.0000 suggests that the model has perfectly learned the training data.
Validation Loss: A validation loss of 0.0000 indicates that the model performs perfectly on the validation set during training.
Full Validation Loss (0.5745):
This metric likely represents the loss on a more comprehensive or final evaluation set. A value of 0.5745 is reasonable and typical for many models, indicating decent performance but not perfect.
Possible Issues:
Overfitting:
The significant difference between the training/validation loss (0.0000) and the full validation loss (0.5745) suggests overfitting. The model has learned the training and initial validation data too well, potentially memorising it rather than generalizing from it.
Data Leakage:
Such a discrepancy might also indicate data leakage, where information from the training set has unintentionally been used in the validation set.
Our project aimed to explore the potential of foundation models in a healthcare application, specifically to classify medical reports into various medical specialties. Initially, we conducted experiments using an MIT-licensed dataset to fine-tune a pre-trained transformer model, resulting in extremely low training and validation losses.
The excessively low training loss and validation loss indicated that our model was overfitting on the training data. This overfitting could be attributed to several factors:
Limited Data Diversity: The initial dataset might have lacked sufficient diversity, leading the model to learn the training data too well without generalizing to unseen data.
Data Imbalance: An imbalance in the distribution of different medical specialties could cause the model to perform exceptionally well on overrepresented classes while underperforming on others.
Inadequate Data Augmentation: Not implementing data augmentation techniques could limit the model's exposure to varied inputs during training.
To address these issues and improve the model's performance, we took several steps while working with the actual, confidential dataset:
We ensured that the new dataset had a more diverse and representative set of medical reports, covering a broader range of medical specialties.
Balanced Data Sampling:
We used stratified sampling to maintain a balanced distribution of classes, preventing the model from being biased toward any particular specialty.
Data Augmentation:
Implementing data augmentation techniques such as paraphrasing reports and adding slight variations helped the model learn more robust features.
Hyperparameter Tuning:
We experimented with different hyperparameters, such as learning rate, batch size, and the number of epochs, to find the optimal settings that reduced overfitting and improved generalization.
Cross-Validation:
Implementing cross-validation allowed us to better evaluate the model's performance and ensure it was not overly fitted to any single subset of the data.
As a result, the model trained on the actual dataset demonstrated improved performance, with more realistic training and validation loss values. This improvement highlighted the importance of using high-quality, diverse, and well-processed data for training foundation models. By addressing the initial shortcomings, we created a robust chat-based model that can accurately classify medical reports, providing valuable support in healthcare research and decision-making.