Fine-tuning is a powerful approach in natural language processing (NLP) and generative AI, allowing businesses to tailor pre-trained large language models (LLMs) for specific tasks. This process involves updating the model’s weights to improve its performance on targeted applications. By fine-tuning, the LLM can adapt its knowledge base to specific data and tasks, resulting in enhanced task-specific capabilities. To achieve optimal results, having a clean, high-quality dataset is of paramount importance. A well-curated dataset forms the foundation for successful fine-tuning. Additionally, careful adjustment of hyperparameters such as learning rate multiplier and batch size plays a crucial role in optimizing the model’s adaptation to the target task.
The capabilities in Amazon Bedrock for fine-tuning LLMs offer substantial benefits for enterprises. This feature enables companies to optimize models like Anthropic’s Claude 3 Haiku on Amazon Bedrock for custom use cases, potentially achieving performance levels comparable to or even surpassing more advanced models such as Anthropic’s Claude 3 Opus or Anthropic’s Claude 3.5 Sonnet. The result is a significant improvement in task-specific performance, while potentially reducing costs and latency. This approach offers a versatile solution to satisfy your goals for performance and response time, allowing businesses to balance capability, domain knowledge, and efficiency in your AI-powered applications.
In this post, we explore the best practices and lessons learned for fine-tuning Anthropic’s Claude 3 Haiku on Amazon Bedrock. We discuss the important components of fine-tuning, including use case definition, data preparation, model customization, and performance evaluation. This post dives deep into key aspects such as hyperparameter optimization, data cleaning techniques, and the effectiveness of fine-tuning compared to base models. We also provide insights on how to achieve optimal results for different dataset sizes and use cases, backed by experimental data and performance metrics.
As part of this post, we first introduce general best practices for fine-tuning Anthropic’s Claude 3 Haiku on Amazon Bedrock, and then present specific examples with the TAT- QA dataset (Tabular And Textual dataset for Question Answering).
The use cases that are the most well-suited for fine-tuning Anthropic’s Claude 3 Haiku include the following:
Fine-tuning Anthropic’s Claude 3 Haiku has demonstrated superior performance compared to few-shot prompt engineering on base Anthropic’s Claude 3 Haiku, Anthropic’s Claude 3 Sonnet, and Anthropic’s Claude 3.5 Sonnet across various tasks. These tasks include summarization, classification, information retrieval, open-book Q&A, and custom language generation such as SQL. However, achieving optimal performance with fine-tuning requires effort and adherence to best practices.
To better illustrate the effectiveness of fine-tuning compared to other approaches, the following table provides a comprehensive overview of various problem types, examples, and their likelihood of success when using fine-tuning versus prompting with Retrieval Augmented Generation (RAG). This comparison can help you understand when and how to apply these different techniques effectively.
Problem | Examples | Likelihood of Success with Fine-tuning | Likelihood of Success with Prompting + RAG |
Make the model follow a specific format or tone | Instruct the model to use a specific JSON schema or talk like the organization’s customer service reps | Very High | High |
Teach the model a new skill | Teach the model how to call APIs, fill out proprietary documents, or classify customer support tickets | High | Medium |
Teach the model a new skill, and hope it learns similar skills | Teach the model to summarize contract documents, in order to learn how to write better contract documents | Low | Medium |
Teach the model new knowledge, and expect it to use that knowledge for general tasks | Teach the model the organizations’ acronyms or more music facts | Low | Medium |
Before diving into the best practices and optimizing fine-tuning LLMs on Amazon Bedrock, familiarize yourself with the general process and how-to outlined in Fine-tune Anthropic’s Claude 3 Haiku in Amazon Bedrock to boost model accuracy and quality. The post provides essential background information and context for the fine-tuning process, including step-by-step guidance on fine-tuning Anthropic’s Claude 3 Haiku on Amazon Bedrock both through the Amazon Bedrock console and Amazon Bedrock API.
The process of fine-tuning an LLM like Anthropic’s Claude 3 Haiku on Amazon Bedrock typically follows these key stages:
Throughout this journey, depending on the business case, you may choose to combine fine-tuning with techniques like prompt engineering for optimal results. The process is inherently iterative, allowing for continuous improvement as new data or requirements emerge.
The TAT-QA dataset is related to a use case for question answering on a hybrid of tabular and textual content in finance where tabular data is organized in table formats such as HTML, JSON, Markdown, and LaTeX. We focus on the task of answering questions about the table. The evaluation metric is the F1 score that measures the word-to-word matching of the extracted content between the generated output and the ground truth answer. The TAT-QA dataset has been divided into train (28,832 rows), dev (3,632 rows), and test (3,572 rows).
The following screenshot provides a snapshot of the TAT-QA data, which comprises a table with tabular and textual financial data. Following this financial data table, a detailed question-answer set is presented to demonstrate the complexity and depth of analysis possible with the TAT-QA dataset. This comprehensive table is from the paper TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance, and it includes several key components:
The following screenshot shows a formatted version of the data as JSONL and is passed to Anthropic’s Claude 3 Haiku for fine-tuning training data. The preceding table has been structured in JSONL format with system, user role (which contains the data and the question), and assistant role (which has answers). The table is enclosed within the XML tag <table><table>
, helping Anthropic’s Claude 3 Haiku parse the prompt with the data from the table. For the model fine-tuning and performance evaluation, we randomly selected 10,000 examples from the TAT-QA dataset to fine-tune the model, and randomly picked 3,572 records from the remainder of the dataset as testing data.
When fine-tuning the Anthropic’s Claude 3 Haiku model, the quality of training data is paramount and serves as the primary determinant of the output quality, surpassing the importance of any other step in the fine-tuning process. Our experiments have consistently shown that high-quality datasets, even if smaller in size, yield better results than a larger but less refined one. This “quality over quantity” approach should guide the entire data preparation process. Data cleaning and validation are essential steps in maintaining the quality of the training set. The following are two effective methods:
{'prompt': {
'system': "You are a reliable and impartial expert judge in question/answering data assessment. ",
'messages': [
{'role': 'user', 'content': [{'type': 'text', 'text': 'Your task is to take a question, an answer, and a context which may include multiple documents, and provide a judgment on whether the answer to the question is correct or not. This decision should be based either on the provided context or your general knowledge and memory. If the answer contradicts the information in context, it's incorrect. A correct answer is ideally derived from the given context. If no context is given, a correct answer should be factually true and directly and unambiguously address the question.nnProvide a short step-by-step reasoning with a maximum of 4 sentences within the <reason></reason> xml tags and provide a single correct or incorrect response within the <judgement></judgement> xml tags.n <context>n...n</context>n<question>n...n</question>n<answer>n...n</answer>n'}]}]}}
The following is a sample output from Anthropic’s Claude 3.5 Sonnet:
{'id': 'job_id',
'type': 'message',
'role': 'assistant',
'model': 'claude-3-5-sonnet-20240620',
'content': [{'type': 'text',
'text': '<reason>n1. I'll check the table for information... </reason>nn<judgement>correct</judgement>'}],
'stop_reason': 'end_turn',
'stop_sequence': None,
'usage': {'input_tokens': 923, 'output_tokens': 90}}
This LLM-as-a-judge approach is effective for large datasets, allowing for efficient and consistent quality assessment across a wide range of examples. It can help identify and filter out low-quality or irrelevant data points, making sure only the most suitable examples are used for fine-tuning.
The format of your training data is equally important. Although it’s optional, it’s highly recommended to include a system prompt that clearly defines the model’s role and tasks. In addition, including rationales within XML tags can provide valuable context for the model and facilitate extraction of key information. Prompt optimization is one of the key factors in improving model performance. Following established guidelines, such as those provided by Anthropic, can significantly enhance results. This might include structuring prompts with semantic blocks within XML tags, both in training samples and at inference time.
By adhering to these best practices in data cleaning, validation, and formatting, you can create a high-quality dataset that forms the foundation for successful fine-tuning. In the world of model training, quality outweighs quantity, and a well-prepared dataset is key to unlocking the full potential of fine-tuning Anthropic’s Claude 3 Haiku.
When fine-tuning Anthropic’s Claude 3 Haiku on Amazon Bedrock, it’s crucial to optimize your training parameters to achieve the best possible performance. Our experiments have revealed several key insights that can guide you in effectively setting up your customization training jobs.
One of the most critical aspects of fine-tuning is selecting the right hyperparameters, particularly learning rate multiplier and batch size (see the appendix in this post for definitions). Our experiment results have shown that these two factors can significantly impact the model’s performance, with improvements ranging from 2–10% across different tasks. For the learning rate multiplier, the value ranges between 0.1–2.0, with a default value of 1.0. We suggest starting with the default value and potentially adjusting this value based on your evaluation result. Batch size is another important parameter, and its optimal value can vary depending on your dataset size. Based on our hyperparameter tuning experiments across different use cases, the API allows a range of 4–256, with a default of 32. However, we’ve observed that dynamically adjusting the batch size based on your dataset size can lead to better results:
The following chart illustrates how model performance improves as the size of the training dataset increases, as well as the change of optimal parameters, using the TAT-QA dataset. Each data point is annotated with the optimal learning rate multiplier (LRM), batch size (BS), and number of epochs (Epoch) used to achieve the best performance with the dataset size. We can observe that larger datasets tend to benefit from higher learning rates and batch sizes, whereas smaller datasets require more training epochs. The red dashed line is the baseline Anthropic’s Claude 3 Haiku performance without fine-tuning efforts.
By following these guidelines, you can configure an Anthropic’s Claude 3 Haiku fine-tuning job with a higher chance of success. However, remember that these are general recommendations and the optimal settings may vary depending on your specific use case and dataset characteristics.
In scenarios with large amounts of data (1,000–10,000 examples), the learning rate tends to have a more significant impact on performance. Conversely, for smaller datasets (32–100 examples), the batch size becomes the dominant factor.
The fine-tuned Anthropic’s Claude 3 Haiku model demonstrated substantial performance improvements over base models when evaluated on the financial Q&A task, highlighting the effectiveness of the fine-tuning process on specialized data. Based on the evaluation results, we found the following:
The following table provides a detailed comparison of the performance metrics for the fine-tuned Claude 3 Haiku model against various base models, illustrating the significant improvements achieved through fine-tuning.
. | . | . | . | . | Fine-Tuned Model Performance | Base Model Performance | Improvement: Fine-Tuned Anthropic’s Claude 3 Haiku vs. Base Models | ||||
Target Use Case | Task Type | Fine-Tuning Data Size | Test Data Size | Eval Metric | Anthropic’s Claude 3 Haiku | Anthropic’s Claude 3 Haiku (Base Model) | Anthropic’s Claude 3 Sonnet | Anthropic’s Claude 3.5 Sonnet | vs. Anthropic’s Claude 3 Haiku Base | vs. Anthropic’s Claude 3 Sonnet Base | vs. Anthropic’s Claude 3.5 Sonnet Base |
TAT-QA | Q&A on financial text and tabular content | 10,000 | 3,572 | F1 score | 91.2% | 73.2% | 76.3% | 83.0% | 24.6% | 19.6% | 9.9% |
Few-shot examples improve performance not only on the base model, but also on fine-tuned models, especially when the fine-tuning data is small.
Fine-tuning also demonstrated significant benefits in reducing token usage. On the TAT-QA HTML test set (893 examples), the fine-tuned Anthropic’s Claude 3 Haiku model reduced the average output token count by 35% compared to the base model, as shown in the following table.
Model | Average Output Token | % Reduced | Median | % Reduced | Standard Deviation | Minimum Token | Maximum Token |
Anthropic’s Claude 3 Haiku Base | 34 | – | 28 | – | 27 | 13 | 245 |
Anthropic’s Claude 3 Haiku Fine-Tuned | 22 | 35% | 17 | 39% | 14 | 13 | 179 |
We use the following figures to illustrate the token count distribution for both the base Anthropic’s Claude 3 Haiku and fine-tuned Anthropic’s Claude 3 Haiku models. The left graph shows the distribution for the base model, and the right graph displays the distribution for the fine-tuned model. These histograms demonstrate a shift towards more concise output in the fine-tuned model, with a notable reduction in the frequency of longer token sequences.
To further illustrate this improvement, consider the following example from the test set:
"How did the company adopt Topic 606?"
"the modified retrospective method"
"The company adopted the provisions of Topic 606 in fiscal 2019 utilizing the modified retrospective method"
"the modified retrospective method"
As evident from this example, the fine-tuned model produces a more concise and precise answer, matching the ground truth exactly, whereas the base model includes additional, unnecessary information. This reduction in token usage, combined with improved accuracy, can lead to enhanced efficiency and reduced costs in production deployments.
Fine-tuning Anthropic’s Claude 3 Haiku on Amazon Bedrock offers significant performance improvements for specialized tasks. Our experiments demonstrate that careful attention to data quality, hyperparameter optimization, and best practices in the fine-tuning process can yield substantial gains over base models. Key takeaways include the following:
Although fine-tuning provides impressive results, combining it with other techniques like prompt engineering may lead to even better outcomes. As LLM technology continues to evolve, mastering fine-tuning techniques will be crucial for organizations looking to use these powerful models for specific use cases and tasks.
Now you’re ready to fine-tune Anthropic’s Claude 3 Haiku on Amazon Bedrock for your use case. We look forward to seeing what you build when you put this new technology to work for your business.
We used the following hyperparameters as part of our fine-tuning:
For our evaluation, we used the F1 score, which is an evaluation metric to assess the performance of LLMs and traditional ML models.
To compute the F1 score for LLM evaluation, we need to define precision and recall at the token level. Precision measures the proportion of generated tokens that match the reference tokens, and recall measures the proportion of reference tokens that are captured by the generated tokens. The F1 score ranges from 0–100, with 100 being the best possible score and 0 being the lowest. However, interpretation can vary depending on the specific task and requirements.
We calculate these metrics as follows:
For example, let’s say the LLM generates the sentence “The cat sits on the mat in the sun” and the reference sentence is “The cat sits on the soft mat under the warm sun.” The precision would be 6/9 (6 matching tokens out of 9 generated tokens), and the recall would be 6/11 (6 matching tokens out of 11 reference tokens).
TL;DR A conversation with 4o about the potential demise of companies like Anthropic. As artificial…
Whether a company begins with a proof-of-concept or live deployment, they should start small, test…
Digital tools are not always superior. Here are some WIRED-tested agendas and notebooks to keep…
Machine learning (ML) models are built upon data.
Editor’s note: This is the second post in a series that explores a range of…
David J. Berg*, David Casler^, Romain Cledat*, Qian Huang*, Rui Lin*, Nissan Pow*, Nurcan Sonmez*,…