Meet FreeWilly, Our Large And Mighty Instruction Fine-Tuned Models

Stability AI and its CarperAI lab are proud to announce FreeWilly1 and its successor FreeWilly2, two powerful new, open access, Large Language Models (LLMs). Both models demonstrate exceptional reasoning ability across varied benchmarks. FreeWilly1 leverages the original LLaMA 65B foundation model and was carefully fine-tuned with a new synthetically-generated dataset using Supervised Fine-Tune (SFT) in standard Alpaca format. Similarly, FreeWilly2 leverages the LLaMA 2 70B foundation model to reach a performance that compares favorably with GPT-3.5 for some tasks.

Both models are research experiments, and are released to foster open research under a non-commercial license. While we have conducted internal red-teaming to ensure the model remains polite and harmless, we welcome the community’s feedback and help in further red-teaming.

Data Generation and Collection

The training for the FreeWilly models was directly inspired by the methodology pioneered by Microsoft in its paper: “Orca: Progressive Learning from Complex Explanation Traces of GPT-4.” While our data generation process is similar, we differ in our data sources.

Our variant of the dataset, containing 600,000 data points (roughly 10% of the dataset size the original Orca paper used), was created by prompting language models with high-quality instructions from the following datasets created by Enrico Shippole:

  1. COT Submix Original

  2. NIV2 Submix Original

  3. FLAN 2021 Submix Original

  4. T0 Submix Original

With this approach, we generated 500,000 examples with one simpler LLM model and an additional 100,000 with a more sophisticated LLM model. To ensure fair comparisons, we carefully filtered these datasets and removed examples that originated from evaluation benchmarks. Despite training on one-tenth the sample size of the original Orca paper (significantly reducing the cost and carbon footprint of training the model compared to the original paper), the resulting FreeWilly models demonstrate exceptional performance across various benchmarks – validating our approach to synthetically generated datasets.

Performance Evaluation

To internally evaluate these models, we used EleutherAI’s lm-eval-harness, to which we added AGIEval.

Both FreeWilly models excel in many areas, including intricate reasoning, understanding linguistic subtleties, and answering complex questions related to specialized domains, e.g. Law and mathematical problem-solving.

Open LLM Leaderboard benchmarks:

These FreeWilly results were evaluated by Stability AI researchers and independently reproduced by Hugging Face on July 21st, 2023, and published in their leaderboard.

GPT4ALL benchmarks (all 0-shot):

AGI Eval (all 0-shot):

Contributing to an open future

FreeWilly1 and FreeWilly2 set a new standard in the field of open access Large Language Models. They both significantly advance research, enhance natural language understanding, and enable complex tasks. We are excited about the endless possibilities that these models will bring to the AI community, and the new applications they will inspire.

We would like to express our sincere gratitude to our passionate team of researchers, engineers, and collaborators, whose remarkable efforts and dedication have enabled us to reach this significant milestone.

Stay tuned for more exciting developments, and begin exploring the incredible potential of FreeWilly today!


 1. The weights for FreeWilly2 are released as-is, while FreeWilly1’s are released as deltas over the original model. Both models are released under CC-BY-NC 4.0

2. These include the ARC-Challenge and others on the Open LLM Leaderboard, and GPT4ALL’s Performance Benchmarks

3. As reported in the “GPT-4 Technical Report” from OpenAI (March 27th, 2023)

4. As reported in the paper “Orca: Progressive Learning from Complex Explanation Traces of GPT-4” from Microsoft Research (June 5th 2023)