Categories: FAANG

How to run LLaMA-13B or OpenChat-8192 on a Single GPU — Pragnakalp Techlabs: AI, NLP, Chatbot…

How to run LLaMA-13B or OpenChat-8192 on a Single GPU — Pragnakalp Techlabs: AI, NLP, Chatbot, Python Development

Recently, numerous open-source large language models (LLMs) have been launched. These powerful models hold great potential for a wide range of applications. However, one major challenge that arises is the limitation of resources when it comes to testing these models. While platforms like Google Colab Pro offer the ability to test up to 7B models, what options do we have when we wish to experiment with even larger models, such as 13B?

In this blog post, we will see how can we run Llama 13b and openchat 13b models on a single GPU. Here we are using Google Colab Pro’s GPU which is T4 with 25 GB of system RAM. Let’s check how to run it step by step.

Step 1:

Install the requirements, you need to install the accelerate and transformers from the source and make sure you have installed the latest version of bitsandbytes library (0.39.0).

!pip install -q -U bitsandbytes !pip install -q -U git+https://github.com/huggingface/transformers.git !pip install -q -U git+https://github.com/huggingface/peft.git !pip install -q -U git+https://github.com/huggingface/accelerate.git !pip install sentencepiec

Step 2:

We are using the quantization technique in our approach, employing the BitsAndBytes functionality from the transformers library. This technique allows us to perform quantization using various 4-bit variants, such as NF4 (normalized float 4, which is the default) or pure FP4 quantization. With 4-bit bitsandbytes, weights are stored in 4 bits, while the computation can still occur in 16 or 32 bits. Different combinations, including float16, bfloat16, and float32, can be chosen for computation.

To enhance the efficiency of matrix multiplication and training, we recommend utilizing a 16-bit compute dtype, with the default being torch.float32. The recent introduction of the BitsAndBytesConfig in transformers provides the flexibility to modify these parameters according to specific requirements.

import torch from transformers import BitsAndBytesConfig from transformers import AutoModelForCausalLM, AutoTokenizer quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16 )

Step 3:

Once we have added the configuration, now in this step we will load the tokenizer and the model, Here we are using Openchat model, you can use any 13b model available on HuggingFace Model.

If you want to use Llama 13 model, then just change the model-id to “openlm-research/open_llama_13b” and again run the steps below

model_id = "openchat/openchat_8192" tokenizer = AutoTokenizer.from_pretrained(model_id) model_bf16 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)

Step 4:

Once we have loaded the model, it is time to test it. You can provide any input of your choice, and also increase the “max_new_tokens” parameter to the number of tokens you would like to generate.

text = "Q: What is the largest animal?nA:" device = "cuda:0" inputs = tokenizer(text, return_tensors="pt").to(device) outputs = model_bf16.generate(**inputs, max_new_tokens=35) print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Output:

You can use any 13b model using this quantization technique using a single GPU or Google Colab Pro.

Originally published at How To Run LLaMA-13B Or OpenChat-8192 On A Single GPU on July 14, 2023.


How to run LLaMA-13B or OpenChat-8192 on a Single GPU — Pragnakalp Techlabs: AI, NLP, Chatbot… was originally published in Chatbots Life on Medium, where people are continuing the conversation by highlighting and responding to this story.

AI Generated Robotic Content

Recent Posts

10 Ways to Use Embeddings for Tabular ML Tasks

Embeddings — vector-based numerical representations of typically unstructured data like text — have been primarily…

20 hours ago

Over-Searching in Search-Augmented Large Language Models

Search-augmented large language models (LLMs) excel at knowledge-intensive tasks by integrating external retrieval. However, they…

20 hours ago

How Omada Health scaled patient care by fine-tuning Llama models on Amazon SageMaker AI

This post is co-written with Sunaina Kavi, AI/ML Product Manager at Omada Health. Omada Health,…

20 hours ago

Anthropic launches Cowork, a Claude Desktop agent that works in your files — no coding required

Anthropic released Cowork on Monday, a new AI agent capability that extends the power of…

21 hours ago

New Proposed Legislation Would Let Self-Driving Cars Operate in New York State

New York governor Kathy Hochul says she will propose a new law allowing limited autonomous…

21 hours ago

From brain scans to alloys: Teaching AI to make sense of complex research data

Artificial intelligence (AI) is increasingly used to analyze medical images, materials data and scientific measurements,…

21 hours ago