Categories: FAANG

How to run LLaMA-13B or OpenChat-8192 on a Single GPU — Pragnakalp Techlabs: AI, NLP, Chatbot…

How to run LLaMA-13B or OpenChat-8192 on a Single GPU — Pragnakalp Techlabs: AI, NLP, Chatbot, Python Development

Recently, numerous open-source large language models (LLMs) have been launched. These powerful models hold great potential for a wide range of applications. However, one major challenge that arises is the limitation of resources when it comes to testing these models. While platforms like Google Colab Pro offer the ability to test up to 7B models, what options do we have when we wish to experiment with even larger models, such as 13B?

In this blog post, we will see how can we run Llama 13b and openchat 13b models on a single GPU. Here we are using Google Colab Pro’s GPU which is T4 with 25 GB of system RAM. Let’s check how to run it step by step.

Step 1:

Install the requirements, you need to install the accelerate and transformers from the source and make sure you have installed the latest version of bitsandbytes library (0.39.0).

!pip install -q -U bitsandbytes !pip install -q -U git+https://github.com/huggingface/transformers.git !pip install -q -U git+https://github.com/huggingface/peft.git !pip install -q -U git+https://github.com/huggingface/accelerate.git !pip install sentencepiec

Step 2:

We are using the quantization technique in our approach, employing the BitsAndBytes functionality from the transformers library. This technique allows us to perform quantization using various 4-bit variants, such as NF4 (normalized float 4, which is the default) or pure FP4 quantization. With 4-bit bitsandbytes, weights are stored in 4 bits, while the computation can still occur in 16 or 32 bits. Different combinations, including float16, bfloat16, and float32, can be chosen for computation.

To enhance the efficiency of matrix multiplication and training, we recommend utilizing a 16-bit compute dtype, with the default being torch.float32. The recent introduction of the BitsAndBytesConfig in transformers provides the flexibility to modify these parameters according to specific requirements.

import torch from transformers import BitsAndBytesConfig from transformers import AutoModelForCausalLM, AutoTokenizer quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16 )

Step 3:

Once we have added the configuration, now in this step we will load the tokenizer and the model, Here we are using Openchat model, you can use any 13b model available on HuggingFace Model.

If you want to use Llama 13 model, then just change the model-id to “openlm-research/open_llama_13b” and again run the steps below

model_id = "openchat/openchat_8192" tokenizer = AutoTokenizer.from_pretrained(model_id) model_bf16 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)

Step 4:

Once we have loaded the model, it is time to test it. You can provide any input of your choice, and also increase the “max_new_tokens” parameter to the number of tokens you would like to generate.

text = "Q: What is the largest animal?nA:" device = "cuda:0" inputs = tokenizer(text, return_tensors="pt").to(device) outputs = model_bf16.generate(**inputs, max_new_tokens=35) print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Output:

You can use any 13b model using this quantization technique using a single GPU or Google Colab Pro.

Originally published at How To Run LLaMA-13B Or OpenChat-8192 On A Single GPU on July 14, 2023.


How to run LLaMA-13B or OpenChat-8192 on a Single GPU — Pragnakalp Techlabs: AI, NLP, Chatbot… was originally published in Chatbots Life on Medium, where people are continuing the conversation by highlighting and responding to this story.

AI Generated Robotic Content

Recent Posts

Qwen Image Edit 2511 — Coming next week

submitted by /u/Queasy-Carrot-7314 [link] [comments]

13 hours ago

BERT Models and Its Variants

This article is divided into two parts; they are: • Architecture and Training of BERT…

13 hours ago

Lean4: How the theorem prover works and why it’s the new competitive edge in AI

Large language models (LLMs) have astounded the world with their capabilities, yet they remain plagued…

14 hours ago

13 Best MagSafe Power Banks for iPhones (2025), Tested and Reviewed

Keep your iPhone or Qi2 Android phone topped up with one of these WIRED-tested Qi2…

14 hours ago

I love Qwen

It is far more likely that a woman underwater is wearing at least a bikini…

2 days ago

100% Unemployment is Inevitable*

TL;DR AI is already raising unemployment in knowledge industries, and if AI continues progressing toward…

2 days ago