Skip to content

Commit

Permalink
Add Llama-2 70B (akash-network#423)
Browse files Browse the repository at this point in the history
* add Llama-2-70B

* Update README.md

* Update deploy.yaml
  • Loading branch information
yuravorobei authored Sep 1, 2023
1 parent 80fb797 commit 4e83673
Show file tree
Hide file tree
Showing 8 changed files with 383 additions and 0 deletions.
19 changes: 19 additions & 0 deletions Llama-2-70B/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
FROM python:3.9

WORKDIR /app

RUN DEBIAN_FRONTEND=noninteractive
RUN apt-get -y update && apt-get upgrade -y
RUN apt-get install -y tzdata software-properties-common

# Install Miniconda and PyTorch + CUDA 11.8
RUN curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
RUN sh Miniconda3-latest-Linux-x86_64.sh -b -p ./miniconda
RUN miniconda/bin/conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# Install required python packages
RUN pip install -U scipy sentencepiece protobuf uvicorn fastapi bitsandbytes
RUN pip install -U git+https://github.com/huggingface/transformers.git git+https://github.com/huggingface/accelerate.git

# Copy source files
COPY . /app
43 changes: 43 additions & 0 deletions Llama-2-70B/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Llama-2 70B

Meta developed and publicly released the Llama 2 family of large language models (LLMs), a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Llama 2 is an auto-regressive language model that uses an optimized transformer architecture.

In this deployment, the [meta-llama/Llama-2-70b-hf](https://huggingface.co/meta-llama/Llama-2-70b-hf) pretrained model is used, which generates a continuation of the incoming text. But to access this model you must have access granted by the Meta. Nothing complicated, but it's a bit inconvenient, so I created my own Hugging Face repository [cryptoman/converted-llama-2-70b](https://huggingface.co/cryptoman/converted-llama-2-70b) with this model weights with open access, since the license allows it.

There is also a [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) model that are optimized for dialogue use cases and can answer questions.
The "converted" and "hf" in the model names means that the model is converted to the Hugging Face Transformers format.


In this deployment, the model is loaded using QLoRa. It reduces the memory usage of LLM finetuning without performance tradeoffs compared to standard 16-bit model finetuning. This method enables 33B model finetuning on a single 24GB GPU and 65B model finetuning on a single 46GB GPU. QLoRA uses 4-bit quantization to compress a pretrained language model.
Model loads in 4bit using NF4 quantization with double quantization with the compute dtype bfloat16
More details can be found here [Link](https://huggingface.co/blog/4bit-transformers-bitsandbytes).


## Deploying

This model **require >40Gb of GPU VRAM**. Tested on NVIDIA A6000 and H100 GPUs.
![a6000_smi](https://github.com/yuravorobei/awesome-akash/assets/19820490/5ad4c8a7-8d58-4a68-8d92-7f8be088b437)
![h100_smi](https://github.com/yuravorobei/awesome-akash/assets/19820490/80dad510-af3f-44f2-b634-6eccfda1b961)


Only 300 MB of VRAM is not enough to work on NVIDIA A100, maybe there is a solution with some settings, I did not succeed. The application starts, but when the generation function is called, the error "CUDA error: out of memory" appears.
![a100_smi](https://github.com/yuravorobei/awesome-akash/assets/19820490/cc3ae93e-f478-4aa5-a7c3-94d4db8c1f31)


When the deployment begins, 15 model weights files will be downloaded with a total of **130 GB** and loaded into memory, and this may take some time. You can watch the loading process in the logs.

Use [this SDL](deploy.yaml) to deploy the application on Akash. There are two environment variables in SDL:
- `MAX_INPUT_TOKEN_LENGTH` - this value specifies the maximum number of incoming text tokens that will be directly processed by the model. Text is truncated at the left end, as the model works to write a continuation of the entered text. The larger this value, the better the model will understand the context of the entered text (if it is relatively large text), but it will also require more computing resources;
- `MAX_NEW_TOKENS` - this value specifies how many new tokens the model will generate. The larger this value, the more computing resources are required.

These parameters must be selected depending on the tasks that have been applied to the model and the power of the GPUs.


## Logs
The logs on the screenshot below show that the loading of the model weights files has completed and the Uvicorn web server has started and the application is ready to work.
![logs](https://github.com/yuravorobei/awesome-akash/assets/19820490/bfc6606d-63bf-4e48-82be-e533d24d493b)


## Demo Video
https://github.com/yuravorobei/awesome-akash/assets/19820490/7b18b65c-9d51-4c3d-8099-fa6aa954a47a

50 changes: 50 additions & 0 deletions Llama-2-70B/deploy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
version: "2.0"

services:
app:
image: yuravorobei/llama-2:0.6
env:
- "MAX_INPUT_TOKEN_LENGTH=256"
- "MAX_NEW_TOKENS=50"

command:
- "bash"
- "-c"
args:
- 'uvicorn main:app --host 0.0.0.0 --port 7860'

expose:
- port: 7860
as: 80
to:
- global: true

profiles:
compute:
app:
resources:
cpu:
units: 4
memory:
size: 15Gi
gpu:
units: 1
attributes:
vendor:
nvidia:

storage:
- size: 150Gi
placement:
akash:
attributes:
pricing:
app:
denom: uakt
amount: 100000
deployment:
app:
akash:
profile: app
count: 1
63 changes: 63 additions & 0 deletions Llama-2-70B/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
import os
from fastapi import FastAPI, Request
from fastapi.staticfiles import StaticFiles
from fastapi.templating import Jinja2Templates
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, LlamaTokenizer
from transformers import BitsAndBytesConfig
import torch


MAX_INPUT_TOKEN_LENGTH = int(os.environ.get('MAX_INPUT_TOKEN_LENGTH', 256))
MAX_NEW_TOKENS = int(os.environ.get('MAX_NEW_TOKENS', 50))

# load model in 4bit using NF4 quantization with double quantization
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained("cryptoman/converted-llama-2-70b", quantization_config=nf4_config, device_map="auto")
tokenizer = LlamaTokenizer.from_pretrained("cryptoman/converted-llama-2-70b")

app = FastAPI()

# set FastAPI directories
app.mount("/static", StaticFiles(directory="static"), name="static")
templates = Jinja2Templates(directory="templates")


@app.get("/talk")
def talk(input):
try:
# convert input text into tokens
input_ids = tokenizer(
input,
return_tensors="pt",
truncation=True, # automatically cut the beginning of input text to the length specified by MAX_INPUT_TOKEN_LENGTH
max_length=MAX_INPUT_TOKEN_LENGTH
).input_ids.to('cuda')

# call model to generate an output text
gen_tokens = model.generate(
input_ids,
do_sample=True,
temperature=0.9,
max_new_tokens=MAX_NEW_TOKENS, # the number of new tokens that the model generates without taking into account the number of incoming ones
)
generated_text = tokenizer.batch_decode(gen_tokens)[0] # decode generated tokens to output text
truncated_text = tokenizer.batch_decode(input_ids)[0] # decode truncated input tokens to the truncated input text
output_text = generated_text[len(truncated_text):] # remove input text from generated output

return {"output": output_text}
except Exception as e:
return {"error": f"Server ERROR: {e}"}

# main page
@app.get("/")
def index(request: Request):
# display the main template with the desired model header and model link parameters
return templates.TemplateResponse("index.html", {
"request": request,
})
40 changes: 40 additions & 0 deletions Llama-2-70B/static/script.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@

$(document).ready(function() {
var text_box = $('#text-box');
var talk_button = document.getElementById("talk-button");

text_box.val("My name is Merve and my favorite "); // add default text to text-box

// 'TALK TO ME' button click handle
talk_button.addEventListener('click', async (event) => {
try
{
$('.lds-spinner').show(); // show loading circle animation
text = encodeURI($('#text-box').val()); // encode input text for URI
const response = await fetch(`talk?input=${text}`); // request server with text from text-box
const responseJson = await response.json(); // parse response as JSON

if (responseJson.error == undefined) { // if there are no errors
output_text = responseJson.output; // get result text from responce
console.log(output_text);

// call typing effect of the result text to the text-box
new TypeIt("#text-box", {
strings: output_text,
speed: 20,
}).go();

} else { // if the server returned an error
text_box.css('color', '#ff0303');
text_box.val(responseJson.error);
}

} catch(error){
alert('server request error')
}
$('.lds-spinner').hide(); // hide loading circle animation
});

});


132 changes: 132 additions & 0 deletions Llama-2-70B/static/style.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
body {
background-color: rgb(19 40 47);
color: #ffffff;
}


.form-block {
display: flex;
flex-direction: column;
flex-wrap: wrap;
align-content: flex-start;
align-items: center;
}

.form-block .form-header a {
color: #ffffff;
}

.form-block .form-body {
display: flex;
flex-direction: column;
width: 50em;
}

#text-box {
width: 100%;
min-height: 300px;
background-color: rgb(19 40 47);
color: #ffffff;
border: 2px solid rgb(50 95 110);
padding: 15px;
font-size: 15px;
}

#talk-button {
margin: 10px;
height: 40px;
width: 100px;
border: 2px solid rgb(50 95 110);
background-color: #3a6370;
color: white;
}

#talk-button:hover {
background-color: grey;
}

#talk-button:active {
height: 38px;
width: 98px;
}


.lds-spinner {
color: official;
position: relative;
width: 39px;
height: 27px;
display: inline-block;
}
.lds-spinner div {
transform-origin: 40px 22px;
animation: lds-spinner 1.2s linear infinite;
}
.lds-spinner div:after {
content: " ";
display: block;
position: absolute;
top: 3px;
left: 37px;
width: 6px;
height: 18px;
border-radius: 20%;
background: #fff;
}
.lds-spinner div:nth-child(1) {
transform: rotate(0deg);
animation-delay: -1.1s;
}
.lds-spinner div:nth-child(2) {
transform: rotate(30deg);
animation-delay: -1s;
}
.lds-spinner div:nth-child(3) {
transform: rotate(60deg);
animation-delay: -0.9s;
}
.lds-spinner div:nth-child(4) {
transform: rotate(90deg);
animation-delay: -0.8s;
}
.lds-spinner div:nth-child(5) {
transform: rotate(120deg);
animation-delay: -0.7s;
}
.lds-spinner div:nth-child(6) {
transform: rotate(150deg);
animation-delay: -0.6s;
}
.lds-spinner div:nth-child(7) {
transform: rotate(180deg);
animation-delay: -0.5s;
}
.lds-spinner div:nth-child(8) {
transform: rotate(210deg);
animation-delay: -0.4s;
}
.lds-spinner div:nth-child(9) {
transform: rotate(240deg);
animation-delay: -0.3s;
}
.lds-spinner div:nth-child(10) {
transform: rotate(270deg);
animation-delay: -0.2s;
}
.lds-spinner div:nth-child(11) {
transform: rotate(300deg);
animation-delay: -0.1s;
}
.lds-spinner div:nth-child(12) {
transform: rotate(330deg);
animation-delay: 0s;
}
@keyframes lds-spinner {
0% {
opacity: 1;
}
100% {
opacity: 0;
}
}

35 changes: 35 additions & 0 deletions Llama-2-70B/templates/index.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Llama2-70B</title>
<link rel="stylesheet" href="{{ url_for('static', path='/style.css') }}" />
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="https://unpkg.com/[email protected]/dist/index.umd.js"></script>
<script type="text/javascript" src="{{ url_for('static', path='/script.js') }}"></script>
</head>
<body>
<main>
<div class="form-block">
<div class="form-header">
<h1>Llama-2-70b</h1>
<p>Model link:
<a href="https://huggingface.co/meta-llama/Llama-2-70b" rel="noreferrer" target="_blank" >meta-llama/Llama-2-70b</a>
</p>
</div>
<div class="form-body">
<p style="text-align: center; font-size: 20px; color: white;">This model generates a logical continuation of your text</p>
<textarea id="text-box"></textarea>
<div>
<button id="talk-button">GENERATE</button>
<div class="lds-spinner" style="display: none;"><div></div><div></div><div></div><div></div><div></div><div></div>
<div></div><div></div><div></div><div></div><div></div><div></div></div>
</div>

</div>
</div>

</main>
</body>
</html>
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ Available on Testnet Only
- [FastChat](FastChat)
- [Flan-T5 XXL](flan-t5-xxl)
- [GPT-Neo](gpt-neo)
- [Llama-2-70B](Llama-2-70B)
- [RedPajama-INCITE-7B-Instruct](redpajama-incite-7b-instruct)
- [Semantra](semantra)
- [Serge](serge-gpu)
Expand Down

0 comments on commit 4e83673

Please sign in to comment.