Add Llama-2 70B (akash-network#423)

* add Llama-2-70B * Update README.md * Update deploy.yaml
0x1d · Sep 1, 2023 · 4e83673 · 4e83673
1 parent 80fb797
commit 4e83673
Show file tree

Hide file tree

Showing 8 changed files with 383 additions and 0 deletions.
diff --git a/Llama-2-70B/Dockerfile b/Llama-2-70B/Dockerfile
@@ -0,0 +1,19 @@
+FROM python:3.9
+
+WORKDIR /app
+
+RUN DEBIAN_FRONTEND=noninteractive
+RUN apt-get -y update && apt-get upgrade -y
+RUN apt-get install -y tzdata software-properties-common
+
+# Install Miniconda and PyTorch + CUDA 11.8
+RUN curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
+RUN sh Miniconda3-latest-Linux-x86_64.sh -b -p ./miniconda
+RUN miniconda/bin/conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
+
+# Install required python packages
+RUN pip install -U scipy sentencepiece protobuf uvicorn fastapi bitsandbytes
+RUN pip install -U git+https://github.com/huggingface/transformers.git git+https://github.com/huggingface/accelerate.git
+
+# Copy source files
+COPY . /app
diff --git a/Llama-2-70B/README.md b/Llama-2-70B/README.md
@@ -0,0 +1,43 @@
+# Llama-2 70B
+
+Meta developed and publicly released the Llama 2 family of large language models (LLMs), a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Llama 2 is an auto-regressive language model that uses an optimized transformer architecture.
+
+In this deployment, the [meta-llama/Llama-2-70b-hf](https://huggingface.co/meta-llama/Llama-2-70b-hf) pretrained model is used, which generates a continuation of the incoming text. But to access this model you must have access granted by the Meta. Nothing complicated, but it's a bit inconvenient, so I created my own Hugging Face repository [cryptoman/converted-llama-2-70b](https://huggingface.co/cryptoman/converted-llama-2-70b) with this model weights with open access, since the license allows it. 
+
+There is also a [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) model that are optimized for dialogue use cases and can answer questions.
+The "converted" and "hf" in the model names means that the model is converted to the Hugging Face Transformers format.
+
+
+In this deployment, the model is loaded using QLoRa. It reduces the memory usage of LLM finetuning without performance tradeoffs compared to standard 16-bit model finetuning. This method enables 33B model finetuning on a single 24GB GPU and 65B model finetuning on a single 46GB GPU. QLoRA uses 4-bit quantization to compress a pretrained language model.
+Model loads in 4bit using NF4 quantization with double quantization with the compute dtype bfloat16
+More details can be found here [Link](https://huggingface.co/blog/4bit-transformers-bitsandbytes).
+
+
+## Deploying
+
+This model **require >40Gb of GPU VRAM**. Tested on NVIDIA A6000 and H100 GPUs.
+![a6000_smi](https://github.com/yuravorobei/awesome-akash/assets/19820490/5ad4c8a7-8d58-4a68-8d92-7f8be088b437)
+![h100_smi](https://github.com/yuravorobei/awesome-akash/assets/19820490/80dad510-af3f-44f2-b634-6eccfda1b961)
+
+
+Only 300 MB of VRAM is not enough to work on NVIDIA A100, maybe there is a solution with some settings, I did not succeed. The application starts, but when the generation function is called, the error "CUDA error: out of memory" appears.
+![a100_smi](https://github.com/yuravorobei/awesome-akash/assets/19820490/cc3ae93e-f478-4aa5-a7c3-94d4db8c1f31)
+
+
+When the deployment begins, 15 model weights files will be downloaded with a total of **130 GB** and loaded into memory, and this may take some time. You can watch the loading process in the logs.
+
+Use [this SDL](deploy.yaml) to deploy the application on Akash. There are two environment variables in SDL:
+- `MAX_INPUT_TOKEN_LENGTH` - this value specifies the maximum number of incoming text tokens that will be directly processed by the model. Text is truncated at the left end, as the model works to write a continuation of the entered text. The larger this value, the better the model will understand the context of the entered text (if it is relatively large text), but it will also require more computing resources;
+- `MAX_NEW_TOKENS` - this value specifies how many new tokens the model will generate. The larger this value, the more computing resources are required.
+
+These parameters must be selected depending on the tasks that have been applied to the model and the power of the GPUs.
+
+
+## Logs
+The logs on the screenshot below show that the loading of the model weights files has completed and the Uvicorn web server has started and the application is ready to work.
+![logs](https://github.com/yuravorobei/awesome-akash/assets/19820490/bfc6606d-63bf-4e48-82be-e533d24d493b)
+
+
+## Demo Video
+https://github.com/yuravorobei/awesome-akash/assets/19820490/7b18b65c-9d51-4c3d-8099-fa6aa954a47a
+
diff --git a/Llama-2-70B/deploy.yaml b/Llama-2-70B/deploy.yaml
@@ -0,0 +1,50 @@
+---
+  version: "2.0"
+
+  services:
+    app:
+      image: yuravorobei/llama-2:0.6
+      env:
+          - "MAX_INPUT_TOKEN_LENGTH=256"
+          - "MAX_NEW_TOKENS=50"
+
+      command:
+          - "bash"
+          - "-c"
+      args:
+          - 'uvicorn main:app --host 0.0.0.0 --port 7860'
+
+      expose:
+        - port: 7860
+          as: 80
+          to:
+          - global: true
+
+  profiles:
+    compute:
+      app:
+        resources:
+            cpu:
+                units: 4
+            memory:
+                size: 15Gi
+            gpu:
+                units: 1
+                attributes:
+                    vendor:
+                        nvidia:
+
+            storage:
+                - size: 150Gi
+    placement:
+      akash:
+        attributes:
+        pricing:
+          app:
+            denom: uakt
+            amount: 100000
+  deployment:
+    app:
+      akash:
+        profile: app
+        count: 1
diff --git a/Llama-2-70B/main.py b/Llama-2-70B/main.py
@@ -0,0 +1,63 @@
+import os
+from fastapi import FastAPI, Request
+from fastapi.staticfiles import StaticFiles
+from fastapi.templating import Jinja2Templates
+from transformers import AutoModelForCausalLM, BitsAndBytesConfig, LlamaTokenizer
+from transformers import BitsAndBytesConfig
+import torch
+
+
+MAX_INPUT_TOKEN_LENGTH = int(os.environ.get('MAX_INPUT_TOKEN_LENGTH', 256))
+MAX_NEW_TOKENS = int(os.environ.get('MAX_NEW_TOKENS', 50))
+
+# load model in 4bit using NF4 quantization with double quantization
+nf4_config = BitsAndBytesConfig(
+   load_in_4bit=True,
+   bnb_4bit_quant_type="nf4",
+   bnb_4bit_use_double_quant=True,
+   bnb_4bit_compute_dtype=torch.bfloat16
+)
+
+model = AutoModelForCausalLM.from_pretrained("cryptoman/converted-llama-2-70b", quantization_config=nf4_config, device_map="auto")
+tokenizer = LlamaTokenizer.from_pretrained("cryptoman/converted-llama-2-70b")
+
+app = FastAPI()
+
+# set FastAPI directories
+app.mount("/static", StaticFiles(directory="static"), name="static")
+templates = Jinja2Templates(directory="templates")
+
+
+@app.get("/talk")
+def talk(input):
+    try:
+        # convert input text into tokens
+        input_ids = tokenizer(
+            input,
+            return_tensors="pt",
+            truncation=True, # automatically cut the beginning of input text to the length specified by MAX_INPUT_TOKEN_LENGTH
+            max_length=MAX_INPUT_TOKEN_LENGTH
+        ).input_ids.to('cuda')
+
+        # call model to generate an output text
+        gen_tokens = model.generate(
+            input_ids,
+            do_sample=True,
+            temperature=0.9,
+            max_new_tokens=MAX_NEW_TOKENS, # the number of new tokens that the model generates without taking into account the number of incoming ones
+        )
+        generated_text = tokenizer.batch_decode(gen_tokens)[0] # decode generated tokens to output text
+        truncated_text = tokenizer.batch_decode(input_ids)[0] # decode truncated input tokens to the truncated input text
+        output_text = generated_text[len(truncated_text):] # remove input text from generated output
+
+        return {"output": output_text}
+    except Exception as e:
+        return {"error": f"Server ERROR: {e}"}
+
+# main page
+@app.get("/")
+def index(request: Request):
+    # display the main template with the desired model header and model link parameters
+    return templates.TemplateResponse("index.html", {
+        "request": request,
+    })
diff --git a/Llama-2-70B/static/script.js b/Llama-2-70B/static/script.js
@@ -0,0 +1,40 @@
+
+$(document).ready(function() {
+    var text_box = $('#text-box');
+    var talk_button = document.getElementById("talk-button");
+
+    text_box.val("My name is Merve and my favorite "); // add default text to text-box
+
+    // 'TALK TO ME' button click handle
+    talk_button.addEventListener('click', async (event) => {
+      try
+      {
+        $('.lds-spinner').show(); // show loading circle animation
+        text = encodeURI($('#text-box').val()); // encode input text for URI
+        const response = await fetch(`talk?input=${text}`); // request server with text from text-box
+        const responseJson = await response.json(); // parse response as JSON
+
+        if (responseJson.error == undefined) { // if there are no errors
+          output_text = responseJson.output; // get result text from responce
+          console.log(output_text);
+
+          // call typing effect of the result text to the text-box
+          new TypeIt("#text-box", {
+            strings: output_text,
+            speed: 20,
+          }).go();
+
+        } else { // if the server returned an error
+          text_box.css('color', '#ff0303');
+          text_box.val(responseJson.error);
+        }
+
+      } catch(error){
+          alert('server request error')
+      }
+      $('.lds-spinner').hide(); // hide loading circle animation
+    });
+
+});
+
+
diff --git a/Llama-2-70B/static/style.css b/Llama-2-70B/static/style.css
@@ -0,0 +1,132 @@
+body {
+  background-color: rgb(19 40 47);
+  color: #ffffff;
+}
+
+
+.form-block {
+  display: flex;
+  flex-direction: column;
+  flex-wrap: wrap;
+  align-content: flex-start;
+  align-items: center;
+}
+
+.form-block .form-header a {
+  color: #ffffff;
+}
+
+.form-block .form-body {
+  display: flex;
+  flex-direction: column;
+  width: 50em;
+}
+
+#text-box {
+  width: 100%;
+  min-height: 300px;
+  background-color: rgb(19 40 47);
+  color: #ffffff;
+  border: 2px solid rgb(50 95 110);
+  padding: 15px;
+  font-size: 15px;
+}
+
+#talk-button {
+  margin: 10px;
+  height: 40px;
+  width: 100px;
+  border: 2px solid rgb(50 95 110);
+  background-color: #3a6370;
+  color: white;
+}
+
+#talk-button:hover {
+  background-color: grey;
+}
+
+#talk-button:active {
+  height: 38px;
+  width: 98px;
+}
+
+
+.lds-spinner {
+  color: official;
+  position: relative;
+  width: 39px;
+  height: 27px;
+  display: inline-block;
+}
+.lds-spinner div {
+  transform-origin: 40px 22px;
+  animation: lds-spinner 1.2s linear infinite;
+}
+.lds-spinner div:after {
+  content: " ";
+  display: block;
+  position: absolute;
+  top: 3px;
+  left: 37px;
+  width: 6px;
+  height: 18px;
+  border-radius: 20%;
+  background: #fff;
+}
+.lds-spinner div:nth-child(1) {
+  transform: rotate(0deg);
+  animation-delay: -1.1s;
+}
+.lds-spinner div:nth-child(2) {
+  transform: rotate(30deg);
+  animation-delay: -1s;
+}
+.lds-spinner div:nth-child(3) {
+  transform: rotate(60deg);
+  animation-delay: -0.9s;
+}
+.lds-spinner div:nth-child(4) {
+  transform: rotate(90deg);
+  animation-delay: -0.8s;
+}
+.lds-spinner div:nth-child(5) {
+  transform: rotate(120deg);
+  animation-delay: -0.7s;
+}
+.lds-spinner div:nth-child(6) {
+  transform: rotate(150deg);
+  animation-delay: -0.6s;
+}
+.lds-spinner div:nth-child(7) {
+  transform: rotate(180deg);
+  animation-delay: -0.5s;
+}
+.lds-spinner div:nth-child(8) {
+  transform: rotate(210deg);
+  animation-delay: -0.4s;
+}
+.lds-spinner div:nth-child(9) {
+  transform: rotate(240deg);
+  animation-delay: -0.3s;
+}
+.lds-spinner div:nth-child(10) {
+  transform: rotate(270deg);
+  animation-delay: -0.2s;
+}
+.lds-spinner div:nth-child(11) {
+  transform: rotate(300deg);
+  animation-delay: -0.1s;
+}
+.lds-spinner div:nth-child(12) {
+  transform: rotate(330deg);
+  animation-delay: 0s;
+}
+@keyframes lds-spinner {
+  0% {
+    opacity: 1;
+  }
+  100% {
+    opacity: 0;
+  }
+}
+
diff --git a/Llama-2-70B/templates/index.html b/Llama-2-70B/templates/index.html
@@ -0,0 +1,35 @@
+<!DOCTYPE html>
+<html lang="en">
+  <head>
+    <meta charset="UTF-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+    <title>Llama2-70B</title>
+    <link rel="stylesheet" href="{{ url_for('static', path='/style.css') }}" />
+    <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
+    <script src="https://unpkg.com/[email protected]/dist/index.umd.js"></script>
+    <script type="text/javascript" src="{{ url_for('static', path='/script.js') }}"></script>
+  </head>
+  <body>
+    <main>
+      <div class="form-block">
+        <div class="form-header">
+          <h1>Llama-2-70b</h1>
+          <p>Model link:
+              <a href="https://huggingface.co/meta-llama/Llama-2-70b" rel="noreferrer" target="_blank" >meta-llama/Llama-2-70b</a>
+          </p>
+        </div>
+        <div class="form-body">
+          <p style="text-align: center; font-size: 20px; color: white;">This model generates a logical continuation of your text</p>
+          <textarea id="text-box"></textarea>
+          <div>
+            <button id="talk-button">GENERATE</button>
+            <div class="lds-spinner" style="display: none;"><div></div><div></div><div></div><div></div><div></div><div></div>
+            <div></div><div></div><div></div><div></div><div></div><div></div></div>
+          </div>
+
+        </div>
+      </div>
+
+    </main>
+  </body>
+</html>
diff --git a/README.md b/README.md
@@ -77,6 +77,7 @@ Available on Testnet Only
 - [FastChat](FastChat)
 - [Flan-T5 XXL](flan-t5-xxl)
 - [GPT-Neo](gpt-neo)
+- [Llama-2-70B](Llama-2-70B)
 - [RedPajama-INCITE-7B-Instruct](redpajama-incite-7b-instruct)
 - [Semantra](semantra)
 - [Serge](serge-gpu)