Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐞 Bug: aram estimate issue with gemma2 Ollama model #117

Closed
1 task
robbiemu opened this issue Sep 21, 2024 · 10 comments · Fixed by #118
Closed
1 task

🐞 Bug: aram estimate issue with gemma2 Ollama model #117

robbiemu opened this issue Sep 21, 2024 · 10 comments · Fixed by #118
Assignees
Labels
bug Something isn't working

Comments

@robbiemu
Copy link

Description

For some reason of all my models Gemma2 doesn't get good vram estimates.

ollama show gemma2:27b-instruct-q6_K
  Model
    architecture        gemma2
    parameters          27.2B
    context length      8192
    embedding length    4608
    quantization        Q6_K

  Parameters
    num_ctx    4096
    stop       "<start_of_turn>"
    stop       "<end_of_turn>"
    
gollama -vram gemma2:27b-instruct-q6_K --fits 40
📊 VRAM Estimation for Model: gemma2:27b-instruct-q6_K

| QUANT|CTX | BPW  | 2K  | 8K  |     16K      |     32K      |     49K      |     64K      |
|-----------|------|-----|-----|--------------|--------------|--------------|--------------|
| IQ1_S     | 1.56 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| IQ2_XXS   | 2.06 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| IQ2_XS    | 2.31 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| IQ2_S     | 2.50 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| IQ2_M     | 2.70 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| IQ3_XXS   | 3.06 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| IQ3_XS    | 3.30 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| Q2_K      | 3.35 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| Q3_K_S    | 3.50 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| IQ3_S     | 3.50 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| IQ3_M     | 3.70 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| Q3_K_M    | 3.91 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| IQ4_XS    | 4.25 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| Q3_K_L    | 4.27 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| IQ4_NL    | 4.50 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| Q4_0      | 4.55 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| Q4_K_S    | 4.58 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| Q4_K_M    | 4.85 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| Q4_K_L    | 4.90 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| Q5_0      | 5.54 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| Q5_K_S    | 5.54 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| Q5_K_M    | 5.69 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| Q5_K_L    | 5.75 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| Q6_K      | 6.59 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |
| Q8_0      | 8.50 | NaN | NaN | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) | NaN(NaN,NaN) |

Environment

If applicable, add screenshots to help explain your problem.

  • OS and version:
  • Install source: (Downloaded from releases / Go get / Built from source)
  • Go version: (Output of go version)

Apple M3 Max, Sequoia 15.0
from releases, macOS
go version go1.22.3 darwin/arm64

Can you contribute?

Im not very familiar with go but if I get a good sense of what to look at I would be happy to help.

  • I will attempt to implement a fix for this bug
@robbiemu robbiemu added the bug Something isn't working label Sep 21, 2024
@robbiemu
Copy link
Author

robbiemu commented Sep 21, 2024

while I'm passively offering to look at code in exchange for some orientation -- I'd love it if there was a cli flag for this command that let you specify the top context length to search for as a power of 2 -- for example --to-nth 16 is current default 65536, or even just a flat number instead, and for it to change the default to the max context length as listed in the model show card (Ollama only, I suppose).

@sammcj
Copy link
Owner

sammcj commented Sep 21, 2024

That’s a really good idea!
I’ll implement that soon.

Re: NaNs: that’s “cool” 😂, will look at this as well

@robbiemu
Copy link
Author

hey @sammcj I recognize you I think from the Ollama discord :) thank you for the help!

@sammcj
Copy link
Owner

sammcj commented Sep 21, 2024

It's not just Gemma - it's a lot (all?) models! Not quite sure how that happened but I'm looking into it.

@sammcj
Copy link
Owner

sammcj commented Sep 21, 2024

SCR-20240921-opiu

@sammcj
Copy link
Owner

sammcj commented Sep 21, 2024

Also supports context size shorthand, e.g. 96k:

SCR-20240921-oroc

@sammcj
Copy link
Owner

sammcj commented Sep 21, 2024

Was a bit to it, but got there 😅 https://github.com/sammcj/gollama/pull/118/files

GitHub
fix: #117 feat: add --vram-to-nth to specify context size to calculate out to

@robbiemu
Copy link
Author

robbiemu commented Sep 21, 2024

awesome :) thank you. Here's a script I've written now to make use of it to simplify the most frequent task I have with this vram command (this is written for a Mac, and would be complicated to add to gollama across various architectures):

#!/usr/bin/env python3
import argparse
import subprocess
import re
import sys
import logging


def run_command(command):
    """Run a shell command and return its output."""
    try:
        result = subprocess.run(command, stdout=subprocess.PIPE, 
                                text=True, check=True)
        return result.stdout.strip()
    except subprocess.CalledProcessError as e:
        logging.error(f"Error running command {' '.join(command)}: {e}")
        sys.exit(1)


def extract_quant(model_name):
    """Extract the quant value from gollama -l output for the given model."""
    output = run_command(["gollama", "-l"])
    pattern = re.compile(rf"{re.escape(model_name)}\s+\S+\s+(\S+)")
    match = pattern.search(output)
    if match:
        quant = match.group(1)
        logging.info(f"Quant value for {model_name}: {quant}")
        return quant
    else:
        logging.error(f"Could not find quant value for model '{model_name}'")
        sys.exit(1)


def extract_vram_nth(model_name):
    """Extract the context length (vram_nth) from ollama show <model> output."""
    output = run_command(["ollama", "show", model_name])
    pattern = re.compile(r"context length\s+(\d+)")
    match = pattern.search(output)
    if match:
        vram_nth = match.group(1)
        logging.info(f"VRAM nth for {model_name}: {vram_nth}")
        return vram_nth
    else:
        logging.error(f"Could not extract context length for model '{model_name}'")
        sys.exit(1)


def run_vram_estimation(model_name, vram_nth, fits_limit):
    """Run gollama -vram estimation and return the output."""
    logging.info(f"Running gollama -vram for {model_name} with fits={fits_limit} GB")
    output = run_command([
        "gollama", "-vram", model_name, "--fits", str(fits_limit), 
        "--vram-to-nth", vram_nth
    ])
    return output


def find_largest_below_fits(vram_output, quant, fits):
    """Find the largest A, B, and C values below the fits limit, along with their column names."""
    lines = vram_output.splitlines()
    header = lines[0]
    separator = lines[1]
    labels = lines[2]
    rows = lines[3:]

    logging.info("VRAM output header, labels, and rows gathered")

    # Find the quant row
    quant_row = None
    for row in rows:
        if quant in row:
            quant_row = row
            break

    if not quant_row:
        logging.error(f"Could not find matching row for quant '{quant}'")
        sys.exit(1)

    logging.info(f"Quant row: {quant_row}")

    # Extract column names from labels
    column_names = [col.strip() for col in labels.split('|')[3:]]
    columns = quant_row.split("|")[3:]

    max_A, max_A_ctx = None, None
    max_B, max_B_ctx = None, None
    max_C, max_C_ctx = None, None

    for idx, col in enumerate(columns):
        col = col.strip()
        match = re.match(r'([\d\.]+)(?:\(([\d\.]+),\s*([\d\.]+)\))?', col)
        if match:
            A_val = float(match.group(1))
            B_val = float(match.group(2) or 0)
            C_val = float(match.group(3) or 0)

            ctx_size = column_names[idx + 1] if idx + 1 < \
                len(column_names) else "Unknown"

            if A_val <= fits and (max_A is None or A_val >= max_A):
                max_A = A_val
                max_A_ctx = ctx_size

            if B_val <= fits and (max_B is None or B_val >= max_B):
                max_B = B_val
                max_B_ctx = ctx_size

            if C_val <= fits and (max_C is None or C_val >= max_C):
                max_C = C_val
                max_C_ctx = ctx_size

    logging.info(f"Max A: {max_A} at {max_A_ctx}")
    logging.info(f"Max B: {max_B} at {max_B_ctx}")
    logging.info(f"Max C: {max_C} at {max_C_ctx}")

    if max_A is not None or max_B is not None or max_C is not None:
        final_output = f"{max_A_ctx}@{max_A} ({max_B_ctx}@{max_B}, {max_C_ctx}@{max_C})"
        logging.info(f"Final Output: {final_output}")
        return header, labels, separator, final_output
    else:
        logging.error(f"No values found below the fits limit of {fits} GB")
        sys.exit(1)


def get_default_fits():
    """Get the default fits value from sysctl iogpu.wired_limit_mb."""
    output = run_command(["sysctl", "iogpu.wired_limit_mb"])
    match = re.search(r"(\d+)", output)
    if match:
        wired_limit_mb = int(match.group(1))
        fits = wired_limit_mb / 1024  # Convert MB to GB
        logging.info(f"Default fits value from sysctl: {fits} GB")
        return fits
    else:
        logging.error("Could not retrieve iogpu.wired_limit_mb from sysctl")
        sys.exit(1)


if __name__ == "__main__":
    parser = argparse\
        .ArgumentParser(description="Estimate VRAM usage for a given model.")
    parser.add_argument("model_name", help="Name of the model")
    parser.add_argument(
        "--fits", type=float, default=None,
        help="Fits limit in GB (default: iogpu.wired_limit_mb / 1024)"
    )
    parser.add_argument("--verbose", "-v", 
                        action="store_true", help="Verbose output")

    args = parser.parse_args()

    # Set up logging with 'VERBOSE' instead of 'DEBUG'
    logging.basicConfig(level=logging.DEBUG if args.verbose else logging.ERROR,
                        format='VERBOSE: %(message)s' if args.verbose else '%(message)s',
                        stream=sys.stderr)

    model_name = args.model_name
    fits_limit = args.fits or get_default_fits()

    quant = extract_quant(model_name)
    vram_nth = extract_vram_nth(model_name)
    vram_output = run_vram_estimation(model_name, vram_nth, fits_limit)
    header, labels, separator, largest_col = \
        find_largest_below_fits(vram_output, quant, fits_limit)

    print(f"Using fits value: {fits_limit:.2f} GB")
    print(largest_col)

sample output:

$ vram llama3.1:8b-instruct-q8_0 --verbose
VERBOSE: Default fits value from sysctl: 40.0 GB
VERBOSE: Quant value for llama3.1:8b-instruct-q8_0: Q8_0
VERBOSE: VRAM nth for llama3.1:8b-instruct-q8_0: 131072
VERBOSE: Running gollama -vram for llama3.1:8b-instruct-q8_0 with fits=40.0 GB
VERBOSE: VRAM output header, labels, and rows gathered
VERBOSE: Quant row: | Q8_0      | 8.50 | 9.1 | 10.9 | 13.4(12.4,11.9) | 18.4(16.4,15.4) | 28.3(24.3,22.3) | 48.2(40.2,36.2) |
VERBOSE: Max A: 28.3 at       64K
VERBOSE: Max B: 24.3 at       64K
VERBOSE: Max C: 36.2 at      128K
VERBOSE: Final Output:       64K      @28.3 (      64K      @24.3,      128K      @36.2)
Using fits value: 40.00 GB
      64K      @28.3 (      64K      @24.3,      128K      @36.2)
$ vram llama3.1:8b-instruct-q8_0
Using fits value: 40.00 GB
      64K      @28.3 (      64K      @24.3,      128K      @36.2)

@robbiemu
Copy link
Author

much, much better than my old script -- your vram thing is what drew me to gollama:

def estimate_model_memory(context_length, param_count, quant_bits, n_heads=16, embedding_length=2048):
    """
    Estimate total memory requirements for a transformer model.
    
    Args:
    context_length (int): Maximum sequence length the model can handle.
    param_count (int): Total number of parameters in the model.
    quant_bits (float): Average number of bits used for quantization.
    n_heads (int): Number of attention heads (default 12).
    embedding_length (int): Dimension of the model (default 768).
    
    Returns:
    float: Total memory estimate in bytes.
    """
    # Estimate model size in bytes
    model_size_bytes = param_count * quant_bits / 8
    
    # KV cache size and activation memory
    kv_cache_size = 2 * context_length * embedding_length * n_heads * quant_bits / 8
    activation_memory = 4 * context_length * embedding_length
    
    # Total memory estimate
    return model_size_bytes + kv_cache_size + activation_memory

# for quantization bits see https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/README.md

# Example usage
# -- gemma2:27b-instruct-q6_K
# context_length = 8192
# param_count = 272e8  
# quant_bits = 6.56  # q6_K quantization
# n_heads = 16
# embedding_length = 4608
# total_memory = estimate_model_memory(context_length, param_count, quant_bits, n_heads, embedding_length)
# print(f"Total Estimated Memory: {total_memory / (1024 ** 2):.2f} MB")
# Total Estimated Memory: 22359.39 MB

# -- mistral-nemo:12b-instruct-2407-q8_0
# context_length = 65536#1.024e+06
# param_count = 122e8  
# quant_bits = 8  # q8_0 quantization
# n_heads = 32
# embedding_length = 5120
# total_memory = estimate_model_memory(context_length, param_count, quant_bits, n_heads, embedding_length)
# print(f"Total Estimated Memory: {total_memory / (1024 ** 2):.2f} MB")
# Total Estimated Memory: 351634.83 MB
# @ 65536 Total Estimated Memory: 33394.83 MB

# -- llama3.1:70b-instruct-q3_K_M
# context_length = 6144#131072
# param_count = 706e8  
# quant_bits = 3.89
# n_heads = 64
# embedding_length = 6144#8192
# total_memory = estimate_model_memory(context_length, param_count, quant_bits, n_heads, embedding_length)
# print(f"Total Estimated Memory: {total_memory / (1024 ** 2):.2f} MB")
# Total Estimated Memory: 100568.68 MB
# @ 8192 Total Estimated Memory: 36978.28 MB
# @ 6144/6144 Total Estimated Memory: 35123.56 MB

# -- llama3.1:8b-instruct-q8_0
# context_length = 65536#32768#131072
# param_count = 8e9  
# quant_bits = 8
# n_heads = 32
# embedding_length = 8192
# total_memory = estimate_model_memory(context_length, param_count, quant_bits, n_heads, embedding_length)
# print(f"Total Estimated Memory: {total_memory / (1024 ** 2):.2f} MB")
# Total Estimated Memory: 77261.39 MB
# @ 32768 Total Estimated Memory: 25037.39 MB

# -- qwen2.5:32b-instruct-q5_K_M
context_length = 32768#131072
param_count = 32e9  
quant_bits = 5.67
n_heads = 40
embedding_length = 5120
total_memory = estimate_model_memory(context_length, param_count, quant_bits, n_heads, embedding_length)
print(f"Total Estimated Memory: {total_memory / (1024 ** 2):.2f} MB")
# Total Estimated Memory: 60477.33 MB
# @ 32768 Total Estimated Memory: 31341.33 MB```

@robbiemu
Copy link
Author

Im actually not clear about the three numbers though 🥹

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants