Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[User] ./quantize fails with illegal hardware instruction when AVX is not supported #1654

Closed
4 tasks done
happysmash27 opened this issue May 31, 2023 · 2 comments
Closed
4 tasks done

Comments

@happysmash27
Copy link

happysmash27 commented May 31, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

When running ./build/bin/quantize ./models/30B/ggml-model-f16.bin ./models/30B/ggml-model-q8_0.bin q8_0, ./quantize either successfully quantises the model, saving it as ggml-model-q8_0.bin, or gives an error message without the program breaking.

Current Behavior

When running this command, ./quantize gives an illegal hardware instruction. As per Valgrind:

vex amd64->IR: unhandled instruction bytes: 0xC5 0xF1 0xEF 0xC9 0xC4 0xC1 0x71 0xC4 0xCF 0x0
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==13694== valgrind: Unrecognised instruction at address 0x13c4f0.
==13694==    at 0x13C4F0: ggml_init (in /mnt/Biblioteko/Gits/llama.cpp/build/bin/quantize)
==13694==    by 0x113219: llama_init_backend (in /mnt/Biblioteko/Gits/llama.cpp/build/bin/quantize)
==13694==    by 0x10FCF9: main (in /mnt/Biblioteko/Gits/llama.cpp/build/bin/quantize)

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

  • Physical (or virtual) hardware you are using:
 ~ % lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         40 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  24
  On-line CPU(s) list:   0-23
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz
    CPU family:          6
    Model:               44
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           2
    Stepping:            2
    Frequency boost:     enabled
    CPU max MHz:         3468,0000
    CPU min MHz:         1600,0000
    BogoMIPS:            6932.61
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe
                         syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf
                         pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt lahf_lm epb pti tpr_shad
                         ow vnmi flexpriority ept vpid dtherm ida arat
Virtualization features:
  Virtualization:        VT-x
Caches (sum of all):
  L1d:                   384 KiB (12 instances)
  L1i:                   384 KiB (12 instances)
  L2:                    3 MiB (12 instances)
  L3:                    24 MiB (2 instances)
NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-5,12-17
  NUMA node1 CPU(s):     6-11,18-23
Vulnerabilities:
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
  Mds:                   Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Vulnerable
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected
  • Operating System:
 ~ % uname -a
Linux computer-pig.net 5.19.3-gentoo #2 SMP PREEMPT_DYNAMIC Fri Aug 26 20:41:30 PDT 2022 x86_64 Intel(R) Xeon(R) CPU X5690 @ 3.47GHz GenuineIntel GNU/Linux
  • SDK version:
 ~ % python3 --version
Python 3.9.16
 ~ % make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

I built with cmake, not make, therefore am also putting:

~ % cmake --version
cmake version 3.25.2

CMake suite maintained and supported by Kitware (kitware.com/cmake).
 ~ % g++ --version
g++ (Gentoo 11.3.0 p4) 11.3.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Pip cannot install requirements.txt verbatim due to the externally-managed-environment error on Gentoo, and sentencepiece is not natively in Gentoo repositories, therefore I had installed the package using pip manually with the --user --break-system-packages flags, and numpy appears to be managed by the system package manager. The following is the current package info:

 % pip list | egrep "torch|numpy|sentencepiece"
numpy                         1.24.2
sentencepiece                 0.1.98
torch                         1.13.0+cpu
torchaudio                    0.13.0+cpu
torchvision                   0.14.0+cpu

Failure Information (for bugs)

Valgrind gives extra details as follows:

vex amd64->IR: unhandled instruction bytes: 0xC5 0xF1 0xEF 0xC9 0xC4 0xC1 0x71 0xC4 0xCF 0x0
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==13694== valgrind: Unrecognised instruction at address 0x13c4f0.
==13694==    at 0x13C4F0: ggml_init (in /mnt/Biblioteko/Gits/llama.cpp/build/bin/quantize)
==13694==    by 0x113219: llama_init_backend (in /mnt/Biblioteko/Gits/llama.cpp/build/bin/quantize)
==13694==    by 0x10FCF9: main (in /mnt/Biblioteko/Gits/llama.cpp/build/bin/quantize)

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. Clone repository.
  2. Install numpy 1.24.2 using Gentoo package manager, and sentencepiece 0.1.98 using pip.
  3. mkdir build; cd build; cmake ..
  4. cmake --build . --config Release
  5. cd ..
  6. Using cp --reflink=auto, copy all model files to the models directory
  7. python convert.py models/30B/
  8. ./build/bin/quantize ./models/30B/ggml-model-f16.bin ./models/30B/ggml-model-q8_0.bin q8_0

Important notes

Originally, I had installed sentencepiece 0.1.98 several days ago. When I initially tried to convert the model today, I had the following error:

Traceback (most recent call last):
  File "/mnt/Biblioteko/Gits/llama.cpp/convert.py", line 25, in <module>
    from sentencepiece import SentencePieceProcessor  # type: ignore
  File "/home/happysmash27/.local/lib/python3.9/site-packages/sentencepiece/__init__.py", line 13, in <module>
    from . import _sentencepiece
ImportError: /home/happysmash27/.local/lib/python3.9/site-packages/sentencepiece/_sentencepiece.cpython-39-x86_64-linux-gnu.so: file too short

So, I ran pip3 install --upgrade --user --break-system-packages sentencepiece to upgrade it, and after this, the conversion went seemingly successfully. The quantisation, however, did not, causing me to start creating this issue.

In the process of making this issue, however, it was necessary to rule out this slightly newer version as causing the error. Upon downgrading back to 0.1.98 and re-running convert.py, it works now, but the error in running ./quantize is the same:

vex amd64->IR: unhandled instruction bytes: 0xC5 0xF1 0xEF 0xC9 0xC4 0xC1 0x71 0xC4 0xCF 0x0
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==18826== valgrind: Unrecognised instruction at address 0x13c4f0.
==18826==    at 0x13C4F0: ggml_init (in /mnt/Biblioteko/Gits/llama.cpp/build/bin/quantize)
==18826==    by 0x113219: llama_init_backend (in /mnt/Biblioteko/Gits/llama.cpp/build/bin/quantize)
==18826==    by 0x10FCF9: main (in /mnt/Biblioteko/Gits/llama.cpp/build/bin/quantize)

For this reason, the steps to reproduce do not include this upgrade-downgrade, in order to not have this issue be too badly disorganised. However, in order to give as much context as possible, here are the steps I have followed at this point in time in more detail:

  1. Clone repository.
  2. Install numpy 1.24.2 using Gentoo package manager, and sentencepiece 0.1.98 using pip.
  3. mkdir build; cd build; cmake ..
  4. cmake --build . --config Release
  5. cd ..
  6. Using cp --reflink=auto, copy all model files to the models directory
  7. python convert.py models/30B/ results in error.
  8. Attempt to fix error: pip3 install --upgrade --user --break-system-packages sentencepiece. If 0.1.99 is not the latest version, modify that command to install that version specifically.
  9. python convert.py models/30B/ runs seemingly successfully.
  10. ./quantize does not run successfully.
  11. While making the issue, downgrade back to 0.1.98: pip3 install --user --break-system-packages sentencepiece==0.1.98
  12. ./quantize still fails.
  13. mv models/30B/ggml-model-f16.bin models/30B/ggml-model-f16.bin.old
  14. python convert.py models/30B/
  15. As the conversion is finishing, delete ggml-model-f16.bin.old
  16. Run ./quantize again. The error still appears to be identical.

After looking through the documentation even more and finding a bunch of sha256sums, I also decided to run a sha256sum on the converted model to be absolutely sure that convert.py is not causing the error. The result is as follows:

 % sha256sum models/30B/ggml-model-f16.bin
7e1b524061a9f4b27c22a12d6d2a5bf13b8ebbea73e99f218809351ed9cf7d37  models/30B/ggml-model-f16.bin

This appears to be distinct from #41 as I am on x86_64, not ARM.

Failure Logs

When compiling, some warnings appeared which may or may not be related:

 % cmake --build . --config Release
[  3%] Built target BUILD_INFO
[  6%] Building C object CMakeFiles/ggml.dir/ggml.c.o
/home/happysmash27/Gits/llama.cpp/ggml.c: In function ‘ggml_graph_export_leaf’:
/home/happysmash27/Gits/llama.cpp/ggml.c:14584:39: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 6 has type ‘int64_t’ {aka ‘long int’} [-Wformat=]
14584 |     fprintf(fout, "%-6s %-12s %8d %8lld %8lld %8lld %8lld %16zu %16zu %16zu %16zu %16p %16s\n",
      |                                   ~~~~^
      |                                       |
      |                                       long long int
      |                                   %8ld
......
14588 |             ne[0], ne[1], ne[2], ne[3],
      |             ~~~~~
      |               |
      |               int64_t {aka long int}
/home/happysmash27/Gits/llama.cpp/ggml.c:14584:45: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 7 has type ‘int64_t’ {aka ‘long int’} [-Wformat=]
14584 |     fprintf(fout, "%-6s %-12s %8d %8lld %8lld %8lld %8lld %16zu %16zu %16zu %16zu %16p %16s\n",
      |                                         ~~~~^
      |                                             |
      |                                             long long int
      |                                         %8ld
......
14588 |             ne[0], ne[1], ne[2], ne[3],
      |                    ~~~~~
      |                      |
      |                      int64_t {aka long int}
/home/happysmash27/Gits/llama.cpp/ggml.c:14584:51: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 8 has type ‘int64_t’ {aka ‘long int’} [-Wformat=]
14584 |     fprintf(fout, "%-6s %-12s %8d %8lld %8lld %8lld %8lld %16zu %16zu %16zu %16zu %16p %16s\n",
      |                                               ~~~~^
      |                                                   |
      |                                                   long long int
      |                                               %8ld
......
14588 |             ne[0], ne[1], ne[2], ne[3],
      |                           ~~~~~
      |                             |
      |                             int64_t {aka long int}
/home/happysmash27/Gits/llama.cpp/ggml.c:14584:57: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 9 has type ‘int64_t’ {aka ‘long int’} [-Wformat=]
14584 |     fprintf(fout, "%-6s %-12s %8d %8lld %8lld %8lld %8lld %16zu %16zu %16zu %16zu %16p %16s\n",
      |                                                     ~~~~^
      |                                                         |
      |                                                         long long int
      |                                                     %8ld
......
14588 |             ne[0], ne[1], ne[2], ne[3],
      |                                  ~~~~~
      |                                    |
      |                                    int64_t {aka long int}
/home/happysmash27/Gits/llama.cpp/ggml.c: In function ‘ggml_graph_export_node’:
/home/happysmash27/Gits/llama.cpp/ggml.c:14598:44: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 7 has type ‘int64_t’ {aka ‘long int’} [-Wformat=]
14598 |     fprintf(fout, "%-6s %-6s %-12s %8d %8lld %8lld %8lld %8lld %16zu %16zu %16zu %16zu %8d %16p %16s\n",
      |                                        ~~~~^
      |                                            |
      |                                            long long int
      |                                        %8ld
......
14603 |             ne[0], ne[1], ne[2], ne[3],
      |             ~~~~~
      |               |
      |               int64_t {aka long int}
/home/happysmash27/Gits/llama.cpp/ggml.c:14598:50: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 8 has type ‘int64_t’ {aka ‘long int’} [-Wformat=]
14598 |     fprintf(fout, "%-6s %-6s %-12s %8d %8lld %8lld %8lld %8lld %16zu %16zu %16zu %16zu %8d %16p %16s\n",
      |                                              ~~~~^
      |                                                  |
      |                                                  long long int
      |                                              %8ld
......
14603 |             ne[0], ne[1], ne[2], ne[3],
      |                    ~~~~~
      |                      |
      |                      int64_t {aka long int}
/home/happysmash27/Gits/llama.cpp/ggml.c:14598:56: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 9 has type ‘int64_t’ {aka ‘long int’} [-Wformat=]
14598 |     fprintf(fout, "%-6s %-6s %-12s %8d %8lld %8lld %8lld %8lld %16zu %16zu %16zu %16zu %8d %16p %16s\n",
      |                                                    ~~~~^
      |                                                        |
      |                                                        long long int
      |                                                    %8ld
......
14603 |             ne[0], ne[1], ne[2], ne[3],
      |                           ~~~~~
      |                             |
      |                             int64_t {aka long int}
/home/happysmash27/Gits/llama.cpp/ggml.c:14598:62: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 10 has type  int64_t’ {aka ‘long int’} [-Wformat=]
14598 |     fprintf(fout, "%-6s %-6s %-12s %8d %8lld %8lld %8lld %8lld %16zu %16zu %16zu %16zu %8d %16p %16s\n",
      |                                                          ~~~~^
      |                                                              |
      |                                                              long long int
      |                                                          %8ld
......
14603 |             ne[0], ne[1], ne[2], ne[3],
      |                                  ~~~~~
      |                                    |
      |                                    int64_t {aka long int}
/home/happysmash27/Gits/llama.cpp/ggml.c: In function ‘ggml_graph_export’:
/home/happysmash27/Gits/llama.cpp/ggml.c:14631:34: warning: format ‘%llu’ expects argument of type ‘long long unsigned int’, but argument 4 has type ‘uint64_t’ {aka ‘long unsigned int’} [-Wformat=]
14631 |         fprintf(fout, "%-16s %8llu\n", "eval",    size_eval);
      |                              ~~~~^                ~~~~~~~~~
      |                                  |                |
      |                                  |                uint64_t {aka long unsigned int}
      |                                  long long unsigned int
      |                              %8lu
/home/happysmash27/Gits/llama.cpp/ggml.c: In function ‘ggml_graph_import’:
/home/happysmash27/Gits/llama.cpp/ggml.c:14865:9: warning: ignoring return value of ‘fread’ declared with attribute ‘warn_unused_result’ [-Wunused-result]
14865 |         fread(data->data, sizeof(char), fsize, fin);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[  6%] Built target ggml
[  9%] Building CXX object CMakeFiles/llama.dir/llama.cpp.o

Every command I have used to get this error:

 % ./build/bin/quantize ./models/30B/ggml-model-f16.bin ./models/30B/ggml-model-q8_0.bin q8_0
zsh: illegal hardware instruction  ./build/bin/quantize ./models/30B/ggml-model-f16.bin  q8_0
 % valgrind ./build/bin/quantize ./models/30B/ggml-model-f16.bin ./models/30B/ggml-model-q8_0.bin q8_0
==13694== Memcheck, a memory error detector
==13694== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==13694== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==13694== Command: ./build/bin/quantize ./models/30B/ggml-model-f16.bin ./models/30B/ggml-model-q8_0.bin q8_0
==13694==
vex amd64->IR: unhandled instruction bytes: 0xC5 0xF1 0xEF 0xC9 0xC4 0xC1 0x71 0xC4 0xCF 0x0
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==13694== valgrind: Unrecognised instruction at address 0x13c4f0.
==13694==    at 0x13C4F0: ggml_init (in /mnt/Biblioteko/Gits/llama.cpp/build/bin/quantize)
==13694==    by 0x113219: llama_init_backend (in /mnt/Biblioteko/Gits/llama.cpp/build/bin/quantize)
==13694==    by 0x10FCF9: main (in /mnt/Biblioteko/Gits/llama.cpp/build/bin/quantize)
==13694== Your program just tried to execute an instruction that Valgrind
==13694== did not recognise.  There are two possible reasons for this.
==13694== 1. Your program has a bug and erroneously jumped to a non-code
==13694==    location.  If you are running Memcheck and you just saw a
==13694==    warning about a bad jump, it's probably your program's fault.
==13694== 2. The instruction is legitimate but Valgrind doesn't handle it,
==13694==    i.e. it's Valgrind's fault.  If you think this is the case or
==13694==    you are not sure, please let us know and we'll try to fix it.
==13694== Either way, Valgrind will now raise a SIGILL signal which will
==13694== probably kill your program.
==13694==
==13694== Process terminating with default action of signal 4 (SIGILL)
==13694==  Illegal opcode at address 0x13C4F0
==13694==    at 0x13C4F0: ggml_init (in /mnt/Biblioteko/Gits/llama.cpp/build/bin/quantize)
==13694==    by 0x113219: llama_init_backend (in /mnt/Biblioteko/Gits/llama.cpp/build/bin/quantize)
==13694==    by 0x10FCF9: main (in /mnt/Biblioteko/Gits/llama.cpp/build/bin/quantize)
==13694==
==13694== HEAP SUMMARY:
==13694==     in use at exit: 73,064 bytes in 6 blocks
==13694==   total heap usage: 6 allocs, 0 frees, 73,064 bytes allocated
==13694==
==13694== LEAK SUMMARY:
==13694==    definitely lost: 0 bytes in 0 blocks
==13694==    indirectly lost: 0 bytes in 0 blocks
==13694==      possibly lost: 0 bytes in 0 blocks
==13694==    still reachable: 73,064 bytes in 6 blocks
==13694==         suppressed: 0 bytes in 0 blocks
==13694== Rerun with --leak-check=full to see details of leaked memory
==13694==
==13694== For lists of detected and suppressed errors, rerun with: -s
==13694== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
zsh: illegal hardware instruction  valgrind ./build/bin/quantize ./models/30B/ggml-model-f16.bin  q8_0

After going to the build/bin directory in case the issue was related to linking, the issue still persists:

 % ./quantize ../../models/30B/ggml-model-f16.bin ../../models/30B/ggml-model-q8_0.bin q8_0
zsh: illegal hardware instruction  ./quantize ../../models/30B/ggml-model-f16.bin  q8_0

File info:

 % file quantize
quantize: ELF 64-bit LSB pie executable, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, not stripped
% cat ../../requirements.txt
numpy==1.24
sentencepiece==0.1.98

As mentioned in "Important notes", initially I had converted using sentencepiece-0.1.99. Downgrading to sentencepiece-0.1.98 to see if it fixes it:

 % pip3 install --user --break-system-packages sentencepiece==0.1.98
Collecting sentencepiece==0.1.98
  Using cached sentencepiece-0.1.98-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
Installing collected packages: sentencepiece
  Attempting uninstall: sentencepiece
    Found existing installation: sentencepiece 0.1.99
    Uninstalling sentencepiece-0.1.99:
      Successfully uninstalled sentencepiece-0.1.99
Successfully installed sentencepiece-0.1.98

Same result, both before re-converting the model and after:

 % valgrind ./build/bin/quantize ./models/30B/ggml-model-f16.bin ./models/30B/ggml-model-q8_0.bin q8_0
==18826== Memcheck, a memory error detector
==18826== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==18826== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==18826== Command: ./build/bin/quantize ./models/30B/ggml-model-f16.bin ./models/30B/ggml-model-q8_0.bin q8_0
==18826==
vex amd64->IR: unhandled instruction bytes: 0xC5 0xF1 0xEF 0xC9 0xC4 0xC1 0x71 0xC4 0xCF 0x0
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==18826== valgrind: Unrecognised instruction at address 0x13c4f0.
==18826==    at 0x13C4F0: ggml_init (in /mnt/Biblioteko/Gits/llama.cpp/build/bin/quantize)
==18826==    by 0x113219: llama_init_backend (in /mnt/Biblioteko/Gits/llama.cpp/build/bin/quantize)
==18826==    by 0x10FCF9: main (in /mnt/Biblioteko/Gits/llama.cpp/build/bin/quantize)
==18826== Your program just tried to execute an instruction that Valgrind
==18826== did not recognise.  There are two possible reasons for this.
==18826== 1. Your program has a bug and erroneously jumped to a non-code
==18826==    location.  If you are running Memcheck and you just saw a
==18826==    warning about a bad jump, it's probably your program's fault.
==18826== 2. The instruction is legitimate but Valgrind doesn't handle it,
==18826==    i.e. it's Valgrind's fault.  If you think this is the case or
==18826==    you are not sure, please let us know and we'll try to fix it.
==18826== Either way, Valgrind will now raise a SIGILL signal which will
==18826== probably kill your program.
==18826==
==18826== Process terminating with default action of signal 4 (SIGILL)
==18826==  Illegal opcode at address 0x13C4F0
==18826==    at 0x13C4F0: ggml_init (in /mnt/Biblioteko/Gits/llama.cpp/build/bin/quantize)
==18826==    by 0x113219: llama_init_backend (in /mnt/Biblioteko/Gits/llama.cpp/build/bin/quantize)
==18826==    by 0x10FCF9: main (in /mnt/Biblioteko/Gits/llama.cpp/build/bin/quantize)
==18826==
==18826== HEAP SUMMARY:
==18826==     in use at exit: 73,064 bytes in 6 blocks
==18826==   total heap usage: 6 allocs, 0 frees, 73,064 bytes allocated
==18826==
==18826== LEAK SUMMARY:
==18826==    definitely lost: 0 bytes in 0 blocks
==18826==    indirectly lost: 0 bytes in 0 blocks
==18826==      possibly lost: 0 bytes in 0 blocks
==18826==    still reachable: 73,064 bytes in 6 blocks
==18826==         suppressed: 0 bytes in 0 blocks
==18826== Rerun with --leak-check=full to see details of leaked memory
==18826==
==18826== For lists of detected and suppressed errors, rerun with: -s
==18826== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
zsh: illegal hardware instruction  valgrind ./build/bin/quantize ./models/30B/ggml-model-f16.bin  q8_0

Update: I have recompiled with -DLLAMA_NATIVE=ON (to try to remove the unsupported instructions, which did not seem to work) and with debug symbols and have ran using gdb. My output is as follows:

(gdb) run
Starting program: /mnt/Biblioteko/Gits/llama.cpp/build/bin/quantize ../models/30B/ggml-model-f16.bin ../models/30B/ggml-model-q8_0.bin q8_0
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGILL, Illegal instruction.
0x000055555559b449 in _cvtsh_ss (__S=0) at /usr/lib/gcc/x86_64-pc-linux-gnu/11.3.0/include/f16cintrin.h:40
40	  __v8hi __H = __extension__ (__v8hi){ (short) __S, 0, 0, 0, 0, 0, 0, 0 };
(gdb) backtrace
#0  0x000055555559b449 in _cvtsh_ss (__S=0) at /usr/lib/gcc/x86_64-pc-linux-gnu/11.3.0/include/f16cintrin.h:40
#1  ggml_init (params=...) at /home/happysmash27/Gits/llama.cpp/ggml.c:3905
#2  0x000055555555d269 in llama_init_backend () at /home/happysmash27/Gits/llama.cpp/llama.cpp:865
#3  0x0000555555559cc1 in main (argc=4, argv=0x7fffffffd1f8) at /home/happysmash27/Gits/llama.cpp/examples/quantize/quantize.cpp:53

Update 2:

(gdb) disassemble
Dump of assembler code for function ggml_init:
   0x000055555559b3c8 <+0>:	push   %rbp
   0x000055555559b3c9 <+1>:	mov    %rsp,%rbp
   0x000055555559b3cc <+4>:	sub    $0x1940,%rsp
   0x000055555559b3d3 <+11>:	mov    %fs:0x28,%rax
   0x000055555559b3dc <+20>:	mov    %rax,-0x8(%rbp)
   0x000055555559b3e0 <+24>:	xor    %eax,%eax
   0x000055555559b3e2 <+26>:	call   0x55555559abab <ggml_critical_section_start>
   0x000055555559b3e7 <+31>:	movzbl 0x500e7(%rip),%eax        # 0x5555555eb4d5 <is_first_call.52>
   0x000055555559b3ee <+38>:	test   %al,%al
   0x000055555559b3f0 <+40>:	je     0x55555559b66f <ggml_init+679>
   0x000055555559b3f6 <+46>:	call   0x555555592a67 <ggml_time_init>
   0x000055555559b3fb <+51>:	call   0x555555592adf <ggml_time_us>
   0x000055555559b400 <+56>:	mov    %rax,-0x1918(%rbp)
   0x000055555559b407 <+63>:	movl   $0x0,-0x1930(%rbp)
   0x000055555559b411 <+73>:	jmp    0x55555559b5db <ggml_init+531>
   0x000055555559b416 <+78>:	mov    -0x1930(%rbp),%eax
   0x000055555559b41c <+84>:	mov    %ax,-0x1870(%rbp)
   0x000055555559b423 <+91>:	movzwl -0x1870(%rbp),%eax
   0x000055555559b42a <+98>:	mov    %ax,-0x1934(%rbp)
   0x000055555559b431 <+105>:	movzwl -0x1934(%rbp),%eax
   0x000055555559b438 <+112>:	movzwl %ax,%eax
   0x000055555559b43b <+115>:	mov    %ax,-0x1932(%rbp)
   0x000055555559b442 <+122>:	movzwl -0x1932(%rbp),%eax
=> 0x000055555559b449 <+129>:	vpxor  %xmm0,%xmm0,%xmm0
   0x000055555559b44d <+133>:	vpinsrw $0x0,%eax,%xmm0,%xmm0
   0x000055555559b452 <+138>:	vmovdqa %xmm0,-0x1890(%rbp)
   0x000055555559b45a <+146>:	vmovdqa -0x1890(%rbp),%xmm0
   0x000055555559b462 <+154>:	vcvtph2ps %xmm0,%xmm0
   0x000055555559b467 <+159>:	vmovaps %xmm0,-0x1880(%rbp)
   0x000055555559b46f <+167>:	vmovaps -0x1880(%rbp),%xmm0
   0x000055555559b477 <+175>:	mov    -0x1930(%rbp),%eax
   0x000055555559b47d <+181>:	cltq

After some more research, using a combination of a bunch of ChatGPT sessions and manually finding documentation to try to deal with halucinations better, I believe the issue is that this is calling an AVX instruction, while AVX was not implemented into Intel CPUs until a generation or so after my CPUs were made.

Update 3:

Edited update 1 to note that I also had added -DLLAMA_NATIVE=ON at that point and...

Now I ahve also added -DLLAMA_AVX2=OFF -DLLAMA_AVX=OFF to my CMake flags. Unfortunately, the error still appears to persist.

Update 4:

I now realise that by Update 3 the Valgrind output had actually changed:

==6513== Memcheck, a memory error detector
==6513== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==6513== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==6513== Command: ./bin/quantize ../models/30B/ggml-model-f16.bin ../models/30B/ggml-model-q8_0.bin q8_0
==6513==
vex amd64->IR: unhandled instruction bytes: 0xC5 0xF9 0xEF 0xC0 0xC5 0xF9 0xC4 0xC0 0x0 0xC5
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==6513== valgrind: Unrecognised instruction at address 0x14f775.
==6513==    at 0x14F775: _cvtsh_ss (f16cintrin.h:40)
==6513==    by 0x14F775: ggml_init (ggml.c:3905)
==6513==    by 0x111268: llama_init_backend (llama.cpp:865)
==6513==    by 0x10DCC0: main (quantize.cpp:53)
==6513== Your program just tried to execute an instruction that Valgrind
==6513== did not recognise.  There are two possible reasons for this.
==6513== 1. Your program has a bug and erroneously jumped to a non-code
==6513==    location.  If you are running Memcheck and you just saw a
==6513==    warning about a bad jump, it's probably your program's fault.
==6513== 2. The instruction is legitimate but Valgrind doesn't handle it,
==6513==    i.e. it's Valgrind's fault.  If you think this is the case or
==6513==    you are not sure, please let us know and we'll try to fix it.
==6513== Either way, Valgrind will now raise a SIGILL signal which will
==6513== probably kill your program.
==6513==
==6513== Process terminating with default action of signal 4 (SIGILL)
==6513==  Illegal opcode at address 0x14F775
==6513==    at 0x14F775: _cvtsh_ss (f16cintrin.h:40)
==6513==    by 0x14F775: ggml_init (ggml.c:3905)
==6513==    by 0x111268: llama_init_backend (llama.cpp:865)
==6513==    by 0x10DCC0: main (quantize.cpp:53)
==6513==
==6513== HEAP SUMMARY:
==6513==     in use at exit: 73,064 bytes in 6 blocks
==6513==   total heap usage: 6 allocs, 0 frees, 73,064 bytes allocated
==6513==
==6513== LEAK SUMMARY:
==6513==    definitely lost: 0 bytes in 0 blocks
==6513==    indirectly lost: 0 bytes in 0 blocks
==6513==      possibly lost: 0 bytes in 0 blocks
==6513==    still reachable: 73,064 bytes in 6 blocks
==6513==         suppressed: 0 bytes in 0 blocks
==6513== Rerun with --leak-check=full to see details of leaked memory
==6513==
==6513== For lists of detected and suppressed errors, rerun with: -s
==6513== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
zsh: illegal hardware instruction  valgrind ./bin/quantize ../models/30B/ggml-model-f16.bin  q8_0

More importantly though, after adding in yet another flag to disable F16C (which I had noticed seems to be the only one of these variables that shows up in the relevent source code), Valgrind output has changed again:

==14777== Memcheck, a memory error detector
==14777== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==14777== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==14777== Command: ./bin/quantize ../models/30B/ggml-model-f16.bin ../models/30B/ggml-model-q8_0.bin q8_0
==14777==
vex amd64->IR: unhandled instruction bytes: 0xC5 0xFA 0x10 0x5 0xAE 0x63 0x3 0x0 0xC5 0xFA
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==14777== valgrind: Unrecognised instruction at address 0x1467ae.
==14777==    at 0x1467AE: ggml_compute_fp16_to_fp32 (ggml.c:269)
==14777==    by 0x14F596: ggml_init (ggml.c:3905)
==14777==    by 0x111268: llama_init_backend (llama.cpp:865)
==14777==    by 0x10DCC0: main (quantize.cpp:53)
==14777== Your program just tried to execute an instruction that Valgrind
==14777== did not recognise.  There are two possible reasons for this.
==14777== 1. Your program has a bug and erroneously jumped to a non-code
==14777==    location.  If you are running Memcheck and you just saw a
==14777==    warning about a bad jump, it's probably your program's fault.
==14777== 2. The instruction is legitimate but Valgrind doesn't handle it,
==14777==    i.e. it's Valgrind's fault.  If you think this is the case or
==14777==    you are not sure, please let us know and we'll try to fix it.
==14777== Either way, Valgrind will now raise a SIGILL signal which will
==14777== probably kill your program.
==14777==
==14777== Process terminating with default action of signal 4 (SIGILL)
==14777==  Illegal opcode at address 0x1467AE
==14777==    at 0x1467AE: ggml_compute_fp16_to_fp32 (ggml.c:269)
==14777==    by 0x14F596: ggml_init (ggml.c:3905)
==14777==    by 0x111268: llama_init_backend (llama.cpp:865)
==14777==    by 0x10DCC0: main (quantize.cpp:53)
==14777==
==14777== HEAP SUMMARY:
==14777==     in use at exit: 73,064 bytes in 6 blocks
==14777==   total heap usage: 6 allocs, 0 frees, 73,064 bytes allocated
==14777==
==14777== LEAK SUMMARY:
==14777==    definitely lost: 0 bytes in 0 blocks
==14777==    indirectly lost: 0 bytes in 0 blocks
==14777==      possibly lost: 0 bytes in 0 blocks
==14777==    still reachable: 73,064 bytes in 6 blocks
==14777==         suppressed: 0 bytes in 0 blocks
==14777== Rerun with --leak-check=full to see details of leaked memory
==14777==
==14777== For lists of detected and suppressed errors, rerun with: -s
==14777== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
zsh: illegal hardware instruction  valgrind ./bin/quantize ../models/30B/ggml-model-f16.bin  q8_0

It doesn't actually solve the issue, unfortunately, but I consider it good progress that it has at least gotten to the issue through a different route than it did before.

@happysmash27 happysmash27 changed the title [User] Insert summary of your issue or enhancement.. [User] ./quantize fails with illegal hardware instruction May 31, 2023
@happysmash27 happysmash27 changed the title [User] ./quantize fails with illegal hardware instruction [User] ./quantize fails with illegal hardware instruction when AVX is not supported May 31, 2023
@happysmash27
Copy link
Author

I finally, finally got it to work! All I had to do was disable yet another one of the new instruction sets listed in CMakeLists.txt that are enabled by default: FMA. Inputting the following when configuring with CMake will make it work completely correctly:

-DLLAMA_NATIVE=ON -DLLAMA_AVX2=OFF -DLLAMA_AVX=OFF -DLLAMA_F16C=OFF -DLLAMA_FMA=OFF

@rankaiyx
Copy link
Contributor

rankaiyx commented Jun 1, 2023

When use make, add mfma or mf16c also seems to cause AVX to be enabled.
See https://github.com/ggerganov/llama.cpp/pull/1659/files

I see you are using a computer with numa. You can try the branch of zrm to see if the reasoning performance can be doubled.
See #1556

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants