Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train-text-from-scratch.cpp: dereferenced NULL KQ_pos->data #3389

Closed
ttkciar opened this issue Sep 29, 2023 · 1 comment · Fixed by #3392
Closed

train-text-from-scratch.cpp: dereferenced NULL KQ_pos->data #3389

ttkciar opened this issue Sep 29, 2023 · 1 comment · Fixed by #3392

Comments

@ttkciar
Copy link

ttkciar commented Sep 29, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [ X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [ X] I carefully followed the README.md.
  • [ X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [ X] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Attempted to train a model via:
train-text-from-scratch --vocab-model models/ggml-vocab-llama.gguf --train-data ../cruft.llama/icbmlog.ttk.2.txt --adam-iter 500 --head 16 --layer 16

Current Behavior

Segmentation fault in llama_build_train_graphs():

main: init model
print_params: n_vocab: 32000
print_params: n_ctx:   128
print_params: n_embd:  256
print_params: n_head:  16
print_params: n_ff:    768
print_params: n_layer: 16
print_params: n_rot:   16
main: total train_iterations 0
main: seen train_samples     0
main: seen train_tokens      0
main: completed train_epochs 0
main: model_size = 240290304 bytes (229.2 MB)
main: opt_size  = 360288288 bytes (343.6 MB)
main: opt iter 0
main: input_size = 131076128 bytes (125.0 MB)
Segmentation fault```

# Environment and Context

* Physical (or virtual) hardware you are using, e.g. for Linux:

```$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          12
On-line CPU(s) list:             0-11
Thread(s) per core:              2
Core(s) per socket:              6
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           158
Model name:                      Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Stepping:                        10
CPU MHz:                         800.332
CPU max MHz:                     4500.0000
CPU min MHz:                     800.0000
BogoMIPS:                        5199.98
Virtualization:                  VT-x
L1d cache:                       192 KiB
L1i cache:                       192 KiB
L2 cache:                        1.5 MiB
L3 cache:                        12 MiB
NUMA node0 CPU(s):               0-11
Vulnerability Itlb multihit:     KVM: Mitigation: Split huge pages
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall n
                                 x pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dte
                                 s64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave
                                 avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ep
                                 t vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv
                                 1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
  • Operating System, e.g. for Linux:
Linux kirov.ciar.org 5.4.10 #1 SMP Thu Jan 9 14:13:31 CST 2020 x86_64 Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz GenuineIntel GNU/Linux
  • SDK version, e.g. for Linux:
$ python3 --version
Python 3.8.1

$ make --version
GNU Make 4.2.1

$ g++ --version
g++ (GCC) 12.1.0

Failure Information (for bugs)

In train-text-from-scratch.cpp, llama_build_train_graphs is trying to initialize KQ_pos->data when it is NULL.

llama_build_train_graphs calls ggml_new_tensor_1d, which calls ggml_new_tensor, which calls ggml_new_tensor_impl.

In ggml_new_tensor_impl:

  • view_src is NULL, so not setting view_offs
  • Since view_src is NULL and ctx->no_alloc is true, data never gets assigned any memory space, and result is returned with result.data = NULL.

Immediately upon assigning returned result to KQ_pos, llama_build_train_graphs tries to set N elements of KQ_pos->data to n_past + i.

Since KQ_pos->data is NULL, this causes a segfault.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. ./train-text-from-scratch --vocab-model models/ggml-vocab-llama.gguf --train-data --adam-iter 500 --head 16 --layer 16

Failure Logs

$ git log | head -1
commit bc39553c901a91cfcb757863586250838c83eeab

$ pip3 list | egrep "torch|numpy|sentencepiece"
numpy                    1.22.1
sentencepiece            0.1.99
torch                    2.0.1
torchvision              0.15.2

$ make --version | head -1
GNU Make 4.2.1
@ttkciar
Copy link
Author

ttkciar commented Sep 29, 2023

Confirmed that this update fixes the problem. Thanks for all that you do, folks!

I do notice this at the end of the training run, but it might be unrelated:

main: total training time: 00:11:53
double free or corruption (!prev)
Aborted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant