train-text-from-scratch.cpp: dereferenced NULL KQ_pos->data #3389

ttkciar · 2023-09-29T03:52:03Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[ X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[ X] I carefully followed the README.md.
[ X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[ X] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Attempted to train a model via:
train-text-from-scratch --vocab-model models/ggml-vocab-llama.gguf --train-data ../cruft.llama/icbmlog.ttk.2.txt --adam-iter 500 --head 16 --layer 16

Current Behavior

Segmentation fault in llama_build_train_graphs():

main: init model
print_params: n_vocab: 32000
print_params: n_ctx:   128
print_params: n_embd:  256
print_params: n_head:  16
print_params: n_ff:    768
print_params: n_layer: 16
print_params: n_rot:   16
main: total train_iterations 0
main: seen train_samples     0
main: seen train_tokens      0
main: completed train_epochs 0
main: model_size = 240290304 bytes (229.2 MB)
main: opt_size  = 360288288 bytes (343.6 MB)
main: opt iter 0
main: input_size = 131076128 bytes (125.0 MB)
Segmentation fault```

# Environment and Context

* Physical (or virtual) hardware you are using, e.g. for Linux:

```$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          12
On-line CPU(s) list:             0-11
Thread(s) per core:              2
Core(s) per socket:              6
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           158
Model name:                      Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Stepping:                        10
CPU MHz:                         800.332
CPU max MHz:                     4500.0000
CPU min MHz:                     800.0000
BogoMIPS:                        5199.98
Virtualization:                  VT-x
L1d cache:                       192 KiB
L1i cache:                       192 KiB
L2 cache:                        1.5 MiB
L3 cache:                        12 MiB
NUMA node0 CPU(s):               0-11
Vulnerability Itlb multihit:     KVM: Mitigation: Split huge pages
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall n
                                 x pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dte
                                 s64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave
                                 avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ep
                                 t vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv
                                 1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d

Operating System, e.g. for Linux:

Linux kirov.ciar.org 5.4.10 #1 SMP Thu Jan 9 14:13:31 CST 2020 x86_64 Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz GenuineIntel GNU/Linux

SDK version, e.g. for Linux:

$ python3 --version
Python 3.8.1

$ make --version
GNU Make 4.2.1

$ g++ --version
g++ (GCC) 12.1.0

Failure Information (for bugs)

In train-text-from-scratch.cpp, llama_build_train_graphs is trying to initialize KQ_pos->data when it is NULL.

llama_build_train_graphs calls ggml_new_tensor_1d, which calls ggml_new_tensor, which calls ggml_new_tensor_impl.

In ggml_new_tensor_impl:

view_src is NULL, so not setting view_offs
Since view_src is NULL and ctx->no_alloc is true, data never gets assigned any memory space, and result is returned with result.data = NULL.

Immediately upon assigning returned result to KQ_pos, llama_build_train_graphs tries to set N elements of KQ_pos->data to n_past + i.

Since KQ_pos->data is NULL, this causes a segfault.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

./train-text-from-scratch --vocab-model models/ggml-vocab-llama.gguf --train-data --adam-iter 500 --head 16 --layer 16

Failure Logs

$ git log | head -1
commit bc39553c901a91cfcb757863586250838c83eeab

$ pip3 list | egrep "torch|numpy|sentencepiece"
numpy                    1.22.1
sentencepiece            0.1.99
torch                    2.0.1
torchvision              0.15.2

$ make --version | head -1
GNU Make 4.2.1

The text was updated successfully, but these errors were encountered:

ttkciar · 2023-09-29T18:14:48Z

Confirmed that this update fixes the problem. Thanks for all that you do, folks!

I do notice this at the end of the training run, but it might be unrelated:

main: total training time: 00:11:53
double free or corruption (!prev)
Aborted

ggerganov mentioned this issue Sep 29, 2023

train : fix KQ_pos allocation #3392

Merged

ggerganov closed this as completed in #3392 Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train-text-from-scratch.cpp: dereferenced NULL KQ_pos->data #3389

train-text-from-scratch.cpp: dereferenced NULL KQ_pos->data #3389

ttkciar commented Sep 29, 2023 •

edited

Loading

ttkciar commented Sep 29, 2023 •

edited

Loading

train-text-from-scratch.cpp: dereferenced NULL KQ_pos->data #3389

train-text-from-scratch.cpp: dereferenced NULL KQ_pos->data #3389

Comments

ttkciar commented Sep 29, 2023 • edited Loading

Prerequisites

Expected Behavior

Current Behavior

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

ttkciar commented Sep 29, 2023 • edited Loading

ttkciar commented Sep 29, 2023 •

edited

Loading

ttkciar commented Sep 29, 2023 •

edited

Loading