Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simple convert scripts run out of tmpfs space on Linux #3433

Closed
cebtenzzre opened this issue Oct 2, 2023 · 18 comments
Closed

simple convert scripts run out of tmpfs space on Linux #3433

cebtenzzre opened this issue Oct 2, 2023 · 18 comments
Labels

Comments

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Oct 2, 2023

By default, Linux prevents tmpfs from using more than 50% of available system memory. This is normally a good thing, but the simple convert scripts write all tensor data to tmpfs before saving the output file, causing this exception if the converted model is larger than 50% of system RAM (ref):

Traceback (most recent call last):
  File "/home/cebtenzzre/src/forks/llama.cpp/convert-baichuan-hf-to-gguf.py", line 279, in <module>
    gguf_writer.add_tensor(new_name, data)
  File "/home/cebtenzzre/src/forks/llama.cpp/gguf-py/gguf/gguf.py", line 622, in add_tensor
    tensor.tofile(self.temp_file)
OSError: Not enough free space to write 140247040 bytes

This is annoying. You can set TMPDIR=/var/tmp to work around this, but then you need twice as much free disk space as the size of the output file - which I don't always have.

The smallest change that would fix this problem would be to provide a way to effectively disable use_temp_file on Linux, while still supporting use of e.g. /var/tmp if desired. That way, I could leverage 100% of my RAM, and my swap space, in order to convert these models. If we choose this route, we should make sure the converted tensor data is freed as it is written, to avoid unnecessary swapping - right now it is not.

We can't just change the simple scripts to convert on-the-fly, because they load one pytorch file at a time, so making two passes over the input tensors would have a high I/O cost.

We could make LazyUnpickler part of the gguf module. That way, we can significantly reduce memory usage, and avoid the need to load the entire model into memory, while hopefully still keeping the non-LLaMA conversion scripts relatively simple.

@ggerganov what do you think?

@staviq
Copy link
Contributor

staviq commented Oct 2, 2023

If I'm reading the code right, that temp file is only being written to, no seeks so sequentially, until the point where it's rewound and copied to the output file.

If that's correct, as far as Linux is concerned, there shouldn't really be any benefit from using a tempfile for "write only" workload, since kernel caches already work in a similar way, and final write to file still takes as long as it takes.

If however, tempfile does in fact provide performance benefits, something else must be non optimal.

Python's open seems to allow for specifying buffer size, which isn't used in gguf.py, and as per docs, this means "default" os buffering is used, which is likely in order of kb.

I would imagine, specifying buffer size in open ( something like say 100M ) would likely allow to get rid of the tempfile.

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Oct 2, 2023

Back when I did my cleanup stuff, I also added a feature to GGUFWriter where you can turn off using a temp file. Just pass use_temp_file = False to the constructor. I left it off by default just to avoid breaking stuff, I'm not 100% sure what the advantage of using the tempfile is.

Just for example, the GGML to GGUF converter script always sets that:

def save(self):
print('* Preparing to save GGUF file')
gguf_writer = gguf.GGUFWriter(
self.cfg.output,
gguf.MODEL_ARCH_NAMES[gguf.MODEL_ARCH.LLAMA],
use_temp_file = False )

@cebtenzzre
Copy link
Collaborator Author

cebtenzzre commented Oct 2, 2023

I'm not 100% sure what the advantage of using the tempfile is.

The only reason to use a tempfile is on platforms other than Linux (*BSD, macOS, Windows) where the temp folder is usually not a ramdisk. This allows you to convert a model with an output size larger than your RAM+swap space. But on typical Linux installs, use_temp_file has no benefit unless you set TMPDIR to disk.

@ggerganov
Copy link
Owner

We could make LazyUnpickler part of the gguf module. That way, we can significantly reduce memory usage, and avoid the need to load the entire model into memory, while hopefully still keeping the non-LLaMA conversion scripts relatively simple.

This sounds like the best option, but I'm not sure how difficult it is to achieve and if we can implement a simple enough API to reuse it in the simple convert scripts

@KerfuffleV2
Copy link
Collaborator

There's an approach that should mostly avoid the issues temp file and memory issues and that's to just numpy.memmap the files. PyTorch files are ZIP but they never actually compress the data, it's always just STOREed. numpy also supports mapping as copy-on-write, so you can do stuff like permute the tensor data without affecting the underlying file which can still be opened readonly. Ref: https://numpy.org/doc/stable/reference/generated/numpy.memmap.html

There is one limitation though, and that is on 32bit systems only files up to 2GB can be mmaped. Would that be acceptable?

@cebtenzzre
Copy link
Collaborator Author

cebtenzzre commented Oct 7, 2023

There is one limitation though, and that is on 32bit systems only files up to 2GB can be mmaped. Would that be acceptable?

AFAIK python on 32-bit Linux is built with 64-bit file offsets (-D_FILE_OFFSET_BITS=64), so the limit is actually 4GB. That's also the practical limit for loading on CPU, so it's probably fine.

edit: we use mmap for safetensors files in convert.py anyway.

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Oct 7, 2023

the limit is actually 4GB.

The numpy reference for memmap specifically says it only supports 2GB so it's possible their wrapper adds some extra restrictions. (But it would also be possible to bypass that and just use Python's mmap instead if necessary.)

The source for the function is here: https://github.com/numpy/numpy/blob/main/numpy/core/memmap.py - it's pretty short if you want to take a look and see if there's a reason it would only support 2GB on 32bit. (Kind of looks like the main reason to use that function is because it simplifies using offset, where with mmap it has to be a multiple of the page size usually.)

@cebtenzzre
Copy link
Collaborator Author

just numpy.memmap the files

Unfortunately, an approach like this would not work with convert.py because the original LLaMA tensors are split between multiple files, and must be concatenated. And I would rather not have different versions of LazyUnpickler between convert.py and the gguf package.

@KerfuffleV2
Copy link
Collaborator

an approach like this would not work with convert.py because the original LLaMA tensors are split between multiple files, and must be concatenated.

Why not? Like I mentioned, it can be mmaped CoW so there aren't any restrictions to manipulating the data. The only thing I was thinking is it might be necessary to make that part more lazy (if it isn't already) because concatting/permuting all the tensors in advance would be basically the same as loading them into memory.

And I would rather not have different versions of LazyUnpickler between convert.py and the gguf package.

Sure, there should be only one version of it in one place. I'm not sure how that relates to loading the data via mmap instead of just reading it from a file though.

@cebtenzzre
Copy link
Collaborator Author

LazyUnpickler already waits until the last second to load tensor data from disk - it builds a chain of 'load' functions and then calls them when the data is needed. mmap would not be any simpler (it requires extra effort to find the offset in the zipfile, which is not directly exposed by the API) or save significant memory usage in convert.py as far as I can tell.

@KerfuffleV2
Copy link
Collaborator

it requires extra effort to find the offset in the zipfile, which is not directly exposed by the API)

That's true, but it's not too hard since it does expose an offset to the local file header (you basically just have to find the name length and extra part length and the raw data will be right past that).

or save significant memory usage in convert.py as far as I can tell.

Well, you wouldn't run out of temp space because there'd no longer be a need to use that approach. Right?

Also when concatting tensors, the data from the first in the sequence wouldn't be written to so it would basically be the same as a read only mmap. I'm not sure if permuting will touch every element or not. The gguf interface could probably be changed a little to support taking a list of tensor arrays that need to be saved, that way it potentially wouldn't be necessary to concat everything into one big numpy array at all.

@cebtenzzre
Copy link
Collaborator Author

convert.py doesn't use tmpfs, it calls GGUFWriter.write_tensor_data directly. This issue refers to the simple scripts that call GGUFWriter.add_tensor.

@cebtenzzre cebtenzzre mentioned this issue Oct 18, 2023
12 tasks
@Galunid
Copy link
Collaborator

Galunid commented Nov 7, 2023

I wonder if a better temporary solution is to set dir to a model directory, rather than $TMPDIR. I think it's much more likely user's tmpfs won't be able to fit a whole model, whereas "normal" partition is much more likely to be able to fit 2*model size in a pessimistic case.

@KerfuffleV2
Copy link
Collaborator

Python's temporary file handling stuff should respect environment variables like TMPDIR, so I think all you'd need to do is to add a little something to the docs to make it more obvious that large files will be created in the default temp directory and an explanation of how the user can change that if they want.

@Galunid
Copy link
Collaborator

Galunid commented Nov 7, 2023

I'm not sure if you understood what I meant. Python's SpooledTemporaryFile exposes dir argument in constructor, and allows for changing default behavior. I feel it's warranted in this case, since (unquantized) models usually are 9GB+. In most cases the whole tmpfs has ~8-16GB (default is 50% of ram excluding swap). I think it's better to overwrite the default, rather than let it fail. This issue has been pretty frustrating for me recently. When you convert larger models it usually fails near the end, after it run for a good 30-40 minutes, and you have to start from scratch. Of course the best solution is to use LazyUnpickler, but that's a more complex approach.

@KerfuffleV2
Copy link
Collaborator

I'm not sure if you understood what I meant.

What I meant is there's already a convention for allowing users to set where temp file get stored, and SpooledTemporaryFile respects that as far as I know. So we don't need to override that behavior, we just need to make users aware that they may need to do that. See what I mean?

I actually think it would probably be weird to override that behavior and use some other temp directory.

@cebtenzzre
Copy link
Collaborator Author

The point of SpooledTemporaryFile is actually that dir is supposed to be on disk - it's designed to use memory until a certain threshold, then write to dir, which is assumed to have more space, not less. Otherwise, you would just use RAM and not use a file object at all.

TMPDIR is the default for consistency with e.g. NamedTemporaryFile, but on Linux we really should not spool to /tmp - we should use the current directory (like transformers does) or /var/tmp.

Copy link
Contributor

github-actions bot commented Apr 3, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants