Instructions for setup and running on Mac Silicon chips #25

crsrusl · 2022-08-16T11:01:04Z

Hi,

I’ve heard it is possible to run Stable-Diffusion on Mac Silicon (albeit slowly), would be good to include basic setup and instructions to do this.

Thanks,
Chris

thelamedia · 2022-08-16T14:27:54Z

I heard that pytorch updated to include apple MDS in the latest nightly release as well. Will this improve performance on M1 devices by utilizing Metal?

mja · 2022-08-20T10:55:40Z

With Homebrew

brew install [email protected]
pip3 install torch torchvision
pip3 install setuptools_rust
pip3 install -U git+https://github.com/huggingface/diffusers.git
pip3 install transformers scipy ftfy

Then start python3 and follow the instructions for using diffusers.

StableDiffusion is CPU-only on M1 Macs because not all the pytorch ops are implemented for Metal. Generating one image with 50 steps takes 4-5 minutes.

frenchie1980 · 2022-08-20T13:25:37Z

Hi @mja,

thanks for these steps. I can get as far as the last one but then installing transformers fails with this error. (the install of setuptools_rust was successful )

      running build_ext
      running build_rust
      error: can't find Rust compiler

      If you are using an outdated pip version, it is possible a prebuilt wheel is available for this package but pip is not able to install from it. Installing from the wheel would avoid the need for a Rust compiler.

      To update pip, run:

          pip install --upgrade pip

      and then retry package installation.

      If you did intend to build this package from source, try installing a Rust compiler from your system package manager and ensure it is on the PATH during installation. Alternatively, rustup (available at https://rustup.rs) is the recommended way to download and update the Rust compiler toolchain.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for tokenizers
Failed to build tokenizers
ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects

for context the first step failed to install python3.10 with brew so I did it with Conda instead. Not sure if having a full anaconda env installed is the problem

filipux · 2022-08-20T20:12:57Z

Just tried the pytorch nighly build with mps support and have some good news.

On my cpu (M1 Max) it runs very slow, almost 9 minutes per image, but with mps enabled it's ~18x faster: less than 30 seconds per image🤩

thelamedia · 2022-08-20T20:19:31Z

Incredible! Would you mind sharing your exact setup so I can duplicate on my end?

filipux · 2022-08-20T22:02:45Z

Unfortunately I got it working by many hours of trial and error, and in the end I don't know what worked. I'm not even a programmer, I'm just really good at googling stuff.

Basically my process was:

install pytorch nightly
update osx (12.3 required, mine was at 12.1)
use a conda environment, I could not get it to work without it
install missing packages using either pip or conda (one of them usually works)
go through every file and changetorch.device("cpu"/"cuda")to torch.device("mps")
in register_buffer() in ddim.py, change to attr = attr.to(torch.device("mps"), torch.float32)
in layer_norm() in functional.py (part of pytorch I guess), change to return torch.layer_norm(input.contiguous(), ...
in terminal write export PYTORCH_ENABLE_MPS_FALLBACK=1
+ dozens of other things I have forgotten about...

I'm sorry that I can't be more helpful than this.

thelamedia · 2022-08-21T00:12:54Z

Thanks. What are you currently using for checkpoints? Are you using research weights or are you using another model for now?

magnusviri · 2022-08-21T01:29:45Z

I don't have access to the model so I haven't tested it, but based off of what @filipux said, I created this pull request to add mps support. If you can't wait for them to merge it you can clone my fork and switch to the apple-silicon-mps-support branch and try it out. Just follow the normal instructions but instead of running conda env create -f environment.yaml, run conda env create -f environment-mac.yaml. I think the only other requirement is that you have to have macOS 12.3 or greater.

@filipux

From @filipux's comment at CompVis/stable-diffusion#25 (comment)

einanao · 2022-08-21T05:05:41Z

I couldn't quite get your fork to work @magnusviri, but based on most of @filipux's suggestions, I was able to install and generate samples on my M2 machine using https://github.com/einanao/stable-diffusion/tree/apple-silicon

Raymonf · 2022-08-21T05:08:21Z

Edit: If you're looking at this comment now, you probably shouldn't follow this. Apparently a lot can change in 2 weeks!

Old comment

I got it to work fully natively without the CPU fallback, sort of. The way I did things is ugly since I prioritized making it work. I can't comment on speeds but my assumption is that using only the native MPS backend is faster?

I used the mps_master branch from kulinseth/pytorch as a base, since it contains an implementation for aten::index.Tensor_out that appears to work from what I can tell: https://github.com/Raymonf/pytorch/tree/mps_master

If you want to use my ugly changes, you'll have to compile PyTorch from scratch as I couldn't get the CPU fallback to work:

# clone the modified mps_master branch
git clone --recursive -b mps_master https://github.com/Raymonf/pytorch.git pytorch_mps && cd pytorch_mps

# dependencies to build (including for distributed)
# slightly modified from the docs
conda install astunparse numpy ninja pyyaml setuptools cmake cffi typing_extensions future six requests dataclasses pkg-config libuv

# build pytorch with explicit USE_DISTRIBUTED=1
USE_DISTRIBUTED=1 MACOSX_DEPLOYMENT_TARGET=12.4 CC=clang CXX=clang++ python setup.py install

I based my version of the Stable Diffusion code on the code from PR #47's branch, you can find my fork here: https://github.com/Raymonf/stable-diffusion/tree/apple-silicon-mps-support

Just your typical pip install -e . should work for this, there's nothing too special going on here, it's just not what I'd call upstream-quality code by any means. I have only tested txt2img, but I did try to modify knn2img and img2img too.

Edit: It definitely takes more than 20 seconds per image at the default settings with either sampler, not sure if I did something wrong. Might be hitting pytorch/pytorch#77799 :(

@magnusviri: You are free to take anything from my branch for yourself if it's helpful at all, thanks for the PR 😃

magnusviri · 2022-08-21T05:49:04Z

@Raymonf: I merged your changes with mine and so they are in the pull request now. It caught everything that I missed and it almost identical to the changes that @einanao made as well. The only difference I could see was in ldm/models/diffusion/plms.py

einanao:

    def register_buffer(self, name, attr):
        if type(attr) == torch.Tensor:
            if attr.device != torch.device("cuda"):
                attr = attr.type(torch.float32).to(torch.device("mps")).contiguous()

Raymonf:

    def register_buffer(self, name, attr):
        if type(attr) == torch.Tensor:
            if attr.device != torch.device(self.device_available):
                attr = attr.to(torch.float32).to(torch.device(self.device_available))

I don't know what the code differences are, except that I read that adding .contiguous() fixes bugs when falling back to the cpu.

einanao · 2022-08-21T05:52:57Z

Pretty sure my version is redundant (I also added a downstream call to .contiguous(), but forgot to remove this one)

…

On Sun, Aug 21, 2022 at 1:49 AM, James Reynolds ***@***.***> wrote: ***@***.***(https://github.com/Raymonf): I merged your changes with mine and so they are in the pull request now. It caught everything that I missed and it almost identical to the changes that ***@***.***(https://github.com/einanao) made as well. The only difference I could see was in ldm/models/diffusion/plms.py einanao: def register_buffer(self, name, attr): if type(attr) == torch.Tensor: if attr.device != torch.device("cuda"): attr = attr.type(torch.float32).to(torch.device("mps")).contiguous() Raymonf: def register_buffer(self, name, attr): if type(attr) == torch.Tensor: if attr.device != torch.device(self.device_available): attr = attr.to(torch.float32).to(torch.device(self.device_available)) I don't know what the code differences are, except that I read that adding .contiguous() fixes bugs when falling back to the cpu. — Reply to this email directly, [view it on GitHub](#25 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AZRBZPFKEKA7VV3IMIHZ743V2G7N3ANCNFSM56VLUFDA). You are receiving this because you were mentioned.Message ID: ***@***.***>

Raymonf · 2022-08-21T06:03:43Z

@einanao Maybe not! How long does yours take to run the default seed and prompt with full precision? GNU time reports ~~4.5~~2.5 minutes with the fans at 100% on a 16 inch M1 Max, which is way longer than 20 seconds. I'm curious if you using the CPU fallback for some parts makes it faster at all.

einanao · 2022-08-21T15:21:42Z

It takes me 1.5 minutes to generate 1 sample on a 13 inch M2 2022

thelamedia · 2022-08-22T00:54:15Z

I'm getting this error when trying to run with the laion400 data set:

return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled) RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

Is this an issue with the torch functional.py script?

einanao · 2022-08-22T00:56:11Z

Yes, see @filipux's earlier comment:

in layer_norm() in functional.py (part of pytorch I guess), change to return torch.layer_norm(input.contiguous(), ...

thelamedia · 2022-08-22T01:07:15Z

@einanao thank you. One step closer, but now I'm getting this:
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.mps.enabled) AttributeError: module 'torch.backends.mps' has no attribute 'enabled'

Here is my function:

def layer_norm( input: Tensor, normalized_shape: List[int], weight: Optional[Tensor] = None, bias: Optional[Tensor] = None, eps: float = 1e-5, ) -> Tensor:

if has_torch_function_variadic(input, weight, bias):
    return handle_torch_function(
        layer_norm, (input.contiguous(), weight, bias), input, normalized_shape, weight=weight, bias=bias, eps=eps
    )
return torch.layer_norm(input.contiguous(), normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)`

byhringo · 2022-08-22T17:00:53Z

It takes me 1.5 minutes to generate 1 sample on a 13 inch M2 2022

For benchmarking purposes - I'm at ~150s (2.5 minutes) on each iteration past the first, which was over 500s after setting up with the steps in these comments.

14" 2021 Macbook Pro with base specs. (M1 Pro chip)

Automatt · 2022-08-22T20:20:53Z

This worked for me. I'm seeing about 30 seconds per image on a 14" M1 Max MacBook Pro (32 GPU core).

henrique-galimberti · 2022-08-22T20:54:27Z

This worked for me. I'm seeing about 30 seconds per image on a 14" M1 Max MacBook Pro (32 GPU core).

What steps did you follow?
I tried three apple forks but they all are taking 1h to generate using the sample command (python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms)
I'm using pytorch nightly btw.

Automatt · 2022-08-22T21:01:04Z

@henrique-galimberti I followed these steps:

Install PyTorch nightly
Used this branch referenced above from @magnusviri
Modify functional.py as noted above here to resolve view size not compatible issue

recurrence · 2022-08-22T21:16:51Z

mps support for aten::index.Tensor_out is now in pytorch nightly according to Denis

recurrence · 2022-08-22T21:24:58Z

Looks like there's a ticket for the reshape error at pytorch/pytorch#80800

magnusviri · 2022-08-22T22:18:02Z

mps support for aten::index.Tensor_out is now in pytorch nightly according to Denis

Is that the pytorch nightly branch? That particular branch is 1068 commits ahead and 28606 commits behind the master. The last commit was 15 hours ago. But master has commits kind of non-stop for the last 9 hours.

pnodseth · 2022-08-22T22:20:09Z

@henrique-galimberti I followed these steps:

Install PyTorch nightly

Used this branch referenced above from @magnusviri

Modify functional.py as noted above here to resolve view size not compatible issue

Where can I find the functional.py file ?

cgodley · 2022-08-22T22:22:23Z

Where can I find the functional.py file ?

import torch
torch.__file__

For me the path is below. Your path will be different.

'/Users/lab/.local/share/virtualenvs/lab-2cY4ojCF/lib/python3.10/site-packages/torch/__init__.py'

Then replace __init__.py with nn/functional.py

henrique-galimberti · 2022-08-22T22:23:48Z

I change conda env to use rosetta and it is faster than before, but still waaaay too slow:

recurrence · 2022-08-22T22:42:12Z

Is that the pytorch nightly branch? That particular branch is 1068 commits ahead and 28606 commits behind the master.

It was merged 5 days ago so it should be in the regular PyTorch nightly that you can get directly from the PyTorch site.

cgodley · 2022-08-22T23:09:44Z

@henrique-galimberti I followed these steps:

* Install PyTorch nightly

* Used [this branch](https://github.com/CompVis/stable-diffusion/pull/47) referenced above from @magnusviri

* Modify functional.py as noted above [here](https://github.com/CompVis/stable-diffusion/issues/25#issuecomment-1221667017) to resolve view size not compatible issue

I also followed these steps and confirmed MPS was being used (printed the return value of get_device()) but it's taking about 31.74s/it, which seems very slow.

macOS 12.5
MacBook Pro M1 14" base model (16GB of memory, 14 GPU cores)

AntonEssenetial · 2022-09-13T22:10:12Z

You may have installed Conda for Apple Silicon. You need Conda for Intel. Download this: https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.pkg And install it normally. Quit Terminal and start it again, and see if that fixed the problem.

i've installed miniconda from repo.
Skipped this step: # install packages
PIP_EXISTS_ACTION=w CONDA_SUBDIR=osx-arm64 conda env create -f environment-mac.yaml
conda activate ldm

Then try to: python scripts/dream.py --full_precision

python scripts/dream.py --full_precision
Traceback (most recent call last):
File "/Users/anton/stable-diffusion/scripts/dream.py", line 12, in
import ldm.dream.readline
ModuleNotFoundError: No module named 'ldm'

hawtdawg · 2022-09-13T22:13:09Z

I believe "osx-arm64" is for M1 Macs, if you have an Intel Mac that's the wrong version. I'd just try and install Miniconda from the website and use that, just in case that's the issue.

corajr · 2022-09-13T22:20:26Z

@AntonEssenetial @hawtdawg is correct. Also, I believe your ldm environment exists and is broken due to the failed install.

Try removing it first with: conda env remove -n ldm

Then because you are on Intel, you should try modifying the command to be:

PIP_EXISTS_ACTION=w CONDA_SUBDIR=osx-64 conda env create -f environment-mac.yaml

osx-64 is the version for Intel so it should be compatible.

I'm not sure if someone has made a more complete guide for Intel Macs, but the default instructions on lstein may not work for you right now. I anticipate the nomkl package included in environment-mac.yaml may also cause problems for you, so you may want to try removing that line as well.

hawtdawg · 2022-09-13T22:23:05Z

@AntonEssenetial @hawtdawg is correct. Also, I believe your ldm environment exists and is broken due to the failed install.

Try removing it first with: conda env remove -n ldm

Then because you are on Intel, you should try modifying the command to be:

PIP_EXISTS_ACTION=w CONDA_SUBDIR=osx-64 conda env create -f environment-mac.yaml

osx-64 is the version for Intel so it should be compatible.

I'm not sure if someone has made a more complete guide for Intel Macs, but the default instructions on lstein may not work for you right now. I anticipate the nomkl package included in environment-mac.yaml may also cause problems for you, so you may want to try removing that line as well.

I got the lstein development branch to work on my Intel Mac without changing anything, with the only exception being installing Conda for Intel instead of Arm. So I think this is the only thing that needs to be done, and everything is the same from there on.

AntonEssenetial · 2022-09-13T22:54:27Z

I'm not sure if someone has made a more complete guide for Intel Macs

That would be great 😌.

AntonEssenetial · 2022-09-13T22:55:30Z

ldm is active but.

python scripts/dream.py --full_precision

Initializing, be patient...

cuda not available, using device mps
Loading model from models/ldm/stable-diffusion-v1/model.ckpt
Traceback (most recent call last):
File "/Users/anton/stable-diffusion/scripts/dream.py", line 685, in
main()
File "/Users/anton/stable-diffusion/scripts/dream.py", line 101, in main
t2i.load_model()
File "/Users/anton/stable-diffusion/ldm/generate.py", line 426, in load_model
model = self._load_model_from_config(config, self.weights)
File "/Users/anton/stable-diffusion/ldm/generate.py", line 537, in _load_model_from_config
pl_sd = torch.load(ckpt, map_location='cpu')
File "/Users/anton/opt/miniconda3/envs/ldm/lib/python3.10/site-packages/torch/serialization.py", line 705, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/Users/anton/opt/miniconda3/envs/ldm/lib/python3.10/site-packages/torch/serialization.py", line 242, in init
super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

corajr · 2022-09-13T23:31:55Z

ldm is active but.

python scripts/dream.py --full_precision

Initializing, be patient...

cuda not available, using device mps
Loading model from models/ldm/stable-diffusion-v1/model.ckpt
Traceback (most recent call last):
File "/Users/anton/stable-diffusion/scripts/dream.py", line 685, in
main()
File "/Users/anton/stable-diffusion/scripts/dream.py", line 101, in main
t2i.load_model()
File "/Users/anton/stable-diffusion/ldm/generate.py", line 426, in load_model
model = self._load_model_from_config(config, self.weights)
File "/Users/anton/stable-diffusion/ldm/generate.py", line 537, in _load_model_from_config
pl_sd = torch.load(ckpt, map_location='cpu')
File "/Users/anton/opt/miniconda3/envs/ldm/lib/python3.10/site-packages/torch/serialization.py", line 705, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/Users/anton/opt/miniconda3/envs/ldm/lib/python3.10/site-packages/torch/serialization.py", line 242, in init
super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

Do you have the model file at models/ldm/stable-diffusion-v1/model.ckpt? Maybe the file is damaged or it's a bad symlink?

HenkPoley · 2022-09-14T14:33:29Z

@Birch-san I there is a bug in your birch-mps-waifu branch that means txt2img_fork.py picks the last sample file filename number and overwrites it (on each start).

E.g. base_count is somehow one-off.

Edit: ah, I see. It just uses base_count = len(os.listdir(sample_path)) (and I deleted a few files 🤪) PEBCAK

Birch-san · 2022-09-14T15:13:05Z

@HenkPoley phew! I was scared for a moment there. since someone else reported the same thing, but I think they had also deleted some files themselves.

recent stuff I've been working on is trying to optimize attention (e.g. trying matmul instead of einsum, trying the changes from Doggettx / neonsecret, trying opt_einsum, trying cosine similarity attention) but none of those ideas improved the speed.

next thing I want to add is latent walks. I'm trying to do it without losing support for multi-prompt or multi-sample so it's a bit harder than copying existing code.

also want to look at better img2img capabilities.

Birch-san · 2022-09-14T16:14:54Z

fix underway for MPSNDArray error: product of dimension sizes > 2**31:
pytorch/pytorch#84039 (comment)

AntonEssenetial · 2022-09-14T16:17:21Z

ldm is active but.
python scripts/dream.py --full_precision

Initializing, be patient...

cuda not available, using device mps
Loading model from models/ldm/stable-diffusion-v1/model.ckpt
Traceback (most recent call last):
File "/Users/anton/stable-diffusion/scripts/dream.py", line 685, in
main()
File "/Users/anton/stable-diffusion/scripts/dream.py", line 101, in main
t2i.load_model()
File "/Users/anton/stable-diffusion/ldm/generate.py", line 426, in load_model
model = self._load_model_from_config(config, self.weights)
File "/Users/anton/stable-diffusion/ldm/generate.py", line 537, in _load_model_from_config
pl_sd = torch.load(ckpt, map_location='cpu')
File "/Users/anton/opt/miniconda3/envs/ldm/lib/python3.10/site-packages/torch/serialization.py", line 705, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/Users/anton/opt/miniconda3/envs/ldm/lib/python3.10/site-packages/torch/serialization.py", line 242, in init
super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

Do you have the model file at models/ldm/stable-diffusion-v1/model.ckpt? Maybe the file is damaged or it's a bad symlink?

Thx, solved, file was damaged 🤦🏻‍♂️

AntonEssenetial · 2022-09-14T16:31:50Z

Maybe someone knows how to switch to GPU on macbook pro with AMD Radeon Pro 5500M 4 GB, for some reason it runs on the CPU.

(ldm) ➜ stable-diffusion git:(main) python scripts/dream.py --full_precision

Initializing, be patient...

cuda not available, using device mps
Loading model from models/ldm/stable-diffusion-v1/model.ckpt
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Using slower but more accurate full-precision math (--full_precision)
Model loaded in 18.70s
Setting Sampler to k_lms

Initialization done! Awaiting your command (-h for help, 'q' to quit)
dream> teapot
/Users/anton/stable-diffusion/ldm/modules/embedding_manager.py:153: UserWarning: The operator 'aten::nonzero' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/miniforge3/conda-bld/pytorch-recipe_1660136156773/work/aten/src/ATen/mps/MPSFallback.mm:11.)
placeholder_idx = torch.where(

Vargol · 2022-09-14T16:49:17Z

That message doesn't mean it all runs on CPU just the some instructions, check activity monitor you'll see it using a big chunk of GPU

AntonEssenetial · 2022-09-14T17:01:35Z

That message doesn't mean it all runs on CPU just the some instructions, check activity monitor you'll see it using a big chunk of GPU

thx

HenkPoley · 2022-09-16T13:03:29Z

Looks like PyTorch nightly 1.13.0.dev20220915 (or slightly earlier) fixes the 'leaked semaphore' problem (I might misattribute it to PyTorch).

Or at least I haven't seen it in a while.

~/miniconda3/envs/ldm/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown

Birch-san · 2022-09-17T21:13:10Z

btw, if you're running on a nightly build: beware that there's a bug with einsum() which will make cross-attention return the wrong result the first time it's invoked.
pytorch/pytorch#85224

HenkPoley · 2022-09-25T10:07:58Z

@Birch-san 👀 "Speed up stable diffusion by ~50% using flash attention"

https://twitter.com/labmlai/status/1573634095732490240

..might just be a CUDA thing, the way it (doesn't) have access to large caches.

https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/diffusion/stable_diffusion/model/unet_attention.py#L192-L235

Birch-san · 2022-09-25T10:12:13Z

@HenkPoley oh wow! I'll have a look into it, but I think the point of Flash attention is using the hardware better. as such, I think there's only a CUDA version of it at the moment. I know there's one merged into PyTorch, but by default isn't even built. perhaps this required building PyTorch from source, enabling that build flag, and relies on having CUDA? will investigate.

Birch-san · 2022-09-25T10:28:33Z

okay yeah, it uses HazyResearch's implementation, which is definitely CUDA-specific.
maybe that Triton implementation will be more platform-agnostic. but their readme says they don't support M1.

AkiKagura · 2022-10-08T08:17:41Z

Hello everyone! I've got both x86_64(Anaconda) and arm64(Miniforge3) Conda environment, but the arm64 one runs much slower than the X86 one. I have no idea how to speed up in the arm64 structure cause it's slow as cpu.I am using the same code, and I'm using M1 silicon chip.

Vargol · 2022-10-08T09:29:38Z

@AkiKagura This repository has never been changed to enable MPS so it is running on the CPU on Apple Silicon.

Try another fork, I use this one
https://github.com/invoke-ai/InvokeAI
as it makes an effort to keep performance high on M1/M2

Also don't use the pytorch nightlies they've totally tanked torch.einsum performance recently, stick with the current stable
as that supports MPS enough for good performance from SD.

AkiKagura · 2022-10-08T10:08:50Z

@Vargol In fact, I am using a version of SD that has been modified to adapt mps support, and it could generate a picture in about 3-4 minutes on the x86 framework(but on arm64 it’s much slower).
But the pytorch I am using is a nightly version. Thanks for reminding me of using the stable version of pytorch, and I would try the fork you provide.

Vargol · 2022-10-08T10:21:56Z

@AkiKagura

For comparison I'm on a 8Gb Mac Mini and despite it still causing a little swap usage now and again I get 4 it/s so
~ 3.5 minutes for a 50 sample 512x512 image plus around 40-60 seconds for all the model loading (it really does vary a lot for some reason, probably due to the swap usage.).

I understand its significantly faster with 16Gb+. M1 Max @ 64Gb should take ~ 30 seconds per image for the 50 sample 512x512 image plus the model loading time according to the benchmarks I've seen.

Any-Winter-4079 · 2022-10-08T11:59:33Z

@AkiKagura M1 Max with 64GB RAM. The time is 26-27s for 50 steps. It was 30s, but an extra optimization was added a while ago in https://github.com/invoke-ai/InvokeAI
"banana pie" -s50 -W512 -H512 -C7.5 -Ak_lms -n3

50/50 [00:26<00:00,  1.87it/s]
50/50 [00:26<00:00,  1.91it/s]
50/50 [00:26<00:00,  1.88it/s]

I suggest you use that repo in you are on M1 because a lot of people are there, collaborating, and it has many features, such as inpainting, outpainting, textual inversion, as well as allowing to generate large images (e.g. 1024x1024 and beyond) without out-of-memory problems, etc.

Also I suggest you check out @Birch-san 's repo.

- Change occurrences (hardcodings, default arguments) of "cuda" to accept other torch devices ("mps", "cpu") - Auto-detect and set torch device when running on appropriate hardware - Don't use unsupported autocast when running on MPS, and always use full-precision (float32) - Port CrossAttention optimizations from InvokeAI stable diffusion frontend, which dramatically speed up inference - Fix seed instability caused by torch.randn not using the global seed on MPS hardware - Various other bugfixes from (CompVis/stable-diffusion#25)

I'm using stable-diffusion on a 2022 Macbook M2 Air with 24 GB unified memory. I see this taking about 2.0s/it. I've moved many deps from pip to conda-forge, to take advantage of the precompiled binaries. Some notes for Mac users, since I've seen a lot of confusion about this: One doesn't need the `apple` channel to run this on a Mac-- that's only used by `tensorflow-deps`, required for running tensorflow-metal. For that, I have an example environment.yml here: https://developer.apple.com/forums/thread/711792?answerId=723276022#723276022 However, the `CONDA_ENV=osx-arm64` environment variable *is* needed to ensure that you do not run any Intel-specific packages such as `mkl`, which will fail with [cryptic errors](CompVis/stable-diffusion#25 (comment)) on the ARM architecture and cause the environment to break. I've also added a comment in the env file about 3.10 not working yet. When it becomes possible to update, those commands run on an osx-arm64 machine should work to determine the new version set. Here's what a successful run of dream.py should look like: ``` $ python scripts/dream.py --full_precision  SIGABRT(6) ↵  08:42:59 * Initializing, be patient... Loading model from models/ldm/stable-diffusion-v1/model.ckpt LatentDiffusion: Running in eps-prediction mode DiffusionWrapper has 859.52 M params. making attention of type 'vanilla' with 512 in_channels Working with z of shape (1, 4, 32, 32) = 4096 dimensions. making attention of type 'vanilla' with 512 in_channels Using slower but more accurate full-precision math (--full_precision) >> Setting Sampler to k_lms model loaded in 6.12s * Initialization done! Awaiting your command (-h for help, 'q' to quit) dream> "an astronaut riding a horse" Generating: 0%| | 0/1 [00:00<?, ?it/s]/Users/corajr/Documents/lstein/ldm/modules/embedding_manager.py:152: UserWarning: The operator 'aten::nonzero' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1662016319283/work/aten/src/ATen/mps/MPSFallback.mm:11.) placeholder_idx = torch.where( 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [01:37<00:00, 1.95s/it] Generating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:38<00:00, 98.55s/it] Usage stats: 1 image(s) generated in 98.60s Max VRAM used for this generation: 0.00G Outputs: outputs/img-samples/000001.1525943180.png: "an astronaut riding a horse" -s50 -W512 -H512 -C7.5 -Ak_lms -F -S1525943180 ```

BeginAnAdventure · 2023-02-03T23:30:33Z

Does anyone know what commands will increase the resolution to photographic quality, like you get from the Stable Diffusion website, and if you can get more than one image at a time. This is the only command I have right now to define the output:
--n_samples 1 --n_iter 1 --plms

Raymonf · 2023-02-04T00:02:31Z

@BeginAnAdventure In txt2img, --W and --H:

stable-diffusion/scripts/txt2img.py

Lines 163 to 174 in 21f890f

    
           parser.add_argument( 
        
               "--H", 
        
               type=int, 
        
               default=512, 
        
               help="image height, in pixel space", 
        
           ) 
        
           parser.add_argument( 
        
               "--W", 
        
               type=int, 
        
               default=512, 
        
               help="image width, in pixel space", 
        
           )

For more than one image at a time, you might have a bad time, but the option is actually --n_samples.

But do consider using InvokeAI or some other UI - the scripts in this repository aren't exactly "feature complete".

BeginAnAdventure · 2023-02-04T00:14:38Z

But do consider using InvokeAI or some other UI - the scripts in this repository aren't exactly "feature complete".

That's what I thought. I did try using this one:
https://replicate.com/cjwbw/stable-diffusion-high-resolution

But I keep getting errors despite no issues with doing all in Terminal. Will check out InvokeAI, just wondering if I'll need to create a whole new setup for their UI. Thanks!

Raymonf added a commit to Raymonf/pytorch that referenced this issue Aug 21, 2022

Make input contiguous for torch.layer_norm call

1f0f895

From @filipux's comment at CompVis/stable-diffusion#25 (comment)

andrewkchan mentioned this issue Dec 5, 2022

support apple silicon MPS andrewkchan/deforum-stable-diffusion-mps#1

Merged

Instructions for setup and running on Mac Silicon chips #25

Instructions for setup and running on Mac Silicon chips #25

Comments

crsrusl commented Aug 16, 2022

thelamedia commented Aug 16, 2022

mja commented Aug 20, 2022

frenchie1980 commented Aug 20, 2022

filipux commented Aug 20, 2022

thelamedia commented Aug 20, 2022

filipux commented Aug 20, 2022

thelamedia commented Aug 21, 2022

magnusviri commented Aug 21, 2022

einanao commented Aug 21, 2022 • edited Loading

Raymonf commented Aug 21, 2022 • edited Loading

magnusviri commented Aug 21, 2022

einanao commented Aug 21, 2022 via email • edited Loading

Raymonf commented Aug 21, 2022 • edited Loading

einanao commented Aug 21, 2022

thelamedia commented Aug 22, 2022

einanao commented Aug 22, 2022

thelamedia commented Aug 22, 2022 • edited Loading

byhringo commented Aug 22, 2022

Automatt commented Aug 22, 2022

henrique-galimberti commented Aug 22, 2022

Automatt commented Aug 22, 2022 • edited Loading

recurrence commented Aug 22, 2022 • edited Loading

recurrence commented Aug 22, 2022

magnusviri commented Aug 22, 2022

pnodseth commented Aug 22, 2022

cgodley commented Aug 22, 2022 • edited Loading

henrique-galimberti commented Aug 22, 2022 • edited Loading

recurrence commented Aug 22, 2022

cgodley commented Aug 22, 2022

AntonEssenetial commented Sep 13, 2022

hawtdawg commented Sep 13, 2022 • edited Loading

corajr commented Sep 13, 2022

hawtdawg commented Sep 13, 2022

AntonEssenetial commented Sep 13, 2022

AntonEssenetial commented Sep 13, 2022

corajr commented Sep 13, 2022

HenkPoley commented Sep 14, 2022 • edited Loading

Birch-san commented Sep 14, 2022

Birch-san commented Sep 14, 2022

AntonEssenetial commented Sep 14, 2022

AntonEssenetial commented Sep 14, 2022 • edited Loading

Vargol commented Sep 14, 2022

AntonEssenetial commented Sep 14, 2022

HenkPoley commented Sep 16, 2022

Birch-san commented Sep 17, 2022

HenkPoley commented Sep 25, 2022 • edited Loading

Birch-san commented Sep 25, 2022

Birch-san commented Sep 25, 2022 • edited Loading

AkiKagura commented Oct 8, 2022

Vargol commented Oct 8, 2022

AkiKagura commented Oct 8, 2022 • edited Loading

Vargol commented Oct 8, 2022

Any-Winter-4079 commented Oct 8, 2022 • edited Loading

BeginAnAdventure commented Feb 3, 2023

Raymonf commented Feb 4, 2023 • edited Loading

BeginAnAdventure commented Feb 4, 2023

einanao commented Aug 21, 2022 •

edited

Loading

Raymonf commented Aug 21, 2022 •

edited

Loading

einanao commented Aug 21, 2022 via email •

edited

Loading

Raymonf commented Aug 21, 2022 •

edited

Loading

thelamedia commented Aug 22, 2022 •

edited

Loading

Automatt commented Aug 22, 2022 •

edited

Loading

recurrence commented Aug 22, 2022 •

edited

Loading

cgodley commented Aug 22, 2022 •

edited

Loading

henrique-galimberti commented Aug 22, 2022 •

edited

Loading

hawtdawg commented Sep 13, 2022 •

edited

Loading

HenkPoley commented Sep 14, 2022 •

edited

Loading

AntonEssenetial commented Sep 14, 2022 •

edited

Loading

HenkPoley commented Sep 25, 2022 •

edited

Loading

Birch-san commented Sep 25, 2022 •

edited

Loading

AkiKagura commented Oct 8, 2022 •

edited

Loading

Any-Winter-4079 commented Oct 8, 2022 •

edited

Loading

Raymonf commented Feb 4, 2023 •

edited

Loading