Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mac/Windows Support #3

Closed
pierotofy opened this issue Feb 16, 2024 · 33 comments · Fixed by #50
Closed

Mac/Windows Support #3

pierotofy opened this issue Feb 16, 2024 · 33 comments · Fixed by #50
Labels
enhancement New feature or request

Comments

@pierotofy
Copy link
Owner

pierotofy commented Feb 16, 2024

Haven't tested there, but should work with minimal/no changes.

@pierotofy pierotofy added the enhancement New feature or request label Feb 16, 2024
@pierotofy pierotofy changed the title Mac/Windows Builds Mac/Windows Support Feb 16, 2024
@DiHubKi
Copy link
Contributor

DiHubKi commented Feb 18, 2024

It's working on Windows

filefile

Changes

  • model.cpp line 189

std::vector<long int> repeats;
to
std::vector<int64_t> repeats;

  • opensplat.cpp line 125

model.savePlySplat(p.replace_filename(fs::path(p.stem().string() + "_" + std::to_string(step) + p.extension().string()).string()));
to
model.savePlySplat((p.replace_filename(fs::path(p.stem().string() + "_" + std::to_string(step) + p.extension().string())).string()));

  • CMakeLists.txt line 48
if (MSVC)
   file(GLOB TORCH_DLLS "${TORCH_INSTALL_PREFIX}/lib/*.dll")
   file(GLOB OPENCV_DLL "${OPENCV_DIR}/x64/vc16/bin/opencv_world490.dll")
   set(DLLS_TO_COPY ${TORCH_DLLS} ${OPENCV_DLL})
   add_custom_command(TARGET opensplat
       POST_BUILD
       COMMAND ${CMAKE_COMMAND} -E copy_if_different
       ${DLLS_TO_COPY}
       $<TARGET_FILE_DIR:opensplat>)
endif (MSVC)

@pierotofy
Copy link
Owner Author

That's awesome! Thanks for testing and confirming it works.

Would you be interested in opening a pull request with these changes? 🙏

@DiHubKi
Copy link
Contributor

DiHubKi commented Feb 19, 2024

Okay

@DiHubKi DiHubKi mentioned this issue Feb 19, 2024
@pierotofy
Copy link
Owner Author

Mac support will likely require porting gsplat to CPU (maybe via HIP).

@BarnabasTakacs
Copy link

Hi, this is great stuff, thank you for doing it.

I am trying to recompile it too on Windows but I got an error when it reaches make -j$(nproc) command.
How did you compile it? Using the same as in the GitHub page or you needed to modify a bit?

@dm-de
Copy link

dm-de commented Feb 23, 2024

run
cmake --build .

@dm-de
Copy link

dm-de commented Feb 23, 2024

@Disa-Kizonda
which versions do you use?
msvc ?
cuda ?
opencv ?
libtorch ?

My versions:
msvc 19.39.33520.0
cuda 11.8 (https://developer.download.nvidia.com/compute/cuda/11.8.0/network_installers/cuda_11.8.0_windows_network.exe)
opencv 4.9 (https://github.com/opencv/opencv/releases/download/4.9.0/opencv-4.9.0-windows.exe)

I had no success building with:
https://download.pytorch.org/libtorch/cu118/libtorch-win-shared-with-deps-2.2.1%2Bcu118.zip

I had success building with:
https://download.pytorch.org/libtorch/libtorch-win-shared-with-deps-1.13.1%2Bcu116.zip

I got many warnings - but exe seems to start.
It is so bad, that we had no information.

@pierotofy
Copy link
Owner Author

Instructions for Windows will be a bit different (e.g. using cmake --build . as it's been pointed out). We need to update the README. I'd love if somebody could document the process once they have it running on Windows.

@dm-de
Copy link

dm-de commented Feb 24, 2024

Now, I used libtorch 2.1.0 for cuda 11.8
https://download.pytorch.org/libtorch/cu118/libtorch-win-shared-with-deps-2.1.0%2Bcu118.zip

edit: not listed, but this seems more updated version
https://download.pytorch.org/libtorch/cu118/libtorch-win-shared-with-deps-2.1.2%2Bcu118.zip

according to this, it's best selection
#17 (comment)

I had no more bad warnings.

Only during compiling cuda:

Details

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include\cuda\std\detail\libcxx\include\support\atomic\atomic_msvc.h(15): warning C4005: "_Compiler_barrier": Makro-Neudefinition

C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.39.33519\include\xatomic.h(55): note: Siehe vorherige Definition von "_Compiler_barrier"

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include\cuda\std\detail\libcxx\include\support\atomic\atomic_msvc.h(15): warning C4005: "_Compiler_barrier": Makro-Neudefinition

C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.39.33519\include\xatomic.h(55): note: Siehe vorherige Definition von "_Compiler_barrier"

I don't know if this is normal...

@BarnabasTakacs
Copy link

BarnabasTakacs commented Feb 24, 2024

Confirming Windows compilation, this is what you should see when it all works

-- Building for: Visual Studio 16 2019
-- Selecting Windows SDK version 10.0.19041.0 to target Windows 10.0.19045.
-- The C compiler identification is MSVC 19.29.30147.0
-- The CXX compiler identification is MSVC 19.29.30147.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files (x86)/Microsoft Visual Studio/2019/Community/VC/Tools/MSVC/14.29.30133/bin/Hostx64/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual Studio/2019/Community/VC/Tools/MSVC/14.29.30133/bin/Hostx64/x64/cl.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Warning (dev) at CMakeLists.txt:5 (set):
implicitly converting 'OPENCV_DIR' to 'STRING' type.
This warning is for project developers. Use -Wno-dev to suppress it.

-- Found CUDA: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8 (found version "11.8")
-- The CUDA compiler identification is NVIDIA 11.8.89
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8/bin/nvcc.exe - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found CUDAToolkit: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8/include (found version "11.8.89")
-- Caffe2: CUDA detected: 11.8
-- Caffe2: CUDA nvcc is: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8/bin/nvcc.exe
-- Caffe2: CUDA toolkit directory: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8
-- Caffe2: Header version is: 11.8
-- C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8/lib/x64/nvrtc.lib shorthash is dd482e34
-- USE_CUDNN is set to 0. Compiling without cuDNN support
-- USE_CUSPARSELT is set to 0. Compiling without cuSPARSELt support
-- Autodetected CUDA architecture(s): 7.5
-- Added CUDA NVCC flags for: -gencode;arch=compute_75,code=sm_75
-- Found Torch: E:/BarnaDesktop/OpenSplat/Installs/libtorch-win-shared-with-deps-2.1.2+cu118/lib/torch.lib
-- OpenCV ARCH: x64
-- OpenCV RUNTIME: vc16
-- OpenCV STATIC: OFF
-- Found OpenCV: C:/opencv/build (found version "4.9.0")
-- Found OpenCV 4.9.0 in C:/opencv/build/x64/vc16/lib
-- You might need to add C:\opencv\build\x64\vc16\bin to your PATH to be able to run your applications.
-- Configuring done
-- Generating done
-- Build files have been written to: E:/OpenSplat/build

It shows a few warnings (in yellow), but the exe will be generated in opensplat.vcxproj -> E:\OpenSplat\build\Debug\opensplat.exe

@BarnabasTakacs
Copy link

If the above installations are correct, you can open build/opensplat.sln directly in Visual Studio and compile (or debug) there.
VisualStudio_Screenshot

@dm-de
Copy link

dm-de commented Feb 24, 2024

Installed software:
Visual Studio 2022 C++
https://github.com/Kitware/CMake/releases/download/v3.28.3/cmake-3.28.3-windows-x86_64.msi
https://developer.download.nvidia.com/compute/cuda/11.8.0/network_installers/cuda_11.8.0_windows_network.exe
https://download.pytorch.org/libtorch/cu118/libtorch-win-shared-with-deps-2.1.2%2Bcu118.zip
https://github.com/opencv/opencv/releases/download/4.9.0/opencv-4.9.0-windows.exe

Build:
"C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Auxiliary/Build/vcvars64.bat"
git clone https://github.com/pierotofy/OpenSplat OpenSplat
cd OpenSplat
mkdir build && cd build
cmake -DCMAKE_PREFIX_PATH=C:/path_to/libtorch_2.1.2_cu11.8 -DOPENCV_DIR=C:/path_to/OpenCV_4.9.0/build -DCMAKE_BUILD_TYPE=Release ..
cmake --build . --config Release

Optional: Edit cuda target (only if required) before cmake --build .
C:/path_to/OpenSplat/build/gsplat.vcxproj
for example: arch=compute_75,code=sm_75

Run:
cd Release
opensplat /path/to/banana -n 2000

@TimBoostGraphics
Copy link

TimBoostGraphics commented Feb 27, 2024

@dm-de did you manage to make it working? I'm trying but after building & running "opensplat /path/to/banana -n 2000" it just loads the images and stops.

@pfxuan
Copy link
Collaborator

pfxuan commented Feb 27, 2024

@pierotofy Any ideas to port gsplat cuda code into MPS, MLX or CPU based architecture? I can easily compile gsplat with Mac M2 chip in NO_CUDA mode. But the most important part (csrc cuda extension) would be skipped. https://github.com/nerfstudio-project/gsplat/blob/main/setup.py#L132

@DiHubKi
Copy link
Contributor

DiHubKi commented Feb 28, 2024

@TimBoostGraphics

not enough vram try -d 8

@TimBoostGraphics
Copy link

@Disa-Kizonda Your solution works but I have an A6000 with 48gb of vram and the opensplat command is barely using vram while running. It's crashing in oversplat.cpp on this line.
torch::Tensor gt = cam.getImage(model.getDownscaleFactor(step));

@dm-de
Copy link

dm-de commented Feb 28, 2024

@dm-de did you manage to make it working? I'm trying but after building & running "opensplat /path/to/banana -n 2000" it just loads the images and stops.

I have got it working...
I was able to run it with 16 banana images scaled down to 1000px with 8 GB vram.
Lower down scale factor than 3 does not work for me
Memory consumption seems huge...
I have only 8GB and I think i will not use opensplat how it's today.

I used also :
https://github.com/MrNeRF/gaussian-splatting-cuda
I was able to run with 250 images - around same size.
Harder to compile. Follow instructions:
MrNeRF/gaussian-splatting-cuda#4 (comment)

@pierotofy
Copy link
Owner Author

I have only 8GB and I think i will not use opensplat how it's today.

That's strange; I've run the banana dataset with the defaults on a card that has 2GB of memory. There might be something going on.

@pierotofy
Copy link
Owner Author

pierotofy commented Feb 28, 2024

@pfxuan gsplat currently requires CUDA. (The BUILD_NO_CUDA option does not make it work without CUDA, it just doesn't build the CUDA parts for you, which is helpful during development).

Adding CPU support will require a rewrite of gsplat for the CPU using multithreading, or porting the CUDA coda to HIP (https://github.com/ROCm/HIP) which should have support for compiling to CPU. The latter might be easier to do.

@dm-de
Copy link

dm-de commented Feb 28, 2024

That's strange; I've run the banana dataset with the defaults on a card that has 2GB of memory. There might be something going on.

banana with -d 3
total system vram usage: 6,5GB (starts from ~2GB)
vram grow quick first and then it is stabilised

with banana -d 2
stopps immediately after image loading - zero samples shot
Crash without error (330mb CrashDump is saved at %appdata%\local\CrashDumps)

edit: crash typically happen, when the graphics memory runs out

GFX Card: Quadro RTX 4000

@pierotofy
Copy link
Owner Author

I wonder if it's Windows related (I've run the software on Linux).

@salovision
Copy link
Contributor

I've tested this on Windows with RTX 2080 and 8GB of memory. I can run opensplat /path/to/banana -n 2000 successfully and it takes about 2.7GB of vram (peak) according to nvidia-smi. However, I had to fix couple of issues first:

  1. cv_utils.cpp, in function tensorToImage(), replace:
    uint8_t *dataPtr = static_cast<uint8_t *>((t * 255.0).toType(torch::kU8).data_ptr());
    with
    torch::Tensor scaledTensor = (t * 255.0).toType( torch::kU8 );
    uint8_t* dataPtr = static_cast<uint8_t*>(scaledTensor.data_ptr());
    Reason: This sometimes crashes because the data_ptr() points to Tensor object that's already destructed.

  2. opensplat.cpp, in function main(), replace:
    InfiniteRandomIterator<Camera> camsIter(cams);
    with
    std::vector< size_t > indices( cams.size() );
    std::iota( indices.begin(), indices.end(), 0 );
    InfiniteRandomIterator<size_t> camsIter(indices);
    and
    Camera cam = camsIter.next();
    with
    Camera& cam = cams[ camsIter.next() ];
    Reason: Original camsIter.next() returned new copy constructed Camera class causing Camera::getImage() to not cache the imagePyramids properly, so it ended up resizing new image every time slowing it down.

  3. model.cpp, in function Model::forward(), I removed calls to cam.scaleOutputResolution() and calculated scaled cam.fx, cam.fy, cam.cx, cam.cy, cam.height and cam.width locally in that function. Problem with the scaleOutputResolution() is that it will quantize camera width/height since it operates with floats and ints and the rescaling back does not really work. It does not return the original image size and this will cause other crashes later where the rgb and gt tensor sizes do not match.

Also it currently seem to be CPU-bound, where mainLoss.backward() seem to take majority of the time. So there are probably lot of opportunities for optimizing it still further. Thanks for taking the initiative to rewrite this in c++/cuda instead of Python!

@pierotofy
Copy link
Owner Author

pierotofy commented Mar 11, 2024

Thanks for testing and the detailed explanation of changes!

Would you be interested in opening a pull request with the aforementioned changes? It might help other users on Windows.

@salovision
Copy link
Contributor

Thanks for testing and the detailed explanation of changes!

Would you be interested in opening a pull request with the aforementioned changes? It might help other users on Windows.

I can do that. There are obviously multiple ways of fixing some things like the scaleOutputResolution(), personally I would remove that function altogether and do the calculations inside Model::forward() as it's safer not to modify the original camera parameters. But I can make a pull request and you can decide if those fixes sound good to you :)

@pierotofy
Copy link
Owner Author

pierotofy commented Mar 12, 2024

@dm-de try with the latest main branch, as the changes from #37 might have fixed the memory issue you were experiencing? 🙏

@ichsan2895
Copy link
Contributor

Something is going on. I try it on Ubuntu 22.04 LTS, and it consumes >10 GB with -d 1 in 251 images with 960 x 540 resolution. I check it from nvidia-smi. But since @pierotofy has said that memory allocated by libtorch is not automatically released, but it’s kept in an available state. Don’t trust the number from nvidia-smi. The actual usage is much lower.

@dm-de
Copy link

dm-de commented Mar 13, 2024

My results Windows x64 / Quadro RTX 4000 8GB

iter x1000 time splats x1000 gb  
start 12:33 0 0,8  
1 12:34 82 2  
2 12:34 273 3,7  
2,9 12:35 407 5,5 ? slow after 3k
4 12:37 423 6,1  
5 12:38 454 6,8  
5,9 12:39 471 7,7 very slow after 6k
7 12:44 465 4,8  
8 12:48 468 4,9  
8,9 12:52 470 4,9  
10 12:57 465 6,6  
11 13:01 465 6,6  
11,9 13:04 464 6,6  
stop     0,6  

@salovision
Copy link
Contributor

Indeed, there is a problem with the memory management where the refining steps (densify/culling) will essentially recreate all tensors (by using Tensor::index) and the current cuda memory manager likes to keep old tensors in memory for caching. While this may work in some projects, but here it will just exhaust all memory which after everything slows down dramatically.

Fortunately, there is an easy fix to empty the cuda memory cache after every refine step.

In Model::afterTrain() function, add c10::cuda::CUDACachingAllocator::emptyCache(); to the end of the function, after line max2DSize = torch::Tensor();. You also need to #include <c10/cuda/CUDACachingAllocator.h> in the beginning of the file.

Here's a comparison before and after the change:

Windows 10 / RTX 2080 / 8GB VRAM

Step Splats Min:Sec VRAM After -> Splats Min:Sec VRAM
200 33951 00:15 1364MB 33951 00:15 1364MB
400 33951 00:30 1364MB 33951 00:30 1364MB
600 33951 00:45 1364MB 33951 00:45 1364MB
800 46498 01:00 1372MB 46423 00:59 1372MB
1000 82050 01:15 1416MB 82165 01:14 1414MB
1200 134395 01:31 1610MB 134517 01:30 1488MB
1400 200933 01:47 2044MB 201939 01:46 1598MB
1600 274549 02:03 2602MB 276601 02:02 1650MB
1800 353564 02:20 3322MB 357344 02:19 1802MB
2000 431318 02:38 4358MB 437070 02:36 1972MB
2200 508949 02:57 5496MB 516026 02:55 2134MB
2400 585358 03:16 6948MB 593383 03:14 2232MB
2600 664535 03:37 8191MB 673355 03:34 2420MB
2800 740199 09:29 8191MB 750424 03:55 2530MB
3000 816083 16:05 8191MB 827623 04:16 2766MB
3200 827623 04:40 2472MB
3400 833699 05:03 2474MB
3600 812580 05:27 2752MB
3800 839227 05:50 2774MB
4000 871056 06:14 2812MB
4200 906333 06:38 2836MB
4400 947372 07:02 3028MB
4600 991100 07:27 3090MB
4800 1035020 07:52 3140MB
5000 1085761 08:18 3222MB

Previously learning stalled after 2600 steps when VRAM was full, 3000 steps took 16 minutes where it now takes 4 minutes when the cache is cleared after refining. Now I can finally use this project with bigger data sets and without running out of VRAM even with an 8GB card.

@pfxuan
Copy link
Collaborator

pfxuan commented Mar 17, 2024

Thanks for bringing up the cache issue. It appears emptyCache() also could help to control VRAM consumption from AMD GPU.

GPU: AMD RX 6700 XT

Before
VRAM: 12,200 MiB
Time: 02:35

After
VRAM: 2,544 MiB
Time: 02:35

I'll add this fix into both cuda and rocm code shortly.

@ichsan2895
Copy link
Contributor

ichsan2895 commented Mar 18, 2024

Its important fix (also fixing #43), I think @pierotofy should tagging OpenSplat v1.0.3.

@josephldobson
Copy link

Is rewriting the CUDA kernels in MSL possible? This is something I'd be interested in learning how to do over the summer, has anyone started this?

@pierotofy
Copy link
Owner Author

pierotofy commented Mar 18, 2024

That would be a cool project! I don't see why it wouldn't; I'm very close to having a CPU-only implementation ready (https://github.com/pierotofy/OpenSplat/tree/cpudiff), so a MSL version would just be another port.

This was referenced Mar 20, 2024
@andrewkchan
Copy link
Contributor

@josephldobson fyi! #76

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants