Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Allocate Stan data on a local stack in the model #848

Closed

Conversation

SteveBronder
Copy link
Contributor

Summary

This is a first hack at getting the data in a Stan model to sit a local stack allocator. To do this we need to

  1. Put a stan::math::stack_alloc inside of the private members of the model class
  2. Use the initializer list of the constructor to make the data and set the size of the memory we will want on the stack.
  • For the stack allocator we want the total size of data and transformed data
  • Each initializer calls a function
stan::io::initialize_data<{datas_type}>("data_name", context__, model_arena_, {data dimensions})`

That constructs the data from the context into the memory allocated by the model_arena_

The code gen is not really all there yet, still fighting the fmt library a bit. There's two other big things this is missing though.

  1. How to keep the transformed data and constraint checks. I think it would be fine to leave the constraint checks in the constructor body and the manipulations of the transformed data should still happen in the constructor body as well. we could also change the signature of initialize_data to take in a enum as a tag for which constraint to check for such as
enum class constrain_checks {none, l_bound, u_bound, lu_bound, ...}
// in the constructor
x(stan::io::initialize_data<std::vector<double>, constrain_checks::u_bound>("x", context__, model_arena_, 10))`

I'm fine with whatever folks think would be easier or better

  1. How to keep the *_opencl__ assignments (or how to do them in the initializer list). I think we could do a second pass of the data and if any are opencl tagged then we can also add those to the initializer. Though we will still need to keep opencl data derived from transformed data in the constructor body.

It will probably be best if (1) waits on @rybern 's current refactor of the constraints and (2) might need to wait for @rok-cesnovar 's MIR stuff for OpenCL

Release notes

Allows the Stan model to place data and transformed data in a local stack.

Copyright and Licensing

By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the BSD 3-clause license (https://opensource.org/licenses/BSD-3-Clause)

@SteveBronder
Copy link
Contributor Author

Shoot I meant to mark this as a draft but I don't see a button to change it to a draft

@bbbales2 bbbales2 marked this pull request as draft March 10, 2021 14:57
@bbbales2
Copy link
Member

change it to a draft

It's hidden under the reviewers list before the Assignees section in little font.

@SteveBronder
Copy link
Contributor Author

Weird it's not there for me?

@rok-cesnovar
Copy link
Member

Ben converted it already.

Check any other non-draft PR:

image

@SteveBronder
Copy link
Contributor Author

Thanks! My eyes must be going goofy

@SteveBronder
Copy link
Contributor Author

SteveBronder commented Mar 28, 2021

Alright so I got this almost up and running to the point that I can use this branch of stan to run a toy model with a few fixes the the cpp I'll discuss below. I'm using the diamonds example from posteriordb but modified to use normal() instead of normal_id_glm() because normal_id_glm() does not hold copies of the data but using normal() makes us use multiply() so we can see the benefit of this scheme wrt not having to copy the data in multiply().

Y ~ normal(Intercept + multiply(Xc, b), sigma);

I also took the data Y and X and just replicated it 6 times so instead of 5K rows it's 30K rows.

Running cmdstan with current develop I get

 Elapsed Time: 55.757 seconds (Warm-up)
               73.029 seconds (Sampling)
               128.786 seconds (Total)

and running with this PR and the stan branch I get

 Elapsed Time: 25.293 seconds (Warm-up)
               30.034 seconds (Sampling)
               55.327 seconds (Total)

So pretty nice more than a 2x speedup!!! And these results should be even sillier once we add parallel chains to cmdstan since all the data will be shared across all of the threads without copies. Though there's still a few issues to tackle.

I'm not doing the initializer list for transformed data correctly. I totally forgot that transformed data's integers and sizes can depend on the data (lol) so since they weren't being set till the body's constructor happened I wasn't allocating the right amount of memory for transformed data in the initializer list. I think I need to ditch the initializer because of the below issue.

Right now I'm just initializing the local allocator with the default size of 64KB (i.e. calling the default constructormodel_arena_() ). But our allocators greedy and once it runs out of memory to give out it malloc's twice as much memory as it needed last time. This is bad because if we have say

data {
int huge_N;
int huge_M;
int small_N;
matrix[huge_N, huge_M] X;
vector[small_N] other_X;
}

Say huge_N and huge_M make some matrix that requires like 3GB of memory and after making X there's not enough space left in our currently reserved memory for other_X. When the stack allocator tries to get the memory for other_X and sees there is not enough it's going to allocate a block of 6GB when we really only needed maybe a few KB! Not good! The only thing I can think of doing right now is adding a .conservative_alloc(size_t some_size) method to the stack allocator, where if it needs to allocate new memory instead of taking 2 * curr_block_size it allocates only as much as was requested. We could also cut down the greedyness of the stack allocator to only grab say 1.2 * curr_block_size. Or at least making a constructor for the stack alloactor that allows us to set the growth rate.

@bbbales2
Copy link
Member

and running with this PR and the stan branch I get

Nice nice, this is cool!

and after making X there's not enough space left in our currently reserved memory

If I remember from the easier issue, we're using special allocators here so that we can use type-logic to avoid copies in Math (to_arena knows to not copy these). If those special allocators just forward to malloc/free, then memory usage should be efficient. I don't think we need to stack-allocate these since these allocations shouldn't be performance-limiting themselves.

I guess this also means the end of const std::vector<T>& in Math (everything will need to be requires to handle std::vectors with special allocators).

I'd like to get threading finished up before formally reviewing and pulling this stuff in though since they are both changes to the memory model and I am scared if I'm not focused on them each individually I'll miss stuff.

@SteveBronder
Copy link
Contributor Author

SteveBronder commented Mar 29, 2021

If I remember from the easier issue, we're using special allocators here so that we can use type-logic to avoid copies in Math (to_arena knows to not copy these). If those special allocators just forward to malloc/free, then memory usage should be efficient. I don't think we need to stack-allocate these since these allocations shouldn't be performance-limiting themselves.

I'm not sure what you mean here, the allocator we are using is just a stan::math::stack_alloc

I guess this also means the end of const std::vector& in Math (everything will need to be requires to handle std::vectors with special allocators).

It actually should be fine. At least for now these can be made into std::vector<Eigen::Map<>> and we won't need to do any extra templating. I can't think of any cases in the math library off the top of my head where we put an std vector holding eigen matrices directly onto the stack.

I'd like to get threading finished up before formally reviewing and pulling this stuff in though since they are both changes to the memory model and I am scared if I'm not focused on them each individually I'll miss stuff.

Yeah I agree. I'm going to make a pass at that today and try to flesh out the cmdstan side more

@SteveBronder
Copy link
Contributor Author

I'm going to close this as I figured out a nicer way to do this and want to open it up in a separate PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants