[WIP] Allocate Stan data on a local stack in the model #848

SteveBronder · 2021-03-10T04:48:53Z

Summary

This is a first hack at getting the data in a Stan model to sit a local stack allocator. To do this we need to

Put a stan::math::stack_alloc inside of the private members of the model class
Use the initializer list of the constructor to make the data and set the size of the memory we will want on the stack.

For the stack allocator we want the total size of data and transformed data
Each initializer calls a function

stan::io::initialize_data<{datas_type}>("data_name", context__, model_arena_, {data dimensions})`

That constructs the data from the context into the memory allocated by the model_arena_

The code gen is not really all there yet, still fighting the fmt library a bit. There's two other big things this is missing though.

How to keep the transformed data and constraint checks. I think it would be fine to leave the constraint checks in the constructor body and the manipulations of the transformed data should still happen in the constructor body as well. we could also change the signature of initialize_data to take in a enum as a tag for which constraint to check for such as

enum class constrain_checks {none, l_bound, u_bound, lu_bound, ...}
// in the constructor
x(stan::io::initialize_data<std::vector<double>, constrain_checks::u_bound>("x", context__, model_arena_, 10))`

I'm fine with whatever folks think would be easier or better

How to keep the *_opencl__ assignments (or how to do them in the initializer list). I think we could do a second pass of the data and if any are opencl tagged then we can also add those to the initializer. Though we will still need to keep opencl data derived from transformed data in the constructor body.

It will probably be best if (1) waits on @rybern 's current refactor of the constraints and (2) might need to wait for @rok-cesnovar 's MIR stuff for OpenCL

Release notes

Allows the Stan model to place data and transformed data in a local stack.

Copyright and Licensing

By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the BSD 3-clause license (https://opensource.org/licenses/BSD-3-Clause)

SteveBronder · 2021-03-10T04:49:41Z

Shoot I meant to mark this as a draft but I don't see a button to change it to a draft

bbbales2 · 2021-03-10T14:58:49Z

change it to a draft

It's hidden under the reviewers list before the Assignees section in little font.

SteveBronder · 2021-03-10T16:23:50Z

Weird it's not there for me?

rok-cesnovar · 2021-03-10T16:25:57Z

Ben converted it already.

Check any other non-draft PR:

SteveBronder · 2021-03-10T16:54:06Z

Thanks! My eyes must be going goofy

…calls

…stexpr bool jacobian__ to write_array_impl, change name of read_cholesky_corr to read_cholesky_factor_corr

SteveBronder · 2021-03-28T20:40:19Z

Alright so I got this almost up and running to the point that I can use this branch of stan to run a toy model with a few fixes the the cpp I'll discuss below. I'm using the diamonds example from posteriordb but modified to use normal() instead of normal_id_glm() because normal_id_glm() does not hold copies of the data but using normal() makes us use multiply() so we can see the benefit of this scheme wrt not having to copy the data in multiply().

Y ~ normal(Intercept + multiply(Xc, b), sigma);

I also took the data Y and X and just replicated it 6 times so instead of 5K rows it's 30K rows.

Running cmdstan with current develop I get

 Elapsed Time: 55.757 seconds (Warm-up)
               73.029 seconds (Sampling)
               128.786 seconds (Total)

and running with this PR and the stan branch I get

 Elapsed Time: 25.293 seconds (Warm-up)
               30.034 seconds (Sampling)
               55.327 seconds (Total)

So pretty nice more than a 2x speedup!!! And these results should be even sillier once we add parallel chains to cmdstan since all the data will be shared across all of the threads without copies. Though there's still a few issues to tackle.

I'm not doing the initializer list for transformed data correctly. I totally forgot that transformed data's integers and sizes can depend on the data (lol) so since they weren't being set till the body's constructor happened I wasn't allocating the right amount of memory for transformed data in the initializer list. I think I need to ditch the initializer because of the below issue.

Right now I'm just initializing the local allocator with the default size of 64KB (i.e. calling the default constructormodel_arena_() ). But our allocators greedy and once it runs out of memory to give out it malloc's twice as much memory as it needed last time. This is bad because if we have say

data {
int huge_N;
int huge_M;
int small_N;
matrix[huge_N, huge_M] X;
vector[small_N] other_X;
}

Say huge_N and huge_M make some matrix that requires like 3GB of memory and after making X there's not enough space left in our currently reserved memory for other_X. When the stack allocator tries to get the memory for other_X and sees there is not enough it's going to allocate a block of 6GB when we really only needed maybe a few KB! Not good! The only thing I can think of doing right now is adding a .conservative_alloc(size_t some_size) method to the stack allocator, where if it needs to allocate new memory instead of taking 2 * curr_block_size it allocates only as much as was requested. We could also cut down the greedyness of the stack allocator to only grab say 1.2 * curr_block_size. Or at least making a constructor for the stack alloactor that allows us to set the growth rate.

bbbales2 · 2021-03-29T16:54:38Z

and running with this PR and the stan branch I get

Nice nice, this is cool!

and after making X there's not enough space left in our currently reserved memory

If I remember from the easier issue, we're using special allocators here so that we can use type-logic to avoid copies in Math (to_arena knows to not copy these). If those special allocators just forward to malloc/free, then memory usage should be efficient. I don't think we need to stack-allocate these since these allocations shouldn't be performance-limiting themselves.

I guess this also means the end of const std::vector<T>& in Math (everything will need to be requires to handle std::vectors with special allocators).

I'd like to get threading finished up before formally reviewing and pulling this stuff in though since they are both changes to the memory model and I am scared if I'm not focused on them each individually I'll miss stuff.

SteveBronder · 2021-03-29T18:30:56Z

If I remember from the easier issue, we're using special allocators here so that we can use type-logic to avoid copies in Math (to_arena knows to not copy these). If those special allocators just forward to malloc/free, then memory usage should be efficient. I don't think we need to stack-allocate these since these allocations shouldn't be performance-limiting themselves.

I'm not sure what you mean here, the allocator we are using is just a stan::math::stack_alloc

I guess this also means the end of const std::vector& in Math (everything will need to be requires to handle std::vectors with special allocators).

It actually should be fine. At least for now these can be made into std::vector<Eigen::Map<>> and we won't need to do any extra templating. I can't think of any cases in the math library off the top of my head where we put an std vector holding eigen matrices directly onto the stack.

I'd like to get threading finished up before formally reviewing and pulling this stuff in though since they are both changes to the memory model and I am scared if I'm not focused on them each individually I'll miss stuff.

Yeah I agree. I'm going to make a pass at that today and try to flesh out the cmdstan side more

SteveBronder · 2021-03-29T19:24:41Z

I'm going to close this as I figured out a nicer way to do this and want to open it up in a separate PR

SteveBronder added 3 commits March 9, 2021 16:19

first pass of using stack allocator in model class

8ab7960

sort of got stack alloc working

4623d18

get size of model arena from context

585d34e

bbbales2 marked this pull request as draft March 10, 2021 14:57

rybern and others added 21 commits March 12, 2021 12:18

Convert reads to constrained reads; remove constraints

66bf994

Handling constraint argument codegen correctly (with hack)

8623456

remove old comments

6ce419d

merge master

9f6836b

Change in__ from reader to deserializer; remove jacobian from read() …

d045c45

…calls

Don't include lp__ in unconstrainted reads

03e789e

merge master

ce092c3

update to deserializer interface

c28d4ab

change reader to deserializer in printed C++ code

7a601ac

use template keyword when calling read_constrain functions, add a con…

0e0fecb

…stexpr bool jacobian__ to write_array_impl, change name of read_cholesky_corr to read_cholesky_factor_corr

add little func for getting the weird constrain dimensions for matrices

3eaab9e

Fix bug in pedantic mode brought up by constraint changes

1716dff

formatting

5191df7

add template keyword to read for deserializer

fae0929

add template keyword to read for deserializer

4c9fdaf

fix typo for cholesky_factor_cov

9d81520

remove the need for assigns

d4895b1

make format

7b9cf1c

turn down the number of parallel jobs for compilation

4a846cf

Merge remote-tracking branch 'upstream/master' into HEAD

c76af48

update to master

854c78e

SteveBronder added 2 commits March 24, 2021 20:38

update with new constrains

9ceebae

sort of working, need to get rid of constructor initializer

8e3ae1c

SteveBronder added 2 commits March 28, 2021 23:41

fix the initialization for transformed data

81da09f

fix inits for parameters so only data is on the local stack

d3db253

SteveBronder closed this Mar 29, 2021

SteveBronder mentioned this pull request Mar 29, 2021

Use Eigen Maps for data and transformed data #865

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Allocate Stan data on a local stack in the model #848

[WIP] Allocate Stan data on a local stack in the model #848

SteveBronder commented Mar 10, 2021

SteveBronder commented Mar 10, 2021

bbbales2 commented Mar 10, 2021

SteveBronder commented Mar 10, 2021

rok-cesnovar commented Mar 10, 2021

SteveBronder commented Mar 10, 2021

SteveBronder commented Mar 28, 2021 •

edited

Loading

bbbales2 commented Mar 29, 2021

SteveBronder commented Mar 29, 2021 •

edited

Loading

SteveBronder commented Mar 29, 2021

[WIP] Allocate Stan data on a local stack in the model #848

[WIP] Allocate Stan data on a local stack in the model #848

Conversation

SteveBronder commented Mar 10, 2021

Summary

Release notes

Copyright and Licensing

SteveBronder commented Mar 10, 2021

bbbales2 commented Mar 10, 2021

SteveBronder commented Mar 10, 2021

rok-cesnovar commented Mar 10, 2021

SteveBronder commented Mar 10, 2021

SteveBronder commented Mar 28, 2021 • edited Loading

bbbales2 commented Mar 29, 2021

SteveBronder commented Mar 29, 2021 • edited Loading

SteveBronder commented Mar 29, 2021

SteveBronder commented Mar 28, 2021 •

edited

Loading

SteveBronder commented Mar 29, 2021 •

edited

Loading