Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile time blowup when calling push_back repeatedly #4395

Closed
spalladino opened this issue Feb 16, 2024 · 35 comments · Fixed by #4643
Closed

Compile time blowup when calling push_back repeatedly #4395

spalladino opened this issue Feb 16, 2024 · 35 comments · Fixed by #4643
Assignees
Labels
aztec.nr Helpful for development of Aztec.nr the smart contract framework bug Something isn't working optimization

Comments

@spalladino
Copy link
Contributor

Aim

Given a large enough array (about 8k elements), copying it into another array by iteratively calling push_back increases compiling time to several minutes (over 5 minutes).

This happens when declaring a very large array as an argument for an Aztec contract, since they get passed into the Hasher which copies elements into its own array using push_back.

https://github.com/AztecProtocol/aztec-packages/blob/f14474571210a46e7159cb9d2f0bc9374a837d3d/noir-projects/aztec-nr/aztec/src/hasher.nr#L22-L26

Expected Behavior

Compile time should not blow up.

Bug

The following vanilla Noir snippet reproduces the issue:

struct Hasher {
    fields: [Field],
}

impl Hasher {
    pub fn new() -> Self {
        Self { fields: [] }
    }

    pub fn add(&mut self, field: Field) {
        self.fields = self.fields.push_back(field);
    }

    pub fn add_multiple<N>(&mut self, fields: [Field; N]) {
        for i in 0..N {
            self.fields = self.fields.push_back(fields[i]);
        }
    }
}

fn main(expected: pub Field, first: Field, input: [Field;8192]) {
    let mut hasher = Hasher::new();
    hasher.add(first);
    hasher.add_multiple(input);
    assert(hasher.fields[0] == expected);
}

To Reproduce

No response

Project Impact

None

Impact Context

This makes it very annoying to add large arrays as arguments to Aztec contracts.

Workaround

None

Workaround Description

We can work around it by injecting the array through an oracle call. Still, if there is an equivalent operation to push_back that would yield better results, it'd be most welcome!

Additional Context

No response

Installation Method

Binary (noirup default)

Nargo Version

0.23.0+5be9f9d7e2f39ca228df10e5a530474af0331704

NoirJS Version

No response

Would you like to submit a PR for this Issue?

None

Support Needs

No response

@spalladino spalladino added the bug Something isn't working label Feb 16, 2024
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Noir Feb 16, 2024
@Savio-Sou Savio-Sou added the aztec.nr Helpful for development of Aztec.nr the smart contract framework label Feb 16, 2024
PhilWindle added a commit to AztecProtocol/aztec-packages that referenced this issue Feb 27, 2024
Updates the `ContractDeployer` to use the new deployment flow. Instead
of creating txs flagged as `isDeployment`, the deployer now creates a
batch call that registers the contract class (if needed), publicly
deploys the contract via the canonical `InstanceDeployer`, and then
initializes it by calling the constructor. This batch call is then sent
via an account contract entrypoint and executed in a single tx. Check
out `e2e_deploy` _publicly deploys and initializes a contract_ to see it
in action.

However, since this requires sending txs through an account entrypoint,
it breaks for deploying accounts. So we still keep the old deployment
flow around for those scenarios, under the `LegacyDeployMethod` name.
We'll need to tweak account contracts constructors so they can be
initialized and optionally deployed in the same single direct call, and
they also pay fees for it. Alternatively, we can also add a canonical
`MultiCall` contract, which could take care of this, though fee payment
is unclear.

This PR also changes the `ClassRegisterer` API so that bytecode is
provided via a capsule instead of an arg, since having args of size over
~2k elements was causing a loop in the Hasher that massively increased
build times (see noir-lang/noir#4395).

**Breaking changes:** Since deployments are now done from an account,
deploying a contract via `Contract.deploy` now requires a `Wallet`
instance instead of a `PXE`, and deploy a contract via CLI now requires
a `--private-key`.

---------

Co-authored-by: PhilWindle <[email protected]>
@jfecher
Copy link
Contributor

jfecher commented Mar 19, 2024

We optimized the implementation of as_slice() recently for converting arrays to slices. Is that applicable here?

You could, for example rewrite the example to make use of as_slice and push to the resulting slice for possibly better performance:

fn main(expected: pub Field, first: Field, input: [Field;8192]) {
    let mut hasher_slice = input.as_slice();
    hasher_slice = hasher_slice.push_front(first);
    assert(hasher_slice[0] == expected);
}

takes 0.644 seconds for me on nargo compiled in debug mode.

push_front can be hard to use if this code is auto-generated from a macro since it works backwards somewhat. If that is the case then for now, I think you can replace:

pub fn add_multiple<N>(&mut self, fields: [Field; N]) {
    for i in 0..N {
        self.fields = self.fields.push_back(fields[i]);
    }
}

with the existing append in the stdlib:

pub fn add_multiple<N>(&mut self, fields: [Field; N]) {
    self.fields = self.fields.append(fields.as_slice());
}

This won't give you a speedup yet, but when we eventually optimize append you'll then see that speedup automatically too.

@vezenovm
Copy link
Contributor

takes 0.644 second for me on nargo compiled in debug mode

Execute was ~.9s for me as well using this update

@zac-williamson
Copy link

@jfecher is anyone working on this bug? I appreciate that a workaround exists for @spalladino , but other developers will also encounter this issue.

@jfecher
Copy link
Contributor

jfecher commented Mar 20, 2024

@zac-williamson I spoke with @vezenovm about adding a builtin for append yesterday. It doesn't seem too difficult but nobody is working on adding it at this moment.

@spalladino
Copy link
Contributor Author

@jfecher I've just tried the approach you suggested (apologies for the delay!) in this branch in aztec-packages, but I'm running into a should have slice capacity set for value v776 being stored at v764. Any hints? Note that I have modified the compiler to allow skipping code autogeneration for the hasher, but that should've affected only the aztec macros.

@jfecher
Copy link
Contributor

jfecher commented Mar 25, 2024

@spalladino what version of the compiler are you using? There was a bug that was fixed a couple days ago related to .as_slice() calls which triggered that error message. You may want to try updating to the latest version of master

@spalladino
Copy link
Contributor Author

@jfecher assuming that fix was #4610, then yeah, it wasn't synced to the aztec-packages codebase yet. It landed just earlier today. I'll retry and report back!

@spalladino
Copy link
Contributor Author

@jfecher no luck, same error as before (though the index has moved a little). I've pushed the latest to the same branch as before: AztecProtocol/aztec-packages#5410.

The application panicked (crashed).
Message:  ICE: should have slice capacity set for value v700 being stored at v688
Location: compiler/noirc_evaluator/src/ssa/opt/flatten_cfg/capacity_tracker.rs:134

@jfecher
Copy link
Contributor

jfecher commented Mar 26, 2024

Another issue. On master, there's been a large regression. The compiler now panics for the original code example with:

The application panicked (crashed).
Message:  assertion `left == right` failed
  left: 2
 right: 1
Location: compiler/noirc_evaluator/src/ssa/opt/inlining.rs:420

I think the issue stems from the slice created by the new function not having an associated capacity returned as well:

Initial SSA:
acir fn main f0 {
  b0(v0: Field, v1: Field, v2: [Field; 8192]):
    inc_rc v2
    v4, v5 = call f1()
    ...
}
acir fn new f1 {
  b0():
    inc_rc []
    return []
}

In the SSA above, the v4, v5 = call f1() expects two return values but f1 only returns a single [] instead of Field 0, [].

Edit: This is due a bug in the array/slice split. new returns an array even though the field should only accept a slice. Changing new to return &[] fixes this.

Edit 2: I think the array/slice split just exposed this bug rather than introduced it. It looks like the coercion code hasn't been changed in quite a while. Before the split, the array in this issue would just be inferred to be a slice instead of being coerced into one.

@vezenovm
Copy link
Contributor

Edit: This is due a bug in the array/slice split. new returns an array even though the field should only accept a slice. Changing new to return &[] fixes this.

Probably can be resolved with an improved type check. I am looking at resolving the initial bug but I can take a look at this one as well.

@jfecher
Copy link
Contributor

jfecher commented Mar 26, 2024

@vezenovm I fixed it and am putting up a PR for it now after I make a new issue. It looks like the array to slice coercion has always been broken which makes me question how it ever worked.

@vezenovm
Copy link
Contributor

It looks like the array to slice coercion has always been broken which makes me question how it ever worked.

I don't think we ever allowed empty arrays before. So if you did [] it was always a slice so you would have Field 0, [].

@jfecher
Copy link
Contributor

jfecher commented Mar 26, 2024

We allowed empty arrays but arrays and slices were polymorphic before so in that case before it'd just be inferred to be a slice instead of coerced into one.

@jfecher
Copy link
Contributor

jfecher commented Mar 26, 2024

@spalladino may be worth testing again after #4639 is merged

@vezenovm
Copy link
Contributor

@spalladino may be worth testing again after #4639 is merged

He didn't use the Hasher struct in his PR. I'm looking into why his PR might be failing.

@jfecher
Copy link
Contributor

jfecher commented Mar 26, 2024

@vezenovm I was assuming there still may be a as_slice coercion inserted somewhere by the aztec macro for hashing

@vezenovm
Copy link
Contributor

@vezenovm I was assuming there still may be a as_slice coercion inserted somewhere by the aztec macro for hashing

That PR doesn't use the macro which hashes the args. I have a fix.

github-merge-queue bot pushed a commit that referenced this issue Mar 26, 2024
# Description

## Problem\*

Resolves
#4395 (comment)

## Summary\*

We were not accurately accounting for situations where the slice
capacity tracker was expecting a capacity from slice intrinsic results.
I have added a case for handling the slice capacity of `AsSlice`.

I also tested this against the `aztec-packages` work and we can
successfully compile now.

## Additional Context



## Documentation\*

Check one:
- [ ] No documentation needed.
- [ ] Documentation included in this PR.
- [ ] **[For Experimental Features]** Documentation to be submitted in a
separate PR.

# PR Checklist\*

- [ ] I have tested the changes locally.
- [ ] I have formatted the changes with [Prettier](https://prettier.io/)
and/or `cargo fmt` on default settings.
@github-project-automation github-project-automation bot moved this from 📋 Backlog to ✅ Done in Noir Mar 26, 2024
@spalladino
Copy link
Contributor Author

Thanks for looking into this, and apologies as I got a bit lost in the discussion. Does this mean that the version of the code that uses slices and push_front (ie the one shared here) should compile fast, as long as #4643 is merged into aztec-packages?

@vezenovm
Copy link
Contributor

vezenovm commented Mar 26, 2024

Does this mean that the version of the code that uses slices and push_front (ie the one shared here) should compile fast, as long as #4643 is merged into aztec-packages?

Yes it should compile fast now, but until we have a builtin append you will have to use push_front.

@spalladino
Copy link
Contributor Author

@jfecher @vezenovm can confirm compile time has gone down from 7 minutes to just 30 seconds using push_front here.

However, given a recent increase in max bytecode size, we now have to increase the ARGS_HASH_CHUNK_LENGTH and ARGS_HASH_CHUNK_COUNT used in the hashing loop here from 64 to 128 each:

https://github.com/AztecProtocol/aztec-packages/pull/5410/files#diff-047179fc6de7179088a0fad53d28c70b04f85c09a5dd1271bf341cf54006d9e2R36-R52

This means going from hashing 4k field elements to hashing 16k, and that causes compile times to spike again, going from 30s to 5 minutes. Is this expected, or is there something further we can do to improve those times?

@spalladino
Copy link
Contributor Author

Commenting out the pedersen_hash inside the loop does not speed things up, so it's the copying that's taking up time.

https://github.com/AztecProtocol/aztec-packages/pull/5410/files#diff-047179fc6de7179088a0fad53d28c70b04f85c09a5dd1271bf341cf54006d9e2R48

@vezenovm
Copy link
Contributor

Commenting out the pedersen_hash inside the loop does not speed things up, so it's the copying that's taking up time.

@jfecher I'm suspicious this is a case of #4621 as the copying is done inside of an if statement that checks the dynamic args.len().

@jfecher
Copy link
Contributor

jfecher commented Mar 27, 2024

@vezenovm possibly. When investigating #4621 further for the hashmap usecase though I found that it didn't actually apply there - the last_uses check already applied to every array set and made them all mutable even with dynamic indices. I think I'll push a PR from a branch I made which turns the last_use analysis into an actual SSA pass to make it more clear for debugging which instructions it applies to.

I'm also not sure if adding reference counting to acir is plausible in general or if we'd need a static analysis like the last_use we have now.

If there was a minimal repro for the aztec-packages PR we could test to see how many array sets are made mutable.

@vezenovm
Copy link
Contributor

This looks to just be an issue when flattening with such a large loop bound. I made the following vanilla Noir program:

use dep::std::hash::pedersen_hash;
global ARGS_HASH_CHUNK_COUNT = 64;
global ARGS_HASH_CHUNK_LENGTH = 64;
fn main(expected: pub Field, first: Field, input: [Field; 20]) {
    let mut hasher_slice = input.as_slice();
    hasher_slice = hasher_slice.push_front(first);
    assert(hasher_slice[0] == expected);

    let hash = hash_args_slice(hasher_slice);
    println(hash);
}

fn hash_args_slice(args: [Field]) -> Field {
    if args.len() == 0 {
        0
    } else {
        let mut chunks_hashes = [0; ARGS_HASH_CHUNK_COUNT];
        for i in 0..ARGS_HASH_CHUNK_COUNT {
            let mut chunk_hash = 0;
            let start_chunk_index = i * ARGS_HASH_CHUNK_LENGTH;
            if start_chunk_index < args.len() {
                let mut chunk_args = [0; ARGS_HASH_CHUNK_LENGTH];
                for j in 0..ARGS_HASH_CHUNK_LENGTH {
                    let item_index = i * ARGS_HASH_CHUNK_LENGTH + j;
                    if item_index < args.len() {
                        chunk_args[j] = args[item_index];
                    }
                }
                chunk_hash = pedersen_hash(chunk_args);
            }
            chunks_hashes[i] = chunk_hash;
        }
        pedersen_hash(chunks_hashes)
    }
}

Running time nargo compile --force on master gives me ~10s. This is no matter the size of input. I tried sizes of 20, 2000, and 15000. This makes sense as we are bound by the loop not the input, but I just wanted to try it to be sure.

Switching the loop bound constants to 128 the compile time jumped to ~2 minutes.

I also made an example without any if statements in the loop.

fn hash_args_slice_no_if(args: [Field]) -> Field {
    if args.len() == 0 {
        0
    } else {
        let mut chunks_hashes = [0; ARGS_HASH_CHUNK_COUNT];
        for i in 0..ARGS_HASH_CHUNK_COUNT {
            let mut chunk_hash = 0;
            let mut chunk_args = [0; ARGS_HASH_CHUNK_LENGTH];
            for j in 0..ARGS_HASH_CHUNK_LENGTH {
                let item_index = i * ARGS_HASH_CHUNK_LENGTH + j;
                chunk_args[j] = args[item_index];
            }
            chunk_hash = pedersen_hash(chunk_args);
            chunks_hashes[i] = chunk_hash;
        }
        pedersen_hash(chunks_hashes)
    }
}

I then made input equal to [Field; ARGS_HASH_CHUNK_COUNT * ARGS_HASH_CHUNK_LENGTH - NUM_PUSH_FRONTS]. This took ~1s to compile with the loop bounds at 64 and ~4s to compile the bounds at 128.

@vezenovm
Copy link
Contributor

It would be good to add some benchmarking on certain tests to see how much time we are spending on each compilation pass.

@jfecher
Copy link
Contributor

jfecher commented Mar 27, 2024

It worries me if the reason for this slowdown is just from a large nested loop since that isn't really fixable. We can always try to optimize these passes further but even if we get the entire compiler running in half the time that'd just mean we'd compile just as slowly for a loop bound 2x higher.

@vezenovm
Copy link
Contributor

It worries me if the reason for this slowdown is just from a large nested loop since that isn't really fixable. We can always try to optimize these passes further but even if we get the entire compiler running in half the time that'd just mean we'd compile just as slowly for a loop bound 2x higher.

I did some basic timing using std::time::Instant and got this breakdown for loop bounds of 64 in the example above (with the if statements in the loop):

After Defunctionalization:: 0
After Removing Paired rc_inc & rc_decs:: 0
After Inlining:: 0
After Mem2Reg:: 0
After Simplifying:: 3
After Flattening:: 1658
After Removing Bit Shifts:: 16
After Mem2Reg:: 3957
After Constant Folding:: 906
After Constraint Folding:: 1240
After Dead Instruction Elimination:: 273
ACIR process: 2180
nargo compile --force  8.65s user 8.85s system 159% cpu 10.960 total

And then loop bounds of 128:

After Defunctionalization:: 0
After Removing Paired rc_inc & rc_decs:: 0
After Inlining:: 0
After Mem2Reg:: 0
After Simplifying:: 17
After Flattening:: 15480
After Removing Bit Shifts:: 140
After Mem2Reg:: 45459
After Constant Folding:: 12410
After Constraint Folding:: 11968
After Dead Instruction Elimination:: 2604
ACIR process: 26105
nargo compile --force  92.16s user 29.98s system 105% cpu 1:56.29 total

@jfecher
Copy link
Contributor

jfecher commented Mar 27, 2024

I'm assuming you changed both loop bounds to 128. Can you change only 1 so that we're comparing 64 * 64 to 64 * 128? That way we can see if they grow linearly or not.

@vezenovm
Copy link
Contributor

vezenovm commented Mar 27, 2024

Changing just ARGS_HASH_CHUNK_LENGTH to 128 (the inner loop bound):

After Defunctionalization:: 0
After Removing Paired rc_inc & rc_decs:: 0
After Inlining:: 0
After Mem2Reg:: 0
After Simplifying:: 7
After Flattening:: 7161
After Removing Bit Shifts:: 93
After Mem2Reg:: 21578
After Constant Folding:: 5410
After Constraint Folding:: 5063
After Dead Instruction Elimination:: 1172
ACIR process: 9440
nargo compile --force  41.29s user 15.25s system 110% cpu 51.085 total

Changing just ARGS_HASH_CHUNK_COUNT to 128 (the outer loop bound):

After Defunctionalization:: 0
After Removing Paired rc_inc & rc_decs:: 0
After Inlining:: 0
After Mem2Reg:: 0
After Simplifying:: 15
After Flattening:: 5662
After Removing Bit Shifts:: 60
After Mem2Reg:: 15511
After Constant Folding:: 2237
After Constraint Folding:: 2696
After Dead Instruction Elimination:: 596
ACIR process: 4155
nargo compile --force  27.18s user 6.29s system 103% cpu 32.465 total

@jfecher
Copy link
Contributor

jfecher commented Mar 27, 2024

Looks like both are growing faster than linearly. mem2reg and flattening take up the most time, with mem2reg growing a bit faster. Generally the passes take longer as the SSA grows towards the end after loop unrolling and flattening.

Is loop unrolling included in another pass or just excluded from these timings?

@vezenovm
Copy link
Contributor

vezenovm commented Mar 27, 2024

Is loop unrolling included in another pass or just excluded from these timings?

Ah good catch I was just printing inside of run_pass not try_run_pass. Let me just make a general PR for this.

@vezenovm
Copy link
Contributor

I did it quickly and loop unrolling is only ~200 ms in both cases above.

@spalladino
Copy link
Contributor Author

I also made an example without any if statements in the loop. (...) I then made input equal to [Field; ARGS_HASH_CHUNK_COUNT * ARGS_HASH_CHUNK_LENGTH - NUM_PUSH_FRONTS]. This took ~1s to compile with the loop bounds at 64 and ~4s to compile the bounds at 128.

Hold on, so the main culprit is the if statements? Given I know the size of the args array at compile time, do you think I can somehow work around this? If I changed the input from a slice to a [Field; N], would those conditionals be erased at compiled time based on the value of N? However, if I did that, I'd be back at square one since I need to copy each argument into the "args" array, which was the original problem that we fixed with slices...

@vezenovm
Copy link
Contributor

If I changed the input from a slice to a [Field; N], would those conditionals be erased at compiled time based on the value of N?

Yeah they should be. As you do know the size of the args array at compile time, another option is we could add that builtin append function we mentioned earlier, and then you simply would attach any extra dummy data to args to fill the MAX_ARGS_LENGTH. Then the copying would at least be simplified. However, this is not super user friendly and like you said we are essentially back to square one.

@spalladino
Copy link
Contributor Author

However, this is not super user friendly and like you said we are essentially back to square one.

Just to be clear, the issue is not the user-friendliness of the syntax, but the increase in the compilation time.

TomAFrench pushed a commit that referenced this issue Apr 3, 2024
# Description

## Problem\*

Resolves
#4395 (comment)

## Summary\*

We were not accurately accounting for situations where the slice
capacity tracker was expecting a capacity from slice intrinsic results.
I have added a case for handling the slice capacity of `AsSlice`.

I also tested this against the `aztec-packages` work and we can
successfully compile now.

## Additional Context



## Documentation\*

Check one:
- [ ] No documentation needed.
- [ ] Documentation included in this PR.
- [ ] **[For Experimental Features]** Documentation to be submitted in a
separate PR.

# PR Checklist\*

- [ ] I have tested the changes locally.
- [ ] I have formatted the changes with [Prettier](https://prettier.io/)
and/or `cargo fmt` on default settings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aztec.nr Helpful for development of Aztec.nr the smart contract framework bug Something isn't working optimization
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants