Skip to content
This repository has been archived by the owner on Jan 30, 2024. It is now read-only.

Make stack painting fast again! 🇪🇺 #302

Merged
merged 7 commits into from
Feb 25, 2022
Merged

Make stack painting fast again! 🇪🇺 #302

merged 7 commits into from
Feb 25, 2022

Conversation

Urhengulas
Copy link
Member

This PR implements the first one of improvements outlined in #258.

Fixes #258.

But what is "stack painting" anyways?

The idea is to write a specific byte pattern to (part of) the stack before the program is getting executed. After the program finished, either because it is done with its task, or because there was an error, we read out the previously painted area and check how much of it is still intact. If the pattern is still the same, we can be rather certain that the program didn't write to this part of the stack. This information helps to either know if there was a stack overflow, or just to measure how much of the stack was used.

So far both reading and writing of the memory was done via the probe. While this works it is also rather slow, because the host and probe communicate via USB which takes time.

The new approach is writing a subroutine to the MCU, which will paint the memory from within.

Mesurements

In following table you can see the measurement how much time the old and new approach take for memory from 8 to 256KiB.

data

The results are pretty impressive. The new approach is about 170 times faster!

Further work

  • A similar approach can also be applied to reading out the stack after the program finished.
  • Additionally the stack canary can be simplified quite a lot. So far we are not painting the whole stack, except the user asks for it, because this was slow. Because it is fast now we can always paint all of it, which simplifies the code and removes the need for the --measure-stack flag.

src/canary.rs Outdated
@@ -186,3 +185,88 @@ impl Canary {
}
}
}

/// Write [`CANARY_VALUE`] to the stack.
fn paint_stack(core: &mut Core, start: u32, mut end: u32) -> Result<(), probe_rs::Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a note here to observe that start should be the numerically lower address, and end should be the numerically higher address: that is start < end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also panic if that is not upheld. Otherwise the subtraction at line 197 will underflow and panic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does start also have to be aligned?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a note here to observe that start should be the numerically lower address, and end should be the numerically higher address: that is start < end.

Good point, I am asserting this in line 250 in fn subroutine, but it is probably better to already check it before.

Does start also have to be aligned?

I think so, but I was not sure how to go about it if it is not. Probably subtracting the modulo, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I am asserting this in line 250 in fn subroutine, but it is probably better to already check it before.

I'd apply the comment to whichever functions take start and end as arguments :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is, since the initial stack pointer points to the address directly above the stack, it itself doesn't belong to the stack anymore. Can anything bad happen when we are writing to this address? (I don't think so because static variables and so on will only be created later, but I still want to double check).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the stack is at the top of the RAM, the initial stack pointer value will be an invalid address.

From memory.x in cortex-m-rt:

_stack_start = ORIGIN(CCRAM) + LENGTH(CCRAM);

The problem, I think, is a range doing -1 when we should be doing -4, because we work in 4-byte units. Or perhaps we shouldn't hold start and end and instead hold start and length-in-32-bit-words.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so it turns out you can't see my comments unless I actually press "Comment" 🤣

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem, I think, is a range doing -1 when we should be doing -4, because we work in 4-byte units. Or perhaps we shouldn't hold start and end and instead hold start and length-in-32-bit-words.

So changing it to initial_stack_pointer - 4 should be fine?
I'd like to avoid bigger refactorings of other parts of the code as part of this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 67c1576.

src/canary.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@jonathanpallant jonathanpallant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some minor suggestions for improving comments (specifically around document pre-conditions like address alignment). But otherwise this is brilliant! Great work :) 👍

@Urhengulas
Copy link
Member Author

The error messages can probably also be improved, but I'd like to do this in a follow-up PR, when reworking the canary (see "Additionally the stack canary can be simplified quite a lot. [...]".

@Urhengulas Urhengulas removed the request for review from japaric February 22, 2022 15:23
@Urhengulas Urhengulas requested review from jonathanpallant and removed request for jonathanpallant February 23, 2022 13:37
@Urhengulas
Copy link
Member Author

bors r+

@Urhengulas
Copy link
Member Author

bors cancel

bors bot added a commit that referenced this pull request Feb 25, 2022
302: Make stack painting fast again! 🇪🇺  r=Urhengulas a=Urhengulas

This PR implements the first one of improvements outlined in #258.

Fixes #258.

## But what is "stack painting" anyways?

The idea is to write a specific byte pattern to (part of) the stack before the program is getting executed. After the program finished, either because it is done with its task, or because there was an error, we read out the previously painted area and check how much of it is still intact. If the pattern is still the same, we can be rather certain that the program didn't write to this part of the stack. This information helps to either know if there was a stack overflow, or just to measure how much of the stack was used.

So far both reading and writing of the memory was done via the probe. While this works it is also rather slow, because the host and probe communicate via USB which takes time.

The new approach is writing a subroutine to the MCU, which will paint the memory from within.

## Mesurements

In following table you can see the measurement how much time the old and new approach take for memory from 8 to 256KiB.

![data](https://user-images.githubusercontent.com/37087391/154973187-c17e66f7-cb22-4e56-8dff-a9798ab3a39a.png)

The results are pretty impressive. The new approach is about 170 times faster!

## Further work

- A similar approach can also be applied to reading out the stack after the program finished.
- Additionally the stack canary can be simplified quite a lot. So far we are not painting the whole stack, except the user asks for it, because this _was_ slow. Because it is fast now we can always paint all of it, which simplifies the code and removes the need for the `--measure-stack` flag.

Co-authored-by: Johann Hemmann <[email protected]>
@bors
Copy link
Contributor

bors bot commented Feb 25, 2022

Canceled.

@Urhengulas
Copy link
Member Author

bors r=jonathanpallant

@bors
Copy link
Contributor

bors bot commented Feb 25, 2022

Build succeeded:

@bors bors bot merged commit 12228a0 into main Feb 25, 2022
@bors bors bot deleted the stack-painting branch February 25, 2022 09:55
bors bot added a commit that referenced this pull request Aug 8, 2022
327: Optimize stack usage measuring r=jonathanpallant a=Urhengulas

This PR optimizes the stack usage measuring by not using the probe but a subroutine to search through the memory.

Fixes #258.

## Measurements

The speedup is similarly impressive as in #302:

| canary size | main (bcaf997) | this PR (3ad3f86) |
| :---: | :---: | :---: |
| 1024 B     | 0.007s | 0.014s |
| 261060 B | 1.912s | 0.028s |

It makes sense that the time actually gets worse for a small canary size of 1024 bytes, since we are doing a lot of setup work (flash subroutine, set registers, set and reset program counter etc.). But we see that this totally pays off, since for a rather big canary of 256KiB we are almost 70 times faster!

## Further work
This PR enables us to drastically simplify the canary logic, because since both painting and measuring are pretty fast now, we can always paint the full stack.

Co-authored-by: Urhengulas <[email protected]>
Co-authored-by: Johann Hemmann <[email protected]>
@Urhengulas Urhengulas mentioned this pull request Jun 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stack painting with --measure-stack is slow
3 participants