-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
driver/mtd_spi_nor, pkg/littlefs: improve reliability with corrupted flash (new PR) #20589
base: master
Are you sure you want to change the base?
Conversation
881f219
to
7f550ac
Compare
@benpicco I think I found a suitable solution to get rid of the Kconfig configuration options and have a versatile option to enable manufacturer dependent security options. The flags are defined in the global header file https://github.com/RIOT-OS/RIOT/blob/master/drivers/include/mtd_spi_nor.h Therefore I would introduce four new structures: two for Macronix with security (1byte and 4byte access) and two for ISSI (1byte and 4byte access). I'll probably have the preliminary commit with the proposed changes ready in a couple of hours, that should make everything clearer. |
e07f123
to
c293317
Compare
The static-test codespell complains about these lines in drivers/mtd_spi_nor/mtd_spi_nor.c being a typo (it would like to have WELL instead of WEL):
I'm not sure how to fix that? Oh wel... 😅 |
the static-test checks if the tool that is running a test is installed - so you probably do not have codespell installed.
it there is a long form (write enable L... what is the L? eg WREL?) or alternative of WEL (STATUS_WEL ?) you could use that -and workaround the spell checker or as it suggest it could be added to the ignored words list. |
WEL stands for "Write Enable Latch". However I found out that codespell added an inline option to ignore certain words. That would avoid adding a global ignored word (and potentially missing typos of "well" in the future): https://github.com/codespell-project/codespell?tab=readme-ov-file#inline-ignore |
citing from https://github.com/codespell-project/codespell?tab=readme-ov-file#ignoring-words
i don't think you need to go the inline route ( |
Okay, this was a bit naive of me. Obviously the spellcheck only checks with the ignored word list that is currently in master. So either I have to ignore the failed spell test or open a separate PR with just the updated ignored word list 🤔 |
that PR also would get spellchecked i think -> there is a egg and chicken problem seems like (our)codespell does not follow their doc codespell-project/codespell#3272 (don't know which date our version is seems like our codespell is case-insensitive in either case |
i read it wrong by case sensitive they mean you need to type it the way they have it in their typo database (usually lowercase) (i don't know why they do it like this) codespell-project/codespell#2451 (comment) so for WEL to not trigger you need to add wel but then wel and wEl and Wel would also not trigger |
So this was quite a rabbit hole. No matter what I did I could not reproduce the issue... It turns out that Linux Mint had an old version of codespell. Now I installed it via pip and not form the package repository and now it complains about WEL as well, yay. BUT... the latest Release of codespell does not support the inline ignores yet. It's only implemented in the development version. A workaround is changing "WEL" to "WEL-Flag" and now the spellcheck passes. |
2113699
to
1e2953d
Compare
The last PR adds a new test for the mtd_spi_nor driver which tests the security features. Contrary to what I wrote in the first post, you can trigger the P_FAIL (Program Failed) and E_FAIL (Erase Failed) flags by protecting a block with the Block Protect flags in the status register. The test is not suuuper neat because I had to copy some internal functions from the mtd_spi_nor driver to have direct access to the register. This seemed the least intrusive way to create the tests. This is a log from the most relevant test for this PR, checking the security functions with the Block Protect bits.
These two messages are generated by the new code that was created by @vincent-d:
HOWEVER... I do not love the solution with the configuration (setting both Flag and Opcode) in drivers/mtd_spi_nor/mtd_spi_nor_configs.c. On one hand it's unnecessary duplication and on the other hand it is really easy to forget to set one of the two (ask me how I know...). And if you forget to set the correct opcodes, it results in false positive security behaviour. An undefined opcode is set to 0x00 (NOP) and that returns 0x00. This 0x00 is interpreted as "everything is fine" by the security code. My preferred option would be adding some Pseudomodules, similar to the AT24CXXX driver. However I'm not sure if it would make sense to do it the same way by allocating every namespace that begins with "M25F...", "M25R...", "IS25LP...", "IS25LE...", "SST25V..." etc. A better option would probably be something like "MTD_SPI_NOR_MX", "MTD_SPI_NOR_IS", "MTD_SPI_NOR_SST" for defining the opcode set and "MTD_SPI_NOR_SECURITY" and (in a future expansion) "MTD_SPI_NOR_ECC" for defining the features. Any other ideas about this are very welcome. |
8509a22
to
01d79f8
Compare
@benpicco You reviewed the original PR, could you take a quick look at this to give me some guidance on the previous question? You don't have to review this PR yet. |
cf373f9
to
c2ca283
Compare
I implemented the Pseudomodule solution and IMO it is way more elegant than the original idea and scales a lot better. So one thing to add to this function is trying to clear the flag manually perhaps after a certain number of tries. (As well as adding the timeout). Therefore it took a lot of testing and fiddling until the ISSI chip was running as expected and the solution with the Pseudomodules proved to be superior in this case because I was able to add the additional OpCodes conditionally. Unfortunately I did not separate the changes in separate commits because I did not intend them to become this involved... |
I think this PR is already quite extensive and beyond the originally intended scope, so I would like to postpone adding the timeout tasks for now to a separate PR. |
Using the CPU to delay a function with a down-counting loop is the most costly option in terms of power consumption and the least accurate option. It is good enough for a brief delay in a driver's init function to avoid having to depend on ztimer just for that - or when desperate (e.g. early in boot before a timer is available). IMO: When the |
b56d5f4
to
6e25daa
Compare
6e25daa
to
8f25a37
Compare
@maribu I absolutely agree with you. As I said, the delay was just a little help for me to catch the actual test output. However, considering that no other tests do it, it was better to delete it. @kfessel I think I got all of the comments in the incorrect styling and added some more information regarding the test procedure. Unrelated: I still don't love git 👀 |
If you really want to busy wait long you can always loop a busy_wait
also for many cpu counting down ( single_threaded / isr disabled ) isn't to bad in precision ( especially many avr-board that have no clock crystal but a cpu crystal and are very predictable counting) setting up a ztimer in up to millisecond is probably worse in precision on slow mcu |
@crasbe: do you want to tackle the missing timeouts (waiting on status change) (from your TODOs)? |
* USEMODULE += mtd_spi_nor | ||
* | ||
* For ISSI and Macronix devices, some security features are provided | ||
* to check if program or erase operations were successful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like something where we could also have a software fallback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, however reading back all the data that was written previously or doing a blank check after an erase operation would lead to a significant performance penalty.
That is something I would prefer to do in a second PR, separate from this.
if ((status & SPI_NOR_STATUS_WEL) == 0) { | ||
break; | ||
} | ||
thread_yield(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does this typically take?
Maybe a proper sleep would be in order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting for WEL usually only takes one cycle, because the majority of the waiting is done in the "wait_for_write_complete" function. The WIP and WEL flags are usually cleared together on a successful write.
Waiting for a write to complete depends a lot on the operation. For short writes it cycles only once, but a chip erase can take up to 400 seconds (6.5 minutes) for my big 1GBit Flash. But that waiting is not done in the wait_for_write_enable_cleared function.
drivers/mtd_spi_nor/mtd_spi_nor.c
Outdated
DEBUG("mtd_spi_nor_write: ERR: page program failed! scur = %02x\n", scur_reg); | ||
ret = -EIO; | ||
} | ||
#elif IS_USED(MODULE_MTD_SPI_NOR_ISSI_SECURITY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we know the vendor from the device ID, can we do this check at run-time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is absolutely possible, I decided against it to avoid having too much "dead" code. Considering how different Macronix and ISSI handle that stuff, ST, Microchip, Winbond, ... probably handle it different as well, so we would have many checks left in the final binary which are never needed in normal operation.
Considering so many other parameters of the Flash have to be known at compile time, it didn't make sense to me to check it on runtime.
HOWEVER now that I think about it, it is probably possible to add a .manufacturer variable to the mtd_spi_nor_params_t struct. That would get rid of the pseudomodules and the compiler can optimize the code at compile time, because the variable is already set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is absolutely possible, I decided against it to avoid having too much "dead" code.
looking at the code, it doesn't appear to be too much code, just one register read for Macronix and two for ISSI.
I don't mind being able to only enable what's needed, but it would be good being able to select multiple 'security' modules if there are board variants with different flash chips.
Considering so many other parameters of the Flash have to be known at compile time, it didn't make sense to me to check it on runtime.
Actually there aren't any critical ones - the timings that get set are just minimal values, the size is determined at run time. The idea is that you can just attach any flash chip as those often get swapped out between board revisions depending on availability.
@@ -622,19 +659,50 @@ static int mtd_spi_nor_write_page(mtd_dev_t *mtd, const void *src, uint32_t page | |||
/* write enable */ | |||
mtd_spi_cmd(dev, dev->params->opcode->wren); | |||
|
|||
/* Wait for WEL-Flag to be set */ | |||
wait_for_write_enable_set(dev); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this required for all chips / when integrity checking is not enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, on all Flash devices you have to wait for the WEL flag to be set before starting a Write/Erase operation. Otherwise the operation will be ignored.
https://www.macronix.com/Lists/Datasheet/Attachments/8868/MX25R6435F,%20Wide%20Range,%2064Mb,%20v1.6.pdf Page 34 has a block diagram. However now that I look at it, Macronix says to issue the WREN command again if the flag is not set. So the right thing to do would not be to wait for the flag to be set but trying to set it until it is set.
Maybe that would need a try-counter instead of a timeout. 🤔
return mtd_write_page_raw(mtd, buffer, page, off, size); | ||
int ret = mtd_write_page_raw(mtd, buffer, page, off, size); | ||
if (ret == -EIO) { | ||
ret = LFS_ERR_CORRUPT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since EIO
is already used by the driver for bus errors, better use something different for errors detected by the device. How about EFAULT
, EBADMSG
or EILSEQ
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This decision was made by the original author, I'll have to look up which return value is most suitable instead of EIO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the MAN page of write: https://man7.org/linux/man-pages/man2/write.2.html
EIO A low-level I/O error occurred while modifying the inode.
This error may relate to the write-back of data written by
an earlier write(), which may have been issued to a
different file descriptor on the same file.
EFAULT seems to be related to the input data and not to the write process:
EFAULT buf is outside your accessible address space.
For the other ones, only generic information is given in the errno documentation, but I don't think they fit better than EIO.
https://man7.org/linux/man-pages/man3/errno.3.html
EBADMSG Bad message (POSIX.1-2001)
EILSEQ Invalid or incomplete multibyte or wide character
(POSIX.1, C99).
The text shown here is the glibc error description; in
POSIX.1, this error is described as "Illegal byte
sequence".
However one thing I found that might fit if EIO is not good, would be EHWPOISON:
EHWPOISON
Memory page has hardware error.
This has the disadvantage though that "EHWPOISON" does not seem to be defined for the nRF52 and possibly for a lot of other platforms neither...
IMO the EIO error fits at this point.
I'm not sure what the best way to go forward is here. IMO the timeout features have to potential to be a big can of worms as well, so I would tend to do that in a separate PR as well. |
This took a bit longer than I anticipated to get back to it, but other things happened. I think we have an elegant solution now. The code introduces a new "SPI_NOR_F_CHECK_INTEGRITY" flag, that can be set in the mtd_spi_nor_params_t.flag field. The manufacturer is determined by reading the first byte of the JEDEC ID. As suggested, the naming has been changed to "integrity" and the function has been excluded into a dedicated helper function, which reduces duplication (especially with the added manufacturer check). When thinking about the best solution I noticed that the Pseudomodule approach would've made it impossible to use a Macronix and an ISSI device in one project. This rarely happens, but nevertheless it isn't nice to have such limitations imposed. Yet to do is the following:
|
I updated the initial message in this PR to reflect the new testing procedure, which (at this moment) requires patching the boards/nrf52840dk/include/board.h file to activate the data integrity checks. |
@mguetschow I think this needs some cleaning work (I know at least one |
Contribution description
This is a takeover of PR #14908, which has been abandoned. The goal is to implement the proposed changes in the original PR to make it mergeable.
The changes to make it mergeable after four years were necessary due to the following commits:
60eb75d
This commit removed the "mtd_write_page" function in favor of the "mtd_write_page_raw" function.
ea105d3
This commit removed the "mtd_spi_nor_write" function, which was changed in the original PR as well but these changes are not necessary anymore.
Open Tasks:
The review from @benpicco ( mtd_spi_nor | littlefs: improve reliability with corrupted flash #14908 (review) ) has to be included in this PR still.
My plan was to add a header file to the drivers/mtd_spi_nor folder which would probably be called
something like "drivers/mtd_spi_nor/include/mtd_spi_nor_security.h""drivers/mtd_spi_nor/include/mtd_spi_nor_defines.h".The security bits used by this PR are not universal to all manufacturers of Flash chips, specifically the Flash used by the original author @vincent-d is from Macronix and for example ISSI uses a different register and different bit positions.
I'll have to check if they are manufacturer specific or device specific.The Kconfig related changes
can probably bewere deleted, as the Kconfig subsystem has been removed from the mtd_spi_nor driver.There is no timeout for the added functions wait_for_write_enable_cleared and wait_for_write_enable_set, which would get the thread stuck if the chip does not answer.Addressed in [tracking] Improve failure tolerance in driver/mtd_spi_nor #20769The function wait_for_write_complete does not have a timeout either but it counts the attempts and how many times it yielded the thread. This can be used as a basis for a timeout.Addressed in [tracking] Improve failure tolerance in driver/mtd_spi_nor #20769Write tests for the Macronix and ISSI flash security features.
Add IS25LP128 and IS25LE01G as DuTs to the test (and possibly more, whatever the magic drawer might supply).
Testing procedure
The PR comes with new tests that utilize the Block Protect functions from NOR Flashs which triggers the Program Fail and Erase Fail flags. The Block Protect bits are not permanent and can be reset easily.
The nRF52840DK development board has a suitable flash chip on board and is covered by the test.
All tests should run successfully to verify the PR is working correctly.
Running the test on the nRF52840DK currently requires patching the boards/nrf52840dk/include/boards.h, the "SPI_NOR_F_CHECK_INTEGRITY" flag has to be added to the "NRF52840DK_NOR_FLAGS" define.
Otherwise the tests will run, but will not test the new data integrity features.
When enabling the ENABLE_DEBUG and ENABLE_TRACE flags in the drivers/mtd_spi_nor/mtd_spi_nor.c file, the debug output should show the following lines:
This indicates that the security related code path was triggered and if the tests still pass, the program and erase failures happened at the right location in the tests.
Issues/PRs references
This closes #14908.