Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[i2c] Missing ACQ queue entries in ChromeOS test #25965

Open
jesultra opened this issue Jan 21, 2025 · 9 comments
Open

[i2c] Missing ACQ queue entries in ChromeOS test #25965

jesultra opened this issue Jan 21, 2025 · 9 comments
Assignees
Labels

Comments

@jesultra
Copy link
Contributor

Reading TPM registers via I2C using the proprietary ChromeOS protocol is done by an I2C write of a single data byte (register address) followed by an I2C read of however many bytes the host wants to read from the specified register. Typically, the host on actual Chromebooks will issue an I2C stop condition between the two, (and furthermore, for historical reasons, wait for off-band signalling that data is ready).

In our testing, we want to ensure that the OT GSC can recover, even if the I2C host reads early, though. And during those tests, I see something odd.

We expect to pull the following sequence of entries from the ACQ queue: START(write), DATA, STOP/RESTART, START(read), ... , and that is what we in fact see, if the host does a STOP condition in between.

However, without the STOP condition on the bus, what we see is START(write), DATA and then nothing further, even as a logic analyzer shows that the restart was followed by an I2C address, which the OT chip is acknowledging, while stretching the clock, as software has not put any data into the TX FIFO.

Two things would be good here:

  1. When the I2C device is stretching the clock, waiting for software to put data in the TX FIFO, a timeout interrupt would be nice, in case the software state machine is broken, so that the controller could be reset, freeing up the bus.
  2. The RESTART and START(read) entries should be visible in the ACQ queue, even if the controller cannot release the clock of the final ACK bit on the address. Otherwise, our software has no chance of knowing that the write of one byte is complete, and was not merely the first of a multi-byte write (which would represent a TPM register write, rather than read.)
@jesultra jesultra self-assigned this Jan 21, 2025
@a-will
Copy link
Contributor

a-will commented Jan 21, 2025

When the I2C device is stretching the clock, waiting for software to put data in the TX FIFO, a timeout interrupt would be nice, in case the software state machine is broken, so that the controller could be reset, freeing up the bus.

Dealing with broken software is not the IP's problem. We generally do not add random timeout circuits to aid debugging at this level. Area is often precious, and there are other tools available to handle such a problem.

The RESTART and START(read) entries should be visible in the ACQ queue, even if the controller cannot release the clock of the final ACK bit on the address. Otherwise, our software has no chance of knowing that the write of one byte is complete, and was not merely the first of a multi-byte write (which would represent a TPM register write, rather than read.)

No. No data-containing symbols go into the ACQ FIFO without completing the ACK stage first. Unacknowledged bytes are available in ACQ_FIFO_NEXT_DATA.

You also have INTR_STATE.cmd_complete to determine that a stop or restart occurred, INTR_STATE.tx_stretch to determine that the stretching is due to lack of data in the TX FIFO, and TARGET_EVENTS.tx_pending for the start of any read transaction to hold things off until the TX FIFO is prepared (if CTRL.tx_stretch_ctrl_en is set).

It doesn't look to me like you are actually without the info. Am I missing something?

@jesultra
Copy link
Contributor Author

jesultra commented Jan 21, 2025

No data-containing symbols go into the ACQ FIFO without completing the ACK stage first. Unacknowledged bytes are available in ACQ_FIFO_NEXT_DATA.

I must be misunderstanding something. I am reading the above as if you say that (among other things) START(read) will not appear in the ACQ fifo until the controller has released the clock on the final ACK phase of the address byte, which it will do only once data is put into the TX FIFO.

Let's say I have configured my OpenTitan chip to respond to two different I2C addresses, and want to implement different functionality. My software needs to see which address the I2C host is attempting to read from, before it can know what data to put into the TX FIFO. Yet from what you say, that does not seem possible.

As far as I can tell ACQ_FIFO_NEXT_DATA does not seem to be meant for peeking at START/RESTART entries, as it is only 8-bit, and does not carry the additional signals.

You also have INTR_STATE.cmd_complete to determine that a stop or restart occurred, [...]

This almost works, but not quite. TPM register write operations are done by an I2C write of more than one byte. So e.g. writing to the status register followed by reading the same status register would be an I2C write of five bytes, then a write of one byte, and then a read of four bytes. With possible interrupt latency, our software handler may not run until the two writes have completed, and in that case, it will see the full five byte write transaction in the ACQ queue, followed by the one byte write transaction, but as described in the original post, it does not see STOP/RESTART after the sole data byte of the second write. And the fact that cmd_complete is set cannot be taken as meaning that the most recent write transaction is only one byte, as that would have been set and latched already when the first five-byte write completed.

Was the convention around RESTART changed between Z1 and A1? Our code seems to assume that RESTART does not carry any address, and will always be followed by a START(addr) entry, this does not match the A1 documentation, which states that RESTART carries an address. The latter makes sense, but may be how we lost the explicit signal in the ACQ queue that the current transfer has completed.

[You also have] INTR_STATE.tx_stretch to determine that the stretching is due to lack of data in the TX FIFO, [...]

I suppose that if my software sees a seemingly incomplete write transfer in the ACQ queue, while INTR_STATE.tx_stretch is flagged, then it can conclude that the write must have completed, and a read begun, and can act accordingly. (Since we do not juggle multiple I2C addresses, that works out.) It is still unfortunate if the multi-address functionality is crippled, and unfortunate that our state machine no longer can rely solely consuming available entries from the ACQ queue upon getting either "ACQ threshold" or "cmd_complete" interrupt.

@a-will
Copy link
Contributor

a-will commented Jan 21, 2025

No data-containing symbols go into the ACQ FIFO without completing the ACK stage first. Unacknowledged bytes are available in ACQ_FIFO_NEXT_DATA.

I must be misunderstanding something. I am reading the above as if you say that (among other things) START(read) will not appear in the ACQ fifo until the controller has released the clock on the final ACK phase of the address byte, which it will do only once data is put into the TX FIFO.

Let's say I have configured my OpenTitan chip to respond to two different I2C addresses, and want to implement different functionality. My software needs to see which address the I2C host is attempting to read from, before it can know what data to put into the TX FIFO. Yet from what you say, that does not seem possible.

As far as I can tell ACQ_FIFO_NEXT_DATA does not seem to be meant for peeking at START/RESTART entries, as it is only 8-bit, and does not carry the additional signals.

You're right; I forgot what I did here and should've looked at the code. ACQ_FIFO_NEXT_DATA is intended for choosing whether to ACK / NACK based on incoming data only (not the command byte).

So, instead, note that the only reason the target FSM would stretch on the (N)ACK phase of a command byte is because the ACQ FIFO doesn't have sufficient space. If it's "full" (INTR_STATE.ACQ_STRETCH or STATUS.ACQFULL), read the data out, and you'll get what you're looking for. As long as there is enough space for the command byte and a STOP, the FSM will always ACK its addresses.

I'd personally recommend setting CTRL.TX_STRETCH_CTRL_EN to 1 for the config, so the FSM doesn't immediately start sending whatever was in the TX FIFO at the time. This function causes the FSM to set TARGET_EVENTS.TX_PENDING and stretch the clock on the first bit of data, so you can choose whether to keep what is in the TX FIFO or replace it.

Was the convention around RESTART changed between Z1 and A1? Our code seems to assume that RESTART does not carry any address, and will always be followed by a START(addr) entry, this does not match the A1 documentation, which states that RESTART carries an address. The latter makes sense, but may be how we lost the explicit signal in the ACQ queue that the current transfer has completed.

Yes, the entries were changed substantially between Z1 and A1. The prior mechanism wasted space and had numerous bugs.

@jesultra
Copy link
Contributor Author

I believe I know what is going on, and how timing differences resulted in this issue not being detected on the CW310 FPGA used for Ti50 regression testing.

High latency in TockOS can result in a WRITE transfer being complete and the next one being part-way through, at the time when the cmd_complete interrupt is handled, and as a result the Ti50 software state machine uses the cmd_complete and acq_threshold interrupt only as a trigger to consumes and process all available entries from the ACQ FIFO, relying on STOP / RESTART in the sequence to know when a transfer is complete.

This worked fine on Z1, which always inserted either STOP or RESTART (without address) at the same time as cmd_complete was latched.

On A1, RESTART works differently, in that it is only inserted after receiving the next address byte, that is, after cmd_complete is latched. And this is the source of a race condition. When the host generated repeated start condition to complete a write transfer, and then goes on to send the address for a read transaction, if the read address happens to be inserted in ACQ FIFO before the Ti50 interrupt handler runs (which is more likely on the CW310, as it is running with reduced clock speed), then the Ti50 code would see the RESTART entry, and know that the write has completed. If on the other hand, the interrupt handler runs before the subsequent address byte is finished, then the interrupt handler would see a data byte as the last entry, and not be sure whether the transfer is complete. This is what I saw when reporting this issue.

I can work around this by making my code also process ACQ queue from the tx_stretch interrupt, which would ensure that it noticed RESTART with the read address. This works for us, even though it would be better if Ti50 could start preparing a response immediately on the repeated start condition, rather than waiting for the read address to be fully transferred. (The faster processing could still be achieved, if we can guarantee low latency processing of the cmd_complete interrupt, but so far, we have tried to stick to the TockOS way, and not rely on such guarantees. I understand that the previous RESTART semantics had issues, but it enabled us to achieve our desired behavior, without need for guaranteed interrupt latency)

@jesultra
Copy link
Contributor Author

I'd personally recommend setting CTRL.TX_STRETCH_CTRL_EN to 1 for the config, so the FSM doesn't immediately start sending whatever was in the TX FIFO at the time. This function causes the FSM to set TARGET_EVENTS.TX_PENDING and stretch the clock on the first bit of data, so you can choose whether to keep what is in the TX FIFO or replace it.

Actually, we like that we have the ability to pre-populate the TX FIFO as a one-byte write transfer completes (TPM register address), as that allows the I2C IP to immediately send response to the AP, rather than having to wait for computation on between read address and data on every TPM operation. Due to the clever design of the IP refusing to send data from the TX FIFO if the ACQ FIFO is nonempty, we are able to avoid sending "stale" data, in case the AP forgets that it has sent on TPM address, and sends a new one before immediately attempting to read the register contents.

@a-will
Copy link
Contributor

a-will commented Jan 22, 2025

I can work around this by making my code also process ACQ queue from the tx_stretch interrupt, which would ensure that it noticed RESTART with the read address. This works for us, even though it would be better if Ti50 could start preparing a response immediately on the repeated start condition, rather than waiting for the read address to be fully transferred. (The faster processing could still be achieved, if we can guarantee low latency processing of the cmd_complete interrupt, but so far, we have tried to stick to the TockOS way, and not rely on such guarantees. I understand that the previous RESTART semantics had issues, but it enabled us to achieve our desired behavior, without need for guaranteed interrupt latency)

Yeah, there were few options for dealing with wasted FIFO entries (area repurposing) and the synchronization problems that arose from having to indicate START conditions in two places, especially under the constraints at the time. This one eliminated the middle ground Ti50 enjoyed ("random" or average better performance). Instead, it asserts that you would either care about performance and ensure the cmd_complete interrupt is handled within ~10 us, or you don't care enough and can wait / do other things

Probably you could still achieve some of this improvement in software, though. If you process the ACQ FIFO and only one command was in it, then the cmd_complete interrupt was for the one command.

I'd personally recommend setting CTRL.TX_STRETCH_CTRL_EN to 1 for the config, so the FSM doesn't immediately start sending whatever was in the TX FIFO at the time. This function causes the FSM to set TARGET_EVENTS.TX_PENDING and stretch the clock on the first bit of data, so you can choose whether to keep what is in the TX FIFO or replace it.

Actually, we like that we have the ability to pre-populate the TX FIFO as a one-byte write transfer completes (TPM register address), as that allows the I2C IP to immediately send response to the AP, rather than having to wait for computation on between read address and data on every TPM operation. Due to the clever design of the IP refusing to send data from the TX FIFO if the ACQ FIFO is nonempty, we are able to avoid sending "stale" data, in case the AP forgets that it has sent on TPM address, and sends a new one before immediately attempting to read the register contents.

As long as you're only using one address (and/or one controller with reasonable firmware?), sounds good to me. 😄

@jesultra
Copy link
Contributor Author

Thanks for the explanation.

This is a bit unfortunate. The Ti50 team had hoped on being able to achieve good performance in the common case, without any guarantee on interrupt latency. (Of course if we regularly have large delays, it will affect performance, but we are guarding against the occasional flash page erasing, or things like that.) With our current proprietary protocol with the AP, it works out because we have the out-of-band ready signal, but if in the future we wanted to adhere to the standard for PM via I2C, then this issue could slightly impact performance.

I would like if we could consider adding one more bit to the STATUS register, stating whether the most recent transfer is complete, that is, it would be set at the same time as cmd_complete interrupt status is set, but then cleared at the same time as a new START entry is inserted into the ACQ FIFO.

Probably you could still achieve some of this improvement in software, though. If you process the ACQ FIFO and only one command was in it, then the cmd_complete interrupt was for the one command.

It would have to be "only one command was in it, and we have already processed cmd_complete from the previous command". You are right, there is a high probability that this would work well in almost all cases. The explicit status register bit would be nice to have still, though.

@a-will
Copy link
Contributor

a-will commented Jan 22, 2025

I would like if we could consider adding one more bit to the STATUS register, stating whether the most recent transfer is complete, that is, it would be set at the same time as cmd_complete interrupt status is set, but then cleared at the same time as a new START entry is inserted into the ACQ FIFO.

This idea sounds reasonable! I might just invert the logic a little, though: Something like "STATUS.pending_cmd" or similar. It would be set between the START and the (N)ACK of the command byte.

Hm... or we could just leave it set until the command completes, even (i.e. between START and the next START/STOP). Either way.

@jesultra
Copy link
Contributor Author

Hm... or we could just leave it set until the command completes, even (i.e. between START and the next START/STOP). Either way.

It would have to be this way, and not merely set during the command byte, in order for it to be useful to me. Assuming my code has just repeatedly retreived data bytes from the ACQ FIFO in a loop until ACQEMPTY became 1, I want it to be able to tell whether more bytes could later appear, still belonging to the same transfer, or whether a repeated START (or STOP) has been seen, and the transfer is therefore complete and can be processed. STOP would show up in the FIFO, but whether repeated START has been seen would be determined by the new pending_cmd, which should go back to zero only when the command completes, as you state. (If we do it this way, then I think that the STOP entry becomes superfluous in the ACQ FIFO, which could be worth exploring. I do not know how you handle the ACQ queue being full, by the time a STOP condition is seen on the bus, do you always reserve one entry in the queue for this? If we dropped the STOP entry, then one more byte of data could be accepted into the ACQ fifo, before clock stretching were necessary.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants