Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

Consistent segfaults on 1.10.34 #26980

Closed
codemonkey6969 opened this issue Aug 8, 2022 · 19 comments
Closed

Consistent segfaults on 1.10.34 #26980

codemonkey6969 opened this issue Aug 8, 2022 · 19 comments

Comments

@codemonkey6969
Copy link
Contributor

I replicated the issue 3 different times:

1st time:

Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809076] blockstore_22[18150]: segfault at 561791650aa0 ip 0000561791650aa0 sp 00007ecb3f2373f8 error 15
Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809080] blockstore_11[18089]: segfault at 561791650aa0 ip 0000561791650aa0 sp 00007ecb46c74768 error 15
Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809090] blockstore_5[18055]: segfault at 561791650aa0 ip 0000561791650aa0 sp 00007ecb4ae95318 error 15
Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809096] Code: 00 00 40 82 bd 8f 17 56 00 00 90 53 be 8f 17 56 00 00 80 15 be 8f 17 56 00 00 a0 6b be 8f 17 56 00 00 70 40 be 8f 17 56 00 00 <08> f0 60 f2 86 7f 00 00 60 00 b3 90 17 56 00 00 00 00 00 00 00 00
Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809106] in solana-validator[5617915bd000+165000]
Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809107] in solana-validator[5617915bd000+165000]
Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809118] Code: 00 00 40 82 bd 8f 17 56 00 00 90 53 be 8f 17 56 00 00 80 15 be 8f 17 56 00 00 a0 6b be 8f 17 56 00 00 70 40 be 8f 17 56 00 00 <08> f0 60 f2 86 7f 00 00 60 00 b3 90 17 56 00 00 00 00 00 00 00 00
Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809127] Code: 00 00 40 82 bd 8f 17 56 00 00 90 53 be 8f 17 56 00 00 80 15 be 8f 17 56 00 00 a0 6b be 8f 17 56 00 00 70 40 be 8f 17 56 00 00 <08> f0 60 f2 86 7f 00 00 60 00 b3 90 17 56 00 00 00 00 00 00 00 00

Second time:

Aug 7 13:05:06 NM-PROD-RPC1 kernel: [37423.188157] show_signal_msg: 20 callbacks suppressed
Aug 7 13:05:06 NM-PROD-RPC1 kernel: [37423.188159] sol-rpc-el[3075]: segfault at 55d85fbf7aa0 ip 000055d85fbf7aa0 sp 00007f20461e0768 error 15 in solana-validator[55d85fb64000+165000]
Aug 7 13:05:06 NM-PROD-RPC1 kernel: [37423.188166] Code: 00 00 40 f2 17 5e d8 55 00 00 90 c3 18 5e d8 55 00 00 80 85 18 5e d8 55 00 00 a0 db 18 5e d8 55 00 00 70 b0 18 5e d8 55 00 00 <08> a0 87 37 d8 7f 00 00 60 70 0d 5f d8 55 00 00 00 00 00 00 00 00

Third time:

Aug 7 16:26:17 NM-PROD-RPC1 kernel: [49493.988592] rocksdb:low[49990]: segfault at 5602332c8aa0 ip 00005602332c8aa0 sp 00007f5d7d45d018 error 15 in solana-validator[560233235000+165000]
Aug 7 16:26:17 NM-PROD-RPC1 kernel: [49493.988599] Code: 00 00 40 02 85 31 02 56 00 00 90 d3 85 31 02 56 00 00 80 95 85 31 02 56 00 00 a0 eb 85 31 02 56 00 00 70 c0 85 31 02 56 00 00 <08> 50 87 09 5e 7f 00 00 60 80 7a 32 02 56 00 00 00 00 00 00 00 00
Aug 7 16:26:17 NM-PROD-RPC1 kernel: [49493.991253] solana-window-i[51047]: segfault at 5602332c8aa0 ip 00005602332c8aa0 sp 00007e9c50bf97d8 error 15 in solana-validator[560233235000+165000]
Aug 7 16:26:17 NM-PROD-RPC1 kernel: [49493.991259] Code: 00 00 40 02 85 31 02 56 00 00 90 d3 85 31 02 56 00 00 80 95 85 31 02 56 00 00 a0 eb 85 31 02 56 00 00 70 c0 85 31 02 56 00 00 <08> 50 87 09 5e 7f 00 00 60 80 7a 32 02 56 00 00 00 00 00 00 00 00

This occurs when a massive amount of RPC traffic occurs for an extended period of time. The system COMPLETELY halts. No chance to recover itself as it requires a full power reset on the associated server chassis.

@steviez
Copy link
Contributor

steviez commented Aug 8, 2022

@codemonkey6969 - I had asked in Discord but might have been lost, do you have any core dumps available? We definitely appreciate you reporting; however, the segfault logs lines unfortunately don't give us much to go off. Being able to poke around a core dump might be more illuminating.

One interesting note is that I had previously been a little suspicious of blockstore/rocksdb because of #25941. The threads in this report that are segfaulting all have a handle to blockstore. More so, one of the segfaults under your Third time section block is a thread spun up by rocksdb.

@yhchiang-sol - FYI for visibility, and maybe you can comment if there might be anything of value in rocksdb logs to inspect.

@codemonkey6969
Copy link
Contributor Author

codemonkey6969 commented Aug 8, 2022

Just enabled apport to capture the core dump. It is only a matter of time before it happens again and can provide you with a core dump.

Re: #25941 that is super interesting. Definitely could see that as a potential issue. I will let you know as soon as I have a core dump to share! Thanks Steviez!

@yhchiang-sol
Copy link
Contributor

I will be traveling within 24 hours so might not have time to take a deeper look, but I will do my best!

Re: #25941 that is super interesting. Definitely could see that as a potential issue. I will let you know as soon as I have a core dump to share! Thanks Steviez!

If I remember it correctly, the cause of #25941 is due to the fact that rocksdb requires all child threads to be joined before rocksdb starts its shutdown, and #25933 tries to fix the tvp and tpu hang but could shutdown rocksdb before joining all child-threads.

@yhchiang-sol
Copy link
Contributor

Just enabled apport to capture the core dump. It is only a matter of time before it happens again and can provide you with a core dump.

@codemonkey6969 do you happen to have the rocksdb LOG file(s) that is associated with your previous validator crashes? The rocksdb LOG file is under the rocksdb directory inside your ledger directory. It's file name is called LOG. It might have some useful information.

@codemonkey6969
Copy link
Contributor Author

Just enabled apport to capture the core dump. It is only a matter of time before it happens again and can provide you with a core dump.

@codemonkey6969 do you happen to have the rocksdb LOG file(s) that is associated with your previous validator crashes? The rocksdb LOG file is under the rocksdb directory inside your ledger directory. It's file name is called LOG. It might have some useful information.

Here is the rocksdb log file: https://snapshots.nodemonkey.io/snapshots/LOG.txt

Any suggestions on enabling a coredump? I enabled apport but nothing ended up dumping. Thanks.

@steviez
Copy link
Contributor

steviez commented Aug 8, 2022

Any suggestions on enabling a coredump? I enabled apport but nothing ended up dumping. Thanks.

I personally use systemd-coredump, but I see no reason as to why apport shouldn't work. It sounds like you already know apport is running, so the other quick thing to check is:

$ ulimit --help
...
      -c	the maximum size of core files created
...

If this value is 0, no core dumps will be created; I have the value unlimited set on my dev machine. Otherwise, this SO post has a good walk through with things to check/try:
https://stackoverflow.com/questions/48178535/no-core-dump-in-var-crash

@codemonkey6969
Copy link
Contributor Author

Cannot create a core dump. It is not triggering one. I want to reiterate how impactful this issue is to my RPC service. Every time it receives high traffic, the system completely halts. The service doesn't crash, the system simply cannot operate so I have to manually reboot it.

@yhchiang-sol
Copy link
Contributor

Here is the rocksdb log file: https://snapshots.nodemonkey.io/snapshots/LOG.txt

I checked the above rocksdb log file, but there isn't anything abnormal.

@codemonkey6969: can I know whether this is the one that is associated with the crash? The log file starts at 2022/08/08-10:37:18.391988 while your previous reported crashes are all on Aug 7th. In case you happen to see a crash again, can you upload the rocksdb LOG file before you restart the instance? Thank you!

@codemonkey6969
Copy link
Contributor Author

This is a consistent issue. It's super easy to replicate. I unfortunately cannot provide those logs prior to restarting the instance as I have to hard power cycle to even get the machine back online.

Here is the rocksdb log file: https://snapshots.nodemonkey.io/snapshots/LOG.txt

I checked the above rocksdb log file, but there isn't anything abnormal.

@codemonkey6969: can I know whether this is the one that is associated with the crash? The log file starts at 2022/08/08-10:37:18.391988 while your previous reported crashes are all on Aug 7th. In case you happen to see a crash again, can you upload the rocksdb LOG file before you restart the instance? Thank you!

@yhchiang-sol
Copy link
Contributor

Cannot create a core dump. It is not triggering one. I want to reiterate how impactful this issue is to my RPC service. Every time it receives high traffic, the system completely halts. The service doesn't crash, the system simply cannot operate so I have to manually reboot it.

cc @lijunwangs for RPC service-related issues and how to enable additional debug logs.

Btw, there's a recently merged PR that handles JsonRpcService failure more gracefully and thus will prevent rocksdb from running under an unstable situation (such as the validator is already panic). Please feel free to give it a try!

#27075

@codemonkey6969
Copy link
Contributor Author

Cannot create a core dump. It is not triggering one. I want to reiterate how impactful this issue is to my RPC service. Every time it receives high traffic, the system completely halts. The service doesn't crash, the system simply cannot operate so I have to manually reboot it.

cc @lijunwangs for RPC service-related issues and how to enable additional debug logs.

Btw, there's a recently merged PR that handles JsonRpcService failure more gracefully and thus will prevent rocksdb from running under an unstable situation (such as the validator is already panic). Please feel free to give it a try!

#27075

Cannot use this PR within the 1.10.34 branch. It references undefined values.

@lijunwangs
Copy link
Contributor

Have you tried temporarily rename the ledger store directory and see if the issue reproduce? What is the full command to start the validator?

@lijunwangs
Copy link
Contributor

Also can you paste the result of "dmesg"?

@codemonkey6969
Copy link
Contributor Author

Have you tried temporarily rename the ledger store directory and see if the issue reproduce? What is the full command to start the validator?

This is replicated across several different machines and different hardware. I have wiped out the ledger and started from scratch. The full script to start, I can provide via a DM on discord. Tim also has it.

@codemonkey6969
Copy link
Contributor Author

Also can you paste the result of "dmesg"?

Attached
output.txt

@steviez
Copy link
Contributor

steviez commented Aug 13, 2022

Also can you paste the result of "dmesg"?

Attached output.txt

Ahh dang; you did give us what he asked for; unfortunately, this is after the boot and thus doesn't show any useful information from the crash. I think you can use journalctl to see logs from previous boots (other feel free to correct me)

@codemonkey6969
Copy link
Contributor Author

Also can you paste the result of "dmesg"?

Attached output.txt

Ahh dang; you did give us what he asked for; unfortunately, this is after the boot and thus doesn't show any useful information from the crash. I think you can use journalctl to see logs from previous boots (other feel free to correct me)

Added in discord.

yhchiang-sol added a commit that referenced this issue Aug 18, 2022
#### Problem
get_entries_in_data_block() panics when there's inconsistency between
slot_meta and data_shred.

However, as we don't lock on reads, reading across multiple column families is
not atomic (especially for older slots) and thus does not guarantee consistency
as the background cleanup service could purge the slot in the middle.  Such
panic was reported in #26980 when the validator serves a high load of RPC calls.

#### Summary of Changes
This PR makes get_entries_in_data_block() panic only when the inconsistency
between slot-meta and data-shred happens on a slot older than lowest_cleanup_slot.
mergify bot pushed a commit that referenced this issue Aug 18, 2022
#### Problem
get_entries_in_data_block() panics when there's inconsistency between
slot_meta and data_shred.

However, as we don't lock on reads, reading across multiple column families is
not atomic (especially for older slots) and thus does not guarantee consistency
as the background cleanup service could purge the slot in the middle.  Such
panic was reported in #26980 when the validator serves a high load of RPC calls.

#### Summary of Changes
This PR makes get_entries_in_data_block() panic only when the inconsistency
between slot-meta and data-shred happens on a slot older than lowest_cleanup_slot.

(cherry picked from commit 6d12bb6)
mergify bot pushed a commit that referenced this issue Aug 18, 2022
#### Problem
get_entries_in_data_block() panics when there's inconsistency between
slot_meta and data_shred.

However, as we don't lock on reads, reading across multiple column families is
not atomic (especially for older slots) and thus does not guarantee consistency
as the background cleanup service could purge the slot in the middle.  Such
panic was reported in #26980 when the validator serves a high load of RPC calls.

#### Summary of Changes
This PR makes get_entries_in_data_block() panic only when the inconsistency
between slot-meta and data-shred happens on a slot older than lowest_cleanup_slot.

(cherry picked from commit 6d12bb6)
mergify bot added a commit that referenced this issue Aug 18, 2022
…27195) (#27231)

Fix a corner-case panic in get_entries_in_data_block() (#27195)

#### Problem
get_entries_in_data_block() panics when there's inconsistency between
slot_meta and data_shred.

However, as we don't lock on reads, reading across multiple column families is
not atomic (especially for older slots) and thus does not guarantee consistency
as the background cleanup service could purge the slot in the middle.  Such
panic was reported in #26980 when the validator serves a high load of RPC calls.

#### Summary of Changes
This PR makes get_entries_in_data_block() panic only when the inconsistency
between slot-meta and data-shred happens on a slot older than lowest_cleanup_slot.

(cherry picked from commit 6d12bb6)

Co-authored-by: Yueh-Hsuan Chiang <[email protected]>
mergify bot added a commit that referenced this issue Aug 19, 2022
…27195) (#27232)

Fix a corner-case panic in get_entries_in_data_block() (#27195)

#### Problem
get_entries_in_data_block() panics when there's inconsistency between
slot_meta and data_shred.

However, as we don't lock on reads, reading across multiple column families is
not atomic (especially for older slots) and thus does not guarantee consistency
as the background cleanup service could purge the slot in the middle.  Such
panic was reported in #26980 when the validator serves a high load of RPC calls.

#### Summary of Changes
This PR makes get_entries_in_data_block() panic only when the inconsistency
between slot-meta and data-shred happens on a slot older than lowest_cleanup_slot.

(cherry picked from commit 6d12bb6)

Co-authored-by: Yueh-Hsuan Chiang <[email protected]>
HaoranYi pushed a commit to HaoranYi/solana that referenced this issue Aug 21, 2022
…7195)

#### Problem
get_entries_in_data_block() panics when there's inconsistency between
slot_meta and data_shred.

However, as we don't lock on reads, reading across multiple column families is
not atomic (especially for older slots) and thus does not guarantee consistency
as the background cleanup service could purge the slot in the middle.  Such
panic was reported in solana-labs#26980 when the validator serves a high load of RPC calls.

#### Summary of Changes
This PR makes get_entries_in_data_block() panic only when the inconsistency
between slot-meta and data-shred happens on a slot older than lowest_cleanup_slot.
HaoranYi added a commit that referenced this issue Aug 22, 2022
* refactor: extract store_stake_accounts fn

* refactor: extract store_vote_account fn

* refactor: extract reward history update fn

* remove avg point value from pay_valiator fn. not used

* clippy: slice

* clippy: slice

* remove abort() from test-validator (#27124)

* chore: bump bytes from 1.1.0 to 1.2.1 (#27172)

* chore: bump bytes from 1.1.0 to 1.2.1

Bumps [bytes](https://github.com/tokio-rs/bytes) from 1.1.0 to 1.2.1.
- [Release notes](https://github.com/tokio-rs/bytes/releases)
- [Changelog](https://github.com/tokio-rs/bytes/blob/master/CHANGELOG.md)
- [Commits](tokio-rs/bytes@v1.1.0...v1.2.1)

---
updated-dependencies:
- dependency-name: bytes
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>

* [auto-commit] Update all Cargo lock files

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: dependabot-buildkite <[email protected]>

* Share Ancestors API get with contains_key (#27161)

consolidate similar fns

* Rename to `MAX_BLOCK_ACCOUNTS_DATA_SIZE_DELTA` (#27175)

* chore: bump libc from 0.2.129 to 0.2.131 (#27162)

* chore: bump libc from 0.2.129 to 0.2.131

Bumps [libc](https://github.com/rust-lang/libc) from 0.2.129 to 0.2.131.
- [Release notes](https://github.com/rust-lang/libc/releases)
- [Commits](rust-lang/libc@0.2.129...0.2.131)

---
updated-dependencies:
- dependency-name: libc
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

* [auto-commit] Update all Cargo lock files

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: dependabot-buildkite <[email protected]>

* reverts wide fanout in broadcast when the root node is down (#26359)

A change included in
#20480
was that when the root node in turbine broadcast tree is down, the
leader will broadcast the shred to all nodes in the first layer.
The intention was to mitigate the impact of dead nodes on shreds
propagation, because if the root node is down, then the entire cluster
will miss out the shred.
On the other hand, if x% of stake is down, this will cause 200*x% + 1
packets/shreds ratio at the broadcast stage which might contribute to
line-rate saturation and packet drop.
To avoid this bandwidth saturation issue, this commit reverts that logic
and always broadcasts shreds from the leader only to the root node.
As before we rely on erasure codes to recover shreds lost due to staked
nodes being offline.

* add getTokenLargestAccounts rpc method to rust client (#26840)

* add get token largest accounts rpc call to client

* split to include with commitment

* Bump spl-token-2022 (#27181)

* Bump token-2022 to 0.4.3

* Allow cargo to bump stuff to v1.11.5

* VoteProgram.safeWithdraw function to safeguard against accidental vote account closures (#26586)

feat: safe withdraw function

Co-authored-by: aschonfeld <[email protected]>

* chore: bump futures from 0.3.21 to 0.3.23 (#27182)

* chore: bump futures from 0.3.21 to 0.3.23

Bumps [futures](https://github.com/rust-lang/futures-rs) from 0.3.21 to 0.3.23.
- [Release notes](https://github.com/rust-lang/futures-rs/releases)
- [Changelog](https://github.com/rust-lang/futures-rs/blob/master/CHANGELOG.md)
- [Commits](rust-lang/futures-rs@0.3.21...0.3.23)

---
updated-dependencies:
- dependency-name: futures
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

* [auto-commit] Update all Cargo lock files

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: dependabot-buildkite <[email protected]>

* chore: bump nix from 0.24.2 to 0.25.0 (#27179)

* chore: bump nix from 0.24.2 to 0.25.0

Bumps [nix](https://github.com/nix-rust/nix) from 0.24.2 to 0.25.0.
- [Release notes](https://github.com/nix-rust/nix/releases)
- [Changelog](https://github.com/nix-rust/nix/blob/master/CHANGELOG.md)
- [Commits](nix-rust/nix@v0.24.2...v0.25.0)

---
updated-dependencies:
- dependency-name: nix
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>

* [auto-commit] Update all Cargo lock files

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: dependabot-buildkite <[email protected]>

* Parse ConfidentialTransaction instructions (#26825)

Parse ConfidentialTransfer instructions

* snapshots: serialize version file first (#27192)

serialize version file first

* serialize incremental_snapshot_hash (#26839)

* serialize incremental_snapshot_hash

* pr feedback

* derives Error trait for ClusterInfoError and core::result::Error (#27208)

* Add clean_accounts_for_tests() (#27200)

* Rust v1.63.0 (#27148)

* Upgrade to Rust v1.63.0

* Add nightly_clippy_allows

* Resolve some new clippy nightly lints

* Increase QUIC packets completion timeout

Co-authored-by: Michael Vines <[email protected]>

* docs: updated "transaction fees" page (#26861)

* docs: transaction fees, compute units, compute budget

* docs: added messages definition

* Revert "docs: added messages definition"

This reverts commit 3c56156.

* docs: added messages definition

* Update docs/src/transaction_fees.md

Co-authored-by: Jacob Creech <[email protected]>

* fix: updates from feedback

Co-authored-by: Jacob Creech <[email protected]>

* sdk: Fix args after "--" in build-bpf and test-bpf (#27221)

* Flaky Unit Test test_rpc_subscriptions (#27214)

Increase unit test timeout from 5 seconds to 10 seconds

* chore: only buildkite pipelines use sccache in docker-run.sh (#27204)

chore: only buildkite ci use sccache

* clean feature: `prevent_calling_precompiles_as_programs` (#27100)

* clean feature: prevent_calling_precompiles_as_programs

* fix tests

* fix test

* remove comment

* fix test

* feedback

* Add get_account_with_commitment to BenchTpsClient (#27176)

* Fix a corner-case panic in get_entries_in_data_block() (#27195)

#### Problem
get_entries_in_data_block() panics when there's inconsistency between
slot_meta and data_shred.

However, as we don't lock on reads, reading across multiple column families is
not atomic (especially for older slots) and thus does not guarantee consistency
as the background cleanup service could purge the slot in the middle.  Such
panic was reported in #26980 when the validator serves a high load of RPC calls.

#### Summary of Changes
This PR makes get_entries_in_data_block() panic only when the inconsistency
between slot-meta and data-shred happens on a slot older than lowest_cleanup_slot.

* Verify snapshot slot deltas (#26666)

* store-tool: log lamports for each account (#27168)

log lamports for each account

* add an assert for a debug feature to avoid wasted time (#27210)

* remove redundant call that bumps age to future (#27215)

* Use from_secs api to create duration (#27222)

use from_secs api to create duration

* reorder slot # in debug hash data path (#27217)

* create helper fn for clarity (#27216)

* Verifying snapshot bank must always specify the snapshot slot (#27234)

* Remove `Bank::ensure_no_storage_rewards_pool()` (#26468)

* cli: Add subcommands for address lookup tables (#27123)

* cli: Add subcommand for creating address lookup tables

* cli: Add additional subcommands for address lookup tables

* short commands

* adds hash domain to ping-pong protocol (#27193)

In order to maintain backward compatibility, for now the responding node
will hash the token both with and without domain so that the other node
will accept the response regardless of its upgrade status.
Once the cluster has upgraded to the new code, we will remove the legacy
domain = false case.

* Revert "Rust v1.63.0 (#27148)" (#27245)

This reverts commit a2e7bdf.

* correct double negation (#27240)

* Enable QUIC client by default. Add arg to disable QUIC client. (Forward port #26927) (#27194)

Enable QUIC client by default. Add arg to disable QUIC client.

* Enable QUIC client by default. Add arg to disable QUIC client.
* Deprecate --disable-quic-servers arg
* Add #[ignore] annotation to failing tests

* slots_connected: check if the range is connected (>= ending_slot) (#27152)

* create-snapshot check if snapshot slot exists (#27153)

* Add Bank::clean_accounts_for_tests() (#27209)

* Call `AccountsDb::shrink_all_slots()` directly (#27235)

* add ed25519_program to built-in instruction cost list (#27199)

* add ed25519_program to built-in instruction cost list

* Remove unnecessary and stale comment

* simple refactorings to disk idx (#27238)

* add _inclusive for clarity (#27239)

* eliminate unnecessary ZERO_RAW_LAMPORTS_SENTINEL (#27218)

* make test code more clear (#27260)

* banking stage: actually aggregate tracer packet stats (#27118)

* aggregated_tracer_packet_stats_option was alwasys None

* Actually accumulate tracer packet stats

* Refactor epoch reward 1 (#27253)

* refactor: extract store_stake_accounts fn

* clippy: slice

Co-authored-by: haoran <haoran@mbook>

* recovers merkle shreds from erasure codes (#27136)

The commit
* Identifies Merkle shreds when recovering from erasure codes and
  dispatches specialized code to reconstruct shreds.
* Coding shred headers are added to recovered erasure shards.
* Merkle tree is reconstructed for the erasure batch and added to
  recovered shreds.
* The common signature (for the root of Merkle tree) is attached to all
  recovered shreds.

* Simplify `Bank::clean_accounts()` by removing params (#27254)

* Account files remove (#26910)

* Create a new function cleanup_accounts_paths, a trivial change

* Remove account files asynchronously

* Update and simplify the implementation after the validator test runs.

* Fixes after testing on the dev device

* Discard tokio.  Use thread instead

* Fix comments format

* Fix config type to pass the github test

* Fix failed tests.  Handle the case of non-existing path

* Final cleanup, addressing the review comments
Avoided OsString.
Made the function more generic with "impl AsRef<Path>"

Co-authored-by: Jeff Washington <[email protected]>

* Refactor: Flattens `TransactionContext::instruction_trace` (#27109)

* Flattens TransactionContext::instruction_trace.

* Stop the search at transaction level.

* Renames get_instruction_context_at => get_instruction_context_at_nesting_level.

* Removes TransactionContext::get_instruction_trace().
Adds TransactionContext::get_instruction_trace_length() and TransactionContext::get_instruction_context_at_index().

* Have TransactionContext::instruction_accounts_lamport_sum() accept an iterator instead of a slice.

* Removes instruction_trace from ExecutionRecord.

* make InstructionContext::new() private

* Parallel insertion of dirty store keys during clean (#27058)

parallelize dirty store key insertion

* Refactor epoch reward 2 (#27257)

* refactor: extract store_stake_accounts fn

* refactor: extract store_vote_account fn

* clippy: slice

* clippy: slice

* fix merge error

Co-authored-by: haoran <haoran@mbook>

* Standardize thread names

Tenets:
1. Limit thread names to 15 characters
2. Prefix all Solana-controlled threads with "sol"
3. Use Camel case. It's more character dense than Snake or Kebab case

* cleanup comment on filter_zero_lamport_clean_for_incremental_snapshots (#27273)

* remove inaccurate log (#27255)

* patches metrics for invalid cached vote/stake accounts (#27266)

patches invalid cached vote/stake accounts metrics

Invalid cached vote accounts is overcounting actual mismatches, and
invalid cached stake accounts is undercounting.

* Refactor epoch reward 3 (#27259)

* refactor: extract store_stake_accounts fn

* refactor: extract store_vote_account fn

* refactor: extract reward history update fn

* clippy: slice

* clippy: slice

Co-authored-by: haoran <haoran@mbook>

* fix merges

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: haoran <haoran@mbook>
Co-authored-by: Jeff Biseda <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: dependabot-buildkite <[email protected]>
Co-authored-by: Brooks Prumo <[email protected]>
Co-authored-by: behzad nouri <[email protected]>
Co-authored-by: AJ Taylor <[email protected]>
Co-authored-by: Tyera Eulberg <[email protected]>
Co-authored-by: Andrew Schonfeld <[email protected]>
Co-authored-by: aschonfeld <[email protected]>
Co-authored-by: apfitzge <[email protected]>
Co-authored-by: Jeff Washington (jwash) <[email protected]>
Co-authored-by: Brennan Watt <[email protected]>
Co-authored-by: Michael Vines <[email protected]>
Co-authored-by: Nick Frostbutter <[email protected]>
Co-authored-by: Jacob Creech <[email protected]>
Co-authored-by: Jon Cinque <[email protected]>
Co-authored-by: Yihau Chen <[email protected]>
Co-authored-by: Justin Starry <[email protected]>
Co-authored-by: kirill lykov <[email protected]>
Co-authored-by: Yueh-Hsuan Chiang <[email protected]>
Co-authored-by: leonardkulms <[email protected]>
Co-authored-by: Will Hickey <[email protected]>
Co-authored-by: Tao Zhu <[email protected]>
Co-authored-by: Xiang Zhu <[email protected]>
Co-authored-by: Jeff Washington <[email protected]>
Co-authored-by: Alexander Meißner <[email protected]>
@codemonkey6969
Copy link
Contributor Author

This resolved the issue. CC: @lijunwangs @yhchiang-sol

@yhchiang-sol
Copy link
Contributor

@codemonkey6969: Thanks for confirming the PR resolved the issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants