Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug when accessing string compiled into binary #247

Open
cmichi opened this issue Jan 7, 2025 · 45 comments
Open

Bug when accessing string compiled into binary #247

cmichi opened this issue Jan 7, 2025 · 45 comments

Comments

@cmichi
Copy link

cmichi commented Jan 7, 2025

I believe I found a bug in one of the polkavm components. I've put the reproducer here.

The reproducer program tries to output a string that is already present in its own blob (in the .data section, I think). But instead of printing the intended string ("Success" in the reproducer) we get an artifact from another string present in the blob. The length of the printed string is correct, the output is just a different part of another string from the blob.

cd examples/program-for-bug/
cargo +nightly build --release --no-default-features --target ../../crates/polkavm-linker/riscv64emac-unknown-none-polkavm.json -Zbuild-std="core,alloc" 

cd ../trigger-bug/
INPUT=../program-for-bug/target/riscv64emac-unknown-none-polkavm/release/program-for-bug cargo run

The output will be along the lines of

Calling into the guest program:
print: "heap: the nul"
print: "stack: "
print: "the nul"

This is wrong, it should be

Calling into the guest program:
print: "heap: Success"
print: "stack: Success"

The string the nul is present in the blob and part of a longer string. This can be checked via strings $INPUT. The length (7) corresponds to the length of "Success".

The bug also occurs if I use a different Rust version or don't use an allocator. If I don't include an allocator, the fragment that is output is just different.

Compiling program-for-bug without --release seems to mitigate the bug at least for the heap variant:

Calling into the guest program:
print: "heap: Success"
thread 'main' panicked at examples/trigger-bug/src/main.rs:48:10:
called `Result::unwrap()` on an `Err` value: Trap
@cmichi
Copy link
Author

cmichi commented Jan 13, 2025

Minimal reproducer:

#![no_std]
#![no_main]

extern crate core;

use polkavm_derive::polkavm_export;
use core::fmt::Write;

pub enum Foo {
    One,
    CalleeTrapped,
    Unknown,
}

impl ::core::fmt::Debug for Foo {
    #[inline]
    fn fmt(&self, f: &mut ::core::fmt::Formatter) -> ::core::fmt::Result {
        ::core::fmt::Formatter::write_str(
            f,
            match self {
                Foo::Success => "Success",
                Foo::CalleeTrapped => "CalleeTrapped",
                Foo::Unknown => "Unknown",
            },
        )
    }
}

struct Writer;
impl core::fmt::Write for Writer {
    fn write_str(&mut self, s: &str) -> core::fmt::Result {
        unsafe {
            crate::debug_message(s.as_ptr(), s.len() as u32);
        }
        Ok(())
    }
}

#[polkavm_derive::polkavm_import]
extern "C" {
    pub fn debug_message(str_ptr: *const u8, str_len: u32);
}

#[polkavm_export(abi = polkavm_derive::default_abi)]
pub fn deploy() {
    let mut m = Writer {};
    let _ = write!(&mut m, "{:?}", Foo::Success);
}

#[panic_handler]
fn panic(_info: &core::panic::PanicInfo) -> ! {
    unsafe {
        core::arch::asm!("unimp");
        core::hint::unreachable_unchecked();
    }
}

compiled via

cargo +nightly build --release --no-default-features --target ../../crates/polkavm-linker/riscv64emac-unknown-none-polkavm.json -Zbuild-std="core,alloc"

And:

linker.define_typed(
    "debug_message",
    |caller: Caller<()>, buffer: u32, length: u32| {
        let buffer = caller.instance.read_memory(buffer, length).unwrap();
        eprintln!("print: {:?}", String::from_utf8(buffer).unwrap());
    },
).unwrap();

Would appreciate someone taking a look.

@athei
Copy link
Member

athei commented Jan 13, 2025

Have you checked that the value of the pointer you pass in the guest and u32 you receive in the host are the same? IMHO it looks like it should work, though.

@cmichi
Copy link
Author

cmichi commented Jan 13, 2025

Have you checked that the value of the pointer you pass in the guest and u32 you receive in the host are the same? IMHO it looks like it should work, though.

Yeah I had checked, that works fine. The more extensive bug report above explains that passing strings back and forth works, but accessing certain string literals from the binary is an issue.

@athei
Copy link
Member

athei commented Jan 14, 2025

@jarkkojs Can you try to reproduce and debug what is going on here?

@jarkkojs
Copy link
Collaborator

@jarkkojs Can you try to reproduce and debug what is going on here?

Yeah, I can check this out tomorrow. I just noticed this.

@athei
Copy link
Member

athei commented Jan 15, 2025

At first I thought the issue might be that the import is defined as pointer (64bit) while the host function definition is u32. However, it should not be a problem since all pointers will fit into u32. That said, I would start by checking the actual numerical value in the u32 on the host side. If the pointer area makes sense or of something was truncated in the wrong way.

@jarkkojs
Copy link
Collaborator

jarkkojs commented Jan 16, 2025

$ git remote add use-ink https://github.com/use-ink/polkavm

~/work/github.com/paritytech/polkavm polkavm-test-data ⇣⇡
$ git fetch use-ink
remote: Enumerating objects: 12, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 12 (delta 2), reused 2 (delta 2), pack-reused 6 (from 1)
Unpacking objects: 100% (12/12), 3.81 KiB | 3.81 MiB/s, done.
From https://github.com/use-ink/polkavm
 * [new branch]      cmichi-reproduce-bug-jan-25 -> use-ink/cmichi-reproduce-bug-jan-25
 * [new branch]      master                      -> use-ink/master

~/work/github.com/paritytech/polkavm polkavm-test-data ⇣⇡
$ git checkout cmichi-reproduce-bug-jan-25 
branch 'cmichi-reproduce-bug-jan-25' set up to track 'use-ink/cmichi-reproduce-bug-jan-25' by rebasing.
Switched to a new branch 'cmichi-reproduce-bug-jan-25'

~/work/github.com/paritytech/polkavm cmichi-reproduce-bug-jan-25
$ cd examples/program-for-bug 

~/work/github.com/paritytech/polkavm/examples/program-for-bug cmichi-reproduce-bug-jan-25
$ cargo +nightly build --release --no-default-features --target ../../crates/polkavm-linker/riscv64emac-unknown-none-polkavm.json -Zbuild-std="core,alloc"

    Updating crates.io index
     Locking 13 packages to latest compatible versions
      Adding polkavm-derive v0.18.0 (available: v0.19.0)
  Downloaded lock_api v0.4.12
  Downloaded unicode-ident v1.0.14
  Downloaded polkavm-derive-impl-macro v0.18.0
  Downloaded polkavm-common v0.18.0
  Downloaded autocfg v1.4.0
  Downloaded proc-macro2 v1.0.93
  Downloaded spin v0.9.8
  Downloaded scopeguard v1.2.0
  Downloaded polkavm-derive-impl v0.18.1
  Downloaded quote v1.0.38
  Downloaded syn v2.0.96
  Downloaded polkavm-derive v0.18.0
  Downloaded talc v4.4.2
  Downloaded 13 crates (774.8 KB) in 1.34s
   Compiling compiler_builtins v0.1.140
   Compiling core v0.0.0 (/home/jarkko/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core)
   Compiling proc-macro2 v1.0.93
   Compiling unicode-ident v1.0.14
   Compiling autocfg v1.4.0
   Compiling polkavm-common v0.18.0
   Compiling lock_api v0.4.12
   Compiling quote v1.0.38
   Compiling syn v2.0.96
   Compiling polkavm-derive-impl v0.18.1
   Compiling polkavm-derive-impl-macro v0.18.0
   Compiling rustc-std-workspace-core v1.99.0 (/home/jarkko/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/rustc-std-workspace-core)
   Compiling alloc v0.0.0 (/home/jarkko/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc)
   Compiling scopeguard v1.2.0
   Compiling polkavm-derive v0.18.0
   Compiling spin v0.9.8
   Compiling talc v4.4.2
   Compiling program-for-bug v0.1.0 (/home/jarkko/work/github.com/paritytech/polkavm/examples/program-for-bug)
    Finished `release` profile [optimized] target(s) in 7.07s

~/work/github.com/paritytech/polkavm/examples/program-for-bug cmichi-reproduce-bug-jan-25* 7s
$ INPUT=../program-for-bug/target/riscv64emac-unknown-none-polkavm/release/program-for-bug cargo run 

error: failed to parse lock file at: /home/jarkko/work/github.com/paritytech/polkavm/examples/program-for-bug/Cargo.lock

Caused by:
  lock file version `4` was found, but this version of Cargo does not understand this lock file, perhaps Cargo needs to be updated?

@cmichi What am I missing here?

@jarkkojs
Copy link
Collaborator

jarkkojs commented Jan 16, 2025

$ git remote add use-ink https://github.com/use-ink/polkavm

~/work/github.com/paritytech/polkavm polkavm-test-data ⇣⇡
$ git fetch use-ink
remote: Enumerating objects: 12, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 12 (delta 2), reused 2 (delta 2), pack-reused 6 (from 1)
Unpacking objects: 100% (12/12), 3.81 KiB | 3.81 MiB/s, done.
From https://github.com/use-ink/polkavm
 * [new branch]      cmichi-reproduce-bug-jan-25 -> use-ink/cmichi-reproduce-bug-jan-25
 * [new branch]      master                      -> use-ink/master

~/work/github.com/paritytech/polkavm polkavm-test-data ⇣⇡
$ git checkout cmichi-reproduce-bug-jan-25 
branch 'cmichi-reproduce-bug-jan-25' set up to track 'use-ink/cmichi-reproduce-bug-jan-25' by rebasing.
Switched to a new branch 'cmichi-reproduce-bug-jan-25'

~/work/github.com/paritytech/polkavm cmichi-reproduce-bug-jan-25
$ cd examples/program-for-bug 

~/work/github.com/paritytech/polkavm/examples/program-for-bug cmichi-reproduce-bug-jan-25
$ cargo +nightly build --release --no-default-features --target ../../crates/polkavm-linker/riscv64emac-unknown-none-polkavm.json -Zbuild-std="core,alloc"

    Updating crates.io index
     Locking 13 packages to latest compatible versions
      Adding polkavm-derive v0.18.0 (available: v0.19.0)
  Downloaded lock_api v0.4.12
  Downloaded unicode-ident v1.0.14
  Downloaded polkavm-derive-impl-macro v0.18.0
  Downloaded polkavm-common v0.18.0
  Downloaded autocfg v1.4.0
  Downloaded proc-macro2 v1.0.93
  Downloaded spin v0.9.8
  Downloaded scopeguard v1.2.0
  Downloaded polkavm-derive-impl v0.18.1
  Downloaded quote v1.0.38
  Downloaded syn v2.0.96
  Downloaded polkavm-derive v0.18.0
  Downloaded talc v4.4.2
  Downloaded 13 crates (774.8 KB) in 1.34s
   Compiling compiler_builtins v0.1.140
   Compiling core v0.0.0 (/home/jarkko/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core)
   Compiling proc-macro2 v1.0.93
   Compiling unicode-ident v1.0.14
   Compiling autocfg v1.4.0
   Compiling polkavm-common v0.18.0
   Compiling lock_api v0.4.12
   Compiling quote v1.0.38
   Compiling syn v2.0.96
   Compiling polkavm-derive-impl v0.18.1
   Compiling polkavm-derive-impl-macro v0.18.0
   Compiling rustc-std-workspace-core v1.99.0 (/home/jarkko/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/rustc-std-workspace-core)
   Compiling alloc v0.0.0 (/home/jarkko/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc)
   Compiling scopeguard v1.2.0
   Compiling polkavm-derive v0.18.0
   Compiling spin v0.9.8
   Compiling talc v4.4.2
   Compiling program-for-bug v0.1.0 (/home/jarkko/work/github.com/paritytech/polkavm/examples/program-for-bug)
    Finished `release` profile [optimized] target(s) in 7.07s

~/work/github.com/paritytech/polkavm/examples/program-for-bug cmichi-reproduce-bug-jan-25* 7s
$ INPUT=../program-for-bug/target/riscv64emac-unknown-none-polkavm/release/program-for-bug cargo run 

error: failed to parse lock file at: /home/jarkko/work/github.com/paritytech/polkavm/examples/program-for-bug/Cargo.lock

Caused by:
  lock file version `4` was found, but this version of Cargo does not understand this lock file, perhaps Cargo needs to be updated?

@cmichi What am I missing here?

False flag, I forgot to change the directory:

    Finished dev [unoptimized + debuginfo] target(s) in 3.95s
     Running `/home/jarkko/work/github.com/paritytech/polkavm/target/debug/trigger-bug`
Calling into the guest program:
print: "heap: the nul"
print: "stack: "
print: "the nul"

I.e. can reproduce with the 1:1 outcome on my side.

With set_optimize(false)

Calling into the guest program:
print: "heap: !\0\0\0\0\0\0"
print: "stack: "
print: "!\0\0\0\0\0\0"

@jarkkojs
Copy link
Collaborator

jarkkojs commented Jan 17, 2025

Better to debug this with set_optimize(false) simply because it does less. I.e. what we can conclude at least is that optimizer is not triggering the bug, and thus using it will obfuscate it further. I.e. optimizer will process broken data to begin with...

@jarkkojs
Copy link
Collaborator

jarkkojs commented Jan 17, 2025

@athei Just a sanity check: does the result which is not mangled by the optimizer ring any bells?

I created https://github.com/paritytech/polkavm/tree/bug-247 for debugging purposes. It is forked from @cmichi's commit with following differences:

  1. set_optimizer(false): We don't want extra spice on obfuscation.
  2. Remove env_logger: Strip of anything extra.

jarkkojs pushed a commit that referenced this issue Jan 17, 2025
@jarkkojs
Copy link
Collaborator

There could be two separate bugs (possibly), where the second is unmasked by the first. I.e. possibility of incorrect handling of strings which are just "\0" in the optimizer. I don't know details of that code. Just adding a remark.

@jarkkojs
Copy link
Collaborator

jarkkojs commented Jan 17, 2025

@cmichi Just a remark. It is crucial to use exactly set_optimizer(false) and not set_optimizer(true) in your example as:

  1. Given that the bug triggers when it is off, we get invalid output, the optimizer will get already corrupted data as input.
  2. Thus, turning it on will further obfuscate the root cause.

Just wanted to make this remark to denote that even tho with both settings a bug is triggered, the way you set this knob in the reproduce does actually matter, that's all :-)

So we should find the needle exactly from this output (and not factor the difficulty):

Calling into the guest program:
print: "heap: !\0\0\0\0\0\0"
print: "stack: "
print: "!\0\0\0\0\0\0"

I'm setting up a branch for better compilation workflow...

jarkkojs pushed a commit that referenced this issue Jan 17, 2025
@jarkkojs
Copy link
Collaborator

jarkkojs commented Jan 17, 2025

@athei, @cmichi @koute https://asciinema.org/a/1OvMyRg8hO0NuRmeaYPr4Ixag

Check this out. I did not try to proactively fix the bug. I just integrated it better to the existing infrastructure and used the stuff in my in-progress PR [1]. Now it produces the expected output. This was unintentional from my side.

The first thing that comes to mind (not yet verified) that maybe it is just a version mismatch. I mean the reproducer enforced polkavm-derive 0.18, which could lead to problems given the opcode number changes.

You can checkout my branch yourself: https://github.com/paritytech/polkavm/tree/bug-247

[1] #240

@jarkkojs
Copy link
Collaborator

Image

The full GIF screencast for convenience.

@jarkkojs
Copy link
Collaborator

OK so this is not about polkavm-derive version. Even with 0.18.0 this goes now right. I think we should now focus on spotting the difference, which maps to how to make it fail.

Image

@cmichi
Copy link
Author

cmichi commented Jan 17, 2025

@jarkkojs I've updated my branch with the most minimal reproducer from my second comment here. The branch now also uses a relative path to the crates in the repo. I still get a malformed output:

Calling into the guest program:
print: "\0\0\0\0\0\0\0"

My personal guess is that something goes wrong during relocation.

@jarkkojs
Copy link
Collaborator

jarkkojs commented Jan 17, 2025

@jarkkojs I've updated my branch with the most minimal reproducer from my second comment here. The branch now also uses a relative path to the crates in the repo. I still get a malformed output:

Calling into the guest program:
print: "\0\0\0\0\0\0\0"

My personal guess is that something goes wrong during relocation.

Yes, it is obvious that there is a glitch still but the above experiments narrow it down. We know now e.g. that if polkavm-derive is causing this bug, then it must be in both 0.18 and 0.19 version.

To summarize:

  1. We have a version that does not work.
  2. We have a version that works.
  3. The differences are somewhat cosmetic.

It's progress at least :-)

EDIT: I'll now try mix and match the right sauce to bring back the bug.

@athei
Copy link
Member

athei commented Jan 17, 2025

We have a version that does not work.
We have a version that works.

Which version works and which won't?

@jarkkojs
Copy link
Collaborator

We have a version that does not work.
We have a version that works.

Which version works and which won't?

The version in my bug-247 branch does not works (after cargo b --release run target/release/bug-247). This was unintentional.

So now I take @cmichi's original example and step by step move back to its implementation until I find the crack, and that way we can precisely determine the root cause.

@jarkkojs
Copy link
Collaborator

jarkkojs commented Jan 17, 2025

In the original bug reproducer by @cmichi, this modification excludes the possibility of DWARF processing bug:

diff --git a/examples/trigger-bug/src/main.rs b/examples/trigger-bug/src/main.rs
index c2b7f50..e69f74b 100644
--- a/examples/trigger-bug/src/main.rs
+++ b/examples/trigger-bug/src/main.rs
@@ -10,6 +10,8 @@ fn main() {
     // link
     let mut config = polkavm_linker::Config::default();
     config.set_optimize(false);
+    /* Disable DWARF processing: */
+    config.set_strip(true);
 
     let orig = fs::read(input_path).expect("Failed to read {input_path:?}");
     let linked = polkavm_linker::program_from_elf(config, orig.as_ref())

6863067

I.e. the result is the same:

Calling into the guest program:
print: "\0\0\0\0\0\0\0"

jarkkojs pushed a commit that referenced this issue Jan 17, 2025
jarkkojs pushed a commit that referenced this issue Jan 17, 2025
jarkkojs pushed a commit that referenced this issue Jan 17, 2025
jarkkojs pushed a commit that referenced this issue Jan 17, 2025
jarkkojs pushed a commit that referenced this issue Jan 17, 2025
@jarkkojs
Copy link
Collaborator

jarkkojs commented Jan 17, 2025

Fix:

#!/usr/bin/env bash

TMP_DIR=$(mktemp -d -t polkavm-XXXXXX)
git clone https://github.com/use-ink/polkavm "$TMP_DIR"
cd "$TMP_DIR"
git switch cmichi-reproduce-bug-jan-25

cat << EOF > fix.patch
diff --git a/examples/program-for-bug/Cargo.toml b/examples/program-for-bug/Cargo.toml
index ffcb2f9..769c8fa 100755
--- a/examples/program-for-bug/Cargo.toml
+++ b/examples/program-for-bug/Cargo.toml
@@ -4,6 +4,9 @@ version = "0.1.0"
 authors = ["[your_name] <[your_email]>"]
 edition = "2021"

+[profile.release]
+lto = "fat"
+
 [dependencies]
 polkavm-derive = { path = "../../crates/polkavm-derive", default-features = false }
EOF

patch -p1 < fix.patch
cd examples/program-for-bug/
cargo +nightly build --release --no-default-features --target ../../crates/polkavm-linker/riscv64emac-unknown-none-polkavm.json -Zbuild-std="core,alloc"
cd ../trigger-bug/
INPUT=../program-for-bug/target/riscv64emac-unknown-none-polkavm/release/program-for-bug cargo run

Run:

Image

@jarkkojs
Copy link
Collaborator

@cmichi, @athei, @koute: In the above script, if I comment patch-command, i.e.

patch -p1 < fix.patch

Then the output becomes:

Calling into the guest program:
print: "\0\0\0\0\0\0\0"

If I have lto = "fat", then

Calling into the guest program:
print: "Success"

So it looks like polkavm-linker fails unless link-time optimizations are performed across the crates?

@jarkkojs
Copy link
Collaborator

@cmichi At least there is now a workaround, I'll wait feedback from @koute because he knows best the reason for having lto = "fat" in guest-programs/Cargo.toml, before taking any further action on this issue :-) Hope this helps for the moment.

@koute
Copy link
Collaborator

koute commented Jan 19, 2025

@cmichi At least there is now a workaround, I'll wait feedback from @koute because he knows best the reason for having lto = "fat" in guest-programs/Cargo.toml, before taking any further action on this issue :-) Hope this helps for the moment.

The behavior should always be the same with and without LTO. Optimizations should never change the program's behavior, and if they do then that's a bug. We need to know why this happens and properly fix it/prevent it from happening.

@jarkkojs
Copy link
Collaborator

@cmichi At least there is now a workaround, I'll wait feedback from @koute because he knows best the reason for having lto = "fat" in guest-programs/Cargo.toml, before taking any further action on this issue :-) Hope this helps for the moment.

The behavior should always be the same with and without LTO. Optimizations should never change the program's behavior, and if they do then that's a bug. We need to know why this happens and properly fix it/prevent it from happening.

Any ideas what look for, given that the symptom is triggered by disabling LTO?

@athei
Copy link
Member

athei commented Jan 20, 2025

I would start by diffing the two binaries (lot, non-lto) and look for something interesting. Cross module (crate) inlining is usually the biggest difference you will observe. In this case core or alloc functions being inlined into the program.

You probably want to minimize the reproducer even more to reduce the amount of code to sift through. Especially the fmt infrastructure adds a lot of code.

You can also reduce the amount of generated code by adding -Z build-std-features=panic_immediate_abort when building the guest program. This will prevent core to pull in the fmt infra in order to format panic messages.

@koute
Copy link
Collaborator

koute commented Jan 20, 2025

No need to diff anything here; the issue should be easily debuggable just by looking at the disassembly.


Anyway, to speed this up I took a look at this; @jarkkojs here's why/how it happens, I trust you can take it from here.

Looking at the disassembly of the buggy function (riscv64-elf-objdump -zrD program-for-bug/target/riscv64emac-unknown-none-polkavm/release/program-for-bug):

Disassembly of section .text._ZN57_$LT$program_for_bug..Foo$u20$as$u20$core..fmt..Debug$GT$3fmt17h48080cab75666063E:

0000000000000000 <_ZN57_$LT$program_for_bug..Foo$u20$as$u20$core..fmt..Debug$GT$3fmt17h48080cab75666063E>:
   0:   00054503                lbu     a0,0(a0)
   4:   00351613                slli    a2,a0,0x3

0000000000000008 <.Lpcrel_hi1>:
   8:   00000697                auipc   a3,0x0
                        8: R_RISCV_PCREL_HI20   .Lswitch.table._ZN57_$LT$program_for_bug..Foo$u20$as$u20$core..fmt..Debug$GT$3fmt17h48080cab75666063E
   c:   00068693                mv      a3,a3
                        c: R_RISCV_PCREL_LO12_I .Lpcrel_hi1
  10:   050a                    slli    a0,a0,0x2

0000000000000012 <.Lpcrel_hi2>:
  12:   00000717                auipc   a4,0x0
                        12: R_RISCV_PCREL_HI20  .Lswitch.table._ZN57_$LT$program_for_bug..Foo$u20$as$u20$core..fmt..Debug$GT$3fmt17h48080cab75666063E.1.rel
  16:   00070713                mv      a4,a4
                        16: R_RISCV_PCREL_LO12_I        .Lpcrel_hi2
  1a:   953a                    add     a0,a0,a4
  1c:   4108                    lw      a0,0(a0)
  1e:   9636                    add     a2,a2,a3
  20:   6210                    ld      a2,0(a2)
  22:   00a706b3                add     a3,a4,a0
  26:   852e                    mv      a0,a1
  28:   85b6                    mv      a1,a3
  2a:   00000317                auipc   t1,0x0
                        2a: R_RISCV_CALL_PLT    _ZN4core3fmt9Formatter9write_str17h5ca77ecf4017bb68E
  2e:   00030067                jr      t1 # 2a <.Lpcrel_hi2+0x18>

Which is roughly this in pseudo code:

    write_str(formatter, TABLE_2 + u32 *TABLE_2[*enum * 4], u64 *TABLE_1[*enum * 8]))

Where TABLE_1 is .Lswitch.table.ZN57$LT$program_for_bug..Foo$u20$as$u20$core..fmt..Debug$GT$3fmt17h48080cab75666063E
and TABLE_2 is .Lswitch.table.ZN57$LT$program_for_bug..Foo$u20$as$u20$core..fmt..Debug$GT$3fmt17h48080cab75666063E.1.rel

So we have two lookup tables: one has the pointer, the other has the length. Let's verify their contents.

Disassembly of section .rodata..Lswitch.table._ZN57_$LT$program_for_bug..Foo$u20$as$u20$core..fmt..Debug$GT$3fmt17h48080cab75666063E:

0000000000000000 <.Lswitch.table._ZN57_$LT$program_for_bug..Foo$u20$as$u20$core..fmt..Debug$GT$3fmt17h48080cab75666063E>:
   0:   00000007                .insn   4, 0x0007
   4:   0000                    unimp
   6:   0000                    unimp
   8:   000d                    c.nop   3
   a:   0000                    unimp
   c:   0000                    unimp
   e:   0000                    unimp
  10:   00000007                .insn   4, 0x0007
  14:   0000                    unimp
  16:   0000                    unimp

Disassembly of section .rodata..Lswitch.table._ZN57_$LT$program_for_bug..Foo$u20$as$u20$core..fmt..Debug$GT$3fmt17h48080cab75666063E.1.rel:

0000000000000000 <.Lswitch.table._ZN57_$LT$program_for_bug..Foo$u20$as$u20$core..fmt..Debug$GT$3fmt17h48080cab75666063E.1.rel>:
   0:   0000                    unimp
                        0: R_RISCV_ADD32        .Lanon.cb76ee1d57f0804fce1a80e99f7c73f1.1
                        0: R_RISCV_SUB32        .Lswitch.table._ZN57_$LT$program_for_bug..Foo$u20$as$u20$core..fmt..Debug$GT$3fmt17h48080cab75666063E.1.rel
   2:   0000                    unimp
   4:   0000                    unimp
                        4: R_RISCV_ADD32        .Lanon.cb76ee1d57f0804fce1a80e99f7c73f1.2
                        4: R_RISCV_SUB32        .Lswitch.table._ZN57_$LT$program_for_bug..Foo$u20$as$u20$core..fmt..Debug$GT$3fmt17h48080cab75666063E.1.rel
   6:   0000                    unimp
   8:   0000                    unimp
                        8: R_RISCV_ADD32        .Lanon.cb76ee1d57f0804fce1a80e99f7c73f1.3
                        8: R_RISCV_SUB32        .Lswitch.table._ZN57_$LT$program_for_bug..Foo$u20$as$u20$core..fmt..Debug$GT$3fmt17h48080cab75666063E.1.rel
   a:   0000                    unimp

So TABLE_1 indeed contains correct lengths: [0x7, 0xd, 0x7]
("Success" has 7, "CalleeTrapped" has 13/0xD and "Unknown" string has 7)

Looking at the program output it prints out 7 zeros, so the lookup from this length table is most likely correct. So what's probably broken is the pointer fetch. So looking at the disassembly dump we have three relocations in TABLE_2 (.Lswitch.table._ZN57_$LT$program_for_bug..Foo$u20$as$u20$core..fmt..Debug$GT$3fmt17h48080cab75666063E.1.rel), and the first one's gonna be the one for the "Success" string:

0: R_RISCV_ADD32        .Lanon.cb76ee1d57f0804fce1a80e99f7c73f1.1
0: R_RISCV_SUB32        .Lswitch.table._ZN57_$LT$program_for_bug..Foo$u20$as$u20$core..fmt..Debug$GT$3fmt17h48080cab75666063E.1.rel

So how does this work? Let's look at the code again:

TABLE_2 + u32 *TABLE_2[*enum * 4]

So this takes the address of the lookup table itself and adds to this table offset that it fetches from the table. And what do the relocations do? The first R_RISCV_ADD32 writes the address of the "Success" string into the table, and the R_RISCV_SUB32 then subtracts from that the address of the table, effectively leaving only the relative offset (from the beginning of the table) to the "Success" string (so adding the address of the table and this offset will give us the full address to the string, exactly like the code's doing).

So we can look at the linker logs to see where each section is in memory, here's the "Success" string:

[TRACE polkavm_linker::program_from_elf] Found section: '.rodata..Lanon.cb76ee1d57f0804fce1a80e99f7c73f1.1', original range = 0x0..0x7 (relocated to: 0x10030..0x10037), size = 0x7/0x7

and here's the switch table which should contain a pointer to it:

[TRACE polkavm_linker::program_from_elf] Found section: '.rodata..Lswitch.table._ZN57_$LT$program_for_bug..Foo$u20$as$u20$core..fmt..Debug$GT$3fmt17h48080cab75666063E.1.rel', original range = 0x0..0xc (relocated to: 0x10090..0x1009c), size = 0xc/0xc

We can print out what's in those sections at the end of linking:

diff --git a/crates/polkavm-linker/src/program_from_elf.rs b/crates/polkavm-linker/src/program_from_elf.rs
index eb574f1..7640a02 100644
--- a/crates/polkavm-linker/src/program_from_elf.rs
+++ b/crates/polkavm-linker/src/program_from_elf.rs
@@ -9578,6 +9580,9 @@ where
     log::debug!("Built a program of {} bytes", raw_blob.len());
     let blob = ProgramBlob::parse(raw_blob[..].into())?;
 
+    println!("success string: {:?}", String::from_utf8_lossy(&blob.ro_data()[0x30..0x37]));
+    println!("table: {:?}", &blob.ro_data()[0x90..0x9c]);
+
     // Sanity check that our debug info was properly emitted and can be parsed.
     if cfg!(debug_assertions) && !config.strip {
         'outer: for (nth_instruction, locations) in locations_for_instruction.iter().enumerate() {

success string: "Success"
table: [96, 0, 0, 0, 89, 0, 0, 0, 76, 0, 0, 0]

So the string itself is correctly emitted. The offset table however contains 96 (0x60), which will make the code read address 0x10090 + 0x60 = 0x100f0. And what do we see in the logs during execution?

[TRACE polkavm::sandbox::linux] Reading memory: 0x100f0-0x100f7 (7 bytes)
print: "\0\0\0\0\0\0\0"

Yep, so it's indeed trying to read from that address.

So the offset inside of the table is incorrect; what we need is 0x10090 + $OFFSET = 0x10030, so with some basic algebra we can figure out that the table should contain.... -0x60 instead. Okay, so the offset is correct-ish, but it has the wrong sign.

@jarkkojs
Copy link
Collaborator

@koute Yeah, that will definitely help thanks! I'll look it in detail tomorrow.

@jarkkojs
Copy link
Collaborator

@koute Do we want also a CI test for this bug or only the fix? (non-opionated question)

@athei
Copy link
Member

athei commented Jan 25, 2025

Every bug fix should come with a regression test. So yes.

@jarkkojs
Copy link
Collaborator

Every bug fix should come with a regression test. So yes.

OK fair enough and this was my thinking too. I just save the blob for now and we can switch to dynamically generated later on (once polkavm-test-data compiles in Windows).

@jarkkojs
Copy link
Collaborator

@athei, @koute, Sanity checking. Doing integration to tests.rs but I see some random tests failing:

test tests::tracing_linux_dynamic_paging_basic ... FAILED
test tests::tracing_linux_access_memory_from_host ... ok
test tests::tracing_linux_dynamic_paging_change_program_counter_during_segfault ... FAILED
test tests::tracing_linux_dynamic_paging_cancel_segfault_by_changing_address ... FAILED
test tests::tracing_linux_dynamic_paging_freeing_pages ... FAILED
test tests::tracing_linux_dynamic_paging_initialize_multiple_pages ... FAILED
test tests::tracing_linux_dynamic_paging_preinitialize_pages ... FAILED
test tests::tracing_linux_dynamic_paging_reading_does_not_resolve_segfaults ... FAILED
test tests::tracing_linux_dynamic_paging_change_written_value_and_address_during_segfault ... FAILED
test tests::tracing_linux_dynamic_paging_read_at_page_boundary ... FAILED
test tests::tracing_linux_dynamic_paging_worker_recycle_during_segfault ... FAILED
test tests::tracing_linux_dynamic_paging_write_at_page_boundary_with_first_page ... FAILED
test tests::tracing_linux_dynamic_paging_worker_recycle_turn_dynamic_paging_on_and_off ... FAILED
test tests::tracing_linux_dynamic_paging_write_at_page_boundary_with_no_pages ... FAILED
test tests::tracing_linux_dynamic_paging_write_at_page_boundary_with_second_page ... FAILED
test tests::tracing_linux_basic_gas_metering_async ... ok
test tests::tracing_linux_memset_with_dynamic_paging ... FAILED
test tests::tracing_linux_module_cache ... ok
test tests::tracing_linux_dynamic_paging_stress_test ... FAILED

I tried to master branch too and they seem to fail also there. Can I safely ignore?

Also, please suggest name for "program-for-bug" (low priority, can be done also part of PR review later on).

@koute
Copy link
Collaborator

koute commented Jan 25, 2025

Doing integration to tests.rs but I see some random tests failing

They're not random, and if you'll read the error message it should tell you why it's failing and how to fix it. :P

@jarkkojs
Copy link
Collaborator

Doing integration to tests.rs but I see some random tests failing

They're not random, and if you'll read the error message it should tell you why it's failing and how to fix it. :P

I mapped your comment to action:

#259

Now I ignore them :-p

@jarkkojs
Copy link
Collaborator

Doing integration to tests.rs but I see some random tests failing

They're not random, and if you'll read the error message it should tell you why it's failing and how to fix it. :P

I mapped your comment to action:

#259

Now I ignore them :-p

LOL, ok :-) That was a quick fix.

@jarkkojs
Copy link
Collaborator

jarkkojs commented Jan 25, 2025

@koute,

Let's assume for the sake of example I'll add debug_message() to TestInstance::new():

    linker.define_typed(
        "debug_message",
        |caller: Caller<()>, buffer: u32, length: u32| {
            let buffer = caller.instance.read_memory(buffer, length).unwrap();
            eprintln!("print: {:?}", String::from_utf8(buffer).unwrap());
        },
    ).unwrap();

And let's further assume we had a test:

fn test_program_for_bug(config: Config, optimize: bool, is_64_bit: bool) {
    let blob = if is_64_bit { PROGRAM_FOR_BUG_64_ELF_ZST } else { PROGRAM_FOR_BUG_32_ELF_ZST };
    let mut i = TestInstance::new(blob, &config, optimize, is_64_bit);
    i.call::<(), ()>("deploy", ()).unwrap();
}

So probably instead of printing I'd want to store the String somewhere and compare post executing the guest call. If I add a String field to TestInstance, can I acccess that somehow from the closure, or how should I wire this?

@jarkkojs
Copy link
Collaborator

So this works now:

$ cargo t -p polkavm tests::compiler_linux_optimized_64_test_program_for_bug -- --nocapture
warning: /home/jarkko/work/github.com/paritytech/polkavm/Cargo.toml: unused manifest key: workspace.lints.rust.unexpected_cfgs.check-cfg
    Finished test [unoptimized + debuginfo] target(s) in 0.02s
     Running unittests src/lib.rs (target/debug/deps/polkavm-b7e6b492dbae036e)

running 1 test
print: "\0\0\0\0\0\0\0"
test tests::compiler_linux_optimized_64_test_program_for_bug ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 1543 filtered out; finished in 0.20s

     Running tests/integration_tests.rs (target/debug/deps/integration_tests-c78775a2d72333bb)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 2 filtered out; finished in 0.00s

And has all the variants:

test tests::compiler_linux_optimized_32_test_program_for_bug ... ok
test tests::compiler_linux_optimized_64_test_program_for_bug ... ok
test tests::compiler_linux_unoptimized_32_test_program_for_bug ... ok
test tests::compiler_linux_unoptimized_64_test_program_for_bug ... ok
test tests::interpreter_optimized_32_test_program_for_bug ... ok
test tests::interpreter_optimized_64_test_program_for_bug ... ok
test tests::interpreter_unoptimized_32_test_program_for_bug ... ok
test tests::interpreter_unoptimized_64_test_program_for_bug ... ok
test tests::tracing_linux_optimized_32_test_program_for_bug ... ok
test tests::tracing_linux_optimized_64_test_program_for_bug ... ok
test tests::tracing_linux_unoptimized_32_test_program_for_bug ... ok
test tests::tracing_linux_unoptimized_64_test_program_for_bug ... ok

But obviously it is all pass, given that I don't check the string.

@jarkkojs
Copy link
Collaborator

@koute,

Let's assume for the sake of example I'll add debug_message() to TestInstance::new():

linker.define_typed(
    "debug_message",
    |caller: Caller<()>, buffer: u32, length: u32| {
        let buffer = caller.instance.read_memory(buffer, length).unwrap();
        eprintln!("print: {:?}", String::from_utf8(buffer).unwrap());
    },
).unwrap();

And let's further assume we had a test:

fn test_program_for_bug(config: Config, optimize: bool, is_64_bit: bool) {
let blob = if is_64_bit { PROGRAM_FOR_BUG_64_ELF_ZST } else { PROGRAM_FOR_BUG_32_ELF_ZST };
let mut i = TestInstance::new(blob, &config, optimize, is_64_bit);
i.call::<(), ()>("deploy", ()).unwrap();
}
So probably instead of printing I'd want to store the String somewhere and compare post executing the guest call. If I add a String field to TestInstance, can I acccess that somehow from the closure, or how should I wire this?

So I guess one option is to replace eprintln() with (did not check compilation):

let s = String::from_utf8(buffer).unwrap();
assert_eq!(&s, "Success");

@jarkkojs
Copy link
Collaborator

@koute,
Let's assume for the sake of example I'll add debug_message() to TestInstance::new():

linker.define_typed(
    "debug_message",
    |caller: Caller<()>, buffer: u32, length: u32| {
        let buffer = caller.instance.read_memory(buffer, length).unwrap();
        eprintln!("print: {:?}", String::from_utf8(buffer).unwrap());
    },
).unwrap();

And let's further assume we had a test:
fn test_program_for_bug(config: Config, optimize: bool, is_64_bit: bool) {
let blob = if is_64_bit { PROGRAM_FOR_BUG_64_ELF_ZST } else { PROGRAM_FOR_BUG_32_ELF_ZST };
let mut i = TestInstance::new(blob, &config, optimize, is_64_bit);
i.call::<(), ()>("deploy", ()).unwrap();
}
So probably instead of printing I'd want to store the String somewhere and compare post executing the guest call. If I add a String field to TestInstance, can I acccess that somehow from the closure, or how should I wire this?

So I guess one option is to replace eprintln() with (did not check compilation):

let s = String::from_utf8(buffer).unwrap();
assert_eq!(&s, "Success");

It did the job:

$ cargo t -p polkavm tests::compiler_linux_optimized_64_test_program_for_bug -- --nocapture
warning: /home/jarkko/work/github.com/paritytech/polkavm/Cargo.toml: unused manifest key: workspace.lints.rust.unexpected_cfgs.check-cfg
   Compiling polkavm v0.19.0 (/home/jarkko/work/github.com/paritytech/polkavm/crates/polkavm)
    Finished test [unoptimized + debuginfo] target(s) in 1.73s
     Running unittests src/lib.rs (target/debug/deps/polkavm-b7e6b492dbae036e)

running 1 test
thread 'tests::compiler_linux_optimized_64_test_program_for_bug' panicked at crates/polkavm/src/tests.rs:2076:17:
assertion `left == right` failed
  left: "\0\0\0\0\0\0\0"
 right: "Success"
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
test tests::compiler_linux_optimized_64_test_program_for_bug ... FAILED

failures:

failures:
    tests::compiler_linux_optimized_64_test_program_for_bug

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 1543 filtered out; finished in 0.18s

error: test failed, to rerun pass `-p polkavm --lib`

@jarkkojs
Copy link
Collaborator

jarkkojs commented Jan 25, 2025

I can work with that and move on, can be refined during review. The current "without fix" result is:

test tests::compiler_linux_optimized_32_test_program_for_bug ... ok
test tests::compiler_linux_optimized_64_test_program_for_bug ... FAILED
test tests::compiler_linux_unoptimized_32_test_program_for_bug ... ok
test tests::interpreter_optimized_32_test_program_for_bug ... ok
test tests::interpreter_optimized_64_test_program_for_bug ... FAILED
test tests::interpreter_unoptimized_32_test_program_for_bug ... ok
test tests::interpreter_unoptimized_64_test_program_for_bug ... FAILED
test tests::compiler_linux_unoptimized_64_test_program_for_bug ... FAILED
test tests::tracing_linux_optimized_32_test_program_for_bug ... ok
test tests::tracing_linux_optimized_64_test_program_for_bug ... FAILED
test tests::tracing_linux_unoptimized_32_test_program_for_bug ... ok
test tests::tracing_linux_unoptimized_64_test_program_for_bug ... FAILED

@jarkkojs
Copy link
Collaborator

I'll commit fix tomorrow, squash the branch and open up a PR so on Monday forward feedback is welcome. It was useful to know that 32-bit does not have the issue. I keep the naming as it is up until review for the tests (and the check). Let's polish those details next week...

@jarkkojs
Copy link
Collaborator

The first knock of the ice to this code (I'm looking it for the first time so expect some misunderstandings):

diff --git a/crates/polkavm-linker/src/program_from_elf.rs b/crates/polkavm-linker/src/program_from_elf.rs
index eb574f1..4252bd8 100644
--- a/crates/polkavm-linker/src/program_from_elf.rs
+++ b/crates/polkavm-linker/src/program_from_elf.rs
@@ -7930,6 +7930,17 @@ where
                 );
                 continue;
             }
+            [(_, Kind::Mut(MutOp::Add, RelocationSize::U32, target_1)), (_, Kind::Mut(MutOp::Sub, RelocationSize::U32, target_2))] => {
+                relocations.insert(
+                    current_location,
+                    RelocationKind::Offset {
+                        origin: *target_1,
+                        target: *target_2,
+                        size: SizeRelocationSize::Generic(RelocationSize::U32),
+                    },
+                );
+                continue;
+            }
             [(_, Kind::Mut(MutOp::Add, size_1, target_1)), (_, Kind::Mut(MutOp::Sub, size_2, target_2))] if size_1 == size_2 => {
                 relocations.insert(
                     current_location,

This code is somewhat alien to me so have to start with something but based earlier comment (with objdump transcripts) get_relocation_target() is working correctly.

All tests pass. @koute, any ideas why 32-bit tests might have passed also with the wrong order (just out of top of your head if you have ideas, no need for in-depth analysis)?

Transcript:

$ cargo t -p polkavm|grep program_for_    
warning: /home/jarkko/work/github.com/paritytech/polkavm/Cargo.toml: unused manifest key: workspace.lints.rust.unexpected_cfgs.check-cfg
    Finished test [unoptimized + debuginfo] target(s) in 0.03s
     Running unittests src/lib.rs (target/debug/deps/polkavm-b7e6b492dbae036e)
test tests::compiler_linux_optimized_32_test_program_for_bug ... ok
test tests::compiler_linux_optimized_64_test_program_for_bug ... ok
test tests::compiler_linux_unoptimized_32_test_program_for_bug ... ok
test tests::interpreter_optimized_32_test_program_for_bug ... ok
test tests::interpreter_optimized_64_test_program_for_bug ... ok
test tests::compiler_linux_unoptimized_64_test_program_for_bug ... ok
test tests::interpreter_unoptimized_32_test_program_for_bug ... ok
test tests::interpreter_unoptimized_64_test_program_for_bug ... ok
test tests::tracing_linux_optimized_32_test_program_for_bug ... ok
test tests::tracing_linux_optimized_64_test_program_for_bug ... ok
test tests::tracing_linux_unoptimized_32_test_program_for_bug ... ok
test tests::tracing_linux_unoptimized_64_test_program_for_bug ... ok
     Running tests/integration_tests.rs (target/debug/deps/integration_tests-c78775a2d72333bb)
   Doc-tests polkavm

@jarkkojs
Copy link
Collaborator

All tests pass. @koute, any ideas why 32-bit tests might have passed also with the wrong order (just out of top of your head if you have ideas, no need for in-depth analysis)?

Never mind, obviously it just does not emit that type of relocation.

Similarly lto = fat made program-for-bug work because it neither generated that relocation. So all questions are answered and PR is out...

@jarkkojs
Copy link
Collaborator

Doing integration to tests.rs but I see some random tests failing

They're not random, and if you'll read the error message it should tell you why it's failing and how to fix it. :P

I did read the error message. There's nothing pointing in your comment to any error message that was not in the capture of the transcript that I captured.

@jarkkojs
Copy link
Collaborator

32

All tests pass. @koute, any ideas why 32-bit tests might have passed also with the wrong order (just out of top of your head if you have ideas, no need for in-depth analysis)?

Never mind, obviously it just does not emit that type of relocation.

Similarly lto = fat made program-for-bug work because it neither generated that relocation. So all questions are answered and PR is out...

32-bit uses instead R_RISCV_32 relocations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Minimal Feature Launch
Development

No branches or pull requests

4 participants