-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug when accessing string compiled into binary #247
Comments
Minimal reproducer: #![no_std]
#![no_main]
extern crate core;
use polkavm_derive::polkavm_export;
use core::fmt::Write;
pub enum Foo {
One,
CalleeTrapped,
Unknown,
}
impl ::core::fmt::Debug for Foo {
#[inline]
fn fmt(&self, f: &mut ::core::fmt::Formatter) -> ::core::fmt::Result {
::core::fmt::Formatter::write_str(
f,
match self {
Foo::Success => "Success",
Foo::CalleeTrapped => "CalleeTrapped",
Foo::Unknown => "Unknown",
},
)
}
}
struct Writer;
impl core::fmt::Write for Writer {
fn write_str(&mut self, s: &str) -> core::fmt::Result {
unsafe {
crate::debug_message(s.as_ptr(), s.len() as u32);
}
Ok(())
}
}
#[polkavm_derive::polkavm_import]
extern "C" {
pub fn debug_message(str_ptr: *const u8, str_len: u32);
}
#[polkavm_export(abi = polkavm_derive::default_abi)]
pub fn deploy() {
let mut m = Writer {};
let _ = write!(&mut m, "{:?}", Foo::Success);
}
#[panic_handler]
fn panic(_info: &core::panic::PanicInfo) -> ! {
unsafe {
core::arch::asm!("unimp");
core::hint::unreachable_unchecked();
}
} compiled via
And: linker.define_typed(
"debug_message",
|caller: Caller<()>, buffer: u32, length: u32| {
let buffer = caller.instance.read_memory(buffer, length).unwrap();
eprintln!("print: {:?}", String::from_utf8(buffer).unwrap());
},
).unwrap(); Would appreciate someone taking a look. |
Have you checked that the value of the pointer you pass in the guest and |
Yeah I had checked, that works fine. The more extensive bug report above explains that passing strings back and forth works, but accessing certain string literals from the binary is an issue. |
@jarkkojs Can you try to reproduce and debug what is going on here? |
Yeah, I can check this out tomorrow. I just noticed this. |
At first I thought the issue might be that the import is defined as pointer (64bit) while the host function definition is u32. However, it should not be a problem since all pointers will fit into |
@cmichi What am I missing here? |
False flag, I forgot to change the directory:
I.e. can reproduce with the 1:1 outcome on my side. With
|
Better to debug this with |
@athei Just a sanity check: does the result which is not mangled by the optimizer ring any bells? I created https://github.com/paritytech/polkavm/tree/bug-247 for debugging purposes. It is forked from @cmichi's commit with following differences:
|
There could be two separate bugs (possibly), where the second is unmasked by the first. I.e. possibility of incorrect handling of strings which are just "\0" in the optimizer. I don't know details of that code. Just adding a remark. |
@cmichi Just a remark. It is crucial to use exactly
Just wanted to make this remark to denote that even tho with both settings a bug is triggered, the way you set this knob in the reproduce does actually matter, that's all :-) So we should find the needle exactly from this output (and not factor the difficulty):
I'm setting up a branch for better compilation workflow... |
@athei, @cmichi @koute https://asciinema.org/a/1OvMyRg8hO0NuRmeaYPr4Ixag Check this out. I did not try to proactively fix the bug. I just integrated it better to the existing infrastructure and used the stuff in my in-progress PR [1]. Now it produces the expected output. This was unintentional from my side. The first thing that comes to mind (not yet verified) that maybe it is just a version mismatch. I mean the reproducer enforced polkavm-derive 0.18, which could lead to problems given the opcode number changes. You can checkout my branch yourself: https://github.com/paritytech/polkavm/tree/bug-247 [1] #240 |
@jarkkojs I've updated my branch with the most minimal reproducer from my second comment here. The branch now also uses a relative path to the crates in the repo. I still get a malformed output:
My personal guess is that something goes wrong during relocation. |
Yes, it is obvious that there is a glitch still but the above experiments narrow it down. We know now e.g. that if polkavm-derive is causing this bug, then it must be in both 0.18 and 0.19 version. To summarize:
It's progress at least :-) EDIT: I'll now try mix and match the right sauce to bring back the bug. |
Which version works and which won't? |
The version in my bug-247 branch does not works (after cargo b --release run target/release/bug-247). This was unintentional. So now I take @cmichi's original example and step by step move back to its implementation until I find the crack, and that way we can precisely determine the root cause. |
In the original bug reproducer by @cmichi, this modification excludes the possibility of DWARF processing bug:
I.e. the result is the same:
|
Fix:
Run: |
@cmichi, @athei, @koute: In the above script, if I comment patch-command, i.e.
Then the output becomes:
If I have
So it looks like polkavm-linker fails unless link-time optimizations are performed across the crates? |
The behavior should always be the same with and without LTO. Optimizations should never change the program's behavior, and if they do then that's a bug. We need to know why this happens and properly fix it/prevent it from happening. |
Any ideas what look for, given that the symptom is triggered by disabling LTO? |
I would start by diffing the two binaries (lot, non-lto) and look for something interesting. Cross module (crate) inlining is usually the biggest difference you will observe. In this case You probably want to minimize the reproducer even more to reduce the amount of code to sift through. Especially the You can also reduce the amount of generated code by adding |
No need to diff anything here; the issue should be easily debuggable just by looking at the disassembly. Anyway, to speed this up I took a look at this; @jarkkojs here's why/how it happens, I trust you can take it from here. Looking at the disassembly of the buggy function (
Which is roughly this in pseudo code:
Where TABLE_1 is .Lswitch.table.ZN57$LT$program_for_bug..Foo$u20$as$u20$core..fmt..Debug$GT$3fmt17h48080cab75666063E So we have two lookup tables: one has the pointer, the other has the length. Let's verify their contents.
So TABLE_1 indeed contains correct lengths: [0x7, 0xd, 0x7] Looking at the program output it prints out 7 zeros, so the lookup from this length table is most likely correct. So what's probably broken is the pointer fetch. So looking at the disassembly dump we have three relocations in TABLE_2 (
So how does this work? Let's look at the code again:
So this takes the address of the lookup table itself and adds to this table offset that it fetches from the table. And what do the relocations do? The first R_RISCV_ADD32 writes the address of the "Success" string into the table, and the R_RISCV_SUB32 then subtracts from that the address of the table, effectively leaving only the relative offset (from the beginning of the table) to the "Success" string (so adding the address of the table and this offset will give us the full address to the string, exactly like the code's doing). So we can look at the linker logs to see where each section is in memory, here's the "Success" string:
and here's the switch table which should contain a pointer to it:
We can print out what's in those sections at the end of linking:
So the string itself is correctly emitted. The offset table however contains
Yep, so it's indeed trying to read from that address. So the offset inside of the table is incorrect; what we need is |
@koute Yeah, that will definitely help thanks! I'll look it in detail tomorrow. |
@koute Do we want also a CI test for this bug or only the fix? (non-opionated question) |
Every bug fix should come with a regression test. So yes. |
OK fair enough and this was my thinking too. I just save the blob for now and we can switch to dynamically generated later on (once polkavm-test-data compiles in Windows). |
@athei, @koute, Sanity checking. Doing integration to tests.rs but I see some random tests failing:
I tried to master branch too and they seem to fail also there. Can I safely ignore? Also, please suggest name for "program-for-bug" (low priority, can be done also part of PR review later on). |
They're not random, and if you'll read the error message it should tell you why it's failing and how to fix it. :P |
I mapped your comment to action: Now I ignore them :-p |
LOL, ok :-) That was a quick fix. |
Let's assume for the sake of example I'll add debug_message() to TestInstance::new(): linker.define_typed(
"debug_message",
|caller: Caller<()>, buffer: u32, length: u32| {
let buffer = caller.instance.read_memory(buffer, length).unwrap();
eprintln!("print: {:?}", String::from_utf8(buffer).unwrap());
},
).unwrap(); And let's further assume we had a test: fn test_program_for_bug(config: Config, optimize: bool, is_64_bit: bool) {
let blob = if is_64_bit { PROGRAM_FOR_BUG_64_ELF_ZST } else { PROGRAM_FOR_BUG_32_ELF_ZST };
let mut i = TestInstance::new(blob, &config, optimize, is_64_bit);
i.call::<(), ()>("deploy", ()).unwrap();
} So probably instead of printing I'd want to store the String somewhere and compare post executing the guest call. If I add a String field to TestInstance, can I acccess that somehow from the closure, or how should I wire this? |
So this works now:
And has all the variants:
But obviously it is all pass, given that I don't check the string. |
So I guess one option is to replace eprintln() with (did not check compilation): let s = String::from_utf8(buffer).unwrap();
assert_eq!(&s, "Success"); |
It did the job:
|
I can work with that and move on, can be refined during review. The current "without fix" result is:
|
I'll commit fix tomorrow, squash the branch and open up a PR so on Monday forward feedback is welcome. It was useful to know that 32-bit does not have the issue. I keep the naming as it is up until review for the tests (and the check). Let's polish those details next week... |
The first knock of the ice to this code (I'm looking it for the first time so expect some misunderstandings):
This code is somewhat alien to me so have to start with something but based earlier comment (with objdump transcripts) get_relocation_target() is working correctly. All tests pass. @koute, any ideas why 32-bit tests might have passed also with the wrong order (just out of top of your head if you have ideas, no need for in-depth analysis)? Transcript:
|
Never mind, obviously it just does not emit that type of relocation. Similarly |
I did read the error message. There's nothing pointing in your comment to any error message that was not in the capture of the transcript that I captured. |
32
32-bit uses instead R_RISCV_32 relocations. |
I believe I found a bug in one of the
polkavm
components. I've put the reproducer here.The reproducer program tries to output a string that is already present in its own blob (in the
.data
section, I think). But instead of printing the intended string ("Success"
in the reproducer) we get an artifact from another string present in the blob. The length of the printed string is correct, the output is just a different part of another string from the blob.The output will be along the lines of
This is wrong, it should be
The string
the nul
is present in the blob and part of a longer string. This can be checked viastrings $INPUT
. The length (7
) corresponds to the length of"Success"
.The bug also occurs if I use a different Rust version or don't use an allocator. If I don't include an allocator, the fragment that is output is just different.
Compiling
program-for-bug
without--release
seems to mitigate the bug at least for the heap variant:The text was updated successfully, but these errors were encountered: