-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All Sphinx tests are incorrectly reported as failing #228
Comments
from a little bit of digging, maybe the problem is that the tests are not being run with enough verbosity for some reason, and so the test harness is not picking up the PASS? |
By comparing to successful run logs from the leaderboard, it looks my pytest command is wrong! It's missing the theirs: I think the patch that is supposed to apply that is failing. From run_instance.log:
MY patch.diff DOES include adding -rA, which maybe is conflicting with the test harness trying to do the same thing?
My hypothesis at this point is that my containers did not get cleaned up somehow, and that's what's causing problems... |
Ok, it's not the fault of dirty containers - those extra lines are coming from sphinx's pre-install steps applied by SWE-agent! This bug affects SWE-agent and you can see it in it's official submissions: (My agent is a fork of SWE-agent) The fix is either:
|
Please ignore the assignment. I will get to this issue in a timely manner. |
Describe the bug
Every single FAIL_TO_PASS and PASS_TO_PASS test case is being marked as a failure in the
report.json
for every single task from the Sphinx repo, on swebench Verified. Looking attest_output.txt
it looks as if many are actually passing...I also noticed that the
patch.diff
includes many changes tosetup.py
andtox.ini
, even though for this test my agent only edited a single line (adding a comment). I'm not sure if that's intentional with the build steps for the sphinx task, or something could be wrong with my setup.Steps/Code to Reproduce
My agent did nothing but view some files and add a comment:
Grading:
Expected Results
For a noop change like this, I would expect all of the
PASS_TO_PASS
tests to passActual Results
Report.json:
test_output.txt:
run_instance.log:
eval.sh:
System Information
The text was updated successfully, but these errors were encountered: