Failing benchmark instances #167

aorwall · 2024-07-02T17:53:19Z

Great job with the new containerized evaluation tool! I've run it a couple of times on the golden patches on SWE-bench Lite and overall it gives a more stable result than my swe-bench-docker setup. There are a few instances that fail intermittently, though. Some I recognize from tests in swe-bench-docker, and some are new. None of them are failing in 100% of the runs.

Django instances

In all the failing Django instances I've checked, the tests seem to pass but are marked as failed because other logs are being printed in the test results.

Here's an example of a test that is marked as failed:

test_annotation_with_nested_outerref (expressions.tests.BasicExpressionsTests) ... System check identified no issues (0 silenced).
ok

The same test in a successful test output log

test_annotation_with_nested_outerref (expressions.tests.BasicExpressionsTests) ... ok

Other instances

In the following instances some different tests fails intermittently and I haven't found the root cause. I got the same issues in swe-bench-docker with matplotlib and sympy instances. I haven't got issues with psf__requests though.

matplotlib__matplotlib-23987
psf__requests-1963
psf__requests-2317
psf__requests-2674
sympy__sympy-13177
sympy__sympy-13146

Have you experienced the same issues? Is it also be possible for you to share your run_instance_logs somewhere to compare to your successful evaluation runs. Would be nice to nail this once and for all :)

I've run the benchmarks on Ubuntu 22 VMs with 16 cores on Azure (max_workers = 14)

The text was updated successfully, but these errors were encountered:

ofirpress · 2024-07-02T20:15:53Z

thanks for making an issue about this

john-b-yang · 2024-07-02T21:28:04Z

Hi @aorwall! Thanks so much for the kind words, swe-bench-docker was a huge inspiration for our release. Really appreciate all the past and ongoing work w/ Moatless + SWE-bench evals 😄

Ok so regarding the issue...

Django test parsing issues: Yeah agreed, we've seen this too. The Django log parsing just got updated at Fix newline outputs for django's log parser #166 most recently, with a couple changes before that too. I think it should cover the case you're talking about? - I'll check.
Intermittently failing tests: We've been noticing this too. Our approach that we feel is reasonable is to just remove flaky tests if they're P2P ones, which has been the large majority so far. I will take a look at these issues, run them 5x, and apply the aforementioned clean up.
run_instance_logs for lite: Linked here!

I'm actively working on 2, have gotten some help from Stanford folks as well on this - I think you can expect a dataset update that addresses these problems by end of next week at the latest!

aorwall · 2024-07-03T07:29:55Z

Looks like #166 fixed the Django isses 👍

aorwall · 2024-07-07T18:10:18Z

I'm currently working on a hosted solution where I'm running benchmarks on virtual machines in Azure. I've gotten very stable results with no instances failing (except for sympy__sympy-13177 sometimes). It looks like you nailed it 💪

One thing that would be worth investigating is the performance of sympy__sympy-11870. The reason it takes over 15 minutes to run the Lite benchmark seems to be because that instance takes 16 minutes. It would be possible to run the benchmark in 10 minutes on a 16-core machine if it weren't for that single instance.

And ping me on Discord if you'd like to try my solution. I might have something up and running in the next few days...

aorwall · 2024-07-17T12:14:45Z

I think I was a bit too quick there. It seems like there are still some shaky tests in sympy/sympy and matplotlib/matplotlib. Dependency issues in astropy/astropy and pydata/xarray might be resolved with #184 .

sympy__sympy-13146 almost consistently fails on the same line of code. I'm not sure if this check will give false positives in the test. But removing the line in test_patch makes the test more stable.

diff --git a/sympy/core/tests/test_evalf.py b/sympy/core/tests/test_evalf.py
index b7cb7abc08..014707aace 100644
--- a/sympy/core/tests/test_evalf.py
+++ b/sympy/core/tests/test_evalf.py
@@ -225,7 +225,9 @@ def test_evalf_bugs():

     #issue 5412
     assert ((oo*I).n() == S.Infinity*I)
-    assert ((oo+oo*I).n() == S.Infinity + S.Infinity*I)
+
+    #issue 11518
+    assert NS(2*x**2.5, 5) == '2.0000*x**2.5000'


 def test_evalf_integer_parts():

For other sympy tests, I haven't found a solution. It's always recursion depth errors. I've tried increasing the recursion depth with sys.setrecursionlimit(...), but I'm not sure if it helps.

So either we can comment out some flaky tests or try with different dependencies.

The tests for psf/requests seem to be shaky possibly because some tests go against httpbin.org, which sometimes doesn't respond. I set up my own httpbin, which gives more stable test results.

I've set up my hosted version for testing at eval.moatless.ai. Feel free to try it out. The report of the latest gold patch run can be found here.

john-b-yang · 2024-07-18T17:12:02Z

Hmm ok yeah I'm definitely also noticing this recursion depth limit issue.

Currently, it does look like most of the shaky tests are pass to pass tests. I'm currently working with another team who've been able to identify these shaky tests by just running multiple times w/ the gold patch and seeing if there are any tests that don't consistently pass.

I think the resolution for this will likely just be eliminating shaky pass to pass tests from the SWE-bench dataset. This has not happened yet, but I think we'll try to make this happen by the end of the month.

Also, the hosted testing version looks beautiful! We're working on something similar, so it's great to see a version of it already out 😄

john-b-yang self-assigned this Jul 2, 2024

john-b-yang added bug Something isn't working in progress We are actively working on this issue. labels Jul 2, 2024

aorwall mentioned this issue Jul 17, 2024

Astropy and xarray fixes #184

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing benchmark instances #167

Failing benchmark instances #167

aorwall commented Jul 2, 2024 •

edited

Loading

ofirpress commented Jul 2, 2024

john-b-yang commented Jul 2, 2024

aorwall commented Jul 3, 2024

aorwall commented Jul 7, 2024

aorwall commented Jul 17, 2024

john-b-yang commented Jul 18, 2024

Failing benchmark instances #167

Failing benchmark instances #167

Comments

aorwall commented Jul 2, 2024 • edited Loading

Django instances

Other instances

ofirpress commented Jul 2, 2024

john-b-yang commented Jul 2, 2024

aorwall commented Jul 3, 2024

aorwall commented Jul 7, 2024

aorwall commented Jul 17, 2024

john-b-yang commented Jul 18, 2024

aorwall commented Jul 2, 2024 •

edited

Loading