Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework MRI test exclusions #3658

Merged
merged 27 commits into from
Sep 27, 2024

Conversation

nirvdrum
Copy link
Collaborator

This is a massive overhaul of our MRI test tags. While dealing with a regression a while back I realized we had an MRI test that would have caught it but the relevant test was never untagged. I think because retagging is so error-prone and so time-consuming almost no one ever updates the tag list. Consequently, we had a lot of things tagged that we didn't need to.

To update the tags I took a significant pass over the retagging process. We already made an effort to extract a meaningful reason message for each tag, but that process was quite limited. And we still had a ton of tags that had "needs investigation" as the reason string. So, I added more extensive parsing to extract a useful message in almost all cases.

Retaining Tags

Because retagging deletes the original tag file and repopulates it with the latest test results, we'd often end up deleting tags that were manually tagged because they only fail sporadically or because they only fail on a particular platform. If they didn't fail in the latest run then then the retagger would think the tests passed. But, they'd often fail in CI until someone went and readded the missing tags. It was an incredibly slow and tedious process.

To solve this problem I changed the retagging process. The process now looks like:

  1. Read the old tag file into memory
  2. Delete the old tag file
  3. Scan the list of tags for any that match any of the MRI_TEST_RETAG_RETAIN_PATTERNS regexps
  4. Rewrite any such lines back to the tag file
  5. Run the tests
  6. Process the test log file to add any additional tags to the tag file

Each of the retained tags will need to be manually deleted if we want to see if a test passes without the tag present. The expectation is that developers working on a particular piece of functionality will check if any of these relevant retained flags are still needed. However, having gone through the list of them, most of them are unlikely to be ever untagged because the test just doesn't make sense on TruffleRuby.

Retrying Tag Generation

Another other major issue we had with retagging is if the test process exited then the tag file was left in an incomplete state. The VM crashing is an obvious, but rare example. A more common one is an exception that escapes the interpreter. Since we run the MRI tests with VM assertions enabled and we don't translate Java AssertionError instances to Ruby exceptions, any time an assertion failed the process would exit. When that happens the tag file needs to be restored from git and the developer needs to manually add a tag for the test that caused the crash, then start the tagging process again. Without the ability to retain tags, that really means we used to have to modify the tests to remove the one that causes the crash. Suffice to say, it was a very tedious and time intensive process.

To solve this problem I now capture stderr to the MRI test log file. We didn't do that before because the log file parser didn't deal with those errors at all and we needed to present them to the developer. The log file parser now handles the following cases:

  • Exceptions that escape the interpreter
  • JVM crashes
  • Undefined symbols (Linux)
  • dyld errors (macOS)

Since the process crashes the test log file is incomplete. To avoid having to manually re-run the tests there's a new protocol:

  1. The log file parser writes a new excludes line for the last test that ran when the process exited. These tags match one of the patterns in MRI_TEST_RETAG_RETAIN_PATTERNS so they will be not be removed when retagging.
  2. The log file parser will return exit status 2 to indicate to jt that the test should be retried.
  3. jt, which is driving the retag process, will check the status code and if it's 2, then it restarts the tagging process for the current test (basically a while loop around steps 5 & 6 in the workflow for retaining tags).

The retry loop for retagging can take a long time. We have some tests that are just very slow. Some of the C extension tests will continually trip over missing symbols. But, critically, this process does not require manual intervention. A developer should be able run jt retag on any test file and get an accurate set of tags without having to clean up after a crash (well, you'll need to manually clean up any hs_err logs).

Slow Tests

Some of the MRI tests are really slow. Anything that was previously manually tagged as slow will match MRI_TEST_RETAG_RETAIN_PATTERNS and be retained on retagging, meaning those tests won't be run at all. But, that process still requires someone to recognize a slow test and manually add a tag.

The MRI test log file has a line showing how long each test took. It's not entirely accurate because it's showing the time the test took to run once the VM was booted, but it's better than nothing. I've augmented the log file parser to tag any test that takes 30s+ as "slow" and any test that takes 60+s as "very slow". There's no behavioral difference between the two tags; it just makes it easier to grep for very slow tests.

The 30s and 60s thresholds can be changed easily in tool/parse_mri_errors.rb. In addition to tagging the tests as slow I've added information about the testing environment to the tag. The TruffleRuby version, build configuration, platform, CPU info, and total number of vCPU are logged. This, I hope, will make it easier to reevaluate slow tests later on. If someone runs the test in 15s and it used to take 45s, they can use that context to hopefully determine if we've made TruffleRuby faster or if it's just a difference in the underlying system.

Tag Ordering

It irked me a bit that the order of tags was dependent on the order the tests ran in. Likewise, manually added tags were put in inconsistent locations. Now what the log file parser emits the tags it sorts them alphabetically. Besides making the tag files easier to read, I expect this will make diffs in the future easier to review as well.

Regenerated Tags

I've gone through and regenerated every MRI tag file we have. This was an extremely long process, so I'd appreciate if no one messes with the tags until this gets merged. I'm hoping to land this before the Ruby 3.3 stdlib merge so we can make use of it then.

I started work on this overhaul 1.5 year ago (possibly 2 years) and every time someone touched a tag file I had to deal with a conflict. I've run this set of tags on an Ubuntu 24.04 x86_64 system and a macOS 14.6.1 ARM64 system. Each platform passed the tests three times consecutively, so I think all of the sporadic tests have been tagged. It would be good to run on the CI machine to verify, however.

CI

I ended up removing a lot of tags when reprocessing each test. Naturally, that means jt test mri will now run more tests. Hopefully, that won't blow out our CI budget, but it's something to look out for. On the other hand, I added tags for slow tests so those won't run in CI. We may need to tweak that. My development workstation is likely faster than GitHub Actions so there are bound to be tests that run in a reasonable amount of time on my machine but are unbearably slow in CI.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Sep 11, 2024
@nirvdrum
Copy link
Collaborator Author

I'm going to consolidate some of these commits. Others I still like having separate since it makes it easier to track the changes. Once I've finished with that I'll mark the PR ready for review.

@nirvdrum nirvdrum changed the title Update mri test exclusions Rework MRI test exclusions Sep 11, 2024
@nirvdrum nirvdrum force-pushed the update-mri-test-exclusions branch 5 times, most recently from b3caecf to cc71b4e Compare September 16, 2024 03:55
@nirvdrum nirvdrum marked this pull request as ready for review September 16, 2024 03:55
@nirvdrum
Copy link
Collaborator Author

I've tidied up the PR, squashing commits that I think made sense to do so. The ones I've kept as individual commits are each logically encapsulated and the commit message provides context on the change -- I think they're worth keeping.

There's some work I'd like to do around the jt retag command, but I'll save that for the future. Notably, now that we have tags that will be retained between invocations of jt retag, it might make sense to have a jt retag --force option that deletes all tags in a file, not just those that wouldn't otherwise be retained. Or, perhaps an option to include the tests that are tagged as slow since they should function.

Additionally, when I was trying to reprocess all MRI tests I found that the change in cabb9b5 to make jt retag skip over tests that don't have a tag file was getting in the way. I ended up disabling that check so I could retag everything. I'm not sure if we even need this check any longer now that retained tags should prevent tests that lock up. But, if we want to keep the check, I think it'd be good to have a jt retag --include-all option or something to that effect.

Copy link
Member

@eregon eregon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR, it looks great.
I haven't reviewed tool/parse_mri_errors.rb in details, I'll let @andrykonchin do that.

tool/jt.rb Outdated Show resolved Hide resolved
tool/jt.rb Outdated Show resolved Hide resolved
tool/jt.rb Outdated Show resolved Hide resolved
@eregon
Copy link
Member

eregon commented Sep 16, 2024

I've run this set of tags on an Ubuntu 24.04 x86_64 system and a macOS 14.6.1 ARM64 system.

Did you run the tests in JVM mode or Native mode?
(it looks like JVM mode because some AssertionError are in there)
IIRC the CI only runs them in Native mode, because JVM mode is too slow for CI.
At some point I tried running a subset of "fast" MRI tests on JVM with assertions enabled, but that was still too slow.
Not really an issue, we'll see what the CI finds out anyway.

@nirvdrum
Copy link
Collaborator Author

Did you run the tests in JVM mode or Native mode? (it looks like JVM mode because some AssertionError are in there) IIRC the CI only runs them in Native mode, because JVM mode is too slow for CI. At some point I tried running a subset of "fast" MRI tests on JVM with assertions enabled, but that was still too slow. Not really an issue, we'll see what the CI finds out anyway.

I did the retagging with JVM so we could get the assertion errors. There were one or two new tags I needed when I ran on native, which were caught in CI when I opened the draft PR. After that CI failure I ran all the MRI tests with the new tags on both Native and JVM on Linux (x86_64) and macOS (ARM64). It was a little clearer before I squashed some commits.

puts '2. Parsing errors'
sh 'ruby', 'tool/parse_mri_errors.rb', output_file, continue_on_failure: true

process_tests = $LAST_SH_STATUS.exitstatus == 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to add some context for this magic number (e.g. to make it a constant or to add a comment and mention tool/parse_mri_errors.rb's RETRY_EXIT_STATUS)

tool/jt.rb Show resolved Hide resolved
error_lines = error.split("\n")
index = 0

while should_skip_error?(error_lines[index])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subjective:

Suggested change
while should_skip_error?(error_lines[index])
while should_skip_error_line?(error_lines[index])


# Mismatched exception messages span three lines. It's much more useful if they're combined into one message.
elsif error_display =~ /but the message doesn't match/
error_display << ' ' + error_lines[index + 1..].join(' ')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for consistency:

Suggested change
error_display << ' ' + error_lines[index + 1..].join(' ')
error_display << ' ' << error_lines[index + 1..].join(' ')

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. To me, I'm appending a new string made up from the parts on the right-hand side, but it doesn't really matter.

# Mismatched expectations span two lines. It's much more useful if they're combined into one message.
if error_display =~ /expected but was/
index += 1
error_display << ' ' + error_lines[index]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
error_display << ' ' + error_lines[index]
error_display << ' ' << error_lines[index]

prefix = "exclude #{test_method.strip.to_sym.inspect}"
new_line = "#{prefix}, #{(REASON || error).inspect}\n"
platform_guard = platform ? " if RUBY_PLATFORM.include?('#{platform}')" : ''
new_line = "#{prefix} #{(REASON || error_display).inspect}#{platform_guard}\n"

FileUtils.mkdir_p(File.dirname(file))
lines = File.exist?(file) ? File.readlines(file) : []

# we need the ',' to handle a case when one test name is a substring of another test name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably this comment is more relevant to the prefix = "exclude... line above.

end

require 'fileutils'
def exclude_test!(class_name, test_method, error_display, platform = nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to pass platform as a keyword argument (it's passed only 1 or 2 times)

tool/parse_mri_errors.rb Show resolved Hide resolved
@@ -0,0 +1,3 @@
exclude :test_02_unknown, "<\"DRbTests::DRbEx::\"> expected but was <\"DRbTests::DRbEx::FooBar\">."
exclude :test_06_timeout, "[Timeout::Error] exception expected, not #<DRb::DRbConnError: execution expired>."
exclude :test_11_remote_no_method_error, "[DRb::DRbRemoteError] exception expected, not #<NoMethodError: undefined method `invoke_no_method' for #<DRbTests::DRbEx:0x4d8>>."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering why these failing tests aren't already excluded (as well as TestBundlerGem.rb, TestBignum_Pack.rb, TestNetHTTPForceEncoding.rb etc). Does it mean they are not run on CI at all?
cc @eregon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That one at least it's from test/mri/failing.exclude which excludes test/mri/tests/drb/test_drb.rb entirely.

Maybe we should remove the drb tests from there then if they work?
One concern about DRb is DRb uses ObjectSpace._id2ref which can be really slow, so tests doing that should probably be tagged as slow/very slow, which might happen automatically.

Copy link
Member

@eregon eregon Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about TestBundlerGem I don't see anything excluding that. You could check in CI logs to see if that gets run in CI currently. If yes, then maybe there are extra failures when run by @nirvdrum compared to CI for some reason?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a for loop in my shell over each of the MRI test files and ran jt retag on them. It looks like we originally tagged these files differently. I think it's quite convenient being able to retag everything in a shell loop, so I'd like to keep the new tags. I didn't know about test/mri/failing.exclude.

I can either delete the entries there and keep the individual tags or I can update jt retag to compare the current test file against whatever's in failing.exclude. I think I'm slightly partial to the former since that would give us reasons for each exclusion. But, if we just don't support a particular feature, I could see a bulk block making sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised jt test mri FILE when FILE is in failing.exclude would even run given on master it doesn't in that case:

$ jt test mri test/mri/tests/drb/test_drb.rb
Using Interpreted TruffleRuby: mxbuild/truffleruby-jvm
No test files! (probably filtered out by failing.exclude)

But I guess with this PR changes it behaves/behaved differently.

Anyway, if these can run it seems fine to remove them from failing.exclude, unless marked as too slow.
It's also OK to do nothing about these and just merge as it is, as it might be a rabbit hole to remove from this failing list.

…s can cause false positives.

We might process two tests named `test_rstrip` and `test_rstrip_bang`. Without the delimiter as part of the prefix, `test_rstrip` would look like a prefix for `test_rstrip_bang`, causing the tool to skip `test_rstrip_bang` thinking it was already added. Depending on the order the tests were processed in you could get two different sets of tags with the same test failures. This was causing us to lose some tags and get different results when retagging MRI tests with random seeds.
The old logic emitted the tags in the order the tests were executed. As a result, if tests are run with random seeds you can get two different files.
… hang.

While this mode means we may continue to exclude a test for an issue that may be fixed, the common problem is the test continues to hang. With the tag file completely cleared we attempt to run that test again and it halts the entire process, require manual intervention to:

* Kill the stuck process
* Manually re-add the tag
* Re-run the tests and pipe into the error processing script without the hepl of `jt tag` in order to run the tests with the partially filled tag file
In some cases we have directories that match the search prefix and they will naturally not include the tagged test data.
Some tests deliberately insert broken bytes or the test sets the encoding to something other than UTF-8. In those cases we may have strings with byte sequences in them that are invalid in UTF-8. Since these are limited to error messages, we can safely ignore them. What we really need to be able to do is split each line in the results file apart and extract key pieces of information. The method names, which are most import for setting up any test exclusions, must be made up of valid bytes.
Scrubbing isn't an entirely fool-proof method, but all of the class names thus far have been ASCII-only, and that's really the data we're after. Non-UTF-8 byte sequences elsewhere in the test will be mangled but we're not doing anything with them here. When we run the tests they'll execute with the correct encoding.
We tend not to retag the MRI tests very frequently. Having the test Ruby version information there gives an idea of when the test was last run and what build type was use (e.g., native vs JVM). It's easy to ping-pong on retagging slow tests because they're necessarily dependent on the version of TruffleRuby and the system the tests were run on. Providing that context should help us keep track of what should be reevaluated and when a slow tag should be retained."
We currently run the MRI tests with assertions enabled. Those assertions all escape the interpreter and result in the process failing. That can result in the existing tag file getting deleted without being repopulated.

What we do here is capture stderr along with the test log and parse it to see if we have any escaped exceptions. If so, tag the corresponding test and then tell the retagger to rerun the test. The new tag will be retained so that test will not run again, allowing the testing process to make forward progress. This makes retagging more reliable and less frustrating.
…ed symbols.

When we encounter the usage of an undefined symbol the process exits, which means we don't end up running all of the tests. Much like we do with assertion errors, this will now tag the associated test and then restart the tagging operation, retaining the newly created tags so the associated tests will be skipped.
When a test times out it often will not print a trace. The previous logic would try to extract one anyway and that could result is misattribution since the data format doesn't match what the search logic is looking for. Rather than search for something that isn't there, we now can tag the test with a message indicating the test timed out and move on with processing the next failure or error.
…MRI test log file.

Also reorganized the code a bit since the comments are long and it was breaking up the flow of the code.
With enhancements to the retagging process we now have the ability to generate much more helpful reasons for each tag. This change replaces most "needs investigation" generic comments with something that makes it much easier to see why an MRI test failed. Additionally, we had many stale tags, which have now been deleted as part of the bulk retagging process.
Now that we have robust retry logic on errors that bubble out and stop the test runner process, we don't need to retain those tags; they can be safely recomputed. Other patterns are just obsolete and should have been removed when the last reference to them was removed in the tag files.
@andrykonchin
Copy link
Member

@nirvdrum The PR is approved and I am happy to merge it now as is. Please let me know if you are going to make any other changes so I can mirror the PR.

@nirvdrum
Copy link
Collaborator Author

If you want to merge now, I'll address some of your other feedback in a follow-up PR. I took care of the issue with patterns possibly matching against test names, which I think was the biggest issue.

@eregon
Copy link
Member

eregon commented Sep 25, 2024

Let's merge now to make sure we don't get conflicts with the Ruby 3.3 work

@andrykonchin andrykonchin added the in-ci The PR is being tested in CI. Do not push new commits. label Sep 25, 2024
@graalvmbot graalvmbot merged commit 6d05853 into oracle:master Sep 27, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in-ci The PR is being tested in CI. Do not push new commits. OCA Verified All contributors have signed the Oracle Contributor Agreement. shopify
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants