Rework MRI test exclusions #3658

nirvdrum · 2024-09-11T06:08:59Z

This is a massive overhaul of our MRI test tags. While dealing with a regression a while back I realized we had an MRI test that would have caught it but the relevant test was never untagged. I think because retagging is so error-prone and so time-consuming almost no one ever updates the tag list. Consequently, we had a lot of things tagged that we didn't need to.

To update the tags I took a significant pass over the retagging process. We already made an effort to extract a meaningful reason message for each tag, but that process was quite limited. And we still had a ton of tags that had "needs investigation" as the reason string. So, I added more extensive parsing to extract a useful message in almost all cases.

Retaining Tags

Because retagging deletes the original tag file and repopulates it with the latest test results, we'd often end up deleting tags that were manually tagged because they only fail sporadically or because they only fail on a particular platform. If they didn't fail in the latest run then then the retagger would think the tests passed. But, they'd often fail in CI until someone went and readded the missing tags. It was an incredibly slow and tedious process.

To solve this problem I changed the retagging process. The process now looks like:

Read the old tag file into memory
Delete the old tag file
Scan the list of tags for any that match any of the MRI_TEST_RETAG_RETAIN_PATTERNS regexps
Rewrite any such lines back to the tag file
Run the tests
Process the test log file to add any additional tags to the tag file

Each of the retained tags will need to be manually deleted if we want to see if a test passes without the tag present. The expectation is that developers working on a particular piece of functionality will check if any of these relevant retained flags are still needed. However, having gone through the list of them, most of them are unlikely to be ever untagged because the test just doesn't make sense on TruffleRuby.

Retrying Tag Generation

Another other major issue we had with retagging is if the test process exited then the tag file was left in an incomplete state. The VM crashing is an obvious, but rare example. A more common one is an exception that escapes the interpreter. Since we run the MRI tests with VM assertions enabled and we don't translate Java AssertionError instances to Ruby exceptions, any time an assertion failed the process would exit. When that happens the tag file needs to be restored from git and the developer needs to manually add a tag for the test that caused the crash, then start the tagging process again. Without the ability to retain tags, that really means we used to have to modify the tests to remove the one that causes the crash. Suffice to say, it was a very tedious and time intensive process.

To solve this problem I now capture stderr to the MRI test log file. We didn't do that before because the log file parser didn't deal with those errors at all and we needed to present them to the developer. The log file parser now handles the following cases:

Exceptions that escape the interpreter
JVM crashes
Undefined symbols (Linux)
dyld errors (macOS)

Since the process crashes the test log file is incomplete. To avoid having to manually re-run the tests there's a new protocol:

The log file parser writes a new excludes line for the last test that ran when the process exited. These tags match one of the patterns in MRI_TEST_RETAG_RETAIN_PATTERNS so they will be not be removed when retagging.
The log file parser will return exit status 2 to indicate to jt that the test should be retried.
jt, which is driving the retag process, will check the status code and if it's 2, then it restarts the tagging process for the current test (basically a while loop around steps 5 & 6 in the workflow for retaining tags).

The retry loop for retagging can take a long time. We have some tests that are just very slow. Some of the C extension tests will continually trip over missing symbols. But, critically, this process does not require manual intervention. A developer should be able run jt retag on any test file and get an accurate set of tags without having to clean up after a crash (well, you'll need to manually clean up any hs_err logs).

Slow Tests

Some of the MRI tests are really slow. Anything that was previously manually tagged as slow will match MRI_TEST_RETAG_RETAIN_PATTERNS and be retained on retagging, meaning those tests won't be run at all. But, that process still requires someone to recognize a slow test and manually add a tag.

The MRI test log file has a line showing how long each test took. It's not entirely accurate because it's showing the time the test took to run once the VM was booted, but it's better than nothing. I've augmented the log file parser to tag any test that takes 30s+ as "slow" and any test that takes 60+s as "very slow". There's no behavioral difference between the two tags; it just makes it easier to grep for very slow tests.

The 30s and 60s thresholds can be changed easily in tool/parse_mri_errors.rb. In addition to tagging the tests as slow I've added information about the testing environment to the tag. The TruffleRuby version, build configuration, platform, CPU info, and total number of vCPU are logged. This, I hope, will make it easier to reevaluate slow tests later on. If someone runs the test in 15s and it used to take 45s, they can use that context to hopefully determine if we've made TruffleRuby faster or if it's just a difference in the underlying system.

Tag Ordering

It irked me a bit that the order of tags was dependent on the order the tests ran in. Likewise, manually added tags were put in inconsistent locations. Now what the log file parser emits the tags it sorts them alphabetically. Besides making the tag files easier to read, I expect this will make diffs in the future easier to review as well.

Regenerated Tags

I've gone through and regenerated every MRI tag file we have. This was an extremely long process, so I'd appreciate if no one messes with the tags until this gets merged. I'm hoping to land this before the Ruby 3.3 stdlib merge so we can make use of it then.

I started work on this overhaul 1.5 year ago (possibly 2 years) and every time someone touched a tag file I had to deal with a conflict. I've run this set of tags on an Ubuntu 24.04 x86_64 system and a macOS 14.6.1 ARM64 system. Each platform passed the tests three times consecutively, so I think all of the sporadic tests have been tagged. It would be good to run on the CI machine to verify, however.

CI

I ended up removing a lot of tags when reprocessing each test. Naturally, that means jt test mri will now run more tests. Hopefully, that won't blow out our CI budget, but it's something to look out for. On the other hand, I added tags for slow tests so those won't run in CI. We may need to tweak that. My development workstation is likely faster than GitHub Actions so there are bound to be tests that run in a reasonable amount of time on my machine but are unbearably slow in CI.

nirvdrum · 2024-09-11T06:09:50Z

I'm going to consolidate some of these commits. Others I still like having separate since it makes it easier to track the changes. Once I've finished with that I'll mark the PR ready for review.

nirvdrum · 2024-09-16T04:01:38Z

I've tidied up the PR, squashing commits that I think made sense to do so. The ones I've kept as individual commits are each logically encapsulated and the commit message provides context on the change -- I think they're worth keeping.

There's some work I'd like to do around the jt retag command, but I'll save that for the future. Notably, now that we have tags that will be retained between invocations of jt retag, it might make sense to have a jt retag --force option that deletes all tags in a file, not just those that wouldn't otherwise be retained. Or, perhaps an option to include the tests that are tagged as slow since they should function.

Additionally, when I was trying to reprocess all MRI tests I found that the change in cabb9b5 to make jt retag skip over tests that don't have a tag file was getting in the way. I ended up disabling that check so I could retag everything. I'm not sure if we even need this check any longer now that retained tags should prevent tests that lock up. But, if we want to keep the check, I think it'd be good to have a jt retag --include-all option or something to that effect.

eregon

Thank you for this PR, it looks great.
I haven't reviewed tool/parse_mri_errors.rb in details, I'll let @andrykonchin do that.

tool/jt.rb

eregon · 2024-09-16T09:46:54Z

I've run this set of tags on an Ubuntu 24.04 x86_64 system and a macOS 14.6.1 ARM64 system.

Did you run the tests in JVM mode or Native mode?
(it looks like JVM mode because some AssertionError are in there)
IIRC the CI only runs them in Native mode, because JVM mode is too slow for CI.
At some point I tried running a subset of "fast" MRI tests on JVM with assertions enabled, but that was still too slow.
Not really an issue, we'll see what the CI finds out anyway.

nirvdrum · 2024-09-16T13:38:52Z

Did you run the tests in JVM mode or Native mode? (it looks like JVM mode because some AssertionError are in there) IIRC the CI only runs them in Native mode, because JVM mode is too slow for CI. At some point I tried running a subset of "fast" MRI tests on JVM with assertions enabled, but that was still too slow. Not really an issue, we'll see what the CI finds out anyway.

I did the retagging with JVM so we could get the assertion errors. There were one or two new tags I needed when I ran on native, which were caught in CI when I opened the draft PR. After that CI failure I ran all the MRI tests with the new tags on both Native and JVM on Linux (x86_64) and macOS (ARM64). It was a little clearer before I squashed some commits.

andrykonchin · 2024-09-16T12:15:49Z

tool/jt.rb

+        puts '2. Parsing errors'
+        sh 'ruby', 'tool/parse_mri_errors.rb', output_file, continue_on_failure: true
+
+        process_tests = $LAST_SH_STATUS.exitstatus == 2


It would be helpful to add some context for this magic number (e.g. to make it a constant or to add a comment and mention tool/parse_mri_errors.rb's RETRY_EXIT_STATUS)

tool/jt.rb

andrykonchin · 2024-09-16T12:29:14Z

tool/parse_mri_errors.rb

+    error_lines = error.split("\n")
+    index = 0
+
+    while should_skip_error?(error_lines[index])


subjective:

Suggested change

while should_skip_error?(error_lines[index])

while should_skip_error_line?(error_lines[index])

andrykonchin · 2024-09-16T12:30:13Z

tool/parse_mri_errors.rb

+
+    # Mismatched exception messages span three lines. It's much more useful if they're combined into one message.
+    elsif error_display =~ /but the message doesn't match/
+      error_display << ' ' + error_lines[index + 1..].join(' ')


just for consistency:

Suggested change

error_display << ' ' + error_lines[index + 1..].join(' ')

error_display << ' ' << error_lines[index + 1..].join(' ')

That's fine. To me, I'm appending a new string made up from the parts on the right-hand side, but it doesn't really matter.

andrykonchin · 2024-09-16T12:30:33Z

tool/parse_mri_errors.rb

+    # Mismatched expectations span two lines. It's much more useful if they're combined into one message.
+    if error_display =~ /expected but was/
+      index += 1
+      error_display << ' ' + error_lines[index]


Suggested change

error_display << ' ' + error_lines[index]

error_display << ' ' << error_lines[index]

andrykonchin · 2024-09-16T12:36:13Z

tool/parse_mri_errors.rb

-  prefix = "exclude #{test_method.strip.to_sym.inspect}"
-  new_line = "#{prefix}, #{(REASON || error).inspect}\n"
+  platform_guard = platform ? " if RUBY_PLATFORM.include?('#{platform}')" : ''
+  new_line = "#{prefix} #{(REASON || error_display).inspect}#{platform_guard}\n"

  FileUtils.mkdir_p(File.dirname(file))
  lines = File.exist?(file) ? File.readlines(file) : []

  # we need the ',' to handle a case when one test name is a substring of another test name


probably this comment is more relevant to the prefix = "exclude... line above.

andrykonchin · 2024-09-16T12:41:36Z

tool/parse_mri_errors.rb

 end

-require 'fileutils'
+def exclude_test!(class_name, test_method, error_display, platform = nil)


It might make sense to pass platform as a keyword argument (it's passed only 1 or 2 times)

tool/parse_mri_errors.rb

andrykonchin · 2024-09-16T12:50:44Z

test/mri/excludes/DRbTests/TestDRbCore.rb

@@ -0,0 +1,3 @@
+exclude :test_02_unknown, "<\"DRbTests::DRbEx::\"> expected but was <\"DRbTests::DRbEx::FooBar\">."
+exclude :test_06_timeout, "[Timeout::Error] exception expected, not #<DRb::DRbConnError: execution expired>."
+exclude :test_11_remote_no_method_error, "[DRb::DRbRemoteError] exception expected, not #<NoMethodError: undefined method `invoke_no_method' for #<DRbTests::DRbEx:0x4d8>>."


Wondering why these failing tests aren't already excluded (as well as TestBundlerGem.rb, TestBignum_Pack.rb, TestNetHTTPForceEncoding.rb etc). Does it mean they are not run on CI at all?
cc @eregon

That one at least it's from test/mri/failing.exclude which excludes test/mri/tests/drb/test_drb.rb entirely.

Maybe we should remove the drb tests from there then if they work?
One concern about DRb is DRb uses ObjectSpace._id2ref which can be really slow, so tests doing that should probably be tagged as slow/very slow, which might happen automatically.

Not sure about TestBundlerGem I don't see anything excluding that. You could check in CI logs to see if that gets run in CI currently. If yes, then maybe there are extra failures when run by @nirvdrum compared to CI for some reason?

I ran a for loop in my shell over each of the MRI test files and ran jt retag on them. It looks like we originally tagged these files differently. I think it's quite convenient being able to retag everything in a shell loop, so I'd like to keep the new tags. I didn't know about test/mri/failing.exclude.

I can either delete the entries there and keep the individual tags or I can update jt retag to compare the current test file against whatever's in failing.exclude. I think I'm slightly partial to the former since that would give us reasons for each exclusion. But, if we just don't support a particular feature, I could see a bulk block making sense.

I'm surprised jt test mri FILE when FILE is in failing.exclude would even run given on master it doesn't in that case:

$ jt test mri test/mri/tests/drb/test_drb.rb Using Interpreted TruffleRuby: mxbuild/truffleruby-jvm No test files! (probably filtered out by failing.exclude)

But I guess with this PR changes it behaves/behaved differently.

Anyway, if these can run it seems fine to remove them from failing.exclude, unless marked as too slow.
It's also OK to do nothing about these and just merge as it is, as it might be a rabbit hole to remove from this failing list.

…s can cause false positives. We might process two tests named `test_rstrip` and `test_rstrip_bang`. Without the delimiter as part of the prefix, `test_rstrip` would look like a prefix for `test_rstrip_bang`, causing the tool to skip `test_rstrip_bang` thinking it was already added. Depending on the order the tests were processed in you could get two different sets of tags with the same test failures. This was causing us to lose some tags and get different results when retagging MRI tests with random seeds.

The old logic emitted the tags in the order the tests were executed. As a result, if tests are run with random seeds you can get two different files.

… hang. While this mode means we may continue to exclude a test for an issue that may be fixed, the common problem is the test continues to hang. With the tag file completely cleared we attempt to run that test again and it halts the entire process, require manual intervention to: * Kill the stuck process * Manually re-add the tag * Re-run the tests and pipe into the error processing script without the hepl of `jt tag` in order to run the tests with the partially filled tag file

In some cases we have directories that match the search prefix and they will naturally not include the tagged test data.

…e for a tagged test.

Some tests deliberately insert broken bytes or the test sets the encoding to something other than UTF-8. In those cases we may have strings with byte sequences in them that are invalid in UTF-8. Since these are limited to error messages, we can safely ignore them. What we really need to be able to do is split each line in the results file apart and extract key pieces of information. The method names, which are most import for setting up any test exclusions, must be made up of valid bytes.

Scrubbing isn't an entirely fool-proof method, but all of the class names thus far have been ASCII-only, and that's really the data we're after. Non-UTF-8 byte sequences elsewhere in the test will be mangled but we're not doing anything with them here. When we run the tests they'll execute with the correct encoding.

…s raised but with the wrong message.

We tend not to retag the MRI tests very frequently. Having the test Ruby version information there gives an idea of when the test was last run and what build type was use (e.g., native vs JVM). It's easy to ping-pong on retagging slow tests because they're necessarily dependent on the version of TruffleRuby and the system the tests were run on. Providing that context should help us keep track of what should be reevaluated and when a slow tag should be retained."

We currently run the MRI tests with assertions enabled. Those assertions all escape the interpreter and result in the process failing. That can result in the existing tag file getting deleted without being repopulated. What we do here is capture stderr along with the test log and parse it to see if we have any escaped exceptions. If so, tag the corresponding test and then tell the retagger to rerun the test. The new tag will be retained so that test will not run again, allowing the testing process to make forward progress. This makes retagging more reliable and less frustrating.

…ed symbols. When we encounter the usage of an undefined symbol the process exits, which means we don't end up running all of the tests. Much like we do with assertion errors, this will now tag the associated test and then restart the tagging operation, retaining the newly created tags so the associated tests will be skipped.

…rash.

…killed while running.

When a test times out it often will not print a trace. The previous logic would try to extract one anyway and that could result is misattribution since the data format doesn't match what the search logic is looking for. Rather than search for something that isn't there, we now can tag the test with a message indicating the test timed out and move on with processing the next failure or error.

…orm-specific.

…r to understand.

…MRI test log file. Also reorganized the code a bit since the comments are long and it was breaking up the flow of the code.

With enhancements to the retagging process we now have the ability to generate much more helpful reasons for each tag. This change replaces most "needs investigation" generic comments with something that makes it much easier to see why an MRI test failed. Additionally, we had many stale tags, which have now been deleted as part of the bulk retagging process.

Now that we have robust retry logic on errors that bubble out and stop the test runner process, we don't need to retain those tags; they can be safely recomputed. Other patterns are just obsolete and should have been removed when the last reference to them was removed in the tag files.

…inst the test name.

andrykonchin · 2024-09-25T15:17:11Z

@nirvdrum The PR is approved and I am happy to merge it now as is. Please let me know if you are going to make any other changes so I can mirror the PR.

nirvdrum · 2024-09-25T15:21:51Z

If you want to merge now, I'll address some of your other feedback in a follow-up PR. I took care of the issue with patterns possibly matching against test names, which I think was the biggest issue.

eregon · 2024-09-25T15:27:49Z

Let's merge now to make sure we don't get conflicts with the Ruby 3.3 work

oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Sep 11, 2024

nirvdrum changed the title ~~Update mri test exclusions~~ Rework MRI test exclusions Sep 11, 2024

nirvdrum added the shopify label Sep 11, 2024

nirvdrum force-pushed the update-mri-test-exclusions branch 5 times, most recently from b3caecf to cc71b4e Compare September 16, 2024 03:55

nirvdrum marked this pull request as ready for review September 16, 2024 03:55

eregon requested a review from andrykonchin September 16, 2024 09:19

eregon approved these changes Sep 16, 2024

View reviewed changes

tool/jt.rb Outdated Show resolved Hide resolved

tool/jt.rb Outdated Show resolved Hide resolved

tool/jt.rb Outdated Show resolved Hide resolved

andrykonchin approved these changes Sep 16, 2024

View reviewed changes

nirvdrum added 14 commits September 24, 2024 12:39

Improve the context message for auto-generated MRI test exclusion flags.

4861136

Sort the MRI tag files so we get deterministic output.

626100e

The old logic emitted the tags in the order the tests were executed. As a result, if tests are run with random seeds you can get two different files.

Fix search of MRI test exclusion files.

56f38f1

In some cases we have directories that match the search prefix and they will naturally not include the tagged test data.

Deeper parsing of MRI test failures to extract a useful reason messag…

396cbc2

…e for a tagged test.

More jt retag retentions.

7106f57

Improve MRI test exclusion messages when the correct exception type i…

7e619c1

…s raised but with the wrong message.

More accurately extract test failure reason when retagging MRI tests.

a36adb7

nirvdrum added 10 commits September 24, 2024 12:39

Add a retry loop when tagging MRI test files that fail due to a JVM c…

bbfa38f

…rash.

Retain MRI test exclusion tags for tests that have historically been …

53e0cd7

…killed while running.

Handle dyld symbol load errors on macOS when tagging MRI tests.

e9df4ce

Handle MRI tests that exit with a non-zero status code when retagging.

e1a60ef

Automatically add platform-specific tags for MRI tests that are platf…

039be93

…orm-specific.

Improve the sorting of MRI test tags.

f26bec6

Restructure the MRI test log parser to make the various regexps easie…

01fe314

…r to understand.

Added more documentation for each of the regexps used in parsing the …

511dc7e

…MRI test log file. Also reorganized the code a bit since the comments are long and it was breaking up the flow of the code.

Add built macOS dynamic libraries to the git ignore list.

8374832

nirvdrum force-pushed the update-mri-test-exclusions branch from 0fd8968 to 7aec4c5 Compare September 24, 2024 16:39

nirvdrum added 3 commits September 24, 2024 18:06

Ensure the patterns to retain MRI test exclusions are not matched aga…

6d05853

…inst the test name.

nirvdrum force-pushed the update-mri-test-exclusions branch from 7aec4c5 to 6d05853 Compare September 25, 2024 07:43

andrykonchin added the in-ci The PR is being tested in CI. Do not push new commits. label Sep 25, 2024

andrykonchin added this to the 24.2.0 Release (March 18, 2025) milestone Sep 25, 2024

graalvmbot merged commit 6d05853 into oracle:master Sep 27, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework MRI test exclusions #3658

Rework MRI test exclusions #3658

nirvdrum commented Sep 11, 2024

nirvdrum commented Sep 11, 2024

nirvdrum commented Sep 16, 2024

eregon left a comment

eregon commented Sep 16, 2024

nirvdrum commented Sep 16, 2024

andrykonchin Sep 16, 2024

andrykonchin Sep 16, 2024

andrykonchin Sep 16, 2024

nirvdrum Sep 24, 2024

andrykonchin Sep 16, 2024

andrykonchin Sep 16, 2024

andrykonchin Sep 16, 2024

andrykonchin Sep 16, 2024

eregon Sep 16, 2024

eregon Sep 16, 2024 •

edited

Loading

nirvdrum Sep 24, 2024

eregon Sep 24, 2024

andrykonchin commented Sep 25, 2024

nirvdrum commented Sep 25, 2024

eregon commented Sep 25, 2024

	while should_skip_error?(error_lines[index])
	while should_skip_error_line?(error_lines[index])

	error_display << ' ' + error_lines[index + 1..].join(' ')
	error_display << ' ' << error_lines[index + 1..].join(' ')

Rework MRI test exclusions #3658

Rework MRI test exclusions #3658

Conversation

nirvdrum commented Sep 11, 2024

Retaining Tags

Retrying Tag Generation

Slow Tests

Tag Ordering

Regenerated Tags

CI

nirvdrum commented Sep 11, 2024

nirvdrum commented Sep 16, 2024

eregon left a comment

Choose a reason for hiding this comment

eregon commented Sep 16, 2024

nirvdrum commented Sep 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eregon Sep 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrykonchin commented Sep 25, 2024

nirvdrum commented Sep 25, 2024

eregon commented Sep 25, 2024

eregon Sep 16, 2024 •

edited

Loading