-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework MRI test exclusions #3658
Rework MRI test exclusions #3658
Conversation
I'm going to consolidate some of these commits. Others I still like having separate since it makes it easier to track the changes. Once I've finished with that I'll mark the PR ready for review. |
b3caecf
to
cc71b4e
Compare
I've tidied up the PR, squashing commits that I think made sense to do so. The ones I've kept as individual commits are each logically encapsulated and the commit message provides context on the change -- I think they're worth keeping. There's some work I'd like to do around the Additionally, when I was trying to reprocess all MRI tests I found that the change in cabb9b5 to make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR, it looks great.
I haven't reviewed tool/parse_mri_errors.rb in details, I'll let @andrykonchin do that.
Did you run the tests in JVM mode or Native mode? |
I did the retagging with JVM so we could get the assertion errors. There were one or two new tags I needed when I ran on native, which were caught in CI when I opened the draft PR. After that CI failure I ran all the MRI tests with the new tags on both Native and JVM on Linux (x86_64) and macOS (ARM64). It was a little clearer before I squashed some commits. |
puts '2. Parsing errors' | ||
sh 'ruby', 'tool/parse_mri_errors.rb', output_file, continue_on_failure: true | ||
|
||
process_tests = $LAST_SH_STATUS.exitstatus == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to add some context for this magic number (e.g. to make it a constant or to add a comment and mention tool/parse_mri_errors.rb
's RETRY_EXIT_STATUS
)
error_lines = error.split("\n") | ||
index = 0 | ||
|
||
while should_skip_error?(error_lines[index]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subjective:
while should_skip_error?(error_lines[index]) | |
while should_skip_error_line?(error_lines[index]) |
|
||
# Mismatched exception messages span three lines. It's much more useful if they're combined into one message. | ||
elsif error_display =~ /but the message doesn't match/ | ||
error_display << ' ' + error_lines[index + 1..].join(' ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for consistency:
error_display << ' ' + error_lines[index + 1..].join(' ') | |
error_display << ' ' << error_lines[index + 1..].join(' ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine. To me, I'm appending a new string made up from the parts on the right-hand side, but it doesn't really matter.
# Mismatched expectations span two lines. It's much more useful if they're combined into one message. | ||
if error_display =~ /expected but was/ | ||
index += 1 | ||
error_display << ' ' + error_lines[index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error_display << ' ' + error_lines[index] | |
error_display << ' ' << error_lines[index] |
prefix = "exclude #{test_method.strip.to_sym.inspect}" | ||
new_line = "#{prefix}, #{(REASON || error).inspect}\n" | ||
platform_guard = platform ? " if RUBY_PLATFORM.include?('#{platform}')" : '' | ||
new_line = "#{prefix} #{(REASON || error_display).inspect}#{platform_guard}\n" | ||
|
||
FileUtils.mkdir_p(File.dirname(file)) | ||
lines = File.exist?(file) ? File.readlines(file) : [] | ||
|
||
# we need the ',' to handle a case when one test name is a substring of another test name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably this comment is more relevant to the prefix = "exclude...
line above.
end | ||
|
||
require 'fileutils' | ||
def exclude_test!(class_name, test_method, error_display, platform = nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to pass platform
as a keyword argument (it's passed only 1 or 2 times)
@@ -0,0 +1,3 @@ | |||
exclude :test_02_unknown, "<\"DRbTests::DRbEx::\"> expected but was <\"DRbTests::DRbEx::FooBar\">." | |||
exclude :test_06_timeout, "[Timeout::Error] exception expected, not #<DRb::DRbConnError: execution expired>." | |||
exclude :test_11_remote_no_method_error, "[DRb::DRbRemoteError] exception expected, not #<NoMethodError: undefined method `invoke_no_method' for #<DRbTests::DRbEx:0x4d8>>." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering why these failing tests aren't already excluded (as well as TestBundlerGem.rb
, TestBignum_Pack.rb
, TestNetHTTPForceEncoding.rb
etc). Does it mean they are not run on CI at all?
cc @eregon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That one at least it's from test/mri/failing.exclude
which excludes test/mri/tests/drb/test_drb.rb entirely.
Maybe we should remove the drb tests from there then if they work?
One concern about DRb is DRb uses ObjectSpace._id2ref which can be really slow, so tests doing that should probably be tagged as slow/very slow, which might happen automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about TestBundlerGem I don't see anything excluding that. You could check in CI logs to see if that gets run in CI currently. If yes, then maybe there are extra failures when run by @nirvdrum compared to CI for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran a for
loop in my shell over each of the MRI test files and ran jt retag
on them. It looks like we originally tagged these files differently. I think it's quite convenient being able to retag everything in a shell loop, so I'd like to keep the new tags. I didn't know about test/mri/failing.exclude.
I can either delete the entries there and keep the individual tags or I can update jt retag
to compare the current test file against whatever's in failing.exclude. I think I'm slightly partial to the former since that would give us reasons for each exclusion. But, if we just don't support a particular feature, I could see a bulk block making sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised jt test mri FILE
when FILE is in failing.exclude
would even run given on master it doesn't in that case:
$ jt test mri test/mri/tests/drb/test_drb.rb
Using Interpreted TruffleRuby: mxbuild/truffleruby-jvm
No test files! (probably filtered out by failing.exclude)
But I guess with this PR changes it behaves/behaved differently.
Anyway, if these can run it seems fine to remove them from failing.exclude
, unless marked as too slow
.
It's also OK to do nothing about these and just merge as it is, as it might be a rabbit hole to remove from this failing list.
…s can cause false positives. We might process two tests named `test_rstrip` and `test_rstrip_bang`. Without the delimiter as part of the prefix, `test_rstrip` would look like a prefix for `test_rstrip_bang`, causing the tool to skip `test_rstrip_bang` thinking it was already added. Depending on the order the tests were processed in you could get two different sets of tags with the same test failures. This was causing us to lose some tags and get different results when retagging MRI tests with random seeds.
The old logic emitted the tags in the order the tests were executed. As a result, if tests are run with random seeds you can get two different files.
… hang. While this mode means we may continue to exclude a test for an issue that may be fixed, the common problem is the test continues to hang. With the tag file completely cleared we attempt to run that test again and it halts the entire process, require manual intervention to: * Kill the stuck process * Manually re-add the tag * Re-run the tests and pipe into the error processing script without the hepl of `jt tag` in order to run the tests with the partially filled tag file
In some cases we have directories that match the search prefix and they will naturally not include the tagged test data.
…e for a tagged test.
Some tests deliberately insert broken bytes or the test sets the encoding to something other than UTF-8. In those cases we may have strings with byte sequences in them that are invalid in UTF-8. Since these are limited to error messages, we can safely ignore them. What we really need to be able to do is split each line in the results file apart and extract key pieces of information. The method names, which are most import for setting up any test exclusions, must be made up of valid bytes.
Scrubbing isn't an entirely fool-proof method, but all of the class names thus far have been ASCII-only, and that's really the data we're after. Non-UTF-8 byte sequences elsewhere in the test will be mangled but we're not doing anything with them here. When we run the tests they'll execute with the correct encoding.
…s raised but with the wrong message.
We tend not to retag the MRI tests very frequently. Having the test Ruby version information there gives an idea of when the test was last run and what build type was use (e.g., native vs JVM). It's easy to ping-pong on retagging slow tests because they're necessarily dependent on the version of TruffleRuby and the system the tests were run on. Providing that context should help us keep track of what should be reevaluated and when a slow tag should be retained."
We currently run the MRI tests with assertions enabled. Those assertions all escape the interpreter and result in the process failing. That can result in the existing tag file getting deleted without being repopulated. What we do here is capture stderr along with the test log and parse it to see if we have any escaped exceptions. If so, tag the corresponding test and then tell the retagger to rerun the test. The new tag will be retained so that test will not run again, allowing the testing process to make forward progress. This makes retagging more reliable and less frustrating.
…ed symbols. When we encounter the usage of an undefined symbol the process exits, which means we don't end up running all of the tests. Much like we do with assertion errors, this will now tag the associated test and then restart the tagging operation, retaining the newly created tags so the associated tests will be skipped.
…killed while running.
When a test times out it often will not print a trace. The previous logic would try to extract one anyway and that could result is misattribution since the data format doesn't match what the search logic is looking for. Rather than search for something that isn't there, we now can tag the test with a message indicating the test timed out and move on with processing the next failure or error.
…MRI test log file. Also reorganized the code a bit since the comments are long and it was breaking up the flow of the code.
0fd8968
to
7aec4c5
Compare
With enhancements to the retagging process we now have the ability to generate much more helpful reasons for each tag. This change replaces most "needs investigation" generic comments with something that makes it much easier to see why an MRI test failed. Additionally, we had many stale tags, which have now been deleted as part of the bulk retagging process.
Now that we have robust retry logic on errors that bubble out and stop the test runner process, we don't need to retain those tags; they can be safely recomputed. Other patterns are just obsolete and should have been removed when the last reference to them was removed in the tag files.
…inst the test name.
7aec4c5
to
6d05853
Compare
@nirvdrum The PR is approved and I am happy to merge it now as is. Please let me know if you are going to make any other changes so I can mirror the PR. |
If you want to merge now, I'll address some of your other feedback in a follow-up PR. I took care of the issue with patterns possibly matching against test names, which I think was the biggest issue. |
Let's merge now to make sure we don't get conflicts with the Ruby 3.3 work |
This is a massive overhaul of our MRI test tags. While dealing with a regression a while back I realized we had an MRI test that would have caught it but the relevant test was never untagged. I think because retagging is so error-prone and so time-consuming almost no one ever updates the tag list. Consequently, we had a lot of things tagged that we didn't need to.
To update the tags I took a significant pass over the retagging process. We already made an effort to extract a meaningful reason message for each tag, but that process was quite limited. And we still had a ton of tags that had
"needs investigation"
as the reason string. So, I added more extensive parsing to extract a useful message in almost all cases.Retaining Tags
Because retagging deletes the original tag file and repopulates it with the latest test results, we'd often end up deleting tags that were manually tagged because they only fail sporadically or because they only fail on a particular platform. If they didn't fail in the latest run then then the retagger would think the tests passed. But, they'd often fail in CI until someone went and readded the missing tags. It was an incredibly slow and tedious process.
To solve this problem I changed the retagging process. The process now looks like:
MRI_TEST_RETAG_RETAIN_PATTERNS
regexpsEach of the retained tags will need to be manually deleted if we want to see if a test passes without the tag present. The expectation is that developers working on a particular piece of functionality will check if any of these relevant retained flags are still needed. However, having gone through the list of them, most of them are unlikely to be ever untagged because the test just doesn't make sense on TruffleRuby.
Retrying Tag Generation
Another other major issue we had with retagging is if the test process exited then the tag file was left in an incomplete state. The VM crashing is an obvious, but rare example. A more common one is an exception that escapes the interpreter. Since we run the MRI tests with VM assertions enabled and we don't translate Java
AssertionError
instances to Ruby exceptions, any time an assertion failed the process would exit. When that happens the tag file needs to be restored from git and the developer needs to manually add a tag for the test that caused the crash, then start the tagging process again. Without the ability to retain tags, that really means we used to have to modify the tests to remove the one that causes the crash. Suffice to say, it was a very tedious and time intensive process.To solve this problem I now capture stderr to the MRI test log file. We didn't do that before because the log file parser didn't deal with those errors at all and we needed to present them to the developer. The log file parser now handles the following cases:
dyld
errors (macOS)Since the process crashes the test log file is incomplete. To avoid having to manually re-run the tests there's a new protocol:
MRI_TEST_RETAG_RETAIN_PATTERNS
so they will be not be removed when retagging.jt
that the test should be retried.jt
, which is driving the retag process, will check the status code and if it's2
, then it restarts the tagging process for the current test (basically awhile
loop around steps 5 & 6 in the workflow for retaining tags).The retry loop for retagging can take a long time. We have some tests that are just very slow. Some of the C extension tests will continually trip over missing symbols. But, critically, this process does not require manual intervention. A developer should be able run
jt retag
on any test file and get an accurate set of tags without having to clean up after a crash (well, you'll need to manually clean up any hs_err logs).Slow Tests
Some of the MRI tests are really slow. Anything that was previously manually tagged as slow will match
MRI_TEST_RETAG_RETAIN_PATTERNS
and be retained on retagging, meaning those tests won't be run at all. But, that process still requires someone to recognize a slow test and manually add a tag.The MRI test log file has a line showing how long each test took. It's not entirely accurate because it's showing the time the test took to run once the VM was booted, but it's better than nothing. I've augmented the log file parser to tag any test that takes 30s+ as
"slow"
and any test that takes 60+s as"very slow"
. There's no behavioral difference between the two tags; it just makes it easier to grep for very slow tests.The 30s and 60s thresholds can be changed easily in
tool/parse_mri_errors.rb
. In addition to tagging the tests as slow I've added information about the testing environment to the tag. The TruffleRuby version, build configuration, platform, CPU info, and total number of vCPU are logged. This, I hope, will make it easier to reevaluate slow tests later on. If someone runs the test in 15s and it used to take 45s, they can use that context to hopefully determine if we've made TruffleRuby faster or if it's just a difference in the underlying system.Tag Ordering
It irked me a bit that the order of tags was dependent on the order the tests ran in. Likewise, manually added tags were put in inconsistent locations. Now what the log file parser emits the tags it sorts them alphabetically. Besides making the tag files easier to read, I expect this will make diffs in the future easier to review as well.
Regenerated Tags
I've gone through and regenerated every MRI tag file we have. This was an extremely long process, so I'd appreciate if no one messes with the tags until this gets merged. I'm hoping to land this before the Ruby 3.3 stdlib merge so we can make use of it then.
I started work on this overhaul 1.5 year ago (possibly 2 years) and every time someone touched a tag file I had to deal with a conflict. I've run this set of tags on an Ubuntu 24.04 x86_64 system and a macOS 14.6.1 ARM64 system. Each platform passed the tests three times consecutively, so I think all of the sporadic tests have been tagged. It would be good to run on the CI machine to verify, however.
CI
I ended up removing a lot of tags when reprocessing each test. Naturally, that means
jt test mri
will now run more tests. Hopefully, that won't blow out our CI budget, but it's something to look out for. On the other hand, I added tags for slow tests so those won't run in CI. We may need to tweak that. My development workstation is likely faster than GitHub Actions so there are bound to be tests that run in a reasonable amount of time on my machine but are unbearably slow in CI.