Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General Protection Fault while looking up a file #16608

Open
snajpa opened this issue Oct 5, 2024 · 4 comments
Open

General Protection Fault while looking up a file #16608

snajpa opened this issue Oct 5, 2024 · 4 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@snajpa
Copy link
Contributor

snajpa commented Oct 5, 2024

System information

Type Version/Name
Distribution Name vpsAdminOS
Distribution Version staging
Kernel Version 6.10.12
Architecture x86_64
OpenZFS Version e8cbb59

Describe the problem you're observing

Turns out the problems logged in #16594 were always early after the machine has booted, but that wasn't the problem causing the crashes, this seems to be it. After we applied fix for the Bad Page State bug (b052035), this is what we're left with, at least it's without noise now :D

Something about this step_into rings a bell though, I think I've seen it this year already, grepping through the log archive now.

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

Oops: general protection fault, probably for non-canonical address 0x7b954b40ed8e20d3: 0000 [#1] PREEMPT SMP NOPTI
CPU: 100 PID: 4140414 Comm: apache2 Not tainted 6.10.12 #1-vpsAdminOS
Hardware name: Dell Inc. PowerEdge R7515/07PXPY, BIOS 2.14.1 12/17/2023
In memory cgroup /osctl/pool.tank/group.default/user.844/ct.6662/user-owned/lxc.payload.6662/system.slice/apache2.service
RIP: 0010:step_into+0x57e/0x760
Code: 83 38 00 0f 84 6b 01 00 00 48 c7 c6 8a 4e 38 8d bf a6 06 00 00 49 c7 c4 f3 ff ff ff e8 fb 27 df ff e9 8a fd ff ff 49 8b 45 30 <4c> 8b 78 08 f6 43 38 40 0f 84 ce 00 00 00 49 83 c6 10 4c 89 ee 31
RSP: 0018:ffffbc951d09fc48 EFLAGS: 00010246
RAX: 7b954b40ed8e20cb RBX: ffffbc951d09fce0 RCX: 000000006700b063
RDX: 0000000000000001 RSI: ffff9c19b3ed72e8 RDI: ffff9c1beb762900
RBP: ffffbc951d09fc98 R08: 0000000000000001 R09: 0000000000000065
R10: 0000000000000001 R11: 0000000000000002 R12: 0000000000000000
R13: ffff9c19b3ed72e8 R14: ffffbc951d09fd48 R15: 0000000000000000
FS:  00007fd7957f2bc0(0000) GS:ffff9c543fb00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005569e774408c CR3: 0000005febc76005 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ? die_addr+0x32/0x80
 ? exc_general_protection+0x1b9/0x3c0
 ? asm_exc_general_protection+0x22/0x30
 ? step_into+0x57e/0x760
 ? step_into+0x158/0x760
 ? srso_alias_return_thunk+0x5/0xfbef5
 path_lookupat+0x6a/0x1a0
 filename_lookup+0xdf/0x1d0
 vfs_statx+0x92/0x180
 __do_sys_newstat+0x66/0xb0
 do_syscall_64+0x9a/0x1a0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fd79410f725
Code: 00 00 00 75 05 48 83 c4 18 c3 e8 d6 53 02 00 66 0f 1f 44 00 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 b8 04 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 31 b7 2d 00 f7 d8 64 89
RSP: 002b:00007fff31054fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd79410f725
RDX: 00007fff31055070 RSI: 00007fff31055070 RDI: 00007fd7941b5108
RBP: 0000000000000000 R08: 000000000000000a R09: 00007fd79416ede0
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fd7941b5108
R13: 0000000000000001 R14: 00007fff31055420 R15: 0000000000007d6d
 </TASK>
---[ end trace 0000000000000000 ]---
RIP: 0010:step_into+0x57e/0x760
Code: 83 38 00 0f 84 6b 01 00 00 48 c7 c6 8a 4e 38 8d bf a6 06 00 00 49 c7 c4 f3 ff ff ff e8 fb 27 df ff e9 8a fd ff ff 49 8b 45 30 <4c> 8b 78 08 f6 43 38 40 0f 84 ce 00 00 00 49 83 c6 10 4c 89 ee 31
RSP: 0018:ffffbc951d09fc48 EFLAGS: 00010246
RAX: 7b954b40ed8e20cb RBX: ffffbc951d09fce0 RCX: 000000006700b063
RDX: 0000000000000001 RSI: ffff9c19b3ed72e8 RDI: ffff9c1beb762900
RBP: ffffbc951d09fc98 R08: 0000000000000001 R09: 0000000000000065
R10: 0000000000000001 R11: 0000000000000002 R12: 0000000000000000
R13: ffff9c19b3ed72e8 R14: ffffbc951d09fd48 R15: 0000000000000000
FS:  00007fd7957f2bc0(0000) GS:ffff9c543fb00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005569e774408c CR3: 0000005febc76005 CR4: 0000000000770ef0

BUG: kernel NULL pointer dereference, address: 0000000000000470
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
Oops: Oops: 0000 [#2] PREEMPT SMP NOPTI
CPU: 115 PID: 91359 Comm: php-cgi Tainted: G      D            6.10.12 #1-vpsAdminOS
Hardware name: Dell Inc. PowerEdge R7515/07PXPY, BIOS 2.14.1 12/17/2023
In memory cgroup /osctl/pool.tank/group.default/user.777/ct.6175/user-owned/lxc.payload.6175/system.slice/apache2.service
RIP: 0010:link_path_walk.part.0.constprop.0+0x329/0x370
Code: ff ff 48 8b 54 24 10 4c 8b 6c 24 18 e9 ab fe ff ff ba ec ff ff ff e9 b1 fd ff ff 48 8b 43 30 48 8b 3c 24 8b 50 04 48 8b 40 38 <48> 8b b0 70 04 00 00 e8 ab d3 03 00 31 d2 89 83 e0 00 00 00 48 8b
RSP: 0018:ffffbc94cbde3c48 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffbc94cbde3ce0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 61c8864680b583eb RDI: ffffffff8db8dee0
RBP: 8080808080808080 R08: ffffffffb3bebdbb R09: 0000000000000000
R10: ffff9cae862e4cc0 R11: 0000000000000002 R12: fefefefefefefeff
R13: ffff9cb8d78e706a R14: d0d0d0d0d0d0d0d0 R15: 2f2f2f2f2f2f2f2f
FS:  00007f914b140500(0000) GS:ffff9cd3becc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000470 CR3: 000000622b43c003 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __die+0x1f/0x60
 ? page_fault_oops+0x17d/0x550
 ? exc_page_fault+0x67/0x140
 ? asm_exc_page_fault+0x22/0x30
 ? link_path_walk.part.0.constprop.0+0x329/0x370
 ? link_path_walk.part.0.constprop.0+0xaa/0x370
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? path_init+0x304/0x420
 path_lookupat+0x3e/0x1a0
 filename_lookup+0xdf/0x1d0
 vfs_statx+0x92/0x180
 __do_sys_newlstat+0x63/0xb0
 do_syscall_64+0x9a/0x1a0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f914ad16a26
Code: 34 0e 00 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 41 89 f8 48 89 f7 48 89 d6 41 83 f8 01 77 29 b8 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 02 c3 90 48 8b 15 39 34 0e 00 f7 d8 64 89 02
RSP: 002b:00007ffe45056dd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
RAX: ffffffffffffffda RBX: 0000000000000046 RCX: 00007f914ad16a26
RDX: 00007ffe45056e30 RSI: 00007ffe45056e30 RDI: 00007ffe450570f0
RBP: 00007ffe45056f00 R08: 0000000000000001 R09: 0000000000000001
R10: 000000000000004a R11: 0000000000000246 R12: 00007ffe450570f0
R13: 0000000000000001 R14: 0000000000000001 R15: 00007ffe45056f20
 </TASK>
CR2: 0000000000000470
---[ end trace 0000000000000000 ]---
RIP: 0010:step_into+0x57e/0x760
Code: 83 38 00 0f 84 6b 01 00 00 48 c7 c6 8a 4e 38 8d bf a6 06 00 00 49 c7 c4 f3 ff ff ff e8 fb 27 df ff e9 8a fd ff ff 49 8b 45 30 <4c> 8b 78 08 f6 43 38 40 0f 84 ce 00 00 00 49 83 c6 10 4c 89 ee 31
RSP: 0018:ffffbc951d09fc48 EFLAGS: 00010246
RAX: 7b954b40ed8e20cb RBX: ffffbc951d09fce0 RCX: 000000006700b063
RDX: 0000000000000001 RSI: ffff9c19b3ed72e8 RDI: ffff9c1beb762900
RBP: ffffbc951d09fc98 R08: 0000000000000001 R09: 0000000000000065
R10: 0000000000000001 R11: 0000000000000002 R12: 0000000000000000
R13: ffff9c19b3ed72e8 R14: ffffbc951d09fd48 R15: 0000000000000000
FS:  00007f914b140500(0000) GS:ffff9cd3becc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000470 CR3: 000000622b43c003 CR4: 0000000000770ef0
PKRU: 55555554
@snajpa snajpa added the Type: Defect Incorrect behavior (e.g. crash, hang) label Oct 5, 2024
@snajpa
Copy link
Contributor Author

snajpa commented Oct 5, 2024

(gdb) l *step_into+0x57e
0xffffffff813b9bfe is in step_into (./include/linux/audit.h:375).
370     }
371     static inline void audit_inode(struct filename *name,
372                                     const struct dentry *dentry,
373                                     unsigned int aflags) {
374             if (unlikely(!audit_dummy_context()))
375                     __audit_inode(name, dentry, aflags);
376     }
377     static inline void audit_file(struct file *file)
378     {
379             if (unlikely(!audit_dummy_context()))

step_into() fs/namei.c#L1874 -> pick_link() fs/namei.c#L1775 -> may_follow_link() fs/namei.c#L1125 -> audit_inode() include/linux/audit.h#L375

@snajpa
Copy link
Contributor Author

snajpa commented Oct 5, 2024

OK, this really is not the first time we're seeing the issue, but it was harder to correlate after a myriad of config changes, like turning off selinux in the kernel config.

This was with OpenZFS @ 8f2f6cd:

step_into ends a bit earlier b/c there's selinux in the mix

general protection fault, probably for non-canonical address 0x11c5fee1a757ad96: 0000 [#5037] PREEMPT SMP NOPTI
CPU: 120 PID: 1957892 Comm: systemctl Tainted: G      D    OE      6.6.21 #1-vpsAdminOS
Hardware name: Dell Inc. PowerEdge R6515/0R4CNN, BIOS 2.0.3 01/15/2021
In memory cgroup /osctl/pool.tank/group.default/user.3639/ct.24657/user-owned/lxc.payload.24657/system.slice/snapd.service
RIP: 0010:selinux_inode_follow_link+0x96/0xd0
Code: 00 00 8b 5c 01 04 e8 49 fd ff ff 85 c0 75 4c 48 8b 45 38 48 85 c0 74 12 48 63 15 ad af d7 00 48 01 d0 48 3d 00 f0 ff ff 77 16 <0f> b7 50 20 8b 70 1c 49 89 e0 b9 02 00 00 00 89 df e8 04 6e ff ff
RSP: 0018:ffff9cb5b2633be0 EFLAGS: 00010203
RAX: 11c5fee1a757ad76 RBX: 0000000000000001 RCX: ffff8a654183dbe0
RDX: 0000000000000010 RSI: 0000000000000000 RDI: ffff8956997b3218
RBP: ffff8956997b3218 R08: 0000000000000051 R09: 0000000000000000
R10: 0000000000000003 R11: 0000000000000002 R12: ffff8956997b3218
R13: ffff8982b8f0a600 R14: ffff9cb5b2633e18 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff8a8c7ee00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe4fb3a0270 CR3: 00000121fc2b4000 CR4: 0000000000350ee0
Call Trace:
 <TASK>
 ? die_addr+0x36/0x90
 ? exc_general_protection+0x1af/0x3d0
 ? asm_exc_general_protection+0x26/0x30
 ? selinux_inode_follow_link+0x96/0xd0
 security_inode_follow_link+0x45/0x70
 step_into+0x157/0x760
 ? srso_return_thunk+0x5/0x10
 link_path_walk.part.0.constprop.0+0x246/0x380
 ? srso_return_thunk+0x5/0x10
 ? path_init+0x290/0x3c0
 path_openat+0xa7/0x1160
 ? srso_return_thunk+0x5/0x10
 do_filp_open+0xb3/0x160 
 do_sys_openat2+0xab/0xe0
 __x64_sys_openat+0x6e/0xa0
 do_syscall_64+0x3e/0x90
 entry_SYSCALL_64_after_hwframe+0x6e/0xd8
RIP: 0033:0x7fe4fb694b18
Code: f9 41 89 f0 41 83 e2 40 75 30 89 f0 25 00 00 41 00 3d 00 00 41 00 74 22 44 89 c2 4c 89 ce bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 30 c3 0f 1f 80 00 00 00 00 48 8d 44 24 08 c7
RSP: 002b:00007ffd37c03348 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 000000000000000a RCX: 00007fe4fb694b18
RDX: 0000000000080000 RSI: 00007ffd37c033b0 RDI: 00000000ffffff9c
RBP: 00007ffd37c033a0 R08: 0000000000080000 R09: 00007ffd37c033b0
R10: 0000000000000000 R11: 0000000000000287 R12: 00007ffd37c035b7
R13: 00007ffd37c035d0 R14: 00007ffd37c033b0 R15: 00007fe4fb6a92e0
 </TASK>
---[ end trace 0000000000000000 ]---

@snajpa
Copy link
Contributor Author

snajpa commented Oct 5, 2024

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD 0 P4D 0
Oops: 0010 [#1] PREEMPT SMP NOPTI
CPU: 77 PID: 2451139 Comm: ps Tainted: G           OE K    6.8.8-2.1 #1-vpsAdminOS
Hardware name: Dell Inc. PowerEdge R7525/0PYVT1, BIOS 2.14.1 12/17/2023
In memory cgroup /osctl/pool.tank/group.default/user.859/ct.6786/user-owned/lxc.payload.6786/system.slice/cron.service
RIP: 0010:0x0
Code: Unable to access opcode bytes at 0xffffffffffffffd6.
RSP: 0018:ffffb94a7f353ca0 EFLAGS: 00010246
RAX: ffffffffc18c9b40 RBX: ffffb94a7f353db0 RCX: 0000000000000007
RDX: ffffb94a7f353e28 RSI: ffffa18cd24ed028 RDI: 0000000000000000
RBP: ffffb94a7f353cf8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000003 R11: 0000000000000002 R12: 0000000000000000
R13: ffffa18cd24ed028 R14: ffffb94a7f353e28 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffa102bf940000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 00000008a842c000 CR4: 0000000000350ef0
Call Trace:
 <TASK>
 ? __die+0x23/0x70
 ? page_fault_oops+0x181/0x490
 ? exc_page_fault+0x68/0x150
 ? asm_exc_page_fault+0x26/0x30
 step_into+0x5be/0x760
 path_openat+0x132/0x1180
 do_filp_open+0xb3/0x160
 do_sys_openat2+0xab/0xe0
 __x64_sys_openat+0x6e/0xa0
 do_syscall_64+0xab/0x1b0
 entry_SYSCALL_64_after_hwframe+0x79/0x81
RIP: 0033:0x7fe4cb378b18
Code: f9 41 89 f0 41 83 e2 40 75 30 89 f0 25 00 00 41 00 3d 00 00 41 00 74 22 44 89 c2 4c 89 ce bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 30 c3 0f 1f 80 00 00 00 00 48 8d 44 24 08 c7
RSP: 002b:00007ffd845bc7d8 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe4cb378b18
RDX: 0000000000080000 RSI: 00007fe4cb351b30 RDI: 00000000ffffff9c
RBP: 00007ffd845bc830 R08: 0000000000080000 R09: 00007fe4cb351b30
R10: 0000000000000000 R11: 0000000000000287 R12: 00007ffd845bc897
R13: 00007ffd845bc8b0 R14: 00007fe4cb351b30 R15: 00007fe4cb38c040
 </TASK>
CR2: 0000000000000000
---[ end trace 0000000000000000 ]---
(gdb) l *step_into+0x5be
0xffffffff813aebae is in step_into (fs/namei.c:1804).
1799            if (!res) {
1800                    const char * (*get)(struct dentry *, struct inode *,
1801                                    struct delayed_call *);
1802                    get = inode->i_op->get_link;
1803                    if (nd->flags & LOOKUP_RCU) {
1804                            res = get(NULL, inode, &last->done);
1805                            if (res == ERR_PTR(-ECHILD) && try_to_unlazy(nd))
1806                                    res = get(link->dentry, inode, &last->done);
1807                    } else {
1808                            res = get(link->dentry, inode, &last->done);

@snajpa
Copy link
Contributor Author

snajpa commented Oct 6, 2024

Interestingly, while this happens, arc_evict and arc_prune are both active at 70-100% cpu each.

This is a list of processes in D state with their stacks after a few processes died with ^ similar oopses:
node6-dstakz.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

1 participant