Debugging riddle of the day
One of our services failed to start on a test system (Ubuntu 12.04 on amd64).
The stdout/stderr log streams contained only the string “Permission denied” –
less than helpful. strace showed that the service tried to create a
file under /run
, which it doesn't have write permissions to. This caused the
it to bail out:
open("/run/some_service", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = -1
EACCES (Permission denied)
Grepping the source code and configuration files for /run
didn't turn up
anything that could explain this open()
call. Debugging with gdb
gave further hints:
Breakpoint 2, 0x00007ffff73e3ea0 in open64 () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007ffff73e3ea0 in open64 () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ffff7bd69bf in shm_open () from /lib/x86_64-linux-gnu/librt.so.1
#2 0x0000000000400948 in daemonize () at service.cpp:93
#3 0x00000000004009ac in main () at main.cpp:24
(gdb) p (char*)$rdi
$1 = 0x7fffffffe550 "/run/some_service"
(gdb) frame 2
#2 0x0000000000400948 in daemonize () at service.cpp:93
9 int fd = shm_open(fname.c_str(), O_RDWR | O_CREAT, 0644);
(gdb) p fname
$2 = {...., _M_p = 0x602028 "/some_service"}}
The open("/run/some_service", ...)
was caused by an shm_open("/some_service", ...)
.
This code is working on other machines, why does it fail on this particular
one? Can you figure it out? Bonus points if you can explain why it is trying to
access /run
and not some other directory. You might find the shm_open()
man page and source code
helpful.
I'll be waiting for you.
The solution is pretty evident after examining the Linux version of
shm_open()
. By default, it tries to create shared memory files under
/dev/shm
. If that doesn't exist, it will pick the first tmpfs mount point
from /proc/mounts
.
In Ubuntu 12.04, /dev/shm
is a symlink to /run/shm
.
On this machine the symlink was missing, which caused shm_open()
to go
hunting for a tmpfs filesystem, and /run
happened to be the first one in
/proc/mounts
.
Re-creating the symlink solved the problem. Why it was missing in the first place is still unclear. In the aftermath, we're also improving the error messages in this part of the code to make such issues easier to diagnose.