With all the talk about debian choosing a default init system (link, link), I’ve decided to share with the world a little project I’ve been working on to help me understand /sbin/init aka. PID 1.
In this blog post I will go step by step showing what an init system must do to be functional. I will ignore all the legacy SysVinit stuff, and technical nuances, and just concentrate on what’s really important.
Introduction
First of all, what is ‘init
‘? In it’s essence it’s a process that must be running at all times, if this process ends, the kernel enters into a panic mode, after which you cannot do anything else, except rebooting.
This process doesn’t need to do anything special, you can use /bin/sh
as your init, or even /bin/yes
(although the latter wouldn’t be very useful).
So let’s write our very first init.
#!/usr/bin/ruby
Process.spawn('agetty', 'tty1')
sleep
Believe it or not, this is actually a rather useful init. How useful it is depends on how your kernel was compiled, your partitioning scheme, and if your root file-system is mounted rw or not. But either way, it covers the basics: rule #1; always keep running no matter what.
This is almost true, except that we need to be listening for SIGCHLD, otherwise some processes wouldn’t be cleaned up properly, so:
Signal.trap(:SIGCHLD) do
loop do
begin
status = Process.wait(-1, Process::WNOHANG)
break if status == nil
rescue Errno::ECHILD
break
end
end
end
Reboot
Now that we have the running indefinitely under control, it’s time to stop running (only when requested), but in order to do that we need some kind of IPC with the running process. There’s many ways to achieve this, but I chose UNIX sockets to do that.
So instead of sleeping forever, we listen for commands issued to /run/initctl
:
begin
server = UNIXServer.open('/run/initctl')
rescue Errno::EADDRINUSE
File.delete('/run/initctl')
retry
end
loop do
ctl = server.accept
cmd = ctl.readline.chomp.to_sym
# do stuff
end
And when the user is calling us with arguments, we pass those commands through /run/initctl
.
def do_cmd(*cmd)
ctl = UNIXSocket.open('/run/initctl')
ctl.puts(cmd.join(' '))
puts(ctl.readline.chomp)
exit
end
case ARGV[0]
when 'poweroff', 'restart', 'halt'
do_cmd(ARGV[0].to_sym)
end
So can issue the command init poweroff
to turn off the machine, but in order to do that we need to tell the kernel:
def sys_reboot(cmd)
map = { poweroff: 0x4321fedc, restart: 0x01234567, halt: 0xcdef0123 }
syscall(169, 0xfee1dead, 537993216, map[cmd])
end
These numbers are not important, what is important is that the kernel understands them, and with this we actually turn off the machine (or halt, or reboot).
Thread carefully
Obviously it would be tedious to type a bunch of commands each time the machine starts, so we need to actually do stuff after booting, however, if we do something wrong, we might render the system unusable. A simple way to solve this is to use scripts, fork a shell, and let it run those, so if there’s something wrong with the scripts, the shell dies, but not PID 1, so the system remains usable, which again, is rule #1.
Fortunately Ruby has exceptions, so we can run code with a safety net that catches all exceptions, and there’s no need to fork, which would waste precious booting time.
def action(name)
print(name)
begin
yield
rescue => e
print(' (error: %s)' % e)
end
puts
end
With this helper, we can safely run chunks of code, and if they fail, the error is reported to the user.
Initialization
This is the bulk of the code; the instructions you don’t want to type every time. This is mostly tedious stuff, you can skim or skip this section safely.
def mount(type, device, dir, opts)
Dir.mkdir(dir) unless File.directory?(dir)
system('mount', '-t', type, device, dir, '-o', opts)
end
action 'Mounting virtual file-systems' do
mount('proc', 'proc', '/proc', 'nosuid,noexec,nodev')
mount('sysfs', 'sys', '/sys', 'nosuid,noexec,nodev')
mount('tmpfs', 'run', '/run', 'mode=0755,nosuid,nodev')
mount('devtmpfs', 'dev', '/dev', 'mode=0755,nosuid')
mount('devpts', 'devpts', '/dev/pts', 'mode=0620,gid=5,nosuid,noexec')
mount('tmpfs', 'shm', '/dev/shm', 'mode=1777,nosuid,nodev')
end
And set the hostname.
action 'Setting hostname' do
hostname = File.read('/etc/hostname').chomp
File.write('/proc/sys/kernel/hostname', hostname)
end
Notice that many things can go wrong, for example the file ‘/etc/hostname’ might not exist, however, that would cause an exception, and our init would continue just fine.
Another thing we would want to do is kill all the processes, otherwise we might not be able to unmount the file-systems. We could do killall5
, but we wouldn’t have much control over the processes, and that would require a fork. Instead we can rely on the kernel to do the right thing, and all we have to do is wait for the results.
def killall
def allgone?()
Dir.glob('/proc/*').each do |e|
pid = File.basename(e).to_i
begin
next if pid < 2
# Is it a kernel process?
next if File.read('/proc/%i/cmdline' % pid).empty?
rescue Errno::ENOENT
end
return false
end
return true
end
def wait_until(timeout = 2, interval = 0.25)
start = Time.now
begin
break true if yield
sleep(interval)
end while (Time.now - start) < timeout
end
ok = false
action 'Sending SIGTERM to processes' do
Process.kill(:SIGTERM, -1)
ok = wait_until(10) { allgone? }
raise 'Failed' unless ok
end
return if ok
action 'Sending SIGKILL to processes' do
Process.kill(:SIGKILL, -1)
ok = wait_until(15) { allgone? }
raise 'Failed' unless ok
end
end
Time to mount real file-systems:
NETFS = %w[nfs nfs4 smbfs cifs codafs ncpfs shfs fuse fuseblk glusterfs davfs fuse.glusterfs]
VIRTFS = %w[proc sysfs tmpfs devtmpfs devpts]
action 'Mounting local filesystems' do
except = NETFS.map { |e| 'no' + e }.join(',')
system('mount', '-a', '-t', except, '-O', 'no_netdev')
end
# On shutdown
action 'Unmounting real filesystems' do
except = (NETFS + VIRTFS).map { |e| 'no' + e }.join(',')
system('umount', '-a', '-t', except, '-O', 'no_netdev')
end
If you are using a modern distribution, chances are your /run and /tmp directories are cleared up on every boot, so many files and directories need to be re-created. We could do this by hand, but we could also use the systemd-tmpfiles
utility which uses the configuration already provided by your distribution in tmpfiles.d directories.
action 'Manage temporary files' do
system('systemd-tmpfiles', '--create', '--remove', '--clean')
end
begin
File.delete('/run/nologin')
rescue Errno::ENOENT
end
Unless you are using a custom kernel with modules built-in, chances are you are going to need udev, so fire it up:
action 'Starting udev daemon' do
system('/usr/lib/systemd/systemd-udevd', '--daemon')
end
action 'Triggering udev uevents' do
system('udevadm', 'trigger', '--action=add', '--type=subsystems')
system('udevadm', 'trigger', '--action=add', '--type=devices')
end
action 'Waiting for udev uevents to be processed' do
system('udevadm', 'settle')
end
# On shutdown
action 'Shutting down udev' do
system('udevadm', 'control', '--exit')
end
Finally
After all this initialization stuff, your system is most likely very usable already, and in fact I was able to start a display manager (SLiM) at this point, which was my main goal while writing this. But we are just getting started.
In control
Another thing init should do is keep track of launched daemons. Each time we do that we store the PID, and when the child exists, we remove it from the list.
def start(id, cmd)
$daemons[id] = Process.spawn(*cmd)
end
start('agetty1', %w[agetty tty1])
# On SIGCHLD
key = $daemons.key(status)
$daemons.delete(key) if key
Once we have this it’s trivial to report the status of them (e.g. init status agetty1
).
ctl.puts($daemons[args.first] ? 'ok' : 'dead')
At this point we actually have a feature that SysVinit doesn’t have. Not bad for 200 lines of code!
cgroups
cgroups is a feature that is often misunderstood, probably because there are no good tools to make use of them, but they are not that hard. Lennart Pottering went to a lot of trouble trying to explain exactly what systemd does with them and it does not, but I don’t think he did a very good job of clarifying anything. Basically systemd is not doing anything with them Normally systemd is not doing anything with them (by default), simply labeling processes so you can see how they are grouped by using visualization tools like systemd-cgls
, but that’s it.
The single most important way you can take advantage of cgroups is for scheduling purposes, so for example your web browser is a control group, and your heavy compilation is in another, then Linux scheduler would isolate the two processes from stealing resources from each other without the need of adjusting the nice level. Basically with cgroups there’s no need for nice
(although you can use alongside).
But you don’t have to move a finger to get this benefit, the kernel already does that if you have CONFIG_SCHED_AUTOGROUP, which you should. Then, cgroups would be created for each session in the system, if you don’t know what sessions are, you can run ‘ps f -eo pid,sid,cmd
‘ to find out to which session id each process belongs to.
To prove this I wrote a little script that finds out the auto-grouping as reported by the Linux kernel, and you can find groups like:
------------------------------------------------------------------------------ 503 slim -nodaemon 895 /bin/sh /etc/xdg/xfce4/xinitrc -- /etc/X11/xinit/xserverrc 901 dbus-launch --sh-syntax --exit-with-session 938 xfce4-session 948 xfwm4 952 xfce4-panel 954 Thunar --daemon 956 xfdesktop 958 conky -q 964 nm-applet ------------------------------------------------------------------------------
This is exactly what you would expect, the session leader (SLiM) starts a bunch of processes, and all of them belong to the same session, and if I compile a Linux kernel, I get:
------------------------------------------------------------------------------ 14584 zsh 17920 make 20610 make -f scripts/Makefile.build obj=arch/x86 20661 make -f scripts/Makefile.build obj=kernel 20715 make -f scripts/Makefile.build obj=mm 20734 make -f scripts/Makefile.build obj=arch/x86/kernel 20736 make -f scripts/Makefile.build obj=fs 20750 make -f scripts/Makefile.build obj=arch/x86/kvm 20758 make -f scripts/Makefile.build obj=arch/x86/mm 21245 make -f scripts/Makefile.build obj=ipc 21274 make -f scripts/Makefile.build obj=security 21281 make -f scripts/Makefile.build obj=security/keys 21376 /bin/sh -c set -e; echo ' CC mm/mmu_context.o'; ... 21378 gcc -Wp,-MD,mm/.mmu_context.o.d ... 21387 /bin/sh -c set -e; echo ' CC ipc/msg.o'; ... 21390 gcc -Wp,-MD,ipc/.msg.o.d ... 21395 /bin/sh -c set -e; echo ' CC kernel/extable.o'; ... 21399 /bin/sh -c set -e; echo ' CC [M] arch/x86/kvm/pmu.o'; ... 21400 gcc -Wp,-MD,kernel/.extable.o.d ... 21403 gcc -Wp,-MD,arch/x86/kvm/.pmu.o.d . 21405 /bin/sh -c set -e; echo ' CC arch/x86/kernel/probe_roms.o'; ... 21407 gcc -Wp,-MD,arch/x86/kernel/.probe_roms.o.d ... 21413 /bin/sh -c set -e; echo ' CC fs/inode.o'; ... 21415 /bin/sh -c set -e; echo ' CC arch/x86/mm/srat.o'; ... 21418 /bin/sh -c set -e; echo ' CC security/keys/keyctl.o'; ... ------------------------------------------------------------------------------
This group will contain a lot of processes that take a lot of resources, but the scheduler knows they belong to the same group. If somebody logs in to my machine and starts running folding@home we would have two cgroups trying to use 100% of the CPU, so the scheduler would assign 50% to one, and 50% to the other, even though the first one has many more processes. Without the grouping, the scheduler would be unfair against folding@home, giving it as much time as it gives each one of the compilation processes.
All this without you moving a finger. Well, almost.
def start(id, cmd)
pid = fork do
Process.setsid()
exec(*cmd)
end
$daemons[id] = pid
end
Socket activation
systemd has made a lot of fuss about socket activation, and how it’s the next best thing after sliced bread. I agree it’s a great idea, but the idea didn’t come from systemd, AFAIK it came from OSX. But, do we need systemd to get the same in Linux?
def start_with_socket(id, stream, cmd)
server = TCPServer.new(stream)
Thread.new do
loop do
socket = server.accept
system(*cmd, :in => socket, : out => socket)
end
end
end
start_with_socket('sshd', 22, %w[/usr/bin/sshd -i])
Believe it or not, this simple code achieves socket activation. We create a socket, and a new thread that waits for connections, if nobody connects, nothing happens, we have an idle thread, each time somebody connects, we launch ssh -i
, which as far as I can tell is the same thing xinetd does, and systemd.
But hey, this is the simple socket activation, it’s not the really fancy one.
Thread.new do
if managed
IO.select([server])
pid = fork do
env = {}
env['LISTEN_PID'] = $$.to_s
env['LISTEN_FDS'] = 1.to_s
Process.setsid()
exec(env, *cmd, 3 => server)
end
$daemons[id] = pid
else
loop do
socket = server.accept
system(*cmd, :in => socket, : out => socket)
end
end
end
There, this does exactly the same thing as systemd (at least for one socket, multiple ones are easy too), so yeah, we have socket activation.
But wait, there’s more
Hopefully this covers the basics of what an init system should do, and how it’s not rocket science, nor voodoo. It is actually something very straightforward; start the system, keep it running, simple. Of course there’s many other things an operating system should do, but those things don’t belong to the init system, don’t let anyone tell you otherwise.
I have more changes on top of this that bring my little toy init system almost up-to-par to Arch Linux’s initscripts, which is what they used before moving to systemd, so chances are if you use my init, you would have little to no problems in your own system.
Unlike systemd and others, this code is actually very readable, so you can add and remove code as you like very easily, and of course, the less code you have, the faster you boot.
Personally when I hear somebody saying “Oh! but OpenRC doesn’t have socket activation, we need systemd!”, I just roll my eyes.
If you want to give it a try, get the code from GitHub:
https://github.com/felipec/finit
Cheers.
I think Lennart always cites SMF from Solaris as his inspiration for systemd. My quick search couldn’t quite determine if SMF has socket activation, but I’m guessing it has.
LikeLike
systemd does resource control, not just labeling
LikeLike
By default? Not on my system.
LikeLike
@Jon Jahren I don’t know about that, but OSX does have socket activation in their launchd.
LikeLike
Any chances to see more posts on the topic? I would love to read more about that or similar topics (reimplementing something in simple manner in ruby or any other language for educational purpose).
LikeLike
@gonzih I don’t know about that. What kind of software do you have in mind?
LikeLike
@FelipeC well I would be interested in services implementation for your init system. How would you handle start, reload, stop of services, maybe optional monitoring (memory usage and restarts for example), service dependencies.
LikeLike
I have some questions about this approach:
Why are additional functionallity like running services, socket-activation and other tied to PID-1, isn’t PID-1 just: a). wait for children and orphans b). read ctl file c). maybe simple supervision like in sysv5-init? All other stuff can be effectivelly done in other processes same for cgroups-management/labeling and socket activation.
LikeLike
Pingback: SystemD status on next/future Slackware - Page 2
Pingback: SystemD status on next/future Slackware - Page 3
Trying this on a 32-bit system reveals some pitfalls concerning the syscalls:
The magic number (first argument) is 88 instead of 169 on 32-bit Linux.
Also, ruby converts some of the magic number literals to Bignums earlier than on a 64bit system, and Kernel::syscall does not accept Bignums. Thus you’ll have to use negative integer literals (resulting in Fixnums) to get the proper bits across:
def sys_reboot(cmd)
map = { poweroff: 0x4321fedc, restart: 0x01234567, halt: -839974621}
syscall(88, -18751827, 537993216, map[cmd])
end
LikeLike
Pingback: Getting Started with Docker - Part 2 - OzNetNerd
thank you for your article. Can you give a Docker file or tell me how to reproduce your development environment. Because i want to follow your article step by step, but i don’t know where to begin
LikeLike
Reblogged this on UNIX init system news and commented:
Nice example and proof of concept code in Ruby.
LikeLike
Pingback: You (probably) need liveness and readiness probes – Sudhakar's blog
Pingback: 制作容器镜像的最佳实践 - CodeUUU