Demystifying the init system (PID 1)

With all the talk about debian choosing a default init system (link, link), I’ve decided to share with the world a little project I’ve been working on to help me understand /sbin/init aka. PID 1.

In this blog post I will go step by step showing what an init system must do to be functional. I will ignore all the legacy SysVinit stuff, and technical nuances, and just concentrate on what’s really important.

Introduction

First of all, what is ‘init‘? In it’s essence it’s a process that must be running at all times, if this process ends, the kernel enters into a panic mode, after which you cannot do anything else, except rebooting.

This process doesn’t need to do anything special, you can use /bin/sh as your init, or even /bin/yes (although the latter wouldn’t be very useful).

So let’s write our very first init.

#!/usr/bin/ruby
Process.spawn('agetty', 'tty1')
sleep

Believe it or not, this is actually a rather useful init. How useful it is depends on how your kernel was compiled, your partitioning scheme, and if your root file-system is mounted rw or not. But either way, it covers the basics: rule #1; always keep running no matter what.

This is almost true, except that we need to be listening for SIGCHLD, otherwise some processes wouldn’t be cleaned up properly, so:

Signal.trap(:SIGCHLD) do
  loop do
    begin
      status = Process.wait(-1, Process::WNOHANG)
      break if status == nil
    rescue Errno::ECHILD
      break
    end
  end
end

Reboot

Now that we have the running indefinitely under control, it’s time to stop running (only when requested), but in order to do that we need some kind of IPC with the running process. There’s many ways to achieve this, but I chose UNIX sockets to do that.

So instead of sleeping forever, we listen for commands issued to /run/initctl:

begin
  server = UNIXServer.open('/run/initctl')
rescue Errno::EADDRINUSE
  File.delete('/run/initctl')
  retry
end

loop do
  ctl = server.accept
  cmd = ctl.readline.chomp.to_sym
  # do stuff
end

And when the user is calling us with arguments, we pass those commands through /run/initctl.

def do_cmd(*cmd)
  ctl = UNIXSocket.open('/run/initctl')
  ctl.puts(cmd.join(' '))
  puts(ctl.readline.chomp)
  exit
end

case ARGV[0]
when 'poweroff', 'restart', 'halt'
  do_cmd(ARGV[0].to_sym)
end

So can issue the command init poweroff to turn off the machine, but in order to do that we need to tell the kernel:

def sys_reboot(cmd)
  map = { poweroff: 0x4321fedc, restart: 0x01234567, halt: 0xcdef0123 }
  syscall(169, 0xfee1dead, 537993216, map[cmd])
end

These numbers are not important, what is important is that the kernel understands them, and with this we actually turn off the machine (or halt, or reboot).

Thread carefully

Obviously it would be tedious to type a bunch of commands each time the machine starts, so we need to actually do stuff after booting, however, if we do something wrong, we might render the system unusable. A simple way to solve this is to use scripts, fork a shell, and let it run those, so if there’s something wrong with the scripts, the shell dies, but not PID 1, so the system remains usable, which again, is rule #1.

Fortunately Ruby has exceptions, so we can run code with a safety net that catches all exceptions, and there’s no need to fork, which would waste precious booting time.

def action(name)
  print(name)
  begin
    yield
  rescue => e
    print(' (error: %s)' % e)
  end
  puts
end

With this helper, we can safely run chunks of code, and if they fail, the error is reported to the user.

Initialization

This is the bulk of the code; the instructions you don’t want to type every time. This is mostly tedious stuff, you can skim or skip this section safely.

def mount(type, device, dir, opts)
  Dir.mkdir(dir) unless File.directory?(dir)
  system('mount', '-t', type, device, dir, '-o', opts)
end

action 'Mounting virtual file-systems' do
  mount('proc', 'proc', '/proc', 'nosuid,noexec,nodev')
  mount('sysfs', 'sys', '/sys', 'nosuid,noexec,nodev')
  mount('tmpfs', 'run', '/run', 'mode=0755,nosuid,nodev')
  mount('devtmpfs', 'dev', '/dev', 'mode=0755,nosuid')
  mount('devpts', 'devpts', '/dev/pts', 'mode=0620,gid=5,nosuid,noexec')
  mount('tmpfs', 'shm', '/dev/shm', 'mode=1777,nosuid,nodev')
end

And set the hostname.

action 'Setting hostname' do
  hostname = File.read('/etc/hostname').chomp
  File.write('/proc/sys/kernel/hostname', hostname)
end

Notice that many things can go wrong, for example the file ‘/etc/hostname’ might not exist, however, that would cause an exception, and our init would continue just fine.

Another thing we would want to do is kill all the processes, otherwise we might not be able to unmount the file-systems. We could do killall5, but we wouldn’t have much control over the processes, and that would require a fork. Instead we can rely on the kernel to do the right thing, and all we have to do is wait for the results.

def killall

  def allgone?()
    Dir.glob('/proc/*').each do |e|
      pid = File.basename(e).to_i
      begin
        next if pid < 2
        # Is it a kernel process?
        next if File.read('/proc/%i/cmdline' % pid).empty?
      rescue Errno::ENOENT
      end
      return false
    end
    return true
  end

  def wait_until(timeout = 2, interval = 0.25)
    start = Time.now
    begin
      break true if yield
      sleep(interval)
    end while (Time.now - start) < timeout
  end

  ok = false

  action 'Sending SIGTERM to processes' do
    Process.kill(:SIGTERM, -1)
    ok = wait_until(10) { allgone? }
    raise 'Failed' unless ok
  end

  return if ok

  action 'Sending SIGKILL to processes' do
    Process.kill(:SIGKILL, -1)
    ok = wait_until(15) { allgone? }
    raise 'Failed' unless ok
  end

end

Time to mount real file-systems:

NETFS = %w[nfs nfs4 smbfs cifs codafs ncpfs shfs fuse fuseblk glusterfs davfs fuse.glusterfs]
VIRTFS = %w[proc sysfs tmpfs devtmpfs devpts]

action 'Mounting local filesystems' do
  except = NETFS.map { |e| 'no' + e }.join(',')
  system('mount', '-a', '-t', except, '-O', 'no_netdev')
end

# On shutdown

action 'Unmounting real filesystems' do
  except = (NETFS + VIRTFS).map { |e| 'no' + e }.join(',')
  system('umount', '-a', '-t', except, '-O', 'no_netdev')
end

If you are using a modern distribution, chances are your /run and /tmp directories are cleared up on every boot, so many files and directories need to be re-created. We could do this by hand, but we could also use the systemd-tmpfiles utility which uses the configuration already provided by your distribution in tmpfiles.d directories.

action 'Manage temporary files' do
  system('systemd-tmpfiles', '--create', '--remove', '--clean')
end

begin
  File.delete('/run/nologin')
rescue Errno::ENOENT
end

Unless you are using a custom kernel with modules built-in, chances are you are going to need udev, so fire it up:

action 'Starting udev daemon' do
  system('/usr/lib/systemd/systemd-udevd', '--daemon')
end

action 'Triggering udev uevents' do
  system('udevadm', 'trigger', '--action=add', '--type=subsystems')
  system('udevadm', 'trigger', '--action=add', '--type=devices')
end

action 'Waiting for udev uevents to be processed' do
  system('udevadm', 'settle')
end

# On shutdown

action 'Shutting down udev' do
  system('udevadm', 'control', '--exit')
end

Finally

After all this initialization stuff, your system is most likely very usable already, and in fact I was able to start a display manager (SLiM) at this point, which was my main goal while writing this. But we are just getting started.

In control

Another thing init should do is keep track of launched daemons. Each time we do that we store the PID, and when the child exists, we remove it from the list.

def start(id, cmd)
  $daemons[id] = Process.spawn(*cmd)
end

start('agetty1', %w[agetty tty1])

# On SIGCHLD
key = $daemons.key(status)
$daemons.delete(key) if key

Once we have this it’s trivial to report the status of them (e.g. init status agetty1).

ctl.puts($daemons[args.first] ? 'ok' : 'dead')

At this point we actually have a feature that SysVinit doesn’t have. Not bad for 200 lines of code!

cgroups

cgroups is a feature that is often misunderstood, probably because there are no good tools to make use of them, but they are not that hard. Lennart Pottering went to a lot of trouble trying to explain exactly what systemd does with them and it does not, but I don’t think he did a very good job of clarifying anything. Basically systemd is not doing anything with them Normally systemd is not doing anything with them (by default), simply labeling processes so you can see how they are grouped by using visualization tools like systemd-cgls, but that’s it.

The single most important way you can take advantage of cgroups is for scheduling purposes, so for example your web browser is a control group, and your heavy compilation is in another, then Linux scheduler would isolate the two processes from stealing resources from each other without the need of adjusting the nice level. Basically with cgroups there’s no need for nice (although you can use alongside).

But you don’t have to move a finger to get this benefit, the kernel already does that if you have CONFIG_SCHED_AUTOGROUP, which you should. Then, cgroups would be created for each session in the system, if you don’t know what sessions are, you can run ‘ps f -eo pid,sid,cmd‘ to find out to which session id each process belongs to.

To prove this I wrote a little script that finds out the auto-grouping as reported by the Linux kernel, and you can find groups like:

------------------------------------------------------------------------------
503	slim -nodaemon
895	/bin/sh /etc/xdg/xfce4/xinitrc -- /etc/X11/xinit/xserverrc
901	dbus-launch --sh-syntax --exit-with-session
938	xfce4-session
948	xfwm4
952	xfce4-panel
954	Thunar --daemon
956	xfdesktop
958	conky -q
964	nm-applet
------------------------------------------------------------------------------

This is exactly what you would expect, the session leader (SLiM) starts a bunch of processes, and all of them belong to the same session, and if I compile a Linux kernel, I get:

------------------------------------------------------------------------------
14584	zsh
17920	make
20610	make -f scripts/Makefile.build obj=arch/x86
20661	make -f scripts/Makefile.build obj=kernel
20715	make -f scripts/Makefile.build obj=mm
20734	make -f scripts/Makefile.build obj=arch/x86/kernel
20736	make -f scripts/Makefile.build obj=fs
20750	make -f scripts/Makefile.build obj=arch/x86/kvm
20758	make -f scripts/Makefile.build obj=arch/x86/mm
21245	make -f scripts/Makefile.build obj=ipc
21274	make -f scripts/Makefile.build obj=security
21281	make -f scripts/Makefile.build obj=security/keys
21376	/bin/sh -c set -e; 	   echo '  CC      mm/mmu_context.o'; ...
21378	gcc -Wp,-MD,mm/.mmu_context.o.d ...
21387	/bin/sh -c set -e; 	   echo '  CC      ipc/msg.o'; ...
21390	gcc -Wp,-MD,ipc/.msg.o.d ...
21395	/bin/sh -c set -e; 	   echo '  CC      kernel/extable.o'; ...
21399	/bin/sh -c set -e; 	   echo '  CC [M]  arch/x86/kvm/pmu.o'; ...
21400	gcc -Wp,-MD,kernel/.extable.o.d ...
21403	gcc -Wp,-MD,arch/x86/kvm/.pmu.o.d .
21405	/bin/sh -c set -e; 	   echo '  CC      arch/x86/kernel/probe_roms.o'; ...
21407	gcc -Wp,-MD,arch/x86/kernel/.probe_roms.o.d ...
21413	/bin/sh -c set -e; 	   echo '  CC      fs/inode.o'; ...
21415	/bin/sh -c set -e; 	   echo '  CC      arch/x86/mm/srat.o'; ...
21418	/bin/sh -c set -e; 	   echo '  CC      security/keys/keyctl.o'; ...
------------------------------------------------------------------------------

This group will contain a lot of processes that take a lot of resources, but the scheduler knows they belong to the same group. If somebody logs in to my machine and starts running folding@home we would have two cgroups trying to use 100% of the CPU, so the scheduler would assign 50% to one, and 50% to the other, even though the first one has many more processes. Without the grouping, the scheduler would be unfair against folding@home, giving it as much time as it gives each one of the compilation processes.

All this without you moving a finger. Well, almost.

def start(id, cmd)
  pid = fork do
    Process.setsid()
    exec(*cmd)
  end
  $daemons[id] = pid
end

Socket activation

systemd has made a lot of fuss about socket activation, and how it’s the next best thing after sliced bread. I agree it’s a great idea, but the idea didn’t come from systemd, AFAIK it came from OSX. But, do we need systemd to get the same in Linux?

def start_with_socket(id, stream, cmd)

  server = TCPServer.new(stream)

  Thread.new do
    loop do
      socket = server.accept
      system(*cmd, :in => socket, : out => socket)
    end
  end

end

start_with_socket('sshd', 22, %w[/usr/bin/sshd -i])

Believe it or not, this simple code achieves socket activation. We create a socket, and a new thread that waits for connections, if nobody connects, nothing happens, we have an idle thread, each time somebody connects, we launch ssh -i, which as far as I can tell is the same thing xinetd does, and systemd.

But hey, this is the simple socket activation, it’s not the really fancy one.

Thread.new do
  if managed
    IO.select([server])
    pid = fork do
      env = {}
      env['LISTEN_PID'] = $$.to_s
      env['LISTEN_FDS'] = 1.to_s
      Process.setsid()
      exec(env, *cmd, 3 => server)
    end
    $daemons[id] = pid
  else
    loop do
      socket = server.accept
      system(*cmd, :in => socket, : out => socket)
    end
  end
end

There, this does exactly the same thing as systemd (at least for one socket, multiple ones are easy too), so yeah, we have socket activation.

But wait, there’s more

Hopefully this covers the basics of what an init system should do, and how it’s not rocket science, nor voodoo. It is actually something very straightforward; start the system, keep it running, simple. Of course there’s many other things an operating system should do, but those things don’t belong to the init system, don’t let anyone tell you otherwise.

I have more changes on top of this that bring my little toy init system almost up-to-par to Arch Linux’s initscripts, which is what they used before moving to systemd, so chances are if you use my init, you would have little to no problems in your own system.

Unlike systemd and others, this code is actually very readable, so you can add and remove code as you like very easily, and of course, the less code you have, the faster you boot.

Personally when I hear somebody saying “Oh! but OpenRC doesn’t have socket activation, we need systemd!”, I just roll my eyes.

If you want to give it a try, get the code from GitHub:

https://github.com/felipec/finit

Cheers.

finit

About these ads

8 thoughts on “Demystifying the init system (PID 1)

  1. I think Lennart always cites SMF from Solaris as his inspiration for systemd. My quick search couldn’t quite determine if SMF has socket activation, but I’m guessing it has.

  2. Any chances to see more posts on the topic? I would love to read more about that or similar topics (reimplementing something in simple manner in ruby or any other language for educational purpose).

  3. @FelipeC well I would be interested in services implementation for your init system. How would you handle start, reload, stop of services, maybe optional monitoring (memory usage and restarts for example), service dependencies.

  4. I have some questions about this approach:

    Why are additional functionallity like running services, socket-activation and other tied to PID-1, isn’t PID-1 just: a). wait for children and orphans b). read ctl file c). maybe simple supervision like in sysv5-init? All other stuff can be effectivelly done in other processes same for cgroups-management/labeling and socket activation.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s