Demystifying the init system (PID 1)

With all the talk about debian choosing a default init system (link, link), I’ve decided to share with the world a little project I’ve been working on to help me understand /sbin/init aka. PID 1.

In this blog post I will go step by step showing what an init system must do to be functional. I will ignore all the legacy SysVinit stuff, and technical nuances, and just concentrate on what’s really important.

Introduction

First of all, what is ‘init‘? In it’s essence it’s a process that must be running at all times, if this process ends, the kernel enters into a panic mode, after which you cannot do anything else, except rebooting.

This process doesn’t need to do anything special, you can use /bin/sh as your init, or even /bin/yes (although the latter wouldn’t be very useful).

So let’s write our very first init.

#!/usr/bin/ruby
Process.spawn('agetty', 'tty1')
sleep

Believe it or not, this is actually a rather useful init. How useful it is depends on how your kernel was compiled, your partitioning scheme, and if your root file-system is mounted rw or not. But either way, it covers the basics: rule #1; always keep running no matter what.

This is almost true, except that we need to be listening for SIGCHLD, otherwise some processes wouldn’t be cleaned up properly, so:

Signal.trap(:SIGCHLD) do
  loop do
    begin
      status = Process.wait(-1, Process::WNOHANG)
      break if status == nil
    rescue Errno::ECHILD
      break
    end
  end
end

Reboot

Now that we have the running indefinitely under control, it’s time to stop running (only when requested), but in order to do that we need some kind of IPC with the running process. There’s many ways to achieve this, but I chose UNIX sockets to do that.

So instead of sleeping forever, we listen for commands issued to /run/initctl:

begin
  server = UNIXServer.open('/run/initctl')
rescue Errno::EADDRINUSE
  File.delete('/run/initctl')
  retry
end

loop do
  ctl = server.accept
  cmd = ctl.readline.chomp.to_sym
  # do stuff
end

And when the user is calling us with arguments, we pass those commands through /run/initctl.

def do_cmd(*cmd)
  ctl = UNIXSocket.open('/run/initctl')
  ctl.puts(cmd.join(' '))
  puts(ctl.readline.chomp)
  exit
end

case ARGV[0]
when 'poweroff', 'restart', 'halt'
  do_cmd(ARGV[0].to_sym)
end

So can issue the command init poweroff to turn off the machine, but in order to do that we need to tell the kernel:

def sys_reboot(cmd)
  map = { poweroff: 0x4321fedc, restart: 0x01234567, halt: 0xcdef0123 }
  syscall(169, 0xfee1dead, 537993216, map[cmd])
end

These numbers are not important, what is important is that the kernel understands them, and with this we actually turn off the machine (or halt, or reboot).

Thread carefully

Obviously it would be tedious to type a bunch of commands each time the machine starts, so we need to actually do stuff after booting, however, if we do something wrong, we might render the system unusable. A simple way to solve this is to use scripts, fork a shell, and let it run those, so if there’s something wrong with the scripts, the shell dies, but not PID 1, so the system remains usable, which again, is rule #1.

Fortunately Ruby has exceptions, so we can run code with a safety net that catches all exceptions, and there’s no need to fork, which would waste precious booting time.

def action(name)
  print(name)
  begin
    yield
  rescue => e
    print(' (error: %s)' % e)
  end
  puts
end

With this helper, we can safely run chunks of code, and if they fail, the error is reported to the user.

Initialization

This is the bulk of the code; the instructions you don’t want to type every time. This is mostly tedious stuff, you can skim or skip this section safely.

def mount(type, device, dir, opts)
  Dir.mkdir(dir) unless File.directory?(dir)
  system('mount', '-t', type, device, dir, '-o', opts)
end

action 'Mounting virtual file-systems' do
  mount('proc', 'proc', '/proc', 'nosuid,noexec,nodev')
  mount('sysfs', 'sys', '/sys', 'nosuid,noexec,nodev')
  mount('tmpfs', 'run', '/run', 'mode=0755,nosuid,nodev')
  mount('devtmpfs', 'dev', '/dev', 'mode=0755,nosuid')
  mount('devpts', 'devpts', '/dev/pts', 'mode=0620,gid=5,nosuid,noexec')
  mount('tmpfs', 'shm', '/dev/shm', 'mode=1777,nosuid,nodev')
end

And set the hostname.

action 'Setting hostname' do
  hostname = File.read('/etc/hostname').chomp
  File.write('/proc/sys/kernel/hostname', hostname)
end

Notice that many things can go wrong, for example the file ‘/etc/hostname’ might not exist, however, that would cause an exception, and our init would continue just fine.

Another thing we would want to do is kill all the processes, otherwise we might not be able to unmount the file-systems. We could do killall5, but we wouldn’t have much control over the processes, and that would require a fork. Instead we can rely on the kernel to do the right thing, and all we have to do is wait for the results.

def killall

  def allgone?()
    Dir.glob('/proc/*').each do |e|
      pid = File.basename(e).to_i
      begin
        next if pid < 2
        # Is it a kernel process?
        next if File.read('/proc/%i/cmdline' % pid).empty?
      rescue Errno::ENOENT
      end
      return false
    end
    return true
  end

  def wait_until(timeout = 2, interval = 0.25)
    start = Time.now
    begin
      break true if yield
      sleep(interval)
    end while (Time.now - start) < timeout
  end

  ok = false

  action 'Sending SIGTERM to processes' do
    Process.kill(:SIGTERM, -1)
    ok = wait_until(10) { allgone? }
    raise 'Failed' unless ok
  end

  return if ok

  action 'Sending SIGKILL to processes' do
    Process.kill(:SIGKILL, -1)
    ok = wait_until(15) { allgone? }
    raise 'Failed' unless ok
  end

end

Time to mount real file-systems:

NETFS = %w[nfs nfs4 smbfs cifs codafs ncpfs shfs fuse fuseblk glusterfs davfs fuse.glusterfs]
VIRTFS = %w[proc sysfs tmpfs devtmpfs devpts]

action 'Mounting local filesystems' do
  except = NETFS.map { |e| 'no' + e }.join(',')
  system('mount', '-a', '-t', except, '-O', 'no_netdev')
end

# On shutdown

action 'Unmounting real filesystems' do
  except = (NETFS + VIRTFS).map { |e| 'no' + e }.join(',')
  system('umount', '-a', '-t', except, '-O', 'no_netdev')
end

If you are using a modern distribution, chances are your /run and /tmp directories are cleared up on every boot, so many files and directories need to be re-created. We could do this by hand, but we could also use the systemd-tmpfiles utility which uses the configuration already provided by your distribution in tmpfiles.d directories.

action 'Manage temporary files' do
  system('systemd-tmpfiles', '--create', '--remove', '--clean')
end

begin
  File.delete('/run/nologin')
rescue Errno::ENOENT
end

Unless you are using a custom kernel with modules built-in, chances are you are going to need udev, so fire it up:

action 'Starting udev daemon' do
  system('/usr/lib/systemd/systemd-udevd', '--daemon')
end

action 'Triggering udev uevents' do
  system('udevadm', 'trigger', '--action=add', '--type=subsystems')
  system('udevadm', 'trigger', '--action=add', '--type=devices')
end

action 'Waiting for udev uevents to be processed' do
  system('udevadm', 'settle')
end

# On shutdown

action 'Shutting down udev' do
  system('udevadm', 'control', '--exit')
end

Finally

After all this initialization stuff, your system is most likely very usable already, and in fact I was able to start a display manager (SLiM) at this point, which was my main goal while writing this. But we are just getting started.

In control

Another thing init should do is keep track of launched daemons. Each time we do that we store the PID, and when the child exists, we remove it from the list.

def start(id, cmd)
  $daemons[id] = Process.spawn(*cmd)
end

start('agetty1', %w[agetty tty1])

# On SIGCHLD
key = $daemons.key(status)
$daemons.delete(key) if key

Once we have this it’s trivial to report the status of them (e.g. init status agetty1).

ctl.puts($daemons[args.first] ? 'ok' : 'dead')

At this point we actually have a feature that SysVinit doesn’t have. Not bad for 200 lines of code!

cgroups

cgroups is a feature that is often misunderstood, probably because there are no good tools to make use of them, but they are not that hard. Lennart Pottering went to a lot of trouble trying to explain exactly what systemd does with them and it does not, but I don’t think he did a very good job of clarifying anything. Basically systemd is not doing anything with them Normally systemd is not doing anything with them (by default), simply labeling processes so you can see how they are grouped by using visualization tools like systemd-cgls, but that’s it.

The single most important way you can take advantage of cgroups is for scheduling purposes, so for example your web browser is a control group, and your heavy compilation is in another, then Linux scheduler would isolate the two processes from stealing resources from each other without the need of adjusting the nice level. Basically with cgroups there’s no need for nice (although you can use alongside).

But you don’t have to move a finger to get this benefit, the kernel already does that if you have CONFIG_SCHED_AUTOGROUP, which you should. Then, cgroups would be created for each session in the system, if you don’t know what sessions are, you can run ‘ps f -eo pid,sid,cmd‘ to find out to which session id each process belongs to.

To prove this I wrote a little script that finds out the auto-grouping as reported by the Linux kernel, and you can find groups like:

------------------------------------------------------------------------------
503	slim -nodaemon
895	/bin/sh /etc/xdg/xfce4/xinitrc -- /etc/X11/xinit/xserverrc
901	dbus-launch --sh-syntax --exit-with-session
938	xfce4-session
948	xfwm4
952	xfce4-panel
954	Thunar --daemon
956	xfdesktop
958	conky -q
964	nm-applet
------------------------------------------------------------------------------

This is exactly what you would expect, the session leader (SLiM) starts a bunch of processes, and all of them belong to the same session, and if I compile a Linux kernel, I get:

------------------------------------------------------------------------------
14584	zsh
17920	make
20610	make -f scripts/Makefile.build obj=arch/x86
20661	make -f scripts/Makefile.build obj=kernel
20715	make -f scripts/Makefile.build obj=mm
20734	make -f scripts/Makefile.build obj=arch/x86/kernel
20736	make -f scripts/Makefile.build obj=fs
20750	make -f scripts/Makefile.build obj=arch/x86/kvm
20758	make -f scripts/Makefile.build obj=arch/x86/mm
21245	make -f scripts/Makefile.build obj=ipc
21274	make -f scripts/Makefile.build obj=security
21281	make -f scripts/Makefile.build obj=security/keys
21376	/bin/sh -c set -e; 	   echo '  CC      mm/mmu_context.o'; ...
21378	gcc -Wp,-MD,mm/.mmu_context.o.d ...
21387	/bin/sh -c set -e; 	   echo '  CC      ipc/msg.o'; ...
21390	gcc -Wp,-MD,ipc/.msg.o.d ...
21395	/bin/sh -c set -e; 	   echo '  CC      kernel/extable.o'; ...
21399	/bin/sh -c set -e; 	   echo '  CC [M]  arch/x86/kvm/pmu.o'; ...
21400	gcc -Wp,-MD,kernel/.extable.o.d ...
21403	gcc -Wp,-MD,arch/x86/kvm/.pmu.o.d .
21405	/bin/sh -c set -e; 	   echo '  CC      arch/x86/kernel/probe_roms.o'; ...
21407	gcc -Wp,-MD,arch/x86/kernel/.probe_roms.o.d ...
21413	/bin/sh -c set -e; 	   echo '  CC      fs/inode.o'; ...
21415	/bin/sh -c set -e; 	   echo '  CC      arch/x86/mm/srat.o'; ...
21418	/bin/sh -c set -e; 	   echo '  CC      security/keys/keyctl.o'; ...
------------------------------------------------------------------------------

This group will contain a lot of processes that take a lot of resources, but the scheduler knows they belong to the same group. If somebody logs in to my machine and starts running folding@home we would have two cgroups trying to use 100% of the CPU, so the scheduler would assign 50% to one, and 50% to the other, even though the first one has many more processes. Without the grouping, the scheduler would be unfair against folding@home, giving it as much time as it gives each one of the compilation processes.

All this without you moving a finger. Well, almost.

def start(id, cmd)
  pid = fork do
    Process.setsid()
    exec(*cmd)
  end
  $daemons[id] = pid
end

Socket activation

systemd has made a lot of fuss about socket activation, and how it’s the next best thing after sliced bread. I agree it’s a great idea, but the idea didn’t come from systemd, AFAIK it came from OSX. But, do we need systemd to get the same in Linux?

def start_with_socket(id, stream, cmd)

  server = TCPServer.new(stream)

  Thread.new do
    loop do
      socket = server.accept
      system(*cmd, :in => socket, : out => socket)
    end
  end

end

start_with_socket('sshd', 22, %w[/usr/bin/sshd -i])

Believe it or not, this simple code achieves socket activation. We create a socket, and a new thread that waits for connections, if nobody connects, nothing happens, we have an idle thread, each time somebody connects, we launch ssh -i, which as far as I can tell is the same thing xinetd does, and systemd.

But hey, this is the simple socket activation, it’s not the really fancy one.

Thread.new do
  if managed
    IO.select([server])
    pid = fork do
      env = {}
      env['LISTEN_PID'] = $$.to_s
      env['LISTEN_FDS'] = 1.to_s
      Process.setsid()
      exec(env, *cmd, 3 => server)
    end
    $daemons[id] = pid
  else
    loop do
      socket = server.accept
      system(*cmd, :in => socket, : out => socket)
    end
  end
end

There, this does exactly the same thing as systemd (at least for one socket, multiple ones are easy too), so yeah, we have socket activation.

But wait, there’s more

Hopefully this covers the basics of what an init system should do, and how it’s not rocket science, nor voodoo. It is actually something very straightforward; start the system, keep it running, simple. Of course there’s many other things an operating system should do, but those things don’t belong to the init system, don’t let anyone tell you otherwise.

I have more changes on top of this that bring my little toy init system almost up-to-par to Arch Linux’s initscripts, which is what they used before moving to systemd, so chances are if you use my init, you would have little to no problems in your own system.

Unlike systemd and others, this code is actually very readable, so you can add and remove code as you like very easily, and of course, the less code you have, the faster you boot.

Personally when I hear somebody saying “Oh! but OpenRC doesn’t have socket activation, we need systemd!”, I just roll my eyes.

If you want to give it a try, get the code from GitHub:

https://github.com/felipec/finit

Cheers.

finit

gst-av 0.6 released; more reliable

gst-av is a GStreamer plug-in to provide support for libav (fork of FFmpeg), it is similar to gst-ffmpeg, but without GStreamer politics, which means all libav plugins are supported, even if there are native GStreamer alternatives; VP8, MP3, Ogg, Vorbis, AAC, etc.

This release takes care of a few corner-cases, and has support for more versions of FFmpeg.

Here are the goods:
http://code.google.com/p/gst-av/downloads/list

And here’s the short-log:

Felipe Contreras (19):
      adec: flush buffer on EOS
      adec: improve timestamp reset
      adec: avoid deprecated av_get_bits_per_sample_fmt()
      adec: avoid FF_API_GET_BITS_PER_SAMPLE_FMT
      vdec: properly initialize input buffer
      parse: add more H.264 parsing checks
      parse: fix H.264 parsing for bitstream format
      get_bits: add show_bits function
      build: set runpath for libav
      vdec: fix potential leaks
      vdec: use libav pts stuff
      vdec: get delayed pictures on eos
      build: trivial improvements
      parse: trivial fix
      h264enc: fix static function
      vdec: add support for old reordered_opaque
      adec: add support for old sample_fmt
      adec: add support for really old bps()
      adec: add support for all MPEG-1 audio

Mark Nauwelaerts (1):
      parse: be less picky regarding some reserved value

Android vs. Maemo power management: static vs. dynamic

Some of you might have heard about Google’s Android team proposal to introduce wakelocks (aka suspend-blockers) to the Linux kernel. While there was a real issue being solved in the kernel side, the benefits on the user-space side were dubious at best, and after a huge discussion, they finally didn’t get in.

During this discussions the dynamic and static power management were described and discussed at length, and there was a bit of talk about Maemo(MeeGo) vs Android approaches to power management, but there was so much more than that.

Some people have problems with the battery on Android devices, for some people it’s just fine, some people have problems with the Maemo, other don’t, so in general; your mileage might vary. But given the extremely different approaches, it’s easy to see in which cases you would have better luck with Maemo, and in which with Android–Although I do think it’s obvious which approach is superior, but I am biased.

An interesting achievement was shared by Thiago Maciera, who managed to get ‘5 days and a couple of minutes‘ out of the Nokia N9 while traveling, and actually using it–and let’s remember this is a 1450 mAh battery. Some Android users might have trouble believing this, but hopefully this blog post would explain what’s going on.

So lets go ahead and explore the two approaches.

Dynamic Power Management

Perhaps the simplest way to imagine dynamic power management, is the gears of a manual transmission car. You go up and down depending on how much power does the system actually needs.

In some hardware, such as OMAP chips, it’s possible to have quite a lot of fine control on the power states of quite a lot of devices, plus different operating power points on the different cores. For example, it might be possible to have some devices, such as the display on, and active, some other devices partially off, like speaker, and other completely off, like USB. And based on the status of the whole system, whole blocks can be powered off, other with low voltage levels, etc.

Linux has a framework to deal properly with this kind of hardware, the runtime power management, that originally came from the embedded world, and a lot from OMAP development, but is now available to everyone.

The idea is very simple; devices should sleep as much as possible. This means that if you have a sound device that needs chunks of 100ms, and the system is not doing anything else but playing sound, then most of the devices go to sleep, even the CPU, and the CPU is only waken up when it needs to write data for the audio device. Even better is to configure the sound device for chunks of 1 second, so the system can sleep even more.

Obviously, some co-operation between kernel and user-space is needed. Say, if you have an instant messenger program that needs to do work every minute, and a mail program that is configured to check mail every 10 minutes, you would want them to do work at the same time when they align at every 10 minutes. This is sometimes called IP heartbeat; the system wakes up for a beat, and then immediately goes back to sleep. And there are kernel facilities as well, such as range timers.

All this is possible thanks to very small latencies required for devices to go to sleep and wakeup, and have intermediary modes (e.g. on, inactive, retention, off), so, for example a device might be able to go to inactive mode in 1ms, retention in 2ms, and off in 5ms (totally invented numbers). Again, the more sleep, the better. Obviously, this is impossible on x86 chips, which have huge latencies–at least right now, and it’s something Intel is probably trying to improve effusively. All these latencies are known by the runtime pm framework in the kernel, and based on that it and the usage, it figures out what is the lowest power state possible without breaking things.

Note I’m not a power management expert, but you cant watch a colleague of mine explain the OMAP 3 power-managment on this video:

Advanced Power Management for OMAP3

And there’s plenty of more resources.

Update: That was the wrong link, here are the resources.

Static Power Management

Static power management has two modes: on and off. That’s it.

OK, that’s not exactly the case in general, but it is in the Android context; the system is either suspended, or active, and it’s easy to know in which mode you are; if the screen is on, it’s active, and if it’s off; it’ is suspended (ideally).

There’s really not much more than that. The complexity comes from the problem of when to suspend; you might have turned off the display, but there might be a system service that still needs to do work, so this service grabs a suspend blocker which, as the name suggests, prevents the system from suspending (until the lock is released). This introduces a problem; a rouge program might grab a ‘suspend blocker’ and never release it, which means your phone will never suspend, and thus the battery would drain rather quickly. So, some security mechanisms are introduced to grant permissions selectively to use suspend blockers.

And moreover, Android developers found race conditions in the suspend sequences in certain situations that were rather nasty (e.g. the system tries to suspend at the same time the user clicks a button, and the system never wakes up again), and these were acknowledged as real issues that would happen on all systems (including PC’s and servers, albeit rarely, because they don’t suspend so often), and got fixed (or at least they tried).

Versus

First of all, it’s important to know that if you have dynamic pm working perfectly, you reach exactly the same voltage usage than static pm, so in ideal cases they both behave exactly the same.

The problem is that it’s really hard for dynamic pm to reach that ideal case, in reality systems are able to sleep for certain periods of time, after which they are woken up, often times unnecessarily, and as I already explained; that’s not good. So the goal of a dynamic pm system is to increase those periods of time as much as possible, thus maximizing battery life. But there’s a point of diminished returns (read this or this for expert explanations), so, if the system manages to sleep 1s in average, there’s really not much more to gain if it sleeps 2s, or even 10s. These goals were quite difficult to achieve in the past (not these, I invented those numbers), but not so much any more thanks to several mechanisms that have been introduced and implemented through the years. So it’s fair to say that the sweet spot of dynamic pm has been achieved.

This means that today a system that has been fine-tuned for dynamic pm can reach reach a decent battery life compared to one that uses static pm in most circumstances. But for some use-cases, say, you leave your phone on your desk and you don’t use it at all, static pm would allow it to stay alive for weeks, or even months, while dynamic pm can’t possibly achieve that any time soon. Hopefully you would agree, that nobody cares about those use-cases were you are not actually using the device.

And of course, you only need one application behaving badly and waking up the system constantly, and the battery life is screwed. So in essence, dynamic pm is hard to achieve.

Android developers argued that was one of the main reasons to go for static pm; it’s easier to achieve, specially if you want to support a lot of third party applications (Android Market) without compromising battery life. While this makes sense, I wasn’t convinced by this argument; you still can have one application that is behaving badly (grabbing suspend blockers the whole time), and while permissions should help, the application might still request the permission, and the user grant it (who reads incomprehensible warnings before clicking ‘Yes’ anyway?).

So, both can get screwed by bad apps (although it’s likely that it’s harder in the static pm case, albeit not that much).

But actually, you can have both static and dynamic power management, and in fact, Android does. But that doesn’t mean Android automatically wins, as I explained, the system needs to be fine-tuned for dynamic pm, and that has never been a focus of Android (there’s no API’s or frameworks for that, etc.). So, for example, a Nokia N9 phone might be able to sleep 1s in average, while an Android phone 100ms (when not suspended). This means when you are actually using the device (the screen is on), chances are, a system fine-tuned for dynamic pm (Nokia N9) would consume less battery, than an Android device.

That is the main difference between the two. tl;dr: dynamic pm is better for active usage.

So, if Android developers want to improve the battery usage while on active usage (which I assume is what the users want), they need to fine-tune the system for dynamic pm, and thus sleep as much as possible, hopefully reaching the sweet spot. But wait a second… If Android is using dynamic pm anyway, and they tune the system to the point of diminishing returns; there is not need for static pm. Right? Well, that’s my thinking, but I didn’t manage to make Android developers realize that in the discussion.

Plus, there’s a bunch of other reasons while static pm is not palatable for non-Android systems (aka. typical Linux systems), but I won’t go into those details.

Nokia’s bet was on dynamic, Google’s bet was on static, and in the end we agreed to disagree, but I think it’s clear from the outcome in real-world situations who was right–N9 users experiencing more than one day of normal usage, and even more than two. Sadly, only Nokia N9 users would manage to experience dynamic pm in full glory (at the moment).

Upstream

But not all is lost, and this in my opinion is the most important aspect. Dynamic pm lives on the Linux kernel mainline through the runtime power management API. This is not a Nokia invention that will die with the Nokia N9; it’s a collaborative effort where Nokia, TI, and other companies worked together, and now not only benefits OMAP, but other embedded chips, and even non-embedded ones. Being upstream means it’s good, and it has been blessed by many parties, and has gone through many iterations, and finally it probably doesn’t look like the original OMAP SRF code at all. Sooner or later your phone (if you don’t have an N9) will benefit from this effort, and it might even be an Android phone, your netbook (if not already benefiting in some way), and even your PC.

Android’s suspend blockers are not upstream, and it’s quite unlikely that they will ever be, or that you would see them used in any system other than Android, and there’s good reasons for that. Matthew Garrett did an excellent job of summarizing what went wrong with the story of suspend blockers on his presentation ‘Android/Linux kernel: Lessons learned’, but unfortunately the Linux foundation seems to be doing a poor job of providing those videos (I cannot watch them any more, even though I registered, and they haven’t been helpful through email conversations).

Update: I managed to get the video from the Linux Foundation and pushed it to YouTube:
[youtube:http://www.youtube.com/watch?v=A4psPP67YMU%5D

Here is part of the discussion on LKML, if you want to read it for some strange reason. WARNING; it’s huge.

A tale of just another Linux kernel bug

As part of a bigger effort to get my Nokia N900 in good shape for development, I decided to track down an issue with the keyboard; I could type ‘a’, but not ‘A’ or any special characters, so no ‘shift’ or ‘ctrl’ or anything special. Trying to figure out what was going on took me through an unexpected journey, which is not remarkable, but I think it’s a good example of what many kernel developers (and low level developers) constantly go through, and as such, might be interesting for some people to read.

Keycodes

So, first things first. I recently had an issue with a PS/2 keyboard on my PC, so I had an idea how to debug this, and the first thing I did was to run keycode, which shows messages like these:

keycode  30 press
keycode  30 release
keycode  30 press
keycode  30 release
keycode  29 press

Then these keys are supposed to be converted to real characters somehow through a key map, and apparently X has a map of its own.

However, I saw keycodes being pressed when I clicked ‘shift’, so I concluded that neither hardware, nor input driver was the problem, could it be the mapping? The MeeGo project provided a mapping file that I have been using for some time, and it clearly shows “keycode 42 = Shift”, and I was getting keycode 42, so the mapping seemed correct, but was it being applied properly? I found out this can be checked by using dumpkeys, and indeed, the mapping was correct.

Everything was fine up to this level. Next.

Note: all these tools (showkey, loadkeys, and dumpkeys) are provided by the ‘kbd’ package.

Virtual terminal

Now I stumbled into a problem; I’ve no idea what it is that I am interacting with on the framebuffer console (I’m not using X). So, I try fill that gap in my knowledge by going to the ##linux IRC channel in freenode.net, and ask the question: what is it that converts they keycodes to characters in a framebuffer console? As it’s typical when I ask these sorts of tricky questions, I get useless responses, like, ‘what distribution are you using?’ I knew that didn’t matter, so I set out to investigate myself.

I thought, well, what is this thing that I have to add in my inittab to get the console working? 1:2345:respawn:/sbin/getty 9600 tty1. I tried to read documentation about getty and try different options, like try to specify vt100 as an argument, but to no avail.

Maybe it’s done in the kernel? I thought. So I quickly went through my kernel config, and I find this gem:

CONFIG_VT

If you say Y here, you will get support for terminal devices with
display and keyboard devices. These are called "virtual" because you
can run several virtual terminals (also called virtual consoles) on
one physical terminal. This is rather useful, for example one
virtual terminal can collect system messages and warnings, another
one can be used for a text-mode user session, and a third could run
an X session, all in parallel. Switching between virtual terminals
is done with certain key combinations, usually Alt-.

The setterm command ("man setterm") can be used to change the
properties (such as colors or beeping) of a virtual terminal. The
man page console_codes(4) ("man console_codes") contains the special
character sequences that can be used to change those properties
directly. The fonts used on virtual terminals can be changed with
the setfont ("man setfont") command and the key bindings are defined
with the loadkeys ("man loadkeys") command.

You need at least one virtual terminal device in order to make use
of your keyboard and monitor. Therefore, only people configuring an
embedded system would want to say N here in order to save some
memory; the only way to log into such a system is then via a serial
or network connection.

If unsure, say Y, or else you won't be able to do much with your new
shiny Linux system :-)

That makes sense, now we are getting somewhere. So, in the past, there was this notion of ‘terminals’, which are specialized pieces of hardware that send and receive ASCII (or some codes), so they are the ones that have all the needed stuff to control they keyboard, screen, and so on. I never had the need to use one of these terminals, therefore I never really knew what a “virtual” terminal was. So a virtual terminal does the job of a real terminal; it needs an input driver, and a display driver, and really puts them to do something useful.

Interesting theory, but is it true? I quickly looked at the code inside drivers/tty/vt, and I found code to control the keyboard, screen, and also that contains the keyboard mappings. Excellent! So we found the thing that uses these keyboard mappings. Now what?

Before moving on, the fact that this code resides in tty is also helpful, basically, serial console, virtual terminals (framebuffer console), and real terminals all operate through tty’s (teleprinter), which are means of communication between hosts and these “devices”. So, getty really gets a tty to be used by any of these.

All right, so now the knowledge gap seems to be filled, what next? Well, clearly there’s something wrong with this virtual terminal, but what? I immediately started looking at the code at drivers/tty/vt/keyboard.c, and I noticed something interesting: kbdmode can have different values, like RAW, and this mode, certain things are not handled, like ‘shift’. That looked promising, but how to change that mode?

I looked for tools to control the virtual terminal, I found an interesting one, setterm, which was not available in my minimal system, I couldn’t find how to get it, and anyway didn’t have any option that I wanted. terminfo and such looked interesting, but it didn’t seem like anything relevant to the issue at hand. Then I found kbd_mode, which obviously did what I wanted, and I already had it :). I couldn’t even type mbd_mode on my keyboard, so I had to write shortcuts on my PC (kbda, kbdb, etc.). Unfortunately, I found out that initially the mode was set to ‘unicode’, which seemed Ok, and changing to ‘ascii’ didn’t change anything.

So, it didn’t seem to be a user-space configuration of any sort. Next.

Getting our hands dirty

Time to actually type some code. I modified the code in ‘drivers/tty/vt/keyboard.c’, specially in kbd_keycode() to find out the true kbdmode, the keycodes coming down, and how they were being interpreted.

I quickly found out that they keycodes and mode were indeed correct, but each and every key press was immediately followed by a key release, so shift+a was interpreted as shift, a. Now we are getting somewhere; the problem has to be on the input driver.

Maybe I chose the wrong one, or maybe I’m missing some configurations. I see some CONFIG_KEYBOARD_GPIO, and CONFIG_KEYBOARD_TWL4030, and it looks like I should be using TWL4030, as that’s the chip the Nokia N900 has, but I’m not sure, so time to look at the N900 schematics. Well, it seemed like TWL4030 is indeed the right one, and there’s nothing to it; either you have it or not.

Maybe some recent change broke it… But there’s nothing recent that I can see that could do that. So it’s time to take a look at the actual code: drivers/input/keyboard/twl4030_keypad.c. After adding a few prints here and there I realize the problem starts with this code ret = twl4030_kpread(kp, &reg, KEYP_ISR1, 1) which returns 0 after a key press (that returns 1). So, time to read the TRM.

It took me some time to find the right document, and then understand what all the configuration options were actually doing (more or less), and then play around with them. After making a lot of more or less random changes I notice no difference in this particular problem of getting an extra ‘0’. So I think to myself; maybe the problem is the interrupt.

So I abandon the configuration of the keyboard, and look at the code to request the interrupt:
request_threaded_irq(kp->irq, NULL, do_kp_irq, 0, pdev->name, kp);

I had some interrupt issues before (in fact on this very keyboard), so I knew a few hacks I could try, like specifying the IRQF_ONESHOT flag. That didn’t help, so I tried to do that in the parent interrupt on ‘drivers/mfd/twl4030-irq.c’, because I saw a patch from Neil Brown on the linux-omap mailing list that fixed another issue, but that didn’t help either. Then I realized I saw one patch that affected this interrupt request (see here). So I revert the patch, and voilà, no more 0’s afterwards, and the keyboard works properly.

However, there’s a nasty warning about interrupts being enabled, which is probably the reason why the original patch happened, so I try a few random things to get rid of it, but nothing helps. So then I wonder, maybe the reason the keyboard driver worked before that patch is just pure luck, and these extra interrupts were not being detected properly.

Enough fooling around

Since I really want to fix this issue properly, I push myself to really understand what’s going on in the driver. So I slowly read all the documentation, and all the registers, and try to set different values to see what’s going on. While doing that, I noticed one function was calculating the times wrongly, and was telling the driver to use values twice as big as originally intended. It took me some time to figure out why the author chose 31 << (x + 1) instead of 2 << (x + 1) * 31, which is what a direct function conversion would return, until I realized that a shift basically means multiplying by two, and x << 1 is basically the same as (2 << 0) * x, but the author missed that x = ptv + 1, and so it should be 31 << (ptv + 2). Anyway, after being confident of these timeout values, I could set big timeouts on the range of seconds without overflowing the registers and see exactly what they were doing in timescales I could notice.

So yeah, these registers were doing what I thought they should be doing. Nice try.

Time to go back to the interrupt handling. After reading the code of the twl4030-irq, which is supposed to fire the irqs that the keyboard driver eventually gets, and then reading some kernel core code as well, it was not really clear to me how these were all weaved together, so I added some printfs.

Before going forward I’ll briefly explain a bit what TWL4030 is. To my understanding it’s basically an integrated circuit that has many functions, one of them is having a keyboard controller. So there is no dedicated IRQ for the keyboard interrupt, but TWL4030 has a level interrupt, and then the right module IRQ is demultiplexed. OK.

I was hoping there was a chain of actions like pih -> sih (from twl-core) -> keyboard, but no, there was only one action in the chain. Fortunately when I reverted the patch I got a warning with a backtrace where I could see something like twl4030_irq_thread->generic_handle_irq->handle_simple_irq. It was really straight-forward, but I couldn’t see why. There was a lot of code in twl403o-irq for a “secondary interrupt handler” but it didn’t seem to be called at all, and I didn’t see how generic_handle_irq was calling handle_simple_irq.

Time to step back for a second. I remembered a patch from Neil Brown trying to fix something regarding how this sih stuff was called (here) because it was not called at all, but these patches are supposed to be fixes for other patches that are not applied at this point, so they really wouldn’t help.

At this point I’m pretty much stuck, as I’ve no idea how all this code is connected, you might think that all this is way over my head, and it might be, even if just a little, but that has never deterred me. I know that as long as I have the code, and I have a way to run it, I can figure out how it works, so I just keep trying.

After reading the code more carefully, I noticed that only some twl “modules” had a sih setup, and the keyboard was not one of them, so that explained the sih part. And then I noticed a part of the irq handling that dealt with “chips”, and one interesting function that setup the chip’s “handler”. Wait a second… So there’s a chip handler, and a client handler, a few keystrokes after and I find this gem irq_set_chip_and_handler(i, &twl4030_irq_chip, handle_simple_irq). Finally! So I can now see the whole chain of events, but alas, I still have no idea why things are failing.

I thought maybe some states where miscalculated, so I find a function that is used to print everything related to an irq, but it was hidden in some internal code, so I had to do a few tricks to use it in the keyboard driver. All the states seemed to be OK, except one… the irq count, which showed something like 0, 0, 0, 1, 2, 2, 3 where it should have shown 0, 0, 1, 1, 2, 2. I looked at all the places where this counter is modified, and I found an interesting function: handle_nested_irq, so I replaced handle_simple_irq just for fun, and…

BAM!

Everything worked perfectly.

Why, Why, Why

So, now I found a fix, but why does it work? If I want to fix this in a way that other people will not get bitten, I have to find the proper fix, so “This seems to make things work here” usually doesn’t convince many kernel hackers.

Using handle_nested_irq makes the compiler throw a warning right away, because the number of arguments is different, so I looked at how other people use this function, and it’s indeed very different, it looks like people are calling handle_nested_irq directly, instead of generic_handle_irq, which would eventually handle handle_simple_irq, because that’s the chip’s configured handler. It all boils down to the difference between handle_nested_irq and handle_simple_irq, and looking at them it’s clear that handle_nested_irq does much less, specifically, it calls directly the action->thread_fn() callback, which is the keyboard interrupt handler, rather than waking up the thread, which is the same function (but in a specialized thread).

And this finally shines some light; the reason why the interrupt counter is wrong, is that the thread is waked after the irq has been processed, so the code that is supposed to clear the interrupt hasn’t been called yet, and a second interrupt is generated which is spurious. The relevant code is supposed to read a register, and the hardware clear the interrupt on a read (or write) operation. The reason handle_nested_irq works is because it doesn’t bother with the interrupt thread at all, it calls the function right away on the same thread, and thus the interrupt is marked as handled at the right time (after it has been really handled).

And this also explains why other people call handle_nested_irq instead of generic_handle_irq; you are supposed to call it inside an interrupt thread (which is configured with request_threaded_irq), but there’s a thread lying around that nobody uses. Indeed, Neil Brown noticed the same (here), so irq_set_nested_thread() will make sure that no extra thread is created.

So both irq handlers must be threaded, the one for the twl-core, and the one for the keyboard, not just the one for the keyboard like right now. Fortunately Felipe Balbi already fixed that (see this patch series), but that broke things badly, and then Neil Brown fixed them (see this). Interestingly enough, I had already tested those patches for other reasons, but at this point I wasn’t sure if they would fix the keyboard problems, I just assumed the would anyway.

Good good, it looks like we finally have the whole picture, but the job’s not done yet.

Getting it “done”

Now the question is: how to fix this? One limitation of the ‘stable’ kernel trees (at least AFAIK), is that the patches should be in the latest Linus’s tree, so whatever solution is found, should be in the main tree.

Ideally, all the stack of irq’s should be reorganized to use the nested/threaded API, or none at all. So this patch, should have been applied only after this patch series, and not two years beforehand. I pondered for quite some time whether a middle-ground could be found, either by workarounds in the core irq handler, or in the keyboard one, eventually I came to the conclusion that since the core is supposed to set irq_set_nested_thread() anyway (just trust me), the keyboard one is the one that should check if it’s nested or not, and either set the thread handler, or a normal one, and this code should work before and after the patches applied for 3.2. Unfortunately, the ‘nested’ flag is supposed to be internal to the kernel’s thread handling code.

So there’s really no way around it, either the original patch is reverted (and other possible similar ones for other twl modules), or all the twl core irq handler gets the new code for 3.2 backported, and that code isn’t even fixed yet (as of v3.2-rc5) (and maybe it will not be the way things are progressing).

Then the question arises; is all that new code on v3.2 working properly? I went ahead and tried it, and lo and behold; the keyboard works as expected. I then backported all those patches to v3.1, and they all worked too.

But does it really make sense to apply all those patches to all the kernels after v2.6.33? Or does it make sense to revert a few of them? That’s not for me to decide, so I ask the community, after I did all that work on this mail.

Sometimes the result of a week’s work ends up being a one line patch, sometimes it’s just one character, and in this case, there was no patch (at least from me), but at least the issue is clearly identified, as well as the fixes. And that means nobody has to do this work again.

Well, the fun is not over… We still need to synchronize the maintainers, so the right patches land on Linus’ tree for v3.2, and they are back-ported to the relevant linux-stable trees.

Kernel development is hard, let’s go shopping

What are you talking about? This isn’t even kernel development, it’s just some legwork 😉 I’m going to be Topper for a second and say “That’s nothing!”, there’s way more complicated and challenging issues kernel developers confront all the time. I thought this was interesting because of all the steps I had to do, and because it involved things I had no idea about.

Was this worth my time? Well, I learned what is actually a virtual terminal for real, and not some vague notion, like when common people say “My browser? Yeah, I use Google… No?”. I also learned a bit more about the IRQ handling API in the Linux kernel, and BTW, all this threaded interrupt API started because of real issues, and the removal of IRQF_DISABLED (nice LWN article about it here), and that thread was fun to read years ago, even when I didn’t understand most of it, maybe it’s time to read it again 🙂

For me, the best is the satisfaction of knowing that I really “got it”. I mean, I really understand what causes the problem, many of the possible solutions, including hacky and proper ones, and exactly what to do if I get bitten by a similar problem regarding threaded IRQ’s.

And now that the keyboard is fixed, on with the next problem on the N900 (which I stopped working on, because I though a functional keyboard would help to debug it, as opposed to shut down, remove the MMC, plug it into a PC, write a script with the commands I want to run, unmount, plug it back, and boot [I can’t use USB networking for this issue to appear]).

Scrobbler for Maemo, now both on N900, and N9

Version 2.0 finally moved to Fremantle stable, so everybody can start using it 🙂

If you are not familiar with it, this package will see what music you are listening on Maemo devices, and scrobble to your favorite service, either last.fm, libre.fm, or both.

I already explained the features in an earlier blog entry, along with an explanation of how to make use of the “love” feature.

But now I also managed to port this to Harmattan, and it works perfectly on my Nokia N9. Interestingly enough, the new UI has a “favorite” feature directly integrated, it took me some time, as it’s not publicly documented, but I finally managed to hook into it, so everything works seamlessly 🙂

I was rather impressed by how easy it was to port it, I was able to leave all the GLib bits intact, even libsoup is still supported, and libconio, so I only had to make changes regarding the new qmafw. Thanks to the Qt guys for using the GLib’s mainloop by default, it certainly made things easier for me 🙂

Update

You can find a debian package here: maemo-scrobbler 2.0-2.

Then, you would need to create a file ~/.config/scrobbler like this:

[lastfm]
username=foo
password=bar
            
[librefm]
username=foo
password=bar

That’s it 🙂

N9 Swipe undocumented feature; activate sane behavior

Ed Page recently blogged about his idea to improve the Swipe UI. Fortunately for him, a bunch of people and I had the same idea inside Nokia 🙂

Update: apparently this is not present in the images distributed with the Nokia N950’s, it was introduced after 25-3.

If you open ~/.config/mcompositor/mcompsitor.conf, you’ll see a bunch of swipe-action-foo configs, all of them set to “away” (by default).

You can, however, change them to something like:
swipe-action-up: switcher
swipe-action-down: close
swipe-action-left: events
swipe-action-right: launcher

And voilà! Now depending on which direction you swipe, is the action that would happen. You need to kill mcompositor for them to become active (or send SIGTERM/SIGINT, I don’t remember which one).

This was possible because I got fed up arguing about the benefits of swiping down to close applications, and decided to implement the thing by myself. While doing that, I also decided to implement an idea that was flying around, which was to do something different depending on the direction of the swipe. It took me about two days to do, including making it configurable, mostly trying to familiarize myself with the code that was split into multiple packages, and wrapping myself around C++, which I have avoided as much as possible. That was much less time than the time spent discussing before and after I implemented the patches. Fortunately, after seeing the thing in action, many people jumped into the wagon, and at least swipe down to close is available to the masses, which people seem to like. Guerrilla design 😉

However, after a lot of though, I realized the configuration I mentioned above is ideal. And this is my rationale:

Rationale

The current design is inconsistent; when you swipe away, you never know in which desktop view you are going to end up, specially if you just unlocked the device. You might end up in ‘switcher’, or ‘launcher’, or ‘events’. Same action, different behaviour equals inconsistency.

There’s a simple solution; utilize the simple mental model required to navigate the desktop views.

Swipe mental model

In this way, it’s easy to know exactly what would happen when swiping on each direction, decreasing the amount of actions needed from the user. The same action always produces the same behavior.

It is unclear if this mode is ever going to be officially supported on the device, despite the fact that I think it’s obviously superior (and Ed Page seems to agree, as he came with the exact same idea by himself), maybe some people are worried about updating manuals and what not, but at least it’s trivial to activate, and anybody can create a 3rd party app for that 😉

BTW. This is yet another reason why I distrust people telling me I’m a “geek”, and “designers” know better when it comes to design. In the words of Elaine Morgan; yes, they can all be wrong, history is strung with occasions when they all got it wrong.

Enjoy 🙂