The Linux way; never ever break user experience

Through the years it has become more and more obvious to me that there’s two camps in open source development, and one camp is not even aware of how the other camp works (or succeeds, rather), often to their own detriment. This was blatantly obvious in Miguel de Icaza’s blog post What Killed the Linux Desktop, in which he accused Linus Torvalds for setting the attitude of breaking API’s to the developer’s heart content without even realizing that they (Linux kernel developers) do the exact opposite of what he claimed; Linux never ever breaks user-space API. This triggered a classic example of many thorned discussions between the two camps, which illustrates how one side doesn’t have a clue of how the other side operates, or even; that there’s entirely different way of doing things. I will name the camps “The Linux guys” (even though they don’t work strictly on the Linux kernel), and “The user-space guys”, which is people that work on user-space, GNOME being one of the peak examples.

This is not an attempt to put people in two black-and-white categories, there’s a spectrum of behaviors, and surely you can find people in a group with the mentality of the other camp. Ultimately it’s you the one that decides if there’s a divide in attitudes or not.

The point of this post is to explain the number one rule of Linux kernel development; never ever break user experience, why that is important, and how far off the user-space camp is.

When I say “Linux” I’m referring to the Linux kernel, because that’s the name of the kernel project.

The Linux way

There are many exhaustive details of what makes the Linux kernel development work on a day-to-day basis, and many reasons for why it is the way it is, and why it’s good and desirable to work in that way. But for now I will simply concentrate on what it is.

Never ever break user experience. This point cannot be stressed enough. Forget about development practices, forget about netiquette, forget about technical competence… what is important is the users, and to never break their expectations. Let me quote Linus Torvalds:

The biggest thing any program can do is not the technical details of the program itself; it’s how useful the program is to users.

So any time any program (like the kernel or any other project), breaks the user experience, to me, that’s the absolute worst failure that a software project can make.

This is a point that is so obvious, and yet many projects (especially big ones), often forget; that they are nothing without their user-base. If you start a small project all by yourself, you are painfully aware that in order for your project to succeed, you need users. Without users you cannot get more developers, and your project could very well disappear from the face of the Earth, and nobody would notice. But once your project is big enough, one or two users complaining about something start being less of an issue, in fact, hundreds of them might be insignificant, and at some point, you loose any measure of what percentage of users are complaining, this problem might grow to the point that developers say “users don’t know what they want”, in order to ignore the importance of users, and their needs.

But that’s not the Linux way; it doesn’t matter if you have one user, or ten, or millions, your project still succeeds (or fails) because of the same reason; it’s useful to some people (or not). And if you break user experience, you risk that usefulness, and you risk your project being irrelevant for your users. That is not good.

Of course, there are compromises, sometimes you can do a bit of risk analysis: OK; this change might affect 1% of our current users, and the change would be kind of annoying, but it would make the code so much more maintainable; let’s go for it. And it’s all about to where you draw the line. Sometimes it might be OK to break user experience, if you have good reasons for it, but you should really try to avoid it, and if you go forward, provide an adjustment period, a configuration for the old behavior, and even involve your users in the whole process to make sure the change is indeed needed, and their discomfort is minimized.

At the end of the day it’s all about trust. I use project X not only because it works for me, but because I trust that it will keep working for me in the years to come. If for some reason I expected it to break next year, I might be better off looking for something else right now that I trust I could keep relying on indefinitely, than having project X break on me while I’m on the middle of a deadline, and I don’t have time for their shenanigans.

Obvious stuff, yet many project don’t realize that. One example is when the udidks2 project felt they should change the address of the mount directories from `/media/foo`, to `/run/media/$user/foo`. What?! I’m in the middle of something important, and all of a sudden I can’t find my disks’ content in /media? I had to spend a considerable amount of time until I found the reason; no, udisks2 didn’t had a bug; they introduced this change willingly and knowingly. They didn’t give any deprecation warning while they moved to the new location, they didn’t have an option to keep the old behavior, they just moved it, with no explanation, in one single commit (here), from one version to the next. Am I going to keep using their project? No. Why would I? Who knows when would be the next time they decide to break some user experience unilaterally without deprecation warnings or anything? The trust is broken, and many others agree.

How about the Linux kernel? When was the last time your Linux kernel failed you in some way that it was not a bug, but that the developers knowingly and willingly broke things for you? Can’t think of any? Me neither. In fact, people often forget about the Linux kernel, because it just works. The external drivers (like NVIDIA or AMD) is not a problem of the kernel, but the drivers themselves, and I will explain later on. You have people bitching about all kinds of projects, and threatening forks, and complaining about the leadership, and whatnot. None of that happens with the Linux kernel. Why? Because it just works. Not for me, not for 90% of the users, for everybody (or 99.99% of everybody).

Because they never ever break user experience. Ever. Period.

The deniers

Miguel de Icaza, after accusing Linus not maintaining a stable ABI for drivers, went on arguing that it was kernel developers’ fault of spreading attitudes like:

We deprecated APIs, because there was a better way. We removed functionality because “that approach is broken”, for degrees of broken from “it is a security hole” all the way to “it does not conform to the new style we are using”.

What part of “never ever break user experience” didn’t Icaza understand? It seems he only mentions the internal API, which does change all the time in the Linux kernel, and which has never had any resemblance of a promise that it wouldn’t (thus the “internal” part), and ignoring the public user-space API, which does indeed never break, which is why you, as a user, don’t have to worry about your user-space not working on Linux v3.0, or Linux v4.0. How can he not see that? Is Icaza blind?

Torvalds:

The gnome people claiming that I set the “attitude” that causes them problems is laughable.

One of the core kernel rules has always been that we never ever break any external interfaces. That rule has been there since day one, although it’s gotten much more explicit only in the last few years. The fact that we break internal interfaces that are not visible to userland is totally irrelevant, and a total red herring.

I wish the gnome people had understood the real rules inside the kernel. Like “you never break external interfaces” – and “we need to do that to improve things” is not an excuse.

Even after Linus Torvalds and Alan Cox explained to him how the Linux kernel actually works in a Google+ thread, he didn’t accept anything.

Lennart Poettering being face to face with both (Torvalds and Cox), argued that this mantra (never break user experience) wasn’t actually followed (video here). Yet at the same time his software (the systemd+udev beast) recently was criticized for knowingly and willingly breaking user experience by making the boot hang for 30s per device that needed firmware. Linus’ reply was priceless (link):

Kay, you are so full of sh*t that it’s not funny. You’re refusing to
acknowledge your bugs, you refuse to fix them even when a patch is
sent to you, and then you make excuses for the fact that we have to
work around *your* bugs, and say that we should have done so from the
very beginning.

Yes, doing it in the kernel is “more robust”. But don’t play games,
and stop the lying. It’s more robust because we have maintainers that
care, and because we know that regressions are not something we can
play fast and loose with. If something breaks, and we don’t know what
the right fix for that breakage is, we *revert* the thing that broke.

So yes, we’re clearly better off doing it in the kernel.

Not because firmware loading cannot be done in user space. But simply
because udev maintenance since Greg gave it up has gone downhill.

So you see, it’s not that GNOME developers understand the Linux way and simply disagree that’s the way they want to go, it’s that they don’t even understand it, even when it’s explained to them directly, clearly, face to face. This behavior is not exclusive to GNOME developers, udisks2 is another example, and there’s many more, but probably not as extreme.

More examples

Linus Torvalds gave Kay a pretty hard time for knowingly and willingly introducing regressions, but does Linux fares better? As an example I can think of a regression I found with Wine, after realizing the problem was in the kernel, I bisected the commit that introduced the problem and notified Linux developers. If this was udev, or GNOME, or any other crappy user-space software, I know what their answer would be: Wine is doing something wrong, Wine needs to be fixed, it’s Wine’s problem, not ours. But that’s not Linux, Linux has a contract with user-space and they never break user experience, so what they did is to revert the change, even though it made things less-than-ideal on the kernel side, that’s what was required so you, the user, doesn’t experience any breakage. The LKML thread is here.

Another example is what happened when Linux moved to 3.0; some programs expected a 2.x version, or even 2.6.x, these programs were clearly buggy, as they should check that the version is greater than 2.x, however, the bugs were already there, and people didn’t want to recompile their binaries, and they might not even be able to do that. It would be stupid for Linux to report 2.6.x, when in fact it’s 3.x, but that’s exactly what they did. They added an option so the kernel would report a 2.6.x version, so the users would have the option to keep running these old buggy binaries. Link here.

Now compare the switch to Linux 3.0 which was transparent and as painless as possible, to the move to GNOME 3. There couldn’t be a more perfect example of a blatant disregard to current user experience. If your workflow doesn’t work correctly in GNOME 3… you have to change your workflow. If GNOME 3 behaves almost as you would expect, but only need a tiny configuration… too bad. If you want to use GNOME 3 technology, but you would like a grace period while you are able to use the old interface, while you adjust to the new one… sucks to be you. In fact, it’s really hard to think of any way in which they could have increased the pain of moving to GNOME 3. And when users reported their user experience broken, the talking points were not surprising: “users don’t know what they want”, “users hate change”, “they will stop whining in a couple of months”. Boy, they sure value their users. And now they are going after the middle-click copy.

If you have more examples of projects breaking user experience, or keeping it. Feel free to mention them in the comments.

No, seriously, no regressions

Sometimes even Linux maintainers don’t realize how important this rule is, and in such cases, Linus doesn’t shy away from explaining it to them (link):

Mauro, SHUT THE FUCK UP!

It’s a bug alright – in the kernel. How long have you been a
maintainer? And you *still* haven’t learnt the first rule of kernel
maintenance?

If a change results in user programs breaking, it’s a bug in the
kernel. We never EVER blame the user programs. How hard can this be to
understand?

> So, on a first glance, this doesn’t sound like a regression,
> but, instead, it looks tha pulseaudio/tumbleweed has some serious
> bugs and/or regressions.

Shut up, Mauro. And I don’t _ever_ want to hear that kind of obvious
garbage and idiocy from a kernel maintainer again. Seriously.

I’d wait for Rafael’s patch to go through you, but I have another
error report in my mailbox of all KDE media applications being broken
by v3.8-rc1, and I bet it’s the same kernel bug. And you’ve shown
yourself to not be competent in this issue, so I’ll apply it directly
and immediately myself.

WE DO NOT BREAK USERSPACE!

The fact that you then try to make *excuses* for breaking user space,
and blaming some external program that *used* to work, is just
shameful. It’s not how we work.

Fix your f*cking “compliance tool”, because it is obviously broken.
And fix your approach to kernel programming.

And if you think that was an isolated incident (link):

Rafael, please don’t *ever* write that crap again.

We revert stuff whether it “fixed” something else or not. The rule is
“NO REGRESSIONS”. It doesn’t matter one whit if something “fixes”
something else or not – if it breaks an old case, it gets reverted.

Seriously. Why do I even have to mention this? Why do I have to
explain this to somebody pretty much *every* f*cking merge window?

This is not a new rule.

There is no excuse for regressions, and “it is a fix” is actually the
_least_ valid of all reasons.

A commit that causes a regression is – by definition – not a “fix”. So
please don’t *ever* say something that stupid again.

Things that used to work are simply a million times more important
than things that historically didn’t work.

So this had better get fixed asap, and I need to feel like people are
working on it. Otherwise we start reverting.

And no amount “but it’s a fix” matters one whit. In fact, it just
makes me feel like I need to start reverting early, because the
maintainer doesn’t seem to understand how serious a regression is.

Compare and contrast

Now that we have a good dose of examples it should be clear how the difference in attitudes from the two camps couldn’t be more different.

In the GNOME/PulseAudio/udev/etc. camp, if a change in API causes a regression on the receiving end of that API, the problem is in the client, and the “fix” is not reverted, it stays, and the application needs to change, if the user suffers as a result of this, too bad, the client application is to blame.

In the Linux camp, if a change in API causes a regression, Linux has a problem, the change is not a “fix”, it’s a regression and it must be reverted (or otherwise fixed), so the client application doesn’t need to change (even though it probably should), and the user never suffers as a result. To even hint otherwise is cause for harsh public shaming.

Do you see the difference? Which of the two approaches do you think is better?

What about the external API?

Linux doesn’t support external modules, if you use an external module, you are own your own. They have good reasons for this; all modules can and should be part of the kernel, this makes maintenance easy for everybody.

Each time an internal API needs to be changed, the person that does the change can do it for all the modules that are using that API. So if you are a company, let’s say Texas Instruments, and you manage to get your module into the Linux mainline, you don’t have to worry about API changes, because they (Linux developers), would do the updates for you. This allows the internal API to always be clean, consistent, relevant, and useful. As an example of a recent change, Russell King (the ARM maintainer), introduced a new API to set the DMA mask, and in the process updated all users of dma_set_mask() to use the new function dma_set_mask_and_coherent(), and by doing that found potential bugs in many instances. So companies like Intel, NVIDIA, and Texas Instruments, benefit from cleaner and more robust code without moving a finger, Rusell did it all in his 51 patch series.

In addition, by having all the modules on the same source tree, when a generic API is to be added, it’s easy to consider all possible use-cases, because the code is readily available. An example of this is the preliminary Common Display Framework, which takes into consideration drivers from Renesas, NVIDIA, Samsung, Texas Instruments, and other Linaro companies. After this framework is done, all existing display drivers would benefit, but things would be specially easier for future generations of drivers. It’s only because of this refactoring that the amount of drivers supported by Linux can grow without the amount of code exploding uncontrollably, which is one of the advantages Linux has over Windows, OSX, and other operating systems’ kernels.

If companies don’t play along in this collaborate effort, like is the case with NVIDIA’s and AMD’s proprietary drivers, is to their own detriment, and there’s nobody to blame but those companies. Whenever you load one of these drivers, Linux goes immediately into a tainted mode, which means that if you find problems with the Linux kernel, Linux developers cannot help you. It’s not that they don’t want to help, but it’s that they might be physically incapable. If a closed-source module has a bug and corrupts memory on the kernel side, there is no way to find that out, and might show as some other module, or even the core itself crashing. So if a Linux developer sees a crash say, on a wireless driver, but the kernel is tainted, there is only so much he can do before deciding it’s not worth his time to investigate this issue which has a good chance of being caused by a proprietary driver.

Thus if a Linux update broke your NVIDIA driver, blame NVIDIA. Or even better, don’t use the proprietary driver, use noveau.

Conclusion

Hopefully after reading this article it would be clear to you what is the number one rule of Linux kernel development, why it is a good rule, and why other projects should follow it.

Unfortunately it should also be clear that other projects, particularly those related to GNOME, don’t follow it, and why that causes such backlash, controversy, and forks.

In my opinion there’s no hope in GNOME, or any other user-space project, being nearly as successful as Linux if they don’t follow the simplest most important rule. Linux will always keep growing in importance and development power, and these others are forever doomed to forks, nearly identical alternatives, and their developers jumping ship after their trust gets broken. If only they would follow this simple rule, or at least understand it.

Bonus video

Tux

Advertisements