I'm thirty seven years old, which is like ninety nine in programmer years. I'm old enough to remember the earliest days of the public Internet and the first boutique Internet service providers. My first online account was via one called Internet Access Cincinnati (IAC). It provided dialup modem access to a Sun SparcStation 10 where users could run such venerable old terminal applications as elm (a mail client), emacs, lynx (text-mode web browser), and of course IRC.
Later they added the ability to dial into a CSLIP (predecessor to PPP) terminal server and connect your own Linux or Trumpet WinSock equipped Windows system directly to the Internet with a real bona-fide IP address.
But back to that SparcStation. It had two CPUs that ran at a whopping 33mhz (0.033ghz) and could hold up to 512mb of RAM, though I somehow doubt that IAC's SparcStation had this much. RAM was expensive back then. With these meagre resources it served 50-100 concurrent active users, provided e-mail for tens of thousands, ran an IRC chat service, served early HTTP 1.0 via NCSA HTTPd, and served as a voluntary open FTP mirror for Slackware Linux. It handled the load pretty well most of the time and routinely clocked uptimes of 1-2 months.
I bet you feel a rant about software bloat coming on right about now. If you do, you're right. But unlike most such rants, this one puts forward a logical hypothesis that I think might explain a whole lot of bloat as the consequence of a very simple early design decision gone awry.
I brought up the SparcStation because I'd like to get started by asking a stupid question: why do we need virtualization? Or containers for that matter? How did we end up with the nested complexity explosion of OS->VM->containers->... instead of the simplicity of multi-user operating systems?
Virtualization is Expensive
ZeroTier (my startup, plug plug) runs on a sprawling cloud infrastructure across many data centers, providers, and continents. Most of the nodes that make up its cloud presence are there to provide stable anchor points on the Internet and last resort relaying. They push packets around. They use bandwidth and a bit of CPU but very little memory or storage. Taking one of its "TCP fallback relays" as an example: it pushes about 5-10 megabits of traffic most of the time but the software it runs only needs about 10mb of RAM and less than a megabyte (!!!) of disk space. It also utilizes less than 1% of available CPU.
The virtual machine it runs on, however, occupies about 8gb of disk and at least 768mb of RAM. All that space is to store an entire CentOS Linux base installation, an entire compliment of standard applications and tools, system management services like systemd and cron, and an OpenSSH server for remote access. The big RAM footprint is a direct consequence of virtualization a.k.a. "a hack to trick a kernel into thinking it's on its own hardware." All that memory has to be available because kernels expect to be on stand-alone systems with contiguous local memory and the hypervisor has to comply. (Modern hypervisors can overbook memory to some extent, but doing this too much risks crashes or awful performance under spike conditions.)
Containers are a bit more efficient than hypervisors. They skip on the whole Linux kernel and its memory reservation requirements and hypervisor overhead, but they still generally bundle at least a subset of an entire Linux installation to accomplish little more than the execution of one or a few programs. A container to run a 10mb app might clock in at a few gigabytes, and the management of containers requires its own big fat heap of software.
Ask anyone why we need all this and they will answer with some combination of security, isolation, and scalability, and they would be correct.
Back in the old days before virtualization and containers, a company like ZeroTier would have to colocate its own hardware in data centers. That's inconvenient and not very economical. Virtualization lets you take a single huge machine (my 768mb VMs probably run on 16-24 core Xeon beasts with 256gb+ RAM) and serve hundreds of customers with it.
Hundreds of customers like... umm... that old 33mhz SparcStation...?
Today's software is orders of magnitude larger and more complex than the stuff that old IAC shell server ran, and while some of that is the result of abstraction and bloat much of it stems from actual advances in capability and huge increases in data set size. I am not arguing that we should be able to cram hundreds of today's typical customer workloads on a coffee machine. But I will argue that we ought to be able to cram quite a few Wordpress sites (one example of a common work load you'd host on a small VM) on a Raspberry Pi, a machine with roughly 100 times the CPU power, several times the RAM, and 10-20 times the disk storage.
A Pi costs $30 and uses less than 15 watts of power. Care to speculate on what a single monstrous VM host node costs and uses?
Let's do a short little thought experiment where we try to set up a Pi as a shell server for a group of users who want to run Wordpress blogs. A web server and small database together use maybe 10-20mb of RAM and our Pi has 1024, so it ought to be possible to host at least 50 small to moderately sized sites. (In reality a lot of that RAM is redundant or inactive so with swap and KSM a Pi could probably host a few hundred, but let's be conservative.)
First we install Linux. Linux, a flavor of Unix, is a multi-user operating system (right?), so we create 50 accounts. Each of these users can now log in. An incoming SSH session and a shell uses only about a megabyte or two of memory and our Pi has over a thousand, so at this point everything should be going great.
Yeah, yeah, ignore security for now. We'll touch on that later. But if none of our users are badly behaved they should all now be downloading and unpacking Wordpress. The problems don't begin until they get to the installation instructions.
Step one: set up a MySQL database.
Ahh... that's easy! Type "sudo apt-get install..."
Wait... sudo? Giving sudo rights means you might as well not have separate users at all.
Turns out you can install MySQL in your own home directory. If you're willing to compile it or manually unpack the Debian package and edit some config files you can run it yourself without any special privileges. You'll have to make it use folders under /home/yourname/... but eventually you'll have your own local MySQL server running in your own local user-space.
Step two: configure your web server to run PHP.
Grumble... "sudo apt-get install" is out once again so out we go to download all the requisite components and put them together and get them up and running. Turns out this too can be done and after a bit more yak shaving you've got your own web server with PHP ready to go.
Then you get something like "bind: permission denied". Huh? After searching a bit you discover that port 80, the default web port, is a privileged port. You have to be root to bind to it.
No matter. Just change it. But that now means everyone has to enter your site's web address with a stupid :### at the end of the host name to connect to a non-standard port.
If you're British you might be saying this word right now. You probably just realized that without the privileged ports restriction everything meaningful on a Unix system would not have to run as root and you could at the very least do a much better job of securing system services. But that's not why I titled this section bollocks. I titled it bollocks because in order to understand the wider implications of the privileged ports restriction we are going to digress for a minute and talk about bollocks.
The testicles are a pretty vital organ. Locating them in a little bag outside a male's body is stupid, not to mention rather silly looking. The females of the species are more logically built: their ovaries are deep inside their abdomens where they're much better shielded from kicks, punches, swords, fence posts, and mountain lions.
Path dependence, a concept from economics and evolutionary biology, attempts to explain why such bizarre design decisions happen. In a nutshell: decisions made in the past can constrain what decisions can easily be made in the present.
A male's testes are outside his body because certain enzymes in human sperm operate best at temperatures slightly below the normal human core body temperature. That "decision" was made long ago, probably (according to the simplest and thus most likely hypothesis) back when we were either not warm blooded at all or ran at a lower body temperature. Reproductive organs are what evolutionary biologists would call a highly conserved system, meaning they don't tend to change often. That's because any mutation that mucks around with reproduction has a high probability of evicting said organism from the gene pool. As a result, it was "easier" (in an evolutionary probabilistic graph traversal sense) for a man's bollocks to end up in a nutty looking sack between his legs than it was to go back and effectively re-engineer sperm to develop and keep well at a higher temperature.
Even if the mutational path to sperm able to tolerate higher temperatures was just as likely as the one that led to the slang phrase "hanging out," maybe it just didn't happen. Maybe those particular mutations just by chance never occurred, leaving the evolutionary learning system with no choice but to take the... uhh... low road. Okay, okay, I'm done.
Technological evolution is guided by intelligent (or so we imagine) agents, but in many ways it's a lot like biological evolution. Once a decision is made, future decisions are more likely to assume it and continue down the chosen path. Sometimes that's because changing the old decision is more costly than living with it, while other times it's just because of simple inertia.
Take the weird little round 12-volt DC plugs in cars. Originally they were for little electric heating element cigarette lighters. When portable electronics got popular, engineers were faced with the question of how to make it so you could plug them into your car. Getting auto makers to install a plug or coaxing users into having one installed themselves by a mechanic is hard, probably a non-starter, but since nearly all cars had cigarette lighters... well... the answer was obvious. Make a plug that fits into the lighter hole. Now cars often don't even come with a cigarette lighter, but they still have the plug the lighter fit into because where else would you plug in your smart phone?
(USB is slowly replacing the lighter plug, but most cars still have them.)
The Path Not Taken
At this point you probably understand where we're going next. Unix was originally designed as a multi-user time sharing operating system, and toward this end a number of features were added to allow all the resources of a system to be shared. It has a file permission system with users, groups, and on newer systems finer grained ACLs. It has quotas to allow per-user memory, disk, and CPU usage to be limited to prevent one user from totally monopolizing the system. It has process isolation, memory protection, and so on. But for whatever reason nobody ever thought to adapt its networking layer to accommodate the needs of network service multi-tenancy. That's probably because during the time that such needs arose privileged ports were still meaningful and relevant. It might also be because nobody thought of it.
Privileged ports were originally a security feature. Back when computers were all owned by institutions and managed by designated system administrators and networks were closed to third party devices (to the extent such devices even existed), a range of ports that could only be bound by root allowed any system service to know whether or not an incoming connection originated from a program "blessed" by an institution's system administrators.
Back in the early 1990s when the Internet started to go public, many systems and networks still looked like this and many systems still relied on privileged ports as a critical authentication mechanism. So when web sites started to proliferate and everyone started wanting to run their own web servers on their own IP addresses, it was more conceptually obvious and less disruptive to implement operating system virtualization and leave the rest of the stack untouched.
Operating system virtualization is useful for other reasons such as testing and debugging of system software or the support of legacy software that must run against older (or different) OSes and OS kernel versions, and this added additional fuel to the fire. In the end instead of adapting Unix networking to the needs of multi-tenancy we just put boxes into boxes and moved on.
... at great cost in hardware and energy. I'm going to hazard a guess that the same 16-24 core Xeon with ~256gb of RAM that probably hosts less than a hundred 768mb VMs could probably host thousands of user workloads if those workloads ran directly on the same kernel without the cost of a hypervisor or the bloat of containers. How much CO2 might we not be emitting into the atmosphere if every data center could be cut down to ~1/10th of its current size?
Container solutions like Docker get us part of the way there. In a sense you could argue that containerization is precisely the multi-tenancy solution I'm heading toward, except that it borrows heavily from the legacy path of virtualization by treating system images like giant statically linked binaries. It's also a step backward in usability. I still need root to launch a container. I can't (easily) log into one container and launch another like I can do with processes on a simple multi-tenant Unix machine. Instead we have to build massive central management consoles and "orchestration" tools.
So what might have been done? What might the path not taken look like?
In some cases path dependence arises because so many new decisions have built upon an old decision that it's too costly to go back and change it. I don't think that's the case here. In this case I think a few simple changes could rewind 20 years of devops history and set us on a path toward a much simpler and radically more efficient design. It wouldn't solve every problem but it would eliminate the need for quite a bit of complexity for the majority of common use cases.
Step one: eliminate privileged ports. I'm guessing this would be a 1-2 line change, probably the removal of an "if" statement. Anything still relying on a non-cryptographic credential like a port number for security is broken, since such things are trivial to spoof.
Step two: extend user and group permissions and ownership into the realm of network resources, allowing UID/GID read/write/bind (bind in place of execute) permission masks to be set for devices and IPs. By default a bind() call that does not specify an address (0.0.0.0 or ::0) will listen for packets or connections on all interfaces that the current user has the appropriate permission to bind and outgoing connections will default to the first interface the user owns.
Step three: It would probably be a good idea to allow user space creation of virtual network devices (tun/tap) for the same reason users can create their own processes and files. This isn't essential but would be nice to have. The permissions around things like tcpdump/pcap would also have to be modified to respect the new network resource permission model.
Now any service can be run as a normal user. Root access is only needed for core system upgrades, hardware or driver changes, and user management.
The Road Not Taken is Overgrown
Since we chose the path of virtualization and containerization we've allowed the multi-tenancy facilities in Unix to atrophy and it would take a little bit of work to bring them back into form. But I think it might be worth it, both in simplicity and overhead.
We'd have to go back and harden its user-mode security. I don't think this is as hard as people think. Virtualization doesn't offer as little surface area as people think, and attacks like row hammering show that it isn't a panacea. We've invested incredible amounts of developer time in making virtualization robust and secure and in building tool chains for it, not to mention how much is now being invested in the container ecosystem. I think a fraction of that effort could harden the user-space of a minimal host Linux system against privilege escalation and leakage.
We'd have to tighten up other aspects of isolation and implement options to limit what users can see via commands like ps and netstat to their own resources, alter package managers to allow installation of packages into a subtree of the user's home directory if the user is not root, etc. Changes would also probably be needed to system elements like the dynamic linker to make it easier for user binaries to prefer shared libraries in their own local "arena" over system libraries if such were present. It would be good if system init services supported user-configured user services too so users didn't have to improvise watcher scripts and cron jobs and other hacks.
The end result would end up looking a little bit like containerization but without the clunkiness, bloat, wheel-reinvention, and inconvenience. You'd be able to keep deployments in git and deploy them with git checkout and git pull using ssh with orchestration issues handled either locally or in a peer to peer manner, and you'd be able to just log into a machine and launch something without having to go through some kind of complicated container management infrastructure. You'd also have lighter weight deployments since unless there were a compelling reason to do so (like library compatibility issues) most software could just use the core shared libraries (libc, stdc++, etc.) and standard toolchain on the host system.
This is all probably water under the bridge. Chances are the path forward will be to develop true secure container multi-tenancy and to achieve with containers what should have been achieved by extending the Unix permission model to networking in user space.
The purpose of this post is to show how small decisions that nobody really thinks about can have dramatic effects on the future evolution of technology (and society). The 1970s decision to use port numbers as an in-band signaling mechanism to implement cross-system security validation might have been, in retrospect, a trillion dollar mistake that pushed the evolution of the Unix platform down a path of significantly greater complexity, resource use, and cost.
But hey, maybe it's not a done deal yet. There's over a dozen Linux distributions and most of them are doing more or less the same things with a slightly different spin. Implementing something like this would be an interesting way for one of them to differentiate. The first step would be to implement networking permissions something like what was discussed above and to propose it as a kernel patch. For backward compatibility you could make it something enabled via a sysctl setting, or maybe a module (if modules can make changes that deep).