Phylobinary Trees

by Scott Schneider

A fundamental tenet of modern biology is that life happened once. In principle, we can trace the lineage of everything alive today back to that occurrence. And, as a consequence, everything alive is related.

We can infer how related everything is by comparing the genomes of different species. Because the genomes for humans and chimps are “less different” than the genomes for humans and gorillas, we conclude that humans and chimps are “more related.” In this context, “more related” means that as we go backward in time, we encounter the most recent common ancestor between humans and chimps before we encounter the most recent common ancestor between humans and gorillas.

The generalization of this concept is the phylogenetic tree. Given a set of genomes, we can visualize the relatedness of the set of species by placing “more related” species fewer hops away in the tree from “less related” species. The convergence point for different species in the tree is their most recent common ancestor. The root of the phylogenetic tree is that first occurrence of life, the ancestor to everything alive.

Let’s create an analogy where biological species map to compilers, and individual organisms within a species map to programs created by a particular compiler. We’ll call this a phylobinary tree as “binaries” is common shorthand to mean “compiled, executable programs.” (This is not to be confused with the binary tree data structure.)

The full phylobinary tree would contain all existing compilers as nodes. If a compiler is an internal node in the tree, it means that that compiler was used to create another compiler—compilers are, after all, just programs like any other. For example, the gc compiler for Go is implemented in C. This means that every compilation of gc will be a child of the C compiler that created it. Note that this is a mechanical relationship, as opposed to a conceptual one. There are many attempts to show conceptual relationships between programming languages, where a parent-child relationship means that ideas in the parent language influenced similar ideas in the child language. Instead, phylobinary trees are about actual, compiled programs. The aforementioned gc compiler will have many entries in the phylobinary tree: one for each distinct binary.

“Distinct binary” does not mean that we count copies of a binary multiple times; copies are indistinguishable. If two binaries are identical, they map to the same place in the phylobinary tree. But, each compiler version, for each different architecture, for each different operating system, will have distinct entries in the phylobinary tree.

We, obviously, have not done the work necessary to construct the full phylobinary tree. (And I am deliberately saying “the” full phylobinary tree, not “a” full phylobinary tree; whatever it looks like, there is only one full phylobinary tree at any given time.) However, we can still reason about its properties.

Roots and Forests

First, unlike the full phylogenetic tree, the full phylobinary tree will not have a single root. A root in the full phylobinary tree is a compiler that was not compiled by another compiler; it was created by hand in assembler. (We could extend the definition of our phylobinary tree to include assemblers, but the tree would be much messier, and I don’t think we would gain any extra insights.) Certainly, many compilers were created by hand—decades ago, it was a necessity. And it is very likely that many programs that exist today can be traced back to distinct compilers that were created by hand. An interesting question, then, is how many roots does the full phylobinary tree for this moment in time contain? We can’t answer this question without constructing the full tree.

In fact, if we look at all programs in current use now, the “tree” will actually be a forest with many disjoint trees. Any program that is not a compiler and written directly in assembly will be a completely disjoint node, with no other connections. (Note that this is necessarily true by how we have constructed the tree: only compilers have children, so a non-compiler cannot have children.) And there will likely be many compiler trees with no connections to each other.

My intuition is that the largest trees will involve C compilers. But how many trees, disjoint nodes, and total roots will the entire phylobinary graph have? I do not know. My guess is that the vast majority of running programs will exist on only a dozen or so trees, and trace back to about that many roots.

Tree Construction

While the easiest way to construct a phylobinary tree would be to keep records external to the binaries themselves of which compiler produced which executable program, it is not necessary. Genes are often compared to computer code. But we can go the other way, and analyze binaries in a similar way to how we analyze genes. The same compiler will tend to produce similar segments of code, even for different programs. In fact, the security community already takes advantage of this fact when analyzing malware. They commonly identify not just the source language from malware binaries, but even the particular compiler that produced it. When analyzing Duqu, Kaspersky Labs concluded that some of the code was C++, compiled by Microsoft Visual Studio C++. However, the majority of the binary code was something they had never seen before. They asked the community for help, and the conclusion was that it was an object-oriented dialect of C, also compiled by Microsoft Visual Studio C++.

If static analysis of binaries is akin to genetic analysis, then observing runtime behaviors of programs is akin to phenotype analysis. (A phenotype is some observable characteristic of an organism, which can be either a physical trait or a behavior.) There are academic projects which attempt to characterize the runtime behavior of malware, such as the BitBlaze project at UC Berekley. (I found their papers on extracting reusable sections of binaries, inferring protocols, and formal methods particularly interesting.)

I am unaware of anyone using a combination of these techniques to create phylobinary trees. I think it is clearly possible.

Cycles

The analysis by Kaspersky Labs raises another wrinkle: not all programs are the product of a single compiler. Many programs are linked together into a single binary from many different binary object files. Such binary object files can come from distinct compilers (and those compilers may even be for different languages). This wrinkle means that some programs will have multiple parents in the phylobinary tree.

Note that cycles are technically possible, so phylobinary trees are not strictly trees, but general graphs. Cycles can happen in one of two ways: multiple compilers producing a single compiler, and trickery. If compiler A produces compilers B0 and B1, and then both B0 and B1 are used to separately compile compiler C, then we have a cycle. This situation is unlikey, but not unrealistic.

The trickery way involves crafting programs to deliberately create cycles. Quines are programs that reproduce their own source code. But quines are about source code; we want a program that when it’s run, generates a copy of itself—not the source code, but an actual reproduction of the binary. I have only found one such example. That program, then, will be a cycle back onto itself. And, if 50 language quine-relays are possible, then it is likely that such quine binary relays are also possible. So we will not be limited to just single nodes that cycle back on themselves, but it should also be possible to have arbitrarily long chains that eventually end in a cycle.

It’s also interesting to note something that does not introduce new roots into trees: cross compilation. It’s tempting to assume that every disjoint tree in the phylobinary forest necessarily represents a single computer architecture, but that is not true. A compiler running on an x86 processor can still compile programs for an ARM processor. An x86 compiler can even compile an ARM compiler. No new roots are introduced by this process.

What might be true is that disjoint trees are very likely to contain programs from a single operating system. Note that in this binary-centric view of computing, source code similarity does not matter. If I port a compiler from one operating system to another, the resultant compiler will be a child of the compiler used to produce it, not the original compiler. That’s potentially confusing, so let’s be precise. Let’s assume I have a compiler for language Foo, which runs on Linux. The Foo compiler is implemented in C, and on Linux, I use a version of gcc. Now, I port the Foo compiler to Windows, and I use the Microsoft Visual Studio C++ compiler to compile my Foo compiler for Windows. My Foo compiler under Windows has no phylobinary relationship to the Foo compiler (or the gcc compiler) on Linux.

What is necessary to bridge such operating system disjoint trees in our phylobinary forest are cross operating system compiles. Such instances will be common with embedded devices, including iOS and Android. (And if you think that is interesting, you must look at what a Canadian Cross is.)

Conclusion

While I find the concept of phylobinary trees interesting, I’m not sure what use they are. Certainly some of the ideas are important for security, as previously mentioned—but they seem to get by just fine without a formalized framework. Perhaps it would be useful to be able to say to each other, “There is no known phylobinary root for this program” instead of “I do not know what compiler produced this program.” But perhaps not. The use could be less scientific, and more historical. For example, in a hundred years, will running programs be in a phylobinary tree with compilers that exist now? I think that’s likely. And I think it would be interesting if we actually had the information in a hundred years to answer that question.

Thanks to Ben DeVane, Dan Gackle and Will Slade for reviewing drafts of this essay.