TreePPL

TreePPL is a probabilistic programming language for phylogenetics. The field of phylogenetics concerns itself with inferring the long-term (millions of years) evolutionary relationships between organismal populations. It can be contrasted with its sister field of population genetics, which looks at the changes of allele frequencies in populations over more recent time scales (thousands of years). There is little to none genetic data over long timescales, and the only direct evidence that we can find comes from paleontological finds, which give us some understanding of the morphology (shape) of the organisms, but not their genetic makeup. However, there is an abundance of data for present populations based on genomic sequences of extant taxa (species). So the statistical models involved often have a large input dataset but also a large hidden space to explore. The historical development of the field progressed from maximum-parsimony models, to maximum-likelihood models, to Bayesian models, the third one being preferred more and more in recent times, as it accommodates uncertainty and facilitates explainability.

The premise of TreePPL is to speed-up the development of Bayesian models by computational biologists. Its builds upon the idea of probabilistic programming, which separates model definition from implementation of the inference. Our vision is to enable practitioners to focus on the domain modeling, while piggy-backing on state-of-the-art inference methods (variations of Markov-chain Monte Carlo and sequential Monte Carlo). We want to provide (a) a very ergonomic interface for model-specification and (b) a set of very efficient pre-built inference runtimes that can be applied to the user-supplied model.

As a postdoc at Ronquist Lab I was responsible for designing the TreePPL language, leading the engineering effort to produce the minimal viable version, and doing some pilot studies. As with OpenBiodiv my responsibilities were heterogeneous, and while I contributed significantly to the design specification and later to the compiler implementation, key parts of my job included facilitating cooperation between various people working on the software components, working with practitioners on real-world use cases, and teaching about the language and popularizing probabilistic programming at conferences, e.g. see my keynote in Galway, or the 2024 Miking workshop.

What I learned from this experience is that designing a novel programming language with the primary purpose of not investigating new language paradigms but rather creating added value for users by leveraging novel compiler techniques is an extremely challenging and complex task and goes beyond academic research. I think it built upon my previous work with OpenBiodiv and contributed to improving my skillset for leading both software and research projects.

References

Feel free to contact:

Fredrik Ronquist, professor of phylogenetics, Swedish Museum of Natural History (project PI)
David Broman , professor of computer science, KTH Royal Institute of Technology (project PI)
Jan Kudlicka, professor of data science, BI Norwegian Business School, co-lead
Daniel Lundén , senior developer at Oracle (key contributor to the code-base)