PyData Seattle 2023

Python in Bioinformatics
04-28, 14:15–15:00 (America/Los_Angeles), Kodiak Theatre

Python is used all over the place in Bioinformatics. In this talk, I'll highlight three areas of interest:

  1. Informatic Jobs how does raw sequencing data turn into variant calls? There's often some Python shepherding the the underlying tools (often CLI tools)
  2. ML Models large language models are getting very good generally, e.g. Codex. Similar models are making nacent progress in Biology, e.g. ESM by meta
  3. Munging the lovely tasks that're relevant to every field, but how to do it is normally tacit within the field. The same is true for Bioinformatics.

Bioinformatics is an exciting field, and there are lots of opportunities for people with Python skill... and some resolve to learn a challenging domain. I've been using Python in Bioinformatics for about 7 years professionally, and I hope share a few ways it is used in the field to pique interest.

While it's not a complete list, Python is used heavily for Informatic Jobs, ML Modeling, and general Data Munging.

For the first, Informatic Jobs, it's often the case that a CLI tool exists to do the underlying job, that was written in C and "works" according to the subject matter expert. The programmers task in this case is to productionalize this job in the relevant way for the organization. I'm most familiar with doing this in AWS, so I'll provide an example of the infrastructure and zoom into the actual Python code that interacts with the CLI tool.

For the second, ML Models, you'd be shocked to learn people took a model applied to a sequence of tokens in one setting and applied it to another. While this has already happened numerous times, HMMs being an important example, we'll talk about how the currently class of large language models (LLMs) are being applied to biology. I'll talk about two use cases. First, using a trained model for information retrieval (think document search but for amino acids). And second, fine tuning a model for predictive tasks (is this a better nuclease than that).

Finally, because I want to end the talk on a bang, I'll highlight some of the libraries used for data munging in the field. Biopython is perhaps the most common given its utility for reading and writing idiomatic file types. PyRanges is another useful one which working with sequences, for example, I have this annotation that is from [4, 10], how can I easily determine which annotations overlap, which ones are close, etc. These problems are common and being able to use Python to solve the problem is very helpful.

Prior Knowledge Expected

No previous knowledge expected

Trent Hauck has been using Python for going on 13 years for various data related endeavors -- in fact, he spoke at PyData 2015 about Latent Dirichlet Allocation before Transformers were around!

Now he works in biotech and develops software and consults as part of his company, WHERE TRUE Technologies (