What is bioinformatics?

Bioinformatics is a new, computationally-oriented Life Science domain. Its primary goal is to make sense of the information stored within living organisms. Bioinformatics relies on and combines concepts and approaches from biology, computer science, and data analysis. Bioinformaticians evaluate and define their success primarily in terms of the new insights they produce about biological processes through digitally parsing genomic information.

Bioinformatics is a data science that investigates how information is stored within and processed by living organisms.

How has bioinformatics changed?

In its early days––perhaps until the beginning of the 2000s––bioinformatics was synonymous with sequence analysis. Scientists typically obtained just a few DNA sequences and then analyzed them for various properties. Today, sequence analysis is still central to the work of bioinformaticians, but it has also grown well beyond it.

In the mid-2000s, the so-called next-generation, high-throughput sequencing instruments (such as the Illumina HiSeq) made it possible to measure the full genomic content of a cell in a single experimental run. With that, the quantity of data shot up immensely as scientists were able to capture a snapshot of everything that is DNA-related.

These new technologies have transformed bioinformatics into an entirely new field of data science that builds on “classical bioinformatics” to process, investigate, and summarize massive data sets of extraordinary complexity.

What subfields of bioinformatics exist?

DNA sequencing was initially valued for revealing the DNA content of a cell. It may come as a surprise to many, however, that the most significant promise for the future of bioinformatics might lie in other applications. In general, most bioinformatics problems fall under one of four categories:

Classification: determining the species composition of a population of organisms
Assembly: establishing the nucleotide composition of genomes
Resequencing: identifying mutations and variations in genomes
Quantification: using DNA sequencing to measure the functional characteristics of a cell

The Human Genome Project fell squarely in the assembly category. Since its completion, scientists have assembled the genomes of thousands of other species. The genomes of many millions of species, however, remain entirely unknown.

Studies that attempt to identify changes relative to known genomes fall into the resequencing field of study. DNA mutations and variants may cause phenotypic changes like emerging diseases, changing fitness, different survival rates, and many others. For example, there are several ongoing efforts to compile all variants present in the human genome––these efforts would fall into the resequencing category. Thanks to the work of bioinformaticians, massive computing efforts are underway to produce clinically valuable information from the knowledge gained through resequencing.

Living microorganisms surround us, and we coexist with them in complex collectives that can only survive by maintaining interdependent harmony. Classifying these mostly unknown species of micro-organisms by their genetic material is a fast-growing subfield of bioinformatics.

Finally, and perhaps most unexpectedly, bioinformatics methods can help us better understand biological processes, like gene expressions, through quantification. In these protocols, the sequencing procedures are used to determine the relative abundances of various DNA fragments that were made to correlate with other biological processes.

Over the decades biologists have become experts at manipulating DNA and are now able to co-opt the many naturally-occurring molecular processes to copy, translate, and reproduce DNA molecules and connect these actions to biological processes. Sequencing has opened a new window into this world, new methods and sequence manipulations are continuously discovered. The various methods are typically named Something-Seq for example RNA-Seq, Chip-Seq, RAD-Seq to reflect what mechanism was captured/connected to sequencing. For example, RNA-Seq reveals the abundance of RNA by turning it into DNA via reverse transcription. Sequencing this construct allows for simultaneously measuring the expression levels of all genes of a cell. For example RAD-Seq uses restriction enzymes to cleave DNA at specific locations and only the fragments around these locations are then sequenced. This method produces very high coverage around these sites and thus is suited for population genetics studies.

Is there a list of functional assays used in bioinformatics?

In the Life Sciences, an assay is an investigative procedure used to assess or measure the presence, amount, or function of some target (like a DNA fragment). Dr. Lior Pachter, professor of Mathematics at Caltech, maintains a list of “functional genomics” assay technologies on the page called Star-Seq.

All of these techniques fall into the quantification category. Each assay uses DNA sequencing to quantify another measure, and many are examples of connecting DNA abundances to various biological processes.

Notably, the list now contains nearly 100 technologies. Many people, us included, believe that these applications of sequencing are of greater importance and impact than identifying the base composition of genomes.

Below are some examples of the assay technologies on Dr. Pachter’s list:

But what is bioinformatics, really?

So now that you know what bioinformatics is all about, you’re probably wondering what it’s like to practice it day-in-day-out as a bioinformatician. The truth is, it’s not easy. Just take a look at this “Biostar Quote of the Day” from Brent Pedersen in Very Bad Things:

I’ve been doing bioinformatics for about 10 years now. I used to joke with a friend of mine that most of our work was converting between file formats. We don’t joke about that anymore.

Jokes aside, modern bioinformatics relies heavily on file and data processing. The data sets are large and contain complex interconnected information. A bioinformatician’s job is to simplify massive datasets and search them for the information that is relevant to the given study. Essentially, bioinformatics is the art of finding the needle in the haystack.

Are bioinformaticians data janitors?

Oh, yes. But then, make no mistake, all data scientists are. It is not a unique feature of this particular field.

I used to get worked up about the problems with data, but then I read the New York Times opinion piece: For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights, where they state:

[…] Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets […]

You will have problems like that all the time. For example, you’ll download a file from an official database like NCBI. It turns out to be a large file with 1.5 million entries. But you won’t be able to use it, programs that ought to work fine with data of this size crash spectacularly. You’ll scratch your head, dive in and a few hours later it turns out that one single line, out of the 1.5 million, has the wrong number of fields - one line has the wrong format! Now you’ll need to find a way to fix or get rid of that single line - then everything works again.

It helps to be prepared and know what to expect.

Is creativity required to succeed?

Bioinformatics requires a dynamic, creative approach. Protocols should be viewed as guidelines, not as rules that guarantee success. Following protocols by the letter is usually entirely counterproductive. At best, doing so leads to sub-optimal outcomes; at worst, it can produce misinformation that spells the end of a research project.

Living organisms operate in immense complexity. Bioinformaticians need to recognize this complexity, respond dynamically to variations, and understand when methods and protocols are not suited to a data set. The myriad complexities and challenges of venturing into the frontiers of scientific knowledge always require creativity, sensitivity, and imagination. Bioinformatics is no exception.

Unfortunately, the misconception that bioinformatics is a procedural skill that anyone can quickly add to their toolkit rather than a scientific domain in its own right can lead some people to underestimate the value of a bioinformatician’s contributions to the success of a project.

As observed in the Nature paper Core services: Reward bioinformaticians

Biological data will continue to pile up unless those who analyze it are recognized as creative collaborators in need of career paths.

Bioinformatics requires multiple skill sets, extensive practice, and familiarity with multiple analytical frameworks. Proper training, a solid foundation and an in-depth understanding of concepts are required of anyone who wishes to develop the particular creativity needed to succeed in this field.

This need for creativity and the necessity for a bioinformatician to think “outside the box” is what this Handbook aims to teach. We don’t just want to list instructions: “do this, do that.” We want to help you establish that robust and reliable foundation that allows you to be creative when (not if) that time comes.

Are analyses all alike?

Most bioinformatics projects start with a “standardized” plan, like the ones you’ll find in this Handbook. However, these plans are never set in stone. Depending on the types and features of observations and results of analyses, additional tasks inevitably deviate from the original plan to account for variances observed in the data. Frequently, the studies need substantive customization.

Again, the authors of Core services: Reward bioinformaticians note the following:

“No project was identical, and we were surprised at how common one-off requests were. There were a few routine procedures that many people wanted, such as finding genes expressed in a disease. But 79% of techniques were applied to fewer than 20% of the projects. In other words, most researchers came to the bioinformatics core seeking customized analysis, not a standardized package.”

In summary, almost no two analyses are precisely the same. Also, it is quite common for projects to deviate from the standardized workflow substantially.

Should life scientists know bioinformatics?

Yes!

The results of bioinformatic analyses are relevant for most areas of study in the life sciences. Even if a scientist isn’t performing the analysis themselves, they need to be familiar with how bioinformatics operates so they can accurately interpret and incorporate the findings of bioinformaticians into their work. All scientists informing their research with bioinformatic insights should understand how it works by studying its principles, methods, and limitations––the majority of which is available for you in this Handbook.

We believe that this book is of great utility even for those who don’t plan to run the analysis themselves.

What type of computer is required?

All tools and methods presented in this book have been tested and will run on all three major operating systems: MacOS, Linux and Windows 10. See the Computer Setup page.

For best results, Windows 10 users will need to join the Windows Insider program (a free service offered by Microsoft) that will allow them to install the newest release of “Windows Subsystem for Linux (WSL)”

Is there data with the book?

Yes, we have a separate data site at http://data.biostarhandbook.com. Various chapters will refer to content distributed from this site.

Who is the book for?

The Biostar Handbook provides training and practical instructions for students and scientists interested in data analysis methodologies of genome-related studies. Our goal is to enable readers to perform analyses on data obtained from high throughput DNA sequencing instruments.

All of the Handbook’s content is designed to be simple, brief, and geared towards practical application.

Is bioinformatics hard to learn?

Bioinformatics engages the distinct fields of biology, computer science, and statistical data analysis. Practitioners must navigate the various philosophies, terminologies, and research priorities of these three domains of science while keeping up with the ongoing advances of each.

Its position at the intersection of these fields might make bioinformatics more challenging than other scientific subdisciplines, but it also means that you’re exploring the frontiers of scientific knowledge, and few things are more rewarding than that!

Can I learn bioinformatics from this book?

Yes you can!

The questions and answers in the Handbook have been carefully selected to provide you with steady, progressive, accumulating levels of knowledge. Think of each question/answer pair as a small, well-defined unit of instruction that builds on the previous ones.

Reading this book will teach you what bioinformatics is all about.
Running the code will demonstrate the skills you need to perform the analyses.

How long will it take me to learn bioinformatics?

About 100 hours.

Of course, a more accurate answer depends on your background preparation, and each person is different. Prior training in at least one of the three fields that bioinformatics builds upon (biology, computer science, and data analysis) is recommended. The time required to master all skills also depends on how you plan to use them. Solving larger and more complex data problems will require more advanced skills, which need more time to develop fully.

That being said, based on several years of evaluating trainees in the field, we have come to believe that an active student would be able to perform publication quality analyses after dedicating about 100 hours of study.

This is what this book is really about – to help you put those 100 hours to good use.