Introduction to Bioinformatics

UCTM Sofia - April 2025

Viktor Senderov

Bioinformatics

Application of statistics and computer science to genomic/biological data.

Purpose: decipher and manage genomic/biological information

Key objectives of bioinformatics

  1. Genome sequencing & annotation
  2. Protein structure prediction
  3. Comparative genomics
  4. Transcriptomics and proteomics
  5. Drug discovery and design
  6. Metagenomics
  7. Systems biology
  8. Personalized medicine

Need for a separate scientific field

Some famous applications

  • Human Genome Project
  • production of the mRNA vaccine vs Covid-19
  • gene cloning for the purposes of the pharmaceutical industry

Structure of the lecture

  1. Theory - 45-50 min
  2. Practice - 45-50 min

Theory

Chromosomes, genes, alleles

Homologs

Similar genes that have a shared ancestry.

Orthologs

0

Paralogs

gene duplication

Analogs

Genes that are similar due to convergent evolution, and not due to shared ancestry.

Central dogma of molecular biology

DNA → RNA → protein

Central dogma of molecular biology

DNA → RNA → protein stated by Watson (1970).

Important

Watson is a known racist:

  • “their intelligence is [not really] the same as ours”(referring to blacks)
  • “some anti-Semitism is justified”
  • having more women in science is “more fun” but “less effective”
  • received Nobel prize together with Crick for work that was in major part done by Rosalind Franklin
  • Watson sold his Nobel prize in 2014 for $4.1 million

Central dogma of molecular biology

Crick (1970)

Transcription

mRNA Processing

Alternative splicing A

Alternative splicing B

Translation

Translation video short from YouTube

Codons

By Mouagip - Codons aminoacids table.png, Public Domain, https://commons.wikimedia.org/w/index.php?curid=5986132

Sequence variations

Nomenclature sensu Society et al. (2007):

  • Mutations: disease causing
  • Polymorphisms: non-disease causing

Level:

  • genomic
  • RNA
  • protein
  • other

Sequence variation types

  • Substitution: 76A>C
  • Deletion: 76_78del
  • Insertion: 83_84insTG
  • indels, duplications, inversions, translocations, etc.

Sequence alignment

Pair-wise alignment: compare two sequences.

Seq1 5' ACTACTAGATTACTTACGGATCAGGTACTTTAGAGGCTTGCAACCA 3'
        |||||||||||    |||||||  |||||||||||||| |||||||
Seq2 5' ACTACTAGATT----ACGGATC--GTACTTTAGAGGCTAGCAACCA 3'

Multiple-sequence alignment: Seq1, Seq2, Seq3, …

Global and local alignment

Seq1 5' ACTACTAGATTACTTACGGATCAGGTACTTTAGAGGCTTGCAACCA 3'
        |||||||||||    |||||||  |||||||||||||| |||||||
Seq2 5' ACTACTAGATT----ACGGATC--GTACTTTAGAGGCTAGCAACCA 3'

global alignment: useful for aligning two closely related sequences, homologous genes

5' ACTACTAGATTACTTACGGATCAGGTACTTTAGAGGCTTGCAACCA 3'
                ||||| |||||| |||||||||||||||
             5' TTACTCACGGATGAGGTACTTTAGAGGC 3'

local alignment: divergent sequences, useful for finding out conserved patterns in DNA, gene families

Algorithms and tools for alignment

Global alignment uses the Needleman-Wunsch algorithm:

  • EMBOSS needle
  • specialized BLAST

Local alignment uses the Smith-Waterman algorithm:

  • BLAST
  • EMBOSS

Genome assembly using a toy example

The sequencing machine returned the following reads:

GTG, GCC, GCA, ATG, TGG, TGC, GGC, CGT, CAA, AAT

In fact the actual sequence looks as follows:

ATGGCGTGC

Using a Hamiltonian cycle

Using a Eulerian cycle

Genome assembly

After Medvedev and Pop (2021) this is misleading.

  • you don’t do a reconstruction of the reads: in real-world scenarios this is not unique
  • instead algorithms produce contigs: long segments of unambiguous segments
  • finding the set of all possible contigs is a polynomial-time problem (unitig algorithm), regardless of whether the genome is reconstructed as Hamiltonian or Eulerian cycle

Main theorem (informal) Medvedev and Pop (2021)

The following problems are equivalent and solvable in linear time:

  • Find an Eulerian cycle in the de Bruijn graph where the edges correspond to k-mers in the reads.

  • Find a Hamiltonian cycle in the de Bruijn graph where the edges correspond to all the possible (k+1)-mers that can be obtained from the reads’ k-mers.

Practical: Working with Genomic Data

Some cat phenotypes

Figure 1 (Lyons 2015): (a) wild type (b) non-Agouti allele (c) Inhibitor allele

The task of the lecture

ATGAATATCCTCCGCCTACTCCTGGCCACCCTGCTGGTCTGCCTGTGCCTCCTCACTGCCTACAGTCAC
CTGGCACCTGAGGAAAAACCCAGAGATGACAGGAACCTGAGGAGCAACTCCTCTGAACATGTTGGATCT
CTCTTCTGTCTCTATTGTAGCGCTGAACAAGAAATCCAAAAAGATCAGCAGAAAAGAGGCGGAAAAGAA
GAGATCTTCCAAGAAAAAGGCTTCGATGAAGAATGTTGCTCAGCCTCGGCGGCCCCGGCCTCCGCCGCC
CGCCCCCTGCGTGGCCACTCGTGACAGCTGCAAGCCGCCGGCGCCCGCCTGCTGCGACCCGTGCGCCTC
CTGCCAGTGCCGCTTCTTCCGCAGCTCCTGCTCCTGCCGAGTGCTCAACCCCACCTGCTGA

Working with literature

[Google Scholar](https://scholar.google.com)

Genome browsers

Genome browsers allow researchers to navigate the genome in an analogous way to navigating the internet (Furey 2006).

Provide

  • access to and
  • visualization of

of genomic sequences and annotations.

Genome browsers

Popular genome browsers include:

NCBI Nucleotide Database

https://www.ncbi.nlm.nih.gov/nucleotide/

NCBI - 2 (mRNA transcript)

https://www.ncbi.nlm.nih.gov/nuccore/NM_001009190.1

https://www.ncbi.nlm.nih.gov/nuccore/NM_000799.4 (UTR)

NCBI - 3 (protein sequence)

https://www.ncbi.nlm.nih.gov/protein/NP_001009190.1

Download files

A Note on Running Windows

Tutorial on how to install WSL on Windows

Examining the files with UNIX CLI

Examining the files with Aliview

Comparing the variant to the wild-type

  • deletion at position 123 and 124

  • refer to Lyons (2015) to infer absence of pands

More topics - Primer

Primer blast

Project

You will be given a genetic sequence and you are to use the tools described today to determine what is the phenotype of the organism that it belongs to.

If you want to do the assignment, send me an email to

vsenderov@gmail.com

Further resources: Online

Introduction to genome browsers using Ensembl (video)

Data File Formats (UCSC FAQ)

FASTA Format for Nucleotide Sequences

Nomenclature for the description of mutations

Recombinant DNA, Cloning, & Editing

Bibliography

Crick, Francis. 1970. “Central Dogma of Molecular Biology.” Nature 227 (5258): 561–63.
Eizirik, Eduardo, Naoya Yuhki, Warren E Johnson, Marilyn Menotti-Raymond, Steven S Hannah, and Stephen J O’Brien. 2003. “Molecular Genetics and Evolution of Melanism in the Cat Family.” Current Biology 13 (5): 448–53. https://www.cell.com/current-biology/fulltext/S0960-9822(03)00128-3.
Furey, Terrence S. 2006. “Comparison of Human (and Other) Genome Browsers.” Human Genomics 2: 1–5. https://pmc.ncbi.nlm.nih.gov/articles/PMC3525149/.
Lyons, Leslie A. 2015. “DNA Mutations of the Cat: The Good, the Bad and the Ugly.” Journal of Feline Medicine and Surgery 17 (3): 203–19. https://journals.sagepub.com/doi/full/10.1177/1098612X15571878.
Medvedev, Paul, and Mihai Pop. 2021. “What Do Eulerian and Hamiltonian Cycles Have to Do with Genome Assembly?” PLoS Computational Biology 17 (5): e1008928. https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008928#pcbi-1008928-g001.
Society, Human Genome Variation et al. 2007. “Nomenclature for the Description of Sequence Variations.” https://www.hgmd.cf.ac.uk/docs/mut_nom.html#intro.
Watson, JD. 1970. “Molecular Biology of the Gene.”