2014 CSHL Undergraduate Research Program in Bioinformatics

Searching for GATTACA

In this class we explored the problem of finding exact occurrences of a query sequence in a large genome or database of sequences. Under this theme, we started by analyzing the brute force approach introducing the concepts of algorithm, complexity analysis, and E-values. Next we discussed suffix arrays as an index for accelerating the search, including analyzing the performance of binary search. We also considered two traditional algorithms for sorting (Selection Sort versus QuickSort) and their relative performance. In the second half of the class we discussed finding approximate occurrences of a short query sequence in a large genome or database of sequences. We first defined the problem by considering various metrics of an approximate occurrence such as hamming distance, or edit distance. We then considered different methods for computing inexact alignments including brute force global & local alignments, and seed-and-extend algorithms. Finally we discussed Bowtie as a Burrows-Wheeler transform based short read mapping algorithm for discovering alignments to reference genome.

Python & Bioinformatics

Python Class 1

Introduction to python, variables, lists, conditions, loops

Python Class 2

Brute force search, dictionaries, motif finding

iPython Notebooks for Probability & Statistics

  1. Rolling a die (Uniform Random Probability)
  2. Flipping a coin (Binomial & Normal Distributions)
  3. Throwing Marbles into Jars (Poisson Distribution)
  4. Throwing Darts (Exponential Distribution)

We also used the exercises at Rosalind throughout the course.

Special topics

Talk by Anne Churchland on balancing work and life.