Consed or not consed…that is the question.

2013-06-25 12.54.45

Pink DNA sequencer at the Genomics Institute in St. Louis.

Once upon a time, in my grad school years, I used to do a lot of microscopy of the fancier kind. At some point, that involved taking care of a deconvolution microscope and the associated Silicon Graphics computer running on Unix. I wanted to be prepared to handle that beast, so I talked to a fellow nerd, who recommended learning Linux. For a few months, I battled with Linux- I installed Red Hat, typed into terminal windows, and learned several commands. It helped a bit when the big machine arrived, but it was not really necessary, so I stopped trying. However, I did notice that it was a skill that impressed certain people.

That pride came rushing back today when we were introduced to Consed, one of the programs dedicated to sequence finishing. It needs to be started from a terminal window typing commands, so we received a short intro about Unix, and I felt pleased that I knew how to do it. It is one of those examples of pieces of knowledge acquired along the way of life.

Today we started officially the second part of the workshop. In the morning we had some free time to practice annotation, which was good. While there is still a long way to go, I could feel that I was getting more comfortable with the thinking behind the analysis, and my speed was hampered only by not remembering which link I had to click on to get the DNA sequence, or the predicted sequence, or whatever piece of information needed. The consensus of the group was that students will need some intense and extensive time to learn the system.

Screen shot 2013-06-25 at 3.18.33 PM

Consed screenshot.

Consed was introduced after we were given a tour of the Genomics Institution, an amazing building with amazing machinery and even more amazing science. We saw the famous pink sequencer, dedicated to the breast cancer genome project. We learned of all the projects the center is involved with, and the gigabytes of information generated weekly through next generation sequencing. Then we were introduced to Consed, and a group of young wizards gave a tutorial about how do they improve the raw sequences.

It was a whirlwind of windows to open, alignments, tracings, tags, comparisons, and decisions to make. However, the program is so visual, that after a while it became almost pleasing to solve the problems with the DNA sequences. Not sure why, maybe because it is more straightforward, but most of us in the group expressed they felt better today than yesterday.

The day flew by again- I get some exercise through a quick morning run in the Forest Park (a joy), otherwise the day is dedicated to work, work, work, with lots of food to sustain us; but we joke about the amount of glucose that we must spend with brain work in the cold cold air conditioned computer room! And it is time to go to sleep again- tomorrow is the last day!

2013-06-25 06.04.41

Sunrise over Forest Park.

Don’t believe everything you see or Blast

The Gateway Arch, according to my camera.

I was too tired last night to update the blog. After another intense day of annotation, discussions, and lectures, we did the must-do touristy thing in St. Louis: visiting the Gateway Arch. It was an impressive view, and then we had some fun in the group squeezing in the small trams resembling space capsules. From the top, the view was amazing, but a bit of a letdown as the windows were so small and the glass not so clean, so the pictures did not turn out stellar. While waiting in line, I played with the stitching feature of my new Nexus 4 phone. I love my phone, and I love the camera, but trying to pan a geometrically defined and narrow structure with unsteady hands standing in line was not easy. The picture you see here is one of my attempts to capture the magnificence of the structure.

The resulting picture connects directly to the message conveyed in the second day of the workshop, still dedicated to annotation: don’t believe everything the programs tell you.

We kept working on our sandbox projects, getting them ready for submission. In the meantime, we listened to presentations by GEP students, GEP faculty, and the usual lunch lecture by Sally Elgin. Every time we have a lecture, the computers in the room become shared, stopping us from use them in the meantime. While this system probably help people to concentrate on the presentation at hand instead of working with the sequence, it does limit sharing and updating. I keep my laptop open while taking notes in Endnote, and know better than to open any page other than the GEP, Flybase, or Blast 🙂

With some practice, the mechanics of annotation become easier, and eventually it becomes almost like a game. We were warned several times to not forget or let students forget science when looking at the results. We had a good lecture explaining how Blast works, and all the possible pitfalls of believing Blast too much, as the algorithm does its miscalculations. The value of RNA ref sequences was also discussed (may be for real or may be noise, depending on the quality of work, which end of the sequence it belongs to, or simply the kind of sequence it is). We got to know another database, Blat, and we were ushered back to look really closely to the sequence and THINK about what do those sequences actually mean in the evolutionary context.

The information overload continued with Sally Elgin’s lecture about the Drosophila chromosome 4 and its unique characteristics of having 80 genes (in melanogaster) in spite of having a prominent heterochromatin nature. There was an optional activity regarding chimpanzee sequences called Chimp Chunks, but I preferred to stay in the computer lab and fight my way through another sequence with multiple isoforms. The majority of participants I talked to expressed they felt better with annotation, but they thought they would dedicate some more time to annotate in the sandbox before moving on.

Today we get started with finishing, which seems rather intimidating. I hope to update with more info and links later on, but I need to rush for morning practice now. Until next time!

Annotation Day

A screenshot from today's annotation.

A screenshot from today’s annotation.

Well, it is almost 11 pm and I am catching up with emails and blog posting at the hotel lobby. As Sally promised, the day was indeed intense. Once inside the computer lab of the Biology building, we sat down facing shiny large Apple displays (all analysis takes place on Macintosh computers, to some chagrin of the PC-only crowd) and started to crank. Actually we started with an intro lecture about annotation of Drosophila species, which is the first powerpoint here. This particular project is based on Drosophila, a model organism that I have never touched in my life except for illustration of some classic genetics concepts and experiments. But the way the project is designed, the workflow is applicable to other systems. We were given a basic annotation workflow suggestion:

  1. Identify the likely ortholog in D. melanogaster
  2. Determine the gene structure of the ortholog
  3. Map each exon of ortholog to the project sequence
  4. Use BLASTX to identify conserved region
    • note position and frame
  5. Use these data to construct a gene model
    • Identify exact start and stop base position for each CDS
  6. Use the Gene Model Checker to verify the gene model
  7. For each additional isoform, repeat steps 2-5

So we claimed fosmids containing genomic sequences from different Drosophila species,and started to play. The Gene Browser used is a mirror of the UCSC gene browser, once we get the alignment, we start to map the genes.

For me, newbie as I am to the whole business, it was amazing how real things become, once you are actually using the information to achieve a goal. For example, the realization that a piece of DNA has indeed 6 ways to be read, depending on the reading frame. That introns indeed exist and have to match the beginning and the end showing the donor (GT) and acceptor (AG) sequences. That codons can be interrupted by introns and one has to check, zooming in deep to the nucleotide level, that it is completed afterward. But as many said, the gamelike feeling of the beginning should be promptly corrected with the deep science behind those sequences. The biology, the conservation, the function of genes and the proteins they encode, are all aspects to be considered when annotating DNA.

I felt profound empathy with my students today. How many times, after going over some deep and complicated topic, I would give them 5 minutes to discuss and practice, and then move on to the next topic? I realize now how important is to pause, and give students time to go over the motions, to practice and just be able to process the information. Luckily we had breaks, and sometimes I just had to politely shoo away the helpful TAs. I needed time for myself, to try to do it on my own, and then ask questions.

There were discussions of many aspects of annotation today, how to handle gene families, repeats, and even transposable elements. I took notes, and asked questions. Luckily, I did not beat myself up choosing a too complicated assignment- I chose a relatively easy one, which gave me some satisfaction. I guess after some years brushing with teaching faculty I learned about scaffolding 🙂 Tomorrow will be another day…

The night ended (after dinner) with a discussion of possible ways of implementation. Many ideas were offered, from research courses to research retreats, or even modules in lower and upper level classes. My reflection was about how to incorporate this into a molecular bio or even gen bio class from day 0- start looking at real sequences from the moment the concepts are introduced. Maybe too steep a learning curve? I will have to think about a good strategy to implement. On the other hand, it could be part of an arc of application to the different levels of biological concepts from gene to protein.

And this is all for today, dear readers…am fading. Good night!

“No rest for the wicked…

and the righteous don’t need it!”


Washington University in St. Louis

With those words ended Sally Elgin’s presentation last night kicking off the GEP workshop. After few hours of sleep, getting up at 4 am,and flying to St. Louis, my brain foggily registered that some very intense days were in front of the group. However, it is such an exciting possibility, especially by the combination of learning some cutting edge research techniques at such a prestigious university, together with a group of educators coming from a variety of schools, mostly small liberal arts colleges, community colleges, and like myself, a private non-profit. Even after such a short time we discussed similar issues: few resources, lots of courses to teach, a a strong desire to share with our students the experience and joy of research.

I am writing this at the breakfast table, in a hurry- we are heading for a full day of computer work with annotation, so I am not sure how much and how quickly I will be able to write. But for now, visit the course material page, from where most of the materials we will be using are.

Will be back soon!

Strict Anaerobes that Produce Catalase

Prepping for Bioinformatics: basics

Picture showing my foot very close to the edge of the Grand Canyon

Thrill of the edge: the North Rim of the Grand Canyon and me

In a post last January I shared how excited I was to become a participant in the Genome Education Partnership project. In less then 2 weeks I am bound to St. Louis, Missouri, to learn about bioinformatics (BI) as a project for crowdsourced research in education. I hope to document a bit my transition from a very superficial knowledge of BI to a deeper one, hopefully one adding a new dimension not only to my teaching, but also research.

My interest in BI came from my student’s interest in an obscure microbe and its even more obscure metabolic pathway. It has been a recurring pattern in my life that some of the most interesting things I learned (bringing me new avenues) came in a mysterious serendipitous way. So once he expressed his interest in having me as a thesis advisor (last Fall), I bought a basic BI book, started following people doing BI on Twitter, and filing articles away on Mendeley and Endnote. However, it was not until some days ago that I found enough space in my brain to actually start reading.

To clarify: if you know BI, you probably won’t find this interesting. On the other hand, I know that a lot of people do not learn certain things because they look intimidating, so they don’t even try. For years I avoided BI because I thought I could be a perfectly happy biologist without delving too deep into it…but the way things are shaping up, the next generation of biologist will need it. And as an educator of biologist, I should be teaching it then. Ergo, I need to learn more about it.

Back to the workshop materials: we were instructed to read the short article by Webber and Ponting for the definitions of the “-ogy” words, such as homology, analogy, orthology, paralogy, and xenology. While it is easy to say: orthologs arise from speciation, and paralogs from gene duplication; additional events such as deletion, duplications, conversions, and horizontal gene transfers (causing xenology) can make the picture quite complicated.  Sequence identity does not mean homology, although statistics (such as an E value lower than 103 ,provided by BLAST) and certain structural features strongly suggest homology.

Next, we were asked to work our way through some materials in the GEP website, such as BLAST tutorials, and introduction to Consed. I feel ok with Blast, although I will review the tutorials, but my next big leap will be to learn about finishing and sequence improvement.

Not today though.


