proteome zone

Friday, February 14, 2014

G-to-T transversions account for Szybalski's Rule

Previously, I described a data-mining experiment showing that the purine content of coding regions of DNA increases in concert with the total A+T content of DNA, not only for bacterial chromosomes (see graph below) but mitochondrial DNA, and eukaryotic DNA. The general finding of excess purines on message strands of DNA is known as Szybalski's Rule. What I found is that the accumulation of purines on message strands in accordance with Szybalski's Rule is not random; it depends, in a predictable way, on the overall A+T content of the genome. For genomes in which A+T content is below 63% (G+C above 65%), purines actually accumulate on the transcribed (or antisense) strand of DNA, rather than the message strand.

Purine content versus A+T content for coding-region DNA in 1363 bacterial species.

Codon analysis shows unambiguously that as genes become richer in A+T content (or as G+C content goes down), the excess of purines on the message strand becomes larger and larger.

The increase in purine content can be exactly accounted for via G-to-T transversions. That is to say, all of the excess adenine on the message strand can be accounted for by loss of guanines on the transcribed strand.

An example will make this clear. Collectively, the coding regions of Streptomyces cattleya contain bases in the following relative amounts:

A: 13.59
G: 35.30
C: 37.85
T: 13.23

S. Cattleya happens to fall exactly on the regression line in the above graph, at A+T = 0.2682 and A+G = 0.4889. The spirochete Borrelia burgdorferi (strain 118a) also falls on the regression line, at A+T = 0.7117 and A+G = 0.5536. It has coding-region base contents of:

A: 38.71
G: 16.65
C: 12.17
T: 32.46

The transcribed strand of S. cattleya can be inferred to have an average guanine content of 37.85%, since the message strand has a cytosine content of 37.85%. In B. burgdorferi, the transcribed-strand guanine has dropped to 12.17%. The difference between the two is 25.68%. On the message strand, adenine content goes from 13.59% for S. cattleya to 38.71% for B. burgdorferi, a difference of 25.12%. The implication is that if organisms evolve along the general path of the regression line in the above graph, all of the increase in message-strand adenine content can be accounted for by the loss in transcribed-strand guanine content. (The loss of guanine, in this example, was 25.68%, which is comparable to the increase of adenine, 25.12%, differing by only two parts per hundred.) Other organisms show a similar pattern of the change in transcribed-strand guanine equaling the change in message-strand adenine.

These numbers imply that guanines on one strand can become adenines on the other strand, which is exactly what happens in G-to-T transversion mutations, which occur through the well-known mechanism of guanine becoming oxidized to 8-oxo-guanine, which in turn pairs with adenine (and leads to substitution of the 8-oxo-guanine with thymine).

Of the four bases in DNA, guanine is well known to be the base most vulnerable to oxidation. Accordingly, oxidation-driven G-to-T transversions are the most common type of transversion. Mutations of this type can cause a shift in overall DNA G+C content (toward higher A+T content). If G-to-T mutations occur preferentially on one DNA strand, the result will be accumulation of adenine on the opposite strand. This is what happens in nature, apparently. Differential repair of DNA strands at transcription time drives the accumulation of purines on the message strand. (See this post for additional discussion, with data, of how the unique repairosome of obligate anaerobes affects differential strand buildup of purines.)

Substantial work has shown that an AT mutational bias (a tendency for G:C pairs to become A:T pairs) exists in bacteria, even for organisms at the extremes of genome G+C content. This is usually taken to mean that GC-to-AT transition mutations are more common than AT-to-GC transitions. Such discussions need to include GC-to-TA transversion mutations, as well. The most common form of DNA damage is oxidation of guanine to 8-oxo-guanine. This strongly suggests that G-to-T transversions are an important driver of changes in genomic G+C content; and combined with asymmetric strand repair, the predominance of such mutations provides a theoretical basis (which has heretofore been lacking) for Szybalski's Rule.

Monday, October 18, 2010

Private midtown-Manhattan supercomputer sets protein-folding record

A specially designed supercomputer named Anton has simulated conformational changes in a protein's three-dimensional structure over a period of a millisecond — a time-scale more than two orders of magnitude greater than the previous record. See the story in NatureNews for details.

Congratulations to the D.E. Shaw Research team!

Tuesday, November 24, 2009

Hadoop in Bioinformatics

Protein Alignment - Paul Brown from Cloudera on Vimeo.

Friday, September 4, 2009

FEBS Letters issue on Protein Misfolding and Disease

I just learned that the 20 August 2009 issue of FEBS Letters, devoted to "Protein Folding, Misfolding and Disease," is open-access, free for download.

At the moment, I'm reading The role of molecular chaperones in human misfolding diseases [PDF] by Sarah A. Broadley and F. Ulrich Hartl of the Max Planck Institute of Biochemistry. It's a good overview of the subject, with 107 references, mostly from 2002 on.

Anxious to look at several other papers, in particular:

Bridging the gap: From protein misfolding to protein misfolding diseases
by Leila M. Luheshi, Christopher M. Dobson
http://www.febsletters.org/article/S0014-5793%2809%2900464-5/abstract?source=aemf

Structure–activity relationship of amyloid fibrils
by Samir K. Maji, Lei Wang, Jason Greenwald, Roland Riek
http://www.febsletters.org/article/S0014-5793%2809%2900528-6/abstract?source=aemf

The GroEL/GroES cis cavity as a passive anti-aggregation device
by Arthur L. Horwich, Adrian C. Apetri, Wayne A. Fenton
http://www.febsletters.org/article/S0014-5793%2809%2900510-9/abstract?source=aemf

Cells and prions: A license to replicate
by Mario Nuvolone, Adriano Aguzzi, Mathias Heikenwalder
http://www.febsletters.org/article/S0014-5793%2809%2900460-8/abstract?source=aemf

Sunday, August 30, 2009

GroEL complex in Chlamydiae

According to Karunakaran et al. (Journal of Bacteriology, March 2003, 185(6):1958-1966), the genomes of all known Chlamydia species containthree groEL-like genes (groEL1, groEL2, and groEL3). Moreover, "Phylogeneticanalysis of groEL1, groEL2, and groEL3 indicates that thesegenes are likely to have been present in chlamydiae since thebeginning of the lineage."

It is known also (Tan et al., Journal of Bacteriology, December 1996; 178(23):6983-90) that the chlamydiae have dnaK; and regulation of the dnaK and groE heat shock operons of Chlamydia trachomatis resembles that of the same operons of Bacillus subtilis and Clostridium acetobutylicum.

Thus, even the smallest of intracellular parasites comes equipped with its own robust heat-shock-protein system.

At 1.04 million base pairs in size, the genome of C. muridarum is not anywhere near as small as the genome of Mycoplasma genitalium, whose hsp system I wrote about before, but it is still pretty small. (M. genitalium has a genome of ~580K base pairs.)

The circumstantial evidence is compelling that a functioning multiple-operon hsp system is absolutely essential to all cytoplasm-producing life forms, including those that live inside other cells.

Monday, November 17, 2008

Biological Informatics Subject-Tracer Blog

Marcus P. Zillman's Biological Informatics Subject-Tracer Blog contains a long list of web resources. Some of the links are stale, but there's still a lot of good stuff there.

Friday, November 14, 2008

Amino acid "seqlets"

IBM publishes bio-dictionaries of 'seqlets' for a number of organisms. What's that? you ask. According to IBM: "In a number of publications, we have presented and discussed the idea of the Bio-Dictionary: the latter is a collection of recurrent amino acid combinations (='seqlets') which completely cover the sequence space defined by the biggest possible collection of amino acid sequences. Normally, we recompute the contents of the Bio-Dictionary on a regular basis, typically once a year."

Bio-dictionaries for a handful of Archaeal genomes and a dozen or so bacterial genomes are available for download here. Note that the files are in .Z (Unix compression format) form. Don't expect to view them in your browser.