0 scientific comment 0 Acta Crystallographica Section D 0 Biological Crystallography 0 ISSN 0907-4449 0 WWWWhy does nature stutter? A survey of strands of repeated amino acids 1 Edgar F. Meyer* and W. John Tollett Jr² 0 Human stuttering is a simple example of the repetition of sounds or symbols, sometimes associated with single letters, and may be used to illustrate the amazing repetition of amino acids (symbolized by a letter, e.g. W) in proteins. A survey of available databases with highly improbable strings of single amino acids is tabulated. This paper concludes with a challenge to the crystallographic community to probe the structural origins of the structure±function relationship in this neglected area. When nature stutters, we should pay attention. 0 Current address: A&M Consolidated High School, College Station, TX 77840, USA. 0 Introduction 0 That 34 virus structures were detected suggests that this model may be overly simplistic and that crosscorrelations may occur, but our purpose here is to report a finding and encourage others to explore its implications, be they probabilistic, statistical, genetic, functional or structural. 0 International Union of Crystallography Printed in Denmark ± all rights reserved 0 As gene, protein and structural databases were searched, who would have guessed that 67 consecutive threonines would be found in Cryptosporidium parvum (Barnes et al., 1998)? The probability of 67 repeats in a random sequence at a specific site is $1 in 2067 = 1/1.5 A 1087 events; the difference in probabilities is exponentially significant. Even though this statistical approximation begs for a more rigorous treatment, it is amazing. WWWWhat is nature telling us? Long consecutive strands of positively or negatively charged amino acids must carry electrostatic penalties, yet these too abound. In a nuclear transport protein (PDB code 1qbk), polyaspartate is augmented by two glutamates to create a startling exposed strand of 14 consecutive negatively charged residues. Intuitively, one could assume that uncharged amino acids would be more likely to occur repetitively, but polymethionine also has a relatively low occurrence (7). Because of pronounced peptide backbone angular constraints, proline was considered to be a `helix breaker', but polyPro actually forms a left-handed helix (1jvr). In HIV-1 reverse transcriptase (1c9r; residues 315±326), an extended polyAla strand is parallel to an -helix that is also rich in Ala. Conversely, a 12-Ala repeat forms a cluster of three -helices at the tip of a tumor necrosis factor receptor (1czz). At this stage, it appears that while polyPro may be structurally conserved, polyAla is not. PolyCys is one of the few repeat sequences which is generally buried, forming a tight trimer knot in a spider toxin (1qdp), a triple S±S knot (1ag8), and a tight buried loop central to an amazing chain of seven S±S linkages in the ferric hydroxamate uptake receptor (1cw3, 1a4z). These searches reveal a wide range of structures, populations and probabilities, summarized by abbreviated tables [tables also 0 Acta Cryst. (2001). D57, 181±186 0 Meyer & Tollett 0 WWWWhy does nature stutter? 0 scientific comment 0 Table 1 0 GenBank results, 23 June 2000. 0 =$key&id=1); the related Chime links will make the structural results more readily accessible to a broader audience]. While some entries of gene sequences are deposited without comment and/or literature 0 citation (Table 1), many protein sequence entries (e.g. PIR, SwissProt, EMBL) are cited (Table 2) and infer functional roles. Although smallest in size, the Protein Data Bank (Bernstein et al., 1977; Meyer, 1997; 0 Amino acid Alanine 0 Residues 129±148 129±148 497±517 497±517 241±260 241±260 241±260 241±260 241±260 13±42 138±187 24±69 720±768 266±311 50±95 777±822 285±325 11±33 1856±1900 362±402 152±191 58±95 58±95 0 GenBank ID# GBINV:DMJ001164 GBINV:AE003814 GBINV:DMU11383 GBINV:DMOVO GBPRI:AF117979 GBPRI:D82344 GBROD:MMPHOX2B GBROD:AB015672 GBPRI:AB015671 GBPRI:HUMFMR1 GBINV:DDU38197 GBINV:AF019981 GBINV:DDI238883 GBINV:AF104350 GBINV:AE001416 GBINV:AE001418 GBPLN:F11A17 GBPRI:HSU63332 GBINV:AF153362 GBVRT:CCJ002238 GBPRI:HSU80741 GBPRI:HUMTFIIDA GBPRI:HS191N21 0 Arginine Asparagine 0 GBPRI:HUMTFIID GBINV:AF024654 GBINV:AE003446 GBROD:MMJ225123 GBROD:AF028737 GBPLN:SCYBR289W GBPLN:SCDPB3 GBPLN:YSCSNF5 GBINV:AE003536 GBPLN:ATF17C15 GBPLN:ATF23E13 GBPLN:ATCHRIV85 GBPRI:HUMARB GBPRI:L29496 GBPRI:HSU16371 GBPLN:ATAC011708 GBINV:AE003451 GBINV:AE003430 GBINV:DMSEG0007 GBVRL:AF169823 GBINV:CELC15C7 GBSYN:AF025672 0 Meyer & Tollett 0 WWWWhy does nature stutter? 0 Acta C 0 ANALYTICAL BIOCHEMISTRY 0 Effects of relative humidity and buffer additives on the contact printing of microarrays by quill pins 1 Mark K. McQuain,a Kevin Seale,b Joel Peek,b Shawn Levy,c and Frederick R. Haseltona,* 0 Abstract DNA microarrays printed with quill pins exhibit significant variation in probe DNA spots. Interspot variations and nonuniform distribution of probe within spots are major sources of experimental uncertainty in microarray analysis. To gain better insight into the sources of variation, we analyzed 450 consecutive depositions printed at relative humidities between 40 and 80% using three print buffers. Increasing relative humidity improved printing performance by delaying pin failure but did not reduce the variability in spot characteristics. Adding either betaine or dimethyl sulfoxide (DMSO) to the print buffer also improved quill pin performance. Least interspot variation was observed with the DMSO additive printed at 80% relative humidity, but this additive also resulted in the greatest intraspot variation. Least intraspot variation was observed with 1.5 M betaine printed at 60% relative humidity, but these conditions produced microarrays with high interspot variability. Evaporation of printing solution from the quill reservoir appeared to be the primary cause of interspot and intraspot variations. Our studies indicate that relative humidity and printing solution additives reduce evaporation. Based on the spot variability requirements for a particular application, humidity and additives may be chosen to optimize either inter- or intraspot variability. O 2003 Elsevier Science (USA). All rights reserved. 0 Keywords: DNA microarrays; Microfluidics 0 DNA microarrays are important tools for obtaining high-throughput genetic information and are often used for expression profiling, gene copy estimation, and polymorphism analysis [1-11]. Though they have been applied successfully in many research applications, there are significant problems which limit their use to qualitative analysis of large signal changes. To compensate for experimental variability, almost all current microarray analyses rely on differential measurement techniques that assess results compared to a reference [12]. Analysis is often focused on the most reliable and repeatable portions of the data [13]. The difficulty in interpreting the remaining data is usually attributed to a variety of factors, including inter- and intraspot variations [14,15]. 0 Abbreviations used: SSC, standard saline citrate; DMSO, dimethyl sulfoxide; R.H., relative humidity; RFU, relative fluorescence unit. 0 interest to be captured and stored electronically. Length calibration was achieved using a laser-etched reference grid positioned to achieve sharp focus at the same height as the point of pin contact with the printing surface. Scanning of multiple spots printed manually or robotically For manual printing, the video microscope apparatus described above was used. Depositions of a freshly loaded pin were recorded over the course of a 10-min period at the rate of one deposition every 3 s. For robotic printing, a commercial robot (designed by 0 Comparative effects of levosulpiride and cisapride on gastric emptying and symptoms in patients with functional dyspepsia and gastroparesis 0 Background: The efficacy of several prokinetic drugs on dyspeptic symptoms and on gastric emptying rates are well-established in patients with functional dyspepsia, but formal studies comparing different prokinetic drugs are lacking. Aim: To compare the effects of chronic oral administration of cisapride and levosulpiride in patients with functional dyspepsia and delayed gastric emptying. Methods: In a double-blind crossover comparison, the effects of a 4-week administration of levosulpiride (25 mg t.d.s.) and cisapride (10 mg t.d.s.) on the gastric emptying rate and on symptoms were evaluated in 30 dyspeptic patients with functional gastroparesis. At the beginning of the study and after levosulpiride or cisapride treatment, the gastric emptying time of a standard meal was measured by 13C-octanoic acid 0 breath test. Gastrointestinal symptom scores were also evaluated. Results: The efficacy of levosulpiride was similar to that of cisapride in significantly shortening (P < 0.001) the t1/2 of gastric emptying. No significant differences were observed between the two treatments with regards to improvements in total symptom scores. However, levosulpiride was significantly more effective (P < 0.01) than cisapride in improving the impact of symptoms on the patients' every-day activities and in improving individual symptoms such as nausea, vomiting and early postprandial satiety. Conclusion: The efficacy of levosulpiride and cisapride in reducing gastric emptying times with no relevant sideeffects is similar. The impact of symptoms on patients' everyday activities and the improvement of some symptoms such as nausea, vomiting and early satiety was more evident with levosulpiride than cisapride. 0 Prokinetic drugs have been extensively tested in the treatment of functional dyspepsia. This is because gastrointestinal motor abnormalities and, in particular, delayed gastric emptying have been frequently reported in patients suffering from this common syndrome.1±6 0 These abnormalities are regarded as a likely source of symptoms even if no clear cause±effect relationship between severity of symptoms and degree of delay in gastric emptying has been proven to date.7 Among prokinetic drugs, several placebo-controlled trials have provided evidence on the efficacy of cisapride and dopamine receptor antagonists such as metoclopramide, domperidone, and recently levosulpiride in the treatment of functional dyspepsia.8±28 Metoclopramide, domperidone and levosulpiride have both antiemetic and prokinetic properties because they antagonize dopamine receptors in the central nervous system as 0 C. MANSI et al. 0 O 2000 Blackwell Science Ltd, Aliment Pharmacol Ther 14, 561±569 0 MATERIALS AND METHODS 0 LEVOSULPIRIDE AND CISAPRIDE IN FUNCTIONAL DYSPEPSIA 0 impact on every-day activities was scored as: 0, not at all bothersome; 1, a little bit bothersome; 2, moderately bothersome; 3, quite a bit bothersome; 4, extremely bothersome. The cut-off values of symptom scores for inclusion in the study was established on the basis of the data obtained by the same questionnaires filled in by 200 healthy volunteers (84 males 116 females, aged 42 4 years). A score decrease of at least 50% was defined as a `symptom improvement'. The reproducibility of the symptom questionnaire had previously been validated in 40 patients with functional dyspepsia. The score evaluation of their symptoms was performed by the patients themselves on two separate occasions (2±4 weeks apart). The calculated K-values were 0.84 for total severity scores, whereas scores for frequency, duration and impact were 0.72, 0.69, and 0.87, respectively. Gastric emptying studies Gastric emptying time was measured by means of 13 C-octanoic acid breath test as previously described.34 This test was performed during the run-in period and at the end of each treatment. Patients were given a standard test meal consisting of one egg with 5 g of butter, two slices of white bread and 150 mL of water; 100 mg 13C-octanoic acid (Cortex Italia, Milan, Italy) was incorporated into the homogenized egg yolk, which was baked separately from the egg white. For practical reasons, the test meal was given at 13.00 hours, after an overnight fast, and eaten in 10 min. In order to interfere as little as possible with the subjects' normal eating habits, they were allowed to eat a light breakfast restricted to 100 mL of milk alone with 10 g of sugar at 07.00/08.00 hours. Females were studied during the first 10 days of the menstrual cycle. Breath samples were collected just before, and every 15 min after the test meal for 6 h; 13CO2 measurements were performed with an isotope ratio mass spectrometer 0 THE THERMODYNAMICS OF DNA STRUCTURAL MOTIFS 1 John SantaLucia, 1,2 and Donald Hicks2 0 Key Words secondary structure, prediction, hybridization, oligonucleotides, nucleic acid folding s Abstract DNA secondary structure plays an important role in biology, genotyping diagnostics, a variety of molecular biology techniques, in vitro-selected DNA catalysts, nanotechnology, and DNA-based computing. Accurate prediction of DNA secondary structure and hybridization using dynamic programming algorithms requires a database of thermodynamic parameters for several motifs including Watson-Crick base pairs, internal mismatches, terminal mismatches, terminal dangling ends, hairpins, bulges, internal loops, and multibranched loops. To make the database useful for predictions under a variety of salt conditions, empirical equations for monovalent and magnesium dependence of thermodynamics have been developed. Bimolecular hybridization is often inhibited by competing unimolecular folding of a target or probe DNA. Powerful numerical methods have been developed to solve multistate-coupled equilibria in bimolecular and higher-order complexes. This review presents the current parameter set available for making accurate DNA structure predictions and also points to future directions for improvement. 0 Loop Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hairpin Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Internal Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bulges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Coaxial Stacking Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multibranched Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . QUALITY OF SECONDARY STRUCTURE PREDICTIONS . . . . . . . . . . . . . . . . . . MULTISTATE MODELING OF DNA FOLDING AND HYBRIDIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FUTURE DIRECTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 INTRODUCTION Biological Importance of DNA Secondary Structure 0 Molecular Biology and Biotechnology Applications of DNA Secondary Structure 0 THERMODYNAMICS OF DNA MOTIFS 0 of biotechnology techniques that exploit the three-dimensional folding potential of DNA have also been demonstrated including DNA nanotechnology (75) and DNA computing (21). 0 The DNA Folding Problem 0 Similar to the protein and RNA folding problems, there is a corresponding "DNA folding problem" in which it is desired to predict the structure and folding energy of the DNA given its sequence. Fortunately, several features of DNA and RNA make them especially amenable to structure prediction. Notably, DNA and RNA secondary structures result from strong Watson-Crick pairing interactions, and tertiary interactions are a weaker second-order effect (81). Thus, to an excellent approximation, tertiary interactions may be neglected and accurate secondary structure prediction is possible. The strong pairing rules also allow for the DNA secondary structure to be reduced to discrete interactions in which two positions in a sequence are either paired or not. Even with the neglect of tertiary interactions such as pseudoknots, however, the number of possible secondary structures is approximately 1.8N, where N is the sequence length (95). Fortunately, with the discrete pairing approximation, DNA and RNA are suitable for powerful dynamic programming algorithms, which were described in a previous review (83). Dynamic programming algorithms guarantee that for a given set of rules, the minimum energy structure (i.e., optimal) will be found in computation time order N3 with memory order N2, thereby allowing predictions of sequences with fewer than 10,000 nucleotides with currently available computers. Dynamic programming algorithms also predict suboptimal structures within user-defined energy and distance windows (94). This is important because the energy rules are not perfect and tertiary interactions are neglected (as are interactions with proteins and the specific interactions with magnesium or other cofactors). Thus, one of the few structures near the free-energy minimum is likely to be correct. It is important to note the important difference between selected functional sequences and random sequences of DNA or RNA. Random sequences have a low probability of folding into compact three-dimensional structures stabilized by tertiary interactions; thus random sequences are most amenable to secondary structure prediction because the neglect of tertiary interactions is appropriate. On the other hand, selected sequences (selected either by evolution or by in vitro selection, or rationally designed) are more likely to contain tertiary interactions, which compromise the reliability of the secondary structure prediction algorithms. This difference makes DNA folding much easier to predict (for random sequences) than corresponding biologically selected RNAs. Note that dynamic programming algorithms also neglect kinetically trapped structures and assume structures are populated according to an equilibrium Boltzmann distribution; thus the structures close to minimum free energy are most probable. Recently, we have also extended the dynamic programming algorithm to predict bimolecular optimal and suboptimal structures so that match and mismatch hybridizations of a short probe to long-target DNA may be readily identified on 0 Overview of the DNA Thermodynamic Database 0 Dynamic programming algorithms for DNA secondary structure predicti 0 Articles Nearest-Neighbor Thermodynamics and NMR of DNA Sequences with Internal A,A, C,C, G,G, and T,T Mismatches 1 Nicolas Peyret, P. Ananda Seneviratne, Hatim T. Allawi, and John SantaLucia, * 0 ABSTRACT: Thermodynamic measurements are reported for 51 DNA duplexes with A,A, C,C, G,G, and T,T single mismatches in all possible Watson-Crick contexts. These measurements were used to test the applicability of the nearest-neighbor model and to calculate the 16 unique nearest-neighbor parameters for the 4 single like with like base mismatches next to a Watson-Crick pair. The observed trend in stabilities of mismatches at 37 °C is G,G > T,T A,A > C,C. The observed stability trend for the closing Watson-Crick pair on the 5 side of the mismatch is G,C g C,G g A,T g T,A. The mismatch contribution to duplex stability ranges from -2.22 kcal/mol for GGC,GGC to +2.66 kcal/mol for ACT, ACT. The mismatch nearest-neighbor parameters predict the measured thermodynamics with average deviations of G°37 ) 3.3%, H° ) 7.4%, S° ) 8.1%, and TM ) 1.1 °C. The imino proton region of 1-D NMR spectra shows that G,G and T,T mismatches form hydrogen-bonded structures that vary depending on the Watson-Crick context. The data reported here combined with our previous work provide for the first time a complete set of thermodynamic parameters for molecular recognition of DNA by DNA with or without single internal mismatches. The results are useful for primer design and understanding the mechanism of triplet repeat diseases. 0 DNA mismatches occur in vivo due to misincorporation of bases during replication (1), heteroduplex formation during homologous recombination (2), mutagenic chemicals (3, 4), ionizing radiation (5), and spontaneous deamination (6). Knowledge of the thermodynamics of DNA mismatches will be useful for elucidating the mechanisms of polymerase fidelity and mismatch repair efficiency. Moreover, thermodynamic parameters for mismatch formation are important for DNA secondary structure prediction (see http://sun2.science.wayne.edu/jslsun2 and http://mfold1.wustl.edu/mfold/dna/form1.cgi). Recent work has shown that triplet repeat sequences form transiently stable hairpins that contain like with like base mismatches (714). The formation of these secondary structures can induce genome expansion or deletion during replication (15, 16) resulting in at least 11 different human diseases (17-19). Mismatch thermodynamics is also important for molecular biological techniques such as PCR (20), Southern blotting (21), single-stranded conformational polymorphism (SSCP) (22-24), sequencing by hybridization (25, 26), antigene targeting (27), Kunkel site-directed mutagenesis (28), and optimization of DNA chip arrays for diagnostics (29). These techniques require optimization of sequence, temperature, 0 and solution conditions to avoid detection or amplification of wrong sequences. Previous work from our laboratory has shown that a NN1 model is valid to describe the thermodynamics of DNA structures involving canonical A,T and G,C base pairs (30-32) as well as G,T (31), G,A (33), C,T (34), and A,C (35) mismatches. We hypothesized that the nearestneighbor model is also applicable to single A,A, C,C, G,G, and T,T mismatches. To test this hypothesis, thermodynamic measurements of 45 sequences combined with 6 from the literature (36, 37) were used to derive NN parameters for like with like base mismatches. 1-D NMR and CD studies were used to qualitatively probe the structures formed by the mismatches. These data combined with our previous results provide a complete thermodynamic database for DNA molecular recognition by DNA with or without single internal mismatches. MATERIALS AND METHODS DNA Synthesis and Purification. Oligonucleotides were graciously provided by Hitachi Chemical Research and were synthesized on solid support using standard phosphoramidite chemistry (38). Oligonucleotides were detached from the 0 Abbreviations: Na EDTA, disodium ethylenediaminetetraacetate; 2 eu, entropy unit; MES, 2-(4-morpholino)ethane sulfonate; NMR, nuclear magnetic resonance; NN, nearest-neighbor; SVD, singular value decomposition; TLC, thin-layer chromatography; UV, ultraviolet. 0 Y°total ) Y°initiation + Y°sym + 2Y°(GG/CC) + 2Y°(GA/CT) + 2Y°(AG/TC) + 2Y°(GT/CT) (2) 0 The notation GT/CT refers to a 5GT3 dimer hydrogen bonded to a 3CT5 dimer with the mismatch underlined. The mismatch contribution to duplex stability is given by rearranging eq 2: 0 2Y°(GT/CT) ) Y°total - Y°initiation - Y°sym 2Y°(GG/CC) - 2Y°(GA/CT) - 2Y°(AG/TC) (3) 0 Thus, the mismatch contribution is calculated by subtracting the initiation, symmetry, and Watson-Crick nearest-neighbor increments (31) from the total experimental value. Number of Linearly Independent Parameters. In our previous studies of G,T, G,A, A,C, and C,T single mismatches, we showed that it is impossible to uniquely solve for eight dimer nearest neighbors from a data set of oligomers containing only single internal mismatches (31). Instead, within the limits of the nearest-neighbor model, only seven linearly independent trimers are sufficient to accurately predict internal mismatch thermodynamics. In the case of single like with like base mismatches (i.e., A,A, C,C, G,G, and T,T), however, symmetry allows for a unique solution of four internal nearest-neighbor dimers to be found. In particular, the dimer nearest neighbors can be uniquely solved from sequences that contain these trimers: 0 where X ) A, C, G, or T. According to the nearest-neighbor model, any sequence with an internal X,X mismatch can be determined from linear combinations of eqs 4a-d. It should be noted, however, that even though it is possible to uniquely solve for the X,X dimer nearest-neighbor parameters from a set of oligonucleotides with only internal mismatches, these parameters cannot be used to accurately predict the thermodynamics of duplexes with terminal mismatches. As we found earlier (31), terminal mismatches always make favorable contributions to dup 0 REVIEW ARTICLE 0 The marks, mechanisms and memory of epigenetic states in mammals 1 Vardhman K. RAKYAN, Jost PREIS, Hugh D. MORGAN and Emma WHITELAW1 0 It is well recognized that there is a surprising degree of phenotypic variation among genetically identical individuals, even when the environmental influences, in the strict sense of the word, are identical. Genetic textbooks acknowledge this fact and use different terms, such as ` intangible variation ' or ` developmental noise ', to describe it. We believe that this intangible variation results from the stochastic establishment of epigenetic modifications to the DNA nucleotide sequence. These modifications, which may involve cytosine methylation and chromatin remodelling, result in alterations in gene expression which, in turn, affects the phenotype of the organism. Recent evidence, from our work and that of others in mice, suggests that these epigenetic 0 modifications, which in the past were thought to be cleared and reset on passage through the germline, may sometimes be inherited to the next generation. This is termed epigenetic inheritance, and while this process has been well recognized in plants, the recent findings in mice force us to consider the implications of this type of inheritance in mammals. At this stage we do not know how extensive this phenomenon is in humans, but it may well turn out to be the explanation for some diseases which appear to be sporadic or show only weak genetic linkage. 0 Key words : chromatin, inheritance, methylation. 0 The various cell types in a multicellular organism are genotypically identical and yet phenotypically different. This is due to differences in the patterns of gene expression that exist between the different cell groups. The stable maintenance of these differences is thought to be due to epigenetic control of gene expression. This involves physically ` marking ' the DNA, without altering the nucleotide sequence, either by the addition of methyl groups to certain cytosine bases and\or the packaging of the DNA into a highly condensed state. These modifications interfere with the DNA-protein interactions that facilitate transcription, resulting in transcriptional silencing of the epigenetically modified allele. Epigenetic modifications can, therefore, cause phenotypic variation in the absence of genetic differences. It is well recognized that ` silenced ' alleles can be inherited through many rounds of DNA replication, and therefore epigenetic modifications or ` marks ' can be maintained through mitotic cell divisions. Generally, however, it has been assumed that these marks are erased and reset at some stage during gametogenesis or early embryogenesis to reinstate the totipotency of the developing embryo. There is now an increasing body of evidence which suggests that epigenetic marks at some mammalian alleles are not completely erased from one generation to the next, resulting in complex patterns of inheritance that do not conform to Mendelian principles. Therefore not only can phenotype vary in the absence of genetic and environmental factors, described by some as ` intangible variation ' [1] or ` developmental noise ' [2], but these phenotypic differences can also be inherited by the offspring. This review will present a brief overview of the role of methylation and chromatin remodelling in epigenetic regulation 0 of gene expression, followed by examples of classic epigenetic phenomena in mammals. We will then discuss the evidence available for epigenetic inheritance through the germline, with an emphasis on murine models, which suggest that this form of inheritance may be occurring at a number of mammalian loci. 0 EPIGENETIC MODIFICATIONS OF DNA 0 The two mechanisms by which DNA is epigenetically marked, although there may be others yet to be discovered, are methylation and chromatin condensation. Both of these mechanisms are associated with gene silencing, and recent evidence, discussed below, suggests that these two mechanisms are not mutually exclusive, but instead act in concert to silence gene expression in mammalian cells. 0 DNA methylation 0 Methylation involves the enzymic transfer of a methyl group to the 5-position of the pyrimidine ring of a cytosine residue [3-5]. This usually occurs at cytosine bases that are immediately followed by a guanine, known as CpG dinucleotides [6,7]. In mammalian genomes, the CpG dinucleotide is greatly underrepresented due to the increased spontaneous deamination rate of 5-methylcytosine into thymine. Of the CpGs present, approx. 70 % are methylated [8], whereas the majority of unmethylated CpGs occur in small clusters known as CpG islands, which are ordinarily found within or near promoters or first exons of ` housekeeping ' genes [9,10]. Methylation is catalysed by DNA methyltransferases (Dnmts) and four mammalian Dnmts have been identified so far, Dnmt1 0 V. K. Rakyan and others 0 the vicinity and reassociating with the newly assembled chromatin following DNA replication. Evidence for this mechanism comes from the observation that some HATs form part of a complex that remains associated with its target DNA throughout the cell cycle [42-44]. A second mechanism may involve targeting the HATs and HDACs to regions of methylated DNA, so that preexisting acetylation patterns are propagated along with methylation patterns during DNA replication. Indeed, it has recently been discovered that the maintenance methylase, Dnmt1, can interact with a histone deacetylase [45-47]. 0 Dnmt2 [12], Dnmt3A and Dnmt3B [13], although our understanding of how these enzymes function is sketchy at best. Dnmt1 is probably involved in maintaining methylation patterns through mitosis [14]. Following DNA replication, the two doublestranded daughter molecules initially contain a hemi-methylated CpG pattern, which is recognized and converted into the fully methylated parental pattern by Dnmt1 [15]. However, it has been found that the error rate of replication of methylation patterns of an artificially methylated DNA sequence transfected into cell lines is significantly higher than that observed for DNA replication [16,17]. In addition, a later study [18] showed that clonal populations of histologically homogenous cells did not have homologous methylation patterns. These findings have been confirmed by more recent work, using the highly sensitive bisulphite conversion method to analyse methylation patterns in i o [19,20]. Therefore the infidelity of replication of methylation patterns has the potential to generate phenotypic diversity among genetically identical cells of the same lineage. Dnmt2 may play a role in epigenetic control of centromere function [21], and Dnmt3A and 3B are thought to be de no o methylases which set up the initial patterns of methylation during embryogenesis [22]. However, data suggests that Dnmts have overlapping functions [23,24], and the precise role of any particular Dnmt is determined by the cellular context. During mammalian development, there are ` waves ' of extensive demethylation of the genome in the primordial germ cell stage and pre-implanatation embryo [25-28]. A mammalian protein with specific demethylase activity for CpG dinucleotides has been reported [29,30], although it remains to be fully characterized biochemically. 0 Epigenetic regulation of transcription 0 The precise mechanisms by which methylation and chromatin compaction regulate transcription are unclear, although several studies suggest that these two mechanisms are linked. MECP2 (methyl-CpG binding protein 2) is a transcriptional repressor that selectively recognizes methylated CpG dinucleotides [48,49]. MECP2, and other methyl-CpG binding proteins, associate with co-repressor complexes that include HDACs [50-53]. This directs the formation of stable repressive chromatin structures [54]. Recent findings [51,52] link the four different methyl-CpG binding domain (MBD) proteins, MECP2, MBD1, MBD2 and MBD3, with the chromatin-remodelling machinery, providing further evidence for the association between methylation and chromatin remodelling. Therefore it seems that methylation acts through histone deacetylation to establish a repressive chromatin state that blocks the access of the transcription machinery, although at present we do not know how the initial patterns of methylation are set up de no o. However, for certain organisms, e.g. Drosophila, methylation is observed only in very early embryogenesis [55] (for decades it was believed that DNA methylation was non-existent in Drosophila), and others like the yeast Schizosaccharomyces pombe, do not methylate their DNA at all. Therefore in some eukaryotic organisms chromatinmediated mechanisms alone may be sufficient to mediate epigenetic regulation of gene expression. 0 Chromatin packaging 0 In the nucleus, DNA exists as a nucleoprotein complex termed chromatin. Chromatin is assembled from arrays of nucleosomes, each of which is approx. 200 bp of linear DNA wrapped around an octamer of histone proteins. Two distinct types of chromatin are known, heterochromatin and euchromatin. Heterochromatin is believed to represent regions of DNA-protein complexes that are in a tightly packed conformation [31,32]. Constitutive heterochromatin is usually found at the centromeric and subtelomeric regions of chromosomes 0 Spot shape modelling and data transformations for microarrays 1 Claus Thorn Ekstrom1,, Soren Bak2 , Charlotte Kristensen2, and Mats Rudemo1 0 Department 0 In order to study lowly expressed genes in microarray experiments, it is useful to increase the photometric gain in the scanning. However, a large gain may cause some pixels for highly expressed genes to become saturated, i.e. the registered 0 Present address: Poalis A/S, Buelowsvej 25, 1870 Frederiksberg C, Denmark 0 pixel values become censored at the upper limit, which with 16-bit precision is 216 - 1 = 65535. Techniques for adjustment of highly expressed signal intensities are given in Wit and McClure (2003) based on a small set of available spot summaries, such as spot mean, spot median and spot variance. As mentioned in Wit and McClure (2003), it should be possible to get more accurate adjustments when all pixel values are available. In the present paper, we study spatial statistical models for pixel values that should enable such adjustments. A convenient type of modelling is to transform data to become approximately Gaussian distributed with a mean value function determined by gene intensities and spot shapes and a corresponding covariance function. For such models, censored pixel values can be estimated optimally. We investigate several types of transformations on the pixel level such as the logarithmic transformation, the Box-Cox family (Box and Cox, 1964) and the inverse hyperbolic sine transformation (Huber et al., 2002; Durbin et al., 2002), also called the generalized logarithm (Rocke and Durbin, 2003). The inverse hyperbolic sine transformation has been proven useful for analyzing microarray spot intensities, but here we apply it at the pixel level. The Box-Cox transformation with exponent 0.5, i.e. a square root transformation optimal for Poisson distributed counts, has been used at pixel level analysis of microarray data by Glasbey and Ghazal (2003). The spot shapes studied include three types suggested by Wierling et al. (2002): (i) a cylindric plateau spot distribution, (ii) an isotropic two-dimensional (2D) Gaussian distribution and (iii) a crater spot distribution consisting of a difference between two scaled isotropic 2D Gaussian distributions. These models does not seem to provide a satisfactory description for the dataset considered, and we introduce a new class of models with polynomial-hyperbolic spot shape. With a second degree polynomial we get a considerably improved performance. This spot shape may be regarded as a generalization of the cylindric plateau spot shape. 0 Spot shape models and transformations 0 The models are applied to a dataset obtained with a specially designed spotted 50mer oligonucleotide microarray. Here, the expression of 452 selected genes in transgenic Arabidopsis plants are compared with the corresponding genes in wildtype plants. Data include scans with different photometric gains ranging from no saturation to heavy saturation. 0 where 1 > 0, and an inverse hyperbolic sine transformation 0 DATA, TRANSFORMATIONS AND EXPLORATORY ANALYSIS Materials 0 Y = k arsinh 0 SPOT SHAPE MODELS 0 Based on empirical observations of spot intensity profiles as seen in Figure 1 as well as in Duggan et al. (1999) (Fig. 2) and Glasbey and Ghazal (2003) (Fig. 1), we desire a spatial spot shape model to have the following three properties: (i) isotropic, i.e. that the average intensity at a pixel x only depends on the distance from x to the spot centre and not on the direction from the centre; (ii) should allow for spot-shapes resembling both `volcanos/craters/donuts' and `plateaus'. Spot intensities are often highest near the edge of the spot and smaller near the spot centre making the resulting spot shape resemble a volcano (middle panel of Fig. 1); and (iii) allow for spatial correlation, i.e. pixels close together and with the same distance from the spot centre should be more correlated than pixels further apart. 0 Let Z = Z(x) denote the intensity of a pixel x. Here, Z is a 16-bit integer, i.e. 0 Z 216 - 1 = 65535. Let Y (x) denote a transformation of Z(x), Y (x) = f (Z(x), ), (1) 0 where f (·, ) is a family of transformation depending on the parameter vector . In the following, we shall consider three transformations: A logarithmic transformation Y = k log(Z + 1 ), (2) 0 C.T.Ekstrom et al. 0 January 2003 0 The Importance of Thermodynamic Equilibrium for High Throughput Gene Expression Arrays 1 Gyan Bhanot,* Yoram Louzoun,y Jianhua Zhu,z and Charles DeLisiz 0 ABSTRACT We present an analysis of physical chemical constraints on the accuracy of DNA micro-arrays under equilibrium and nonequilibrium conditions. At the beginning of the article we describe an algorithm for choosing a probe set with high specificity for targeted genes under equilibrium conditions. The algorithm as well as existing methods is used to select probes from the full Saccharomyces cerevisiae genome, and these probe sets, along with a randomly selected set, are used to simulate array experiments and identify sources of error. Inasmuch as specificity and sensitivity are maximum at thermodynamic equilibrium, we are particularly interested in the factors that affect the approach to equilibrium. These are analyzed later in the article, where we develop and apply a rapidly executable method to simulate the kinetics of hybridization on a solid phase support. Although the difference between solution phase and solid phase hybridization is of little consequence for specificity and sensitivity when equilibrium is achieved, the kinetics of hybridization has a pronounced effect on both. We first use the model to estimate the effects of diffusion, crosshybridization, relaxation time, and target concentration on the hybridization kinetics, and then investigate the effects of the most important kinetic parameters on specificity. We find even when using probe sets that have high specificity at equilibrium that substantial crosshybridization is present under nonequilibrium conditions. Although those complexes that differ from perfect complementarity by more than a single base do not contribute to sources of error at equilibrium, they slow the approach to equilibrium dramatically and confound interpretation of the data when they dissociate on a time scale comparable to the time of the experiment. For the best probe set, our simulation shows that steady-state behavior is obtained in a relaxation time of ;12-15 h for experimental target concentrations ;(10y13 y 10y14)M, but the time is greater for lower target concentrations in the range (10y15-10y16)M. The result points to an asymmetry in the accuracy with which upand downregulated genes are identified. 0 INTRODUCTION Single assay characterization of the response of thousands of genes to environmental perturbations is altering the research paradigm in biomolecular science. Applications are increasing explosively in areas as wide ranging as gene expression and regulation (Lashkari et al., 1997), genotyping and resequencing, and drug discovery and disease stratification (Eisen et al., 1998). The potential impact of micro-arrays on basic and applied biology is so important that an entire industry has been spawned, using any of dozens of variants of two generic methods to fabricate arrays--either direct deposition of probes (Schena et al., 1998; DeRisi et al., 1996; Duggan et al., 1999) or covalent attachment by in situ synthesis (Hughes et al., 2001; LeProust et al., 2000; Lipshutz et al., 1999; Singh-Gasson et al., 1999). The former method allows a wide range of substances such as presynthesized oligomers, proteins, cloned DNA, etc., to be used as probes. The latter is generally restricted to oligonucleotides but offers higher specificity. The central theme of this article is the physical chemical limits of specificity; i.e., conditions that allow the best specificity we consider mainly, though not exclusively, arrays of 20-30 nucleotides long probes, manufactured by in situ synthesis. These conditions minimize false hybridizations resulting from the slow equilibration that is characteristic of long probes, and avoid competition between surface-bound and solubilized probes. Typically an array of tens to hundreds of thousands of different pixels, each consisting of a homogeneous set of 1-10 million oligonucleotide probes, is used to determine the expression levels of genes of known sequence. The molecules to be assayed, e.g., cDNA, are hybridized, during a 12-15 h incubation, with probes chosen to be their reverse complements The most common detection method relies on fluorescence. Usually molecules from the target and reference cells are labeled with red and green dyes respectively; pixels are then scanned at the two distinct wavelengths to determine expression changes. Genes that are up- or downregulated in response to drugs, hormones, or other environmental influences are thus quickly identified. Although micro-array assays are high throughput in the sense that in excess of 10,000 genes at a time are probed, the number of false-positives is high, even for arrays prepared by in situ synthesis. Increased specificity is typically achieved by sacrificing sensitivity: only genes with a pronounced change in expression level, typically in the fifth percentile, are scored as having changed. The screened set, or a select 0 Gene Array Thermodynamics 0 group of the screened set, is then investigated further using traditional methods such as Northern blotting. Increased throughput is generally achieved by increased array density. However, as the above remarks imply, a substantial increase in throughput can be achieved by a well validated, high-specificity system. To increase specificity by rational design procedures, it is helpful to have a clear understanding of the physical limitations of the assay. This includes understanding the conditions that will provide the best specificity, the robustness to deviations from optimal conditions, the relation of optimal conditions to those prevalent in the most common experimental procedures, and strategies for optimization. This article is divided into two broad components: equilibrium and kinetic. In the first section, we outline the thermodynamics of hybridization. Specificity and sensitivity are maximum when equilibrium has been achieved, but even under this ideal condition the method used to select probes affects the formation of crosshybrids, and thus it affects specificity. Probe selection is a large optimization problem. We discuss this below, and present a new probe selection method. Further below, we use this method to select probes for the full set of yeast genes and compare the specificities obtained at equilibrium where both specificity and sensitivity are maximum. This has particular implications for long probes inasmuch as length substantially reduces the rate at which equilibrium is approached, and consequently increases false-positives if equilibrium is not achieved. 0 melting temperature is easily obtained. Define b as the equilibrium constant for bimolecular nucleation (formation of the first bond) in units of inverse concentration, and let K be the (dimensionless) equilibrium constant for the formation of the remainder of the helix. For a helix with n bases, there will be n-1 stacking interactions. We write the sum of the standard Gibbs free energies for the n-1 stacks as DHyTDS, so that the corresponding intramolecular equilibrium constant is K ¼ e½ydDHyTDSÞ=RT , where DH and DS are the sums of the standard enthalpies and entropies for base stacking, in accordance with the base sequence. The free energy of the nucleation event also, to some extent, depends on the basepairs that nucleate dimerization. If A be the free strand concentration and B the concentration of hybrids, and we assume the molecules are either fully hybridized or completely separated, then, B ¼ bA2 K: (1) 0 If cT is the total strand concentration, then by conservation cT ¼ 2B þ A: In addition, at the melting temperature Tm we have by definition 2B ¼ A. Substituting these relations in the equation for B, and utilizing the definition of K, we have that, Tm ¼ DH : ½RlogdbcT Þ þ DS (2) 0 The presence of a surface 0 Thermodynamics of hybridization 0 Melting profiles 0 As temperature is increased, an initially fully intact hybrid will gradually destabilize, and at high enough temperature, the strands will separate. Approximately 90% of the transition occurs over a temperature range of ;10-15 degrees for 25-mers, with the range narrowing as length increases. The so-called melting curve, determined under equilibrium conditions, is cooperative and has an inflection point which is referred to as the melting temperature, Tm. The melting temperature is defined as the temperature at which half the total number of strands are free (i.e., not hybridized). In general the population of hybridized strands will have a distribution of intact basepairs, and the arrangement of a given number of pairs will also be distributed. The common practice of neglecting partially hybridized states reduces a very complex multistage model to a two state model, eliminates the physical basis for cooperativity, and broadens the melting profile. For short chains, however, it has little affect on the midpoint of the transition, introducing an error that is within the error caused by experimental uncertainty in the stacking free energy. For this two-state model in which partially hybridized states are neglected, a sequence-dependent expression for the 0 The formation of a DNA hybrid consists of a bimolecular nucleation event followed by formation of a double 1 Arnold Vainrub B. Montgomery Pettitt 0 Surface Electrostatic Effects in Oligonucleotide Microarrays: Control and Optimization of Binding Thermodynamics 0 retical analysis of the surface electrostatic effects,6 which is in accord with recent experiments,7 we describe here the effect of the surface charge density on the melting curve and match/mismatch discrimination ratio for surface hybridization, and predict possible substantial improvements in several properties for microarrays. The surface material, dielectric or metal, 0 Vainrub and Pettitt 0 and the surface electrostatic conditions are shown to be critically important because they strongly determine the yield of the nucleic acid target hybridization to the surface-immobilized oligonucleotide probes. We propose to use these properties for control and enhancement of sensitivity during surface hybridization. In particular, an equal sensitivity of the probes with different base-pair composition may be achieved by adjustment of their specific linker molecule length or the local surface charge. Further, we suggest enhancement of the match/mismatch discrimination by narrowing the melting curve by optimizing the surface charge. Finally, we discuss a new microarray design using hybridization at low salt where the duplex stability is achieved by the positive surface charge. Under these conditions the target's secondary structure is melted, allowing hybridization to most of the target's nucleotides and increasing the sequencing information up to tenfold. 0 RESULTS AND DISCUSSION Statistical Thermodynamics of Hybridization 0 THEORETICAL MODEL AND CALCULATION METHODS 0 where n is the fraction of the hybridized probes in equilibrium, C0 is the concentration of the targets, and G is the molar Gibbs free energy of the probe:target duplex formation. Equation (1) is valid under the condition that the target concentration is constant. For brevity, we omit a straightforward derivation for a general case when targets are depleted because of hybridization. Note that at constant temperature Eq. (1) corresponds to the well-known Langmuir adsorption isotherm equation, which is often used to interpret microarray experiments.3 For discussing the mechanism of the interaction below, we introduce here the interaction Gibbs free energy with the surface for the probe Vp, target Vt, and duplex Vd. This interaction impacts the hybridization equilibrium and therefore the parameters in Eq. (1) in several ways. First, the target concentrations on the surface Cs and in solution C0 vary according to the Boltzmann distribution formula Cs C0 exp( Vt/RT) (2) 0 Second, the Gibbs free energy differences of the duplex formation on the surface Gs and in solution G differ by the change of the interaction energy after and before hybridization, (Vd Vp Vt). Thus Gs G Vd Vp Vt (3) 0 Equations 2 and 3 account for the target concentration and duplex binding strength changes near the surface, respectively. Substitution of Eqs. (2) and (3) in Eq. (1) gives the formula ns 1/{1 C0 1 exp[( G Vd Vp)/RT]}, (4) 0 Surface Electrostatic Effects 0 which describes the effect of surface interactions on the hybridization equilibrium. This equation differs from Eq. (1) for hybridization in bulk by addition of (Vd Vp) to the hybrid formation free energy. Hence, if duplex and probe are attracted to the surface (Vd 0 and Vp 0), the stronger attraction of the duplex for the surface Vd Vp promotes duplex formation. In contrast, a stronger surface repulsion of the duplex than the probe shifts the hybridization equilibrium toward melting of duplexes into single strand targets and probes. This approach can be also used out of thermodynamic equilibrium when the target's concentration on the surface Cs is determined not by the Boltzmann distribution Eq. (2), but rather by some steady state transport process. The corresponding Cs and Eq. (3) should be substituted in Eq. (1) to obtain the equilibrium yield of the duplexes in surface hybridization, ns. This is relevant to electronic DNA chips where the assayed nucleic acid is transported by electrokinetic drag13,14 and flow-through biochips.15 0 Surface Electrostatic Interaction 0 In order to evaluate the hybridization with the surface tethered probes, one need to know the probe Vp and duplex Vd interaction energies in Eq. (4). Recently, we calculated the oligonucleotide-surface interaction in electrolyte solution.6 We assumed the electrostatic interaction to be dominant since in microarray applications typically the oligonucleotide is tethered to the surface through a sufficiently long linker molecule, making the short-range van der Waals forces weak and therefore their effect small. The electrostatic Gibbs free energy was shown to be a sum of two components, V1 and V2. As depicted in Figure 1, V1 corresponds to the direct electrostatic interaction with the surface charge and is attractive (repulsive) for the positively (negatively) charged surface because of the negative charge of the nucleic acid target. V2 is the target's electrostatic free e 0 BGX: a fully Bayesian gene expression index for Affymetrix GeneChip data 1 By ANNE-METTE K. HEIN 0 Department of Epidemiology and Public Health, Imperial College, Norfolk Place, London W2 1PG, UK 0 Department of Epidemiology and Public Health, Imperial College, Norfolk Place, London W2 1PG, UK 1 HELEN C. CAUSTON 0 Microarray Centre, MRC Clinical Sciences Centre, Imperial College, Hammersmith Hospital, London W12 0NN, UK 1 GRAEME K. AMBLER and PETER J. GREEN 0 Some key words: Bayesian, Affymetrix, GeneChip, probe-level analysis, gene expression, differential expression, MCMC 0 Introduction Microarrays are one of the new technologies that have developed in line with the sequencing of the human and other genomes and developments in miniaturization and robotics. They permit 0 A.K. Hein et al. 0 the expression profiles of tens of thousands of genes to be measured in a single experiment and promise to revolutionize the biomedical and life sciences. This is partly because the gene expression profiles obtained form a `signature' -- a molecular phenotype -- that can be used to characterize the type, age, disease state and growth conditions of an organism. Affymetrix are one of the leading manufacturers of microarrays (Affymetrix gene expression arrays are also referred to as `GeneChips') and these are widely used. They differ from many other array types in that a single labelled extract is hybridized to each array and because they contain multiple `match' and `mismatch' sequences for each transcript. This presents particular challenges for low-level data analysis including the integration of data from the multiple probes representing each transcript on an array to provide a measure that represents gene expression and its inherent uncertainty, and the bringing into par (`normalization') of data from different arrays. 0 Affymetrix Oligonucleotide arrays The oligonucleotide array technology exploits two fundamental biological properties: (a) mRNA is an intermediate product between genes encoded in DNA and their protein products, so mRNA abundance can be used as a measure of gene expression, and (b) single stranded RNA molecules have a high affinity to form double stranded structures. Pairing between RNA strands is highly specific and complementary strands have particularly high binding affinities. Oligonucleotide arrays contain hundreds of thousands of features. A feature is a small rectangular area, containing a large number of identical oligonucleotides. In general, a different oligonucleotide sequence is represented at each feature. The features on oligonucleotide arrays are referred to as probes. A measure of the abundance of a particular transcript RNA in a biological sample can be obtained by going through the following procedure: isolating RNA, making a labelled representation of it, fragmenting the sample, hybridizing the labelled, fragmented RNA to an array, washing off the material that has not hybridized and scanning the array to obtain fluorescence intensities at each probe (Schena et al., 1995). The abundance of a transcript is related to the intensity measured at the features representing the complementary RNA sequence. On GeneChip arrays oligonucleotides of length 25 are used. However, many genes are similar, sharing common motifs or subsequences, and cannot, in general, be uniquely identified by a single sequence of length 25. Therefore each gene is represented by a probe set, consisting of a number of probe pairs. A probe pair consists of a perfect match probe (PM) and a mismatch probe (MM). At each perfect match probe, an oligonucleotide which perfectly matches part of the transcript is represented. The detection of transcripts at the PMs of a probe set indicates that the gene is expressed, and the level of detection indicates the degree of expression. However, although complementary RNA sequences have particularly high affinities, sequences that are complementary over only part of the length of the sequence, or shorter sequence fragments, may also hybridize. We refer to the hybridization of non-complementary transcripts to the probes as non-specific hybridization. This is the motivation for including MM probes. The oligonucleotides represented at an MM probe are identical to those at the corresponding PM probe, except that the middle nucleotide is that of the complementary base. The intention is that, since PM and MM probes are almost identical, equal amounts of non-specific hybridization will occur at these probes. Excess hybridization to the PM probe, relative to the MM probe will be due to specific hybridization, that is, the hybridization of complementary transcripts. A probe set for a gene typically consists of 11-20 PM and MM probe pairs, and these represent the information available about the expression of the gene. 0 BGX: a new gene expression index 1.2. Gene expression experiments and analysis 0 The generation of gene expression data is a multi-step process, and variability (from different sources) may be introduced at a number of experimental stages. The variability of interest is that of biological origin, e.g., variability in gene expression between experimental conditions, individuals or tissue types. Variability of non-biological origin may arise due to differences in the preparation of the biological samples to be hybridized, in the manufacture of the arrays, or in the process of scanning the arrays (see Hartemink et al. (2001) for a more detailed discussion). The replicability of raw gene expression data is low and gene expression data is notoriously noisy. This can be clearly demonstrated by hybridizing two technical replicates of the same biological sample on two arrays. The intensities obtained will often be found to differ (Figure 1). FIGURE 1 ABOUT HERE The analysis of gene expression data is usually treated as a multi-step process. The individual steps often consist of correcting the intensities for background noise, estimation of gene expression indices, normalization between samples, assessment of which genes are differentially expressed and clustering of genes or conditions with similar expression profiles or patterns. The focus of this paper is on the steps leading to the estimation of gene expression and on detection of differential expression. A drawback of splitting up the analysis of gene expression data into separate steps that are dealt with independently is that the error associated with each step is ignored in the downstream analysis. In assessing differential expression, it is clearly of interest to know how reliable the expression index of a gene is. In turn, in the estimation of the gene expression index, it is of interest to quantify the variability in the background corrected intensities, on which the estimation is based. A primary aim of the work presented here is to develop a statistically coherent framework for the analysis of Affymetrix GeneChip arrays, in which the splitting up of the analysis into separate steps is avoided. 1.3. Bayesian hierarchical modelling of Affymetrix gene expression data In this paper we present Bayesian hierarchical models for the analysis of gene expression data, where all steps in the process, and thus the associated errors, are modelled simultaneously. For clarity, we first set out a model for estimating the expression of genes using data obtained from a single array. In the model, background correction for non-specific hybridization and calculation of gene expression indices are considered simultaneously. We base the inference on the full posterior distributions for the parameters, so that, in addition to point estimates of gene expression levels we obtain their credibility intervals. Next, we extend the model to encompass the more commonly encountered situation, in which different experimental conditions are considered, and where replicate arrays may be available under some or all of the conditions. Here all information is used simultaneously to make the relevant inferences: where replicate arrays are available, measures of the expression of genes are obtained from a simultaneous consideration of the probe sets for the genes on the arrays. When experimental conditions are compared it is often of interest to identify genes that are differentially expressed, and to rank the genes according to their degre 0 Short Technical Reports 0 SHORT TECHNICAL REPORTS 0 Analysis of DNA Microarrays by NonDestructive Fluorescent Staining Using SYBRfi Green II 0 ABSTRACT A simple, non-destructive procedure is described to determine the quality of DNA arrays before they are used. It consists of a preliminary staining step of the DNA microarray by using SYBRfi green II, a fluorophore with specific affinity for ssDNA, followed by a laser scan analysis. The surface quality, integrity and homogeneity of each DNA spot of the array can thus be assessed. After this preliminary control, which may avoid further analytical steps that lead to the waste of precious biological samples, a fully reversible staining procedure is performed that produces an array ready for subsequent use. 0 INTRODUCTION The use of microarrays is growing exponentially (5). The technology consists of dense arrays of DNA spots deposited on suitably prepared surfaces, mainly glass. Several formats have been 0 BioTechniques 0 plate and primers used. A portion of each PCR amplification product (5 µL) was examined by agarose gel electrophoresis, followed by ethidium bromide staining. Only PCR products showing a clear and strong band on UV transillumination were recovered by ethanol precipitation and resuspension in 15 µL 3x standard saline citrate (SSC) (450 mM NaCl, 45 mM sodium citrate, pH 7.0). The DNA concentration was determined using PicoGreenfi reagent (Molecular Probes), a fluorescent nucleic acid stain useful for quantitating dsDNA in solution. The final concentration of DNA averaged 50 ng/µL. Samples were transferred into 96-well plates, which were sealed and stored at -20°C until used. Preparation of Polylysine-Coated Glass Slides Standard glass microscope slides (Sigma Aldrich) were pre-cleaned by immersion for at least 2 h in an alkaline wash solution consisting of 10% (w/v) NaOH and 57% (v/v) ethanol, followed by rinsing five times in double-distilled 0 water. The slides were then gently shaken for 1 h in a coating solution consisting of 35 mL Poly-L-Lysine (Sigma Aldrich; 0.1% w/v in water), 35 mL filtered PBS and 280 mL doubledistilled water. Coated slides were extensively washed with double-distilled water, centrifuged at low speed, (80x g) dried in a vacuum drying oven at 45°C for 10 min and then stored at room temperature in a tightly sealed slide box. Slides were used after at least two weeks to produce a sufficiently hydrophobic surface. This aging process is a key step in obtaining a suitable surface for array preparation. Printing of DNA Microarrays Target DNA samples in 3x SSC were spotted on the glass slides using a piezoelectric pipet (Nanoplotter SystemTM, Gesim GmbH, Germany). The pipet was programmed to release about 10 nL DNA solution for each DNA spot. Spots were arrayed in a 20 x 20 arrangement (400 spots in a 1.5 x 1.5cm square with a center-to-center spacing between spots of approximately 750 µm) or a 30 x 30 arrangement (900 spots in a 1.5 x 1.5-cm square with a center-to-center spacing of 500 µm). After deposition, arrayed DNA spots were completely dried by overnight incubation at room temperature in a covered box. Printed slides were rehydrated (DNA side down) in a plastic humid chamber (Sigma Aldrich) until spots glistened and then snap-dried at 100°C. 0 BioTechniques 79 0 BMC Bioinformatics 0 Methodology article 0 BioMed Central 0 Open Access 0 In silico microdissection of microarray data from heterogeneous cell populations 1 Harri Laehdesmaeki1, llya Shmulevich2, Valerie Dunmire2, Olli Yli-Harja1 and Wei Zhang*2 0 Background: Very few analytical approaches have been reported to resolve the variability in microarray measurements stemming from sample heterogeneity. For example, tissue samples used in cancer studies are usually contaminated with the surrounding or infiltrating cell types. This heterogeneity in the sample preparation hinders further statistical analysis, significantly so if different samples contain different proportions of these cell types. Thus, sample heterogeneity can result in the identification of differentially expressed genes that may be unrelated to the biological question being studied. Similarly, irrelevant gene combinations can be discovered in the case of gene expression based classification. Results: We propose a computational framework for removing the effects of sample heterogeneity by "microdissecting" microarray data in silico. The computational method provides estimates of the expression values of the pure (non-heterogeneous) cell samples. The inversion of the sample heterogeneity can be facilitated by providing accurate estimates of the mixing percentages of different cell types in each measurement. For those cases where no such information is available, we develop an optimization-based method for joint estimation of the mixing percentages and the expression values of the pure cell samples. We also consider the problem of selecting the correct number of cell types. Conclusion: The efficiency of the proposed methods is illustrated by applying them to a carefully controlled cDNA microarray data obtained from heterogeneous samples. The results demonstrate that the methods are capable of reconstructing both the sample and cell type specific expression values from heterogeneous mixtures and that the mixing percentages of different cell types can also be estimated. Furthermore, a general purpose model selection method can be used to select the correct number of cell types. 0 Page 1 of 15 0 (page number not for citation purposes) 0 Recent developments in high-throughput genomic techTable 3: The measured mixing percentages. The measured mixing percentages (RKO/normal) in the five heterogeneous samples. 0 sample #1 RKO normal 100 0 0 nologies have revolutionized the approaches aimed at understanding biological systems and emphasized the need for computational and systems biology research. Microarray analysis, for instance, can provide massive amounts of information about a biological sample by simultaneously measuring thousands of transcript levels. Application of such methodologies has already yielded important molecular insight into cellular phenotypes under various experimental conditions [1] and provided new knowledge about the development and treatment of human diseases, such as cancers [2-4]. During the last several years, microarray technology has undergone continued improvement with better quality control in the overall measurement process, ranging from hybridization conditions to image processing techniques [5]. Nevertheless, to fully harness the power of the microarray technology to study biological materials such as cancer tissues, one has to deal with a source of measurement variability that comes from the biological materials themselves, which rarely consist of homogeneous cell populations. For example, except for a few types of immune-privileged tissues such as the brain, most solid tumor tissues contain infiltrating lymphocytes as a result of the immune response. Most tumor tissues also contain endothelial cells as part of the necessary vasculature systems that provide nutrients for the tumor cells. The complexity of this problem is that different tumor tissues contain different proportions of these non-tumor cells. Therefore, if tumor tissues are used without consideration of such a mixing phenomenon, measurement of differential gene expression will certainly be confounded by the heterogeneous cell populations. In some studies [6], pathologists carefully evaluated the tissues and only selected tissues with more than a certain percentage of tumor cells. This prescreening step, however, results in the exclusion of many tumor tissues for the study and contributes to the small sample size problem in some of the studies. Alternatively, laser capture microdissection (LCM) technology can be used to purify the tumor cells from mixed populations [7]. This approach has been very successful in DNA-based studies because of the relatively high stability of DNA. However, for microarray studies, which require less stable RNA, LCM has seen limited success because it is much 0 more challenging to maintain RNA stability during the microdissection process. Other drawbacks of LCM are that such procedures are time-consuming and yield insufficient quantities of RNA, thus requiring multiple amplification steps that may confound quantitative inferences from gene expression data. A recent paper by Ghosh [8] introduced a mixture model based framework for determining differential expression in the presence of mixed cell populations. In this study, we aim at reconstructing the actual expression values of the pure cell types from the heterogeneous mixtures. That is, we develop a computational method for removing the effect of mixing from heterogeneous samples and to microdissect microarray data in silico. Similar analytical approaches have been previously proposed by Lu et al. [9], Stuart et al. [10] and Venet et al. [11]. Lu et al. focused on estimating the fraction of cells in different phases of the cell cycle whereas Stuart et al. considered the problem of estimating the cell type specific expression patterns over all samples. Here we focus on estimating both the sample and cell type specific expression values using carefully controlled microarray experiments. The inversion of the 'cell mixing effect' can be made appreciably easier by providing estimates of the mixing percentages of different cell types in each measurement, which can be measured by an experienced pathologist. The entire process does not hinge upon such measurements, however, as the mixing percentages can be estimated within the modeling framework. Venet et al. [11] introduced some preliminary methods and results for tackling the same problem as we consider here. In particular, they used a similar regression based framework as in [10] and as we do. We also consider the problem of selecting the correct number of cell types using the cross-validation model selection framework. 0 The microarray data to which we apply our computational methods consists of five different heterogeneous mixtures of lymph node and colon cancer samples which are hereafter abbreviated as normal and RKO, respectively. For more details, see Materials and methods Section. Each 0 Page 2 of 15 0 (page number not for citation purposes) 0 heterogeneous mixture consists of different fractions of different cell samples, see Table 3. 0 Inversion of sample heterogeneity The first goal is to invert the mixing effect caused by sample heterogeneity. We apply the linear model developed in Materials and methods Section to the heterogeneous microarray data. The obtained results are presented below. 0 clearly shows that the heterogeneous samples ('m1' through 'm5') are located almost on a straight line in the 2-dimensional PCA space. Furthermore, the line on which the heterogeneous samples are lying is parallel to the first principal component, suggesting that the most significant variation in the data is due to the linear mixing effect. The estimated expression profile of the pure colon cancer cells and lymphocytes are close to samples number #1 and #5, respectively, indicating that the inversion of the mixing phenomenon produces reasonable results. The results are more easily appreciated when only the most significant PCA component is shown. As discussed above, the variation in the most significant PCA component is due to the mixing effect. The results in Figure 2 (a) are as in Figure 1, but now shown in 1-dimension in order to facilitate the interpretation. Results in Figure 2 (b), in turn, are as in Figure 2 (a) except that the inversion was done using only the samples #2, #3, and #4. This represents a more difficult and realistic case, since fewer mixtures are available. When comparing Figure 2 (a) with Figure 2 (b), one can conclude that the method performs slightly better when more samples are used to estimate the true expression profiles - a result that was expected. Overall performance, however, is good in both cases. The est 0 BMC Bioinformatics 0 BioMed Central 0 Open Access 0 ProbeMaker: an extensible framework for design of sets of oligonucleotide probes 1 Johan Stenberg*, Mats Nilsson and Ulf Landegren 0 Background: Procedures for genetic analyses based on oligonucleotide probes are powerful tools that can allow highly parallel investigations of genetic material. Such procedures require the design of large sets of probes using application-specific design constraints. Results: ProbeMaker is a software framework for computer-assisted design and analysis of sets of oligonucleotide probe sequences. The tool assists in the design of probes for sets of target sequences, incorporating sequence motifs for purposes such as amplification, visualization, or identification. An extension system allows the framework to be equipped with application-specific components for evaluation of probe sequences, and provides the possibility to include support for importing sequence data from a variety of file formats. Conclusion: ProbeMaker is a suitable tool for many different oligonucleotide design and analysis tasks, including the design of probe sets for various types of parallel genetic analyses, experimental validation of design parameters, and in silico testing of probe sequence evaluation algorithms. 0 Increasing numbers of methods are being developed for parallel nucleic acid analyses for different purposes. Many of these methods employ sets of oligonucleotide probes or probe pairs that hybridize to the sequences targeted for analysis, allowing the probe sequences to be acted upon by one or more enzymes, creating new molecular species that reflect the presence or nature of the different target sequences. The reaction products generally contain identifying sequences or other features that allow the separation of signals originating from different targets. This is the case in methods such as the multiplex oligonucleotide ligation assay (OLA) [1], the multiplex ligation-dependent probe amplification assay (MLPA) [2], the RNA- and cDNA-mediated annealing, selection, extension and ligation assays (RASL, DASL) [3,4], the GoldenGate genotyp- 0 ing assay [5], multiplex minisequencing [6], and the padlock or molecular inversion probe assay [7,8]. The latter method has been used to genotype more than 10,000 single nucleotide polymorphisms (SNPs) in multiplex. Another method that utilizes sets of oligonucleotide probes for multiplex processing of nucleic acid molecules is the selector amplification technique. This technique uses partially double-stranded oligonucleotides, called selectors, to circularize a selection of restriction fragments from total genomic DNA, and it incorporates a general sequence motif that allows parallel amplification of all circularized fragments using a single primer pair [9]. With molecular solutions to many tasks of highly parallel genetic analysis now at hand, other factors become limiting, such as the design and the synthesis of reagents. In the 0 Page 1 of 6 0 (page number not for citation purposes) 0 work presented here, we address the problem of largescale probe design. When large numbers of probes are combined, the risk for unintended interactions between probes and targets must be considered. This risk places strict requirements on the design of sets of probes to be used together. In particular, it is important that probes do not contain sequences that result in the production of detectable signal from any probe in the absence of its cognate target molecule, or that otherwise interfere with the activity of other probes in the set. Due to these and other constraints and the many possible alternative probe sequences to evaluate, the difficulty of designing probe sets increases rapidly with the size of the probe sets. Many computer programs exist for the design of oligonucleotide probes such as PCR primers [10-12], microarray probes [13,14], and more [15]. These programs define algorithms to evaluate the risk of primer or probe sequences being involved in undesired interactions such as probe homo- or heterodimer formation, cross-hybridization, false priming, etc. However, the available programs are generally limited in scope, and are not applicable to the task of designing sets of complex probes containing multiple sequence elements. The ProbeMaker software presented herein is a framework for computer-assisted design and analysis of sets of oligonucleotide probe sequences composed of several functional sequence elements. As the composition of probes and the constraints imposed on sets of probes vary between applications, this framework has been constructed to support the design of different types of probes using application-specific constraints, as defined by the user. ProbeMaker takes as input a set of target sequences and a number of sets of so-called 'tag' sequences. These tag sequences may serve as targets for restriction digestion, as binding sites for amplification primers or fluorescent detection probes, or as identification codes for individual amplification products that are decoded by hybridization to oligonucleotide arrays [16]. Probes are designed for each target by construction of target-specific sequences and addition of tag sequences according to rules specified by the user. Different combinations of sequence elements are evaluated for each probe, and a set of probe sequences is created that satisfies user-defined criteria. 0 it should have the potential to import sequence data from a variety of sources. The flexibility is provided by the target and probe sequence data structures used. Each target defines two template sequences that are used to construct target-specific sequences (TSSs) to use in the corresponding probe. Each probe is made up of two such TSSs and a number of tag sequences, which may be located 5' of, between, or 3' of the TSSs. As TSSs may be of zero length, this system allows the design of many different types of probes. Support for more than two TSSs per probe was not deemed necessary as this is not used in any current methods. Furthermore, targets may be grouped, allowing the program to perform selection of tag sequences based on the relations of target sequences, for example variants of the same polymorphic sequence. The extensibility is realized by using an extension mechanism for much of the functionality. Extensions are constructed in the form of Java classes that implement defined interfaces and may be loaded into the framework at run-time. This mechanism allows the addition of new target types and support for different formats for sequence input and output, as well as design constraints and acceptor schemes, the function of which will be described below. ProbeMaker may be run through a graphical user interface or from the command line. For the graphical user interface, a set of target sequences and sets of tag sequences are provided as input by the user. Application-specific parameters for probe design and evaluation are set through the user interface. When running ProbeMaker from the command line, a project file defining all sequences and parameters is used as input. The potential for supporting different file formats is provided by using the sequence input system of the MolTools Java library [17]. A combination of components for sequence file parsing, sequence notation conversion, and post-import modifications are used to allow creation of sets of any type of target from a variety of sequence file formats, with the possibility to carry out other operations on the imported data, such as selecting which position within the target sequence to design probes for, or to group or sort sequences based on some particular property. 0 The main objectives in the development of ProbeMaker were to provide a framework that is flexible, in the sense that it should support design of oligonucleotide probes for different purposes, and extensible, in that it should be possible to add support for designing new types of probes and to add new types of design constraints. Furthermore, the software should be adaptable to new applications, and 0 For a given set of targets, and a number of sets of tag sequences, ProbeMaker performs two tasks (Figure 1A). Firstly, TSSs are constructed for each target as determined by the target type in use, forming the basis for a probe for that target. Secondly, tag sequences are added to each probe sequentially in a pattern specified by the user. 0 Page 2 of 6 0 (page number not for citation purposes) 0 BMC Genomics 0 Research article 0 BioMed Central 0 Open Access 0 A generic approach for the design of whole-genome oligoarrays, validated for genomotyping, deletion mapping and gene expression analysis on Staphylococcus aureus 1 Yvan Charbonnier*1,2, Brian Gettler1,2, Patrice Francois1, Manuela Bento1, Adriana Renzoni3, Pierre Vaudaux3, Werner Schlegel2 and Jacques Schrenzel1,4 0 Background: DNA microarray technology is widely used to determine the expression levels of thousands of genes in a single experiment, for a broad range of organisms. Optimal design of immobilized nucleic acids has a direct impact on the reliability of microarray results. However, despite small genome size and complexity, prokaryotic organisms are not frequently studied to validate selected bioinformatics approaches. Relying on parameters shown to affect the hybridization of nucleic acids, we designed freely available software and validated experimentally its performance on the bacterial pathogen Staphylococcus aureus. Results: We describe an efficient procedure for selecting 40-60 mer oligonucleotide probes combining optimal thermodynamic properties with high target specificity, suitable for genomic studies of microbial species. The algorithm for filtering probes from extensive oligonucleotides libraries fitting standard thermodynamic criteria includes positional information of predicted targetprobe binding regions. This algorithm efficiently selected probes recognizing homologous gene targets across three different sequenced genomes of Staphylococcus aureus. BLAST analysis of the final selection of 5,427 probes yielded >97%, 93%, and 81% of Staphylococcus aureus genome coverage in strains N315, Mu50, and COL, respectively. A manufactured oligoarray including a subset of control Escherichia coli probes was validated for applications in the fields of comparative genomics and molecular epidemiology, mapping of deletion mutations and transcription profiling. Conclusion: This generic chip-design process merging sequence information from several related genomes improves genome coverage even in conserved regions. 0 Page 1 of 12 0 (page number not for citation purposes) 0 Current hybridization technologies allow assaying thousands of nucleic acid sequences in a single reaction on a solid substrate. Such massively parallel systems offer unprecedented opportunities for basic research and diagnostic applications, including gene sequencing [1], detection of genetic polymorphisms [2], genome-composition analysis [3,4] and measurement of gene expression profiles in prokaryotes [5,6] or cancer cells [7]. Oligonucleotide probes (up to 70-mer) offer more flexibility than cDNA probes since they can be tailored according to optimal in silico physico-chemical and specificity properties, and applied to any sequence data. Early available probe design software identified sets of probes sharing homogeneous thermodynamic properties for probe-target hybridization [8]. More elaborated software tools include cross-homology testing of probes against a reference database by BLAST (Basic Local Alignment Search Tool) [9,10] or prediction of secondary structures into the thermodynamically-based approach [1114]. A frequent drawback of some of these algorithms is to yield an excessive number of unprocessed BLAST outputs that complicates final selection of the most specific probes. Furthermore, these approaches do not take into consideration probe interaction with microarray surface, in particular the impact of mismatches position between the target and probes, as shown by Hughes et al [15]. Designing reliable oligonucleotide probes with available software is quite difficult for bacterial genomes with low GC content [16], low complexity in sequence composition, or frequent conserved repeats leading to erroneous target identification by cross-hybridization. The reported method (OliCheck) implements an algorithm for filtering oligonucleotide probes libraries sharing homogeneous thermodynamic properties by using positional information of predicted target-probe binding regions. An additional characteristic of OliCheck is to annotate probes recognizing highly conserved targets shared by different genomes. Staphylococcus aureus (S. aureus) was selected as a model organism for implementing and experimentally validating this approach. The choice of this clinically important pathogen for fundamental and applied genomic studies is prompted by the availability of several fully or partially sequenced strain genomes [16-18]. A set of feature elements was designed by OliCheck to yield an extensive S. aureus genome coverage. This S. aureus specific probe set together with control probes were used to manufacture an oligoarray that was extensively validated for comparative genomics, molecular epidemiology, mapping of deletion mutations, and transcription profiling applications. The specificity, signal-response linearity, and influence of hybridization temperatures for transcript profiling are also described. 0 Further genomic oligoarrays of several distinct microbial species have been successfully designed using this generic methodological approach. 0 In silico properties of the S. aureus oligoarray and manufacturing of StaphChip The final set of 5,335 S. aureus OliCheck-filtered probes recognized 97.5, 93.0, and 81.0% of N315, Mu50, and COL ORFs, respectively. The low residual percentage of 0 Page 2 of 12 0 (page number not for citation purposes) 0 Step A 0 N315 (2'593 ORFs) (2,593 0 BLAST probes 0 N315 (2'593 ORFs) (2,593 0 Hybridization intensities prediction (%) 0 Surface end 0 Solution end 0 Probe A 0 Step B 0 Probe B 0 BLAST probes 0 Hybridization intensities prediction (%) 0 Surface end 0 Solution end 0 Probe A 0 Step C 0 Probe B 0 Step D 0 BMC Genomics 0 BMC Genomics 2002, 3 0 BioMed Central 0 Methodology article 0 Open Access 0 Optimization and evaluation of T7 based RNA linear amplification protocols for cDNA microarray analysis 1 Hongjuan Zhao1, Trevor Hastie2, Michael L Whitfield3, Anne-Lise BorresenDale4 and Stefanie S Jeffrey*1 0 Background: T7 based linear amplification of RNA is used to obtain sufficient antisense RNA for microarray expression profiling. We optimized and systematically evaluated the fidelity and reproducibility of different amplification protocols using total RNA obtained from primary human breast carcinomas and high-density cDNA microarrays. Results: Using an optimized protocol, the average correlation coefficient of gene expression of 11,123 cDNA clones between amplified and unamplified samples is 0.82 (0.85 when a virtual array was created using repeatedly amplified samples to minimize experimental variation). Less than 4% of genes show changes in expression level by 2-fold or greater after amplification compared to unamplified samples. Most changes due to amplification are not systematic both within one tumor sample and between different tumors. Amplification appears to dampen the variation of gene expression for some genes when compared to unamplified poly(A)+ RNA. The reproducibility between repeatedly amplified samples is 0.97 when performed on the same day, but drops to 0.90 when performed weeks apart. The fidelity and reproducibility of amplification is not affected by decreasing the amount of input total RNA in the 0.3-3 µg range. Adding template-switching primer, DNA ligase, or column purification of double-stranded cDNA does not improve the fidelity of amplification. The correlation coefficient between amplified and unamplified samples is higher when total RNA is used as template for both experimental and reference RNA amplification. Conclusion: T7 based linear amplification reproducibly generates amplified RNA that closely approximates original sample for gene expression profiling using cDNA microarrays. 0 Gene expression profiling using complementary DNA (cDNA) microarrays is being applied for multiple purposes such as defining the taxonomy of different molecular 0 subtypes of human breast and other cancers [1-10] and discovering biomarkers and therapeutic targets [11,12]. A limitation of the use of this technology is that small specimens of human tissue, such as obtained by core needle or 0 Page 1 of 15 0 (page number not for citation purposes) 0 BMC Genomics 2002, 3 0 fine needle aspiration (FNA) biopsies, may not be sufficient for microarray hybridization using direct labelling protocols. Typical microarray labelling procedures require 2-4 µg poly(A)+ RNA or 25-50 µg total RNA per cDNA microarray. This amount of poly(A)+ RNA or total RNA can be obtained from samples of human tissue that weigh greater than 50-100 mg. However, core needle biopsies of breast cancers, for example, weigh in the 10-25 mg range and yield only 3-15 µg of total RNA. Small tumors identified using early detection strategies may thus be too small to excise a specimen with enough RNA for microarray analysis. A pilot study by Assersohn et al. [13] showed that only 15% of FNA samples from human breast cancers produced sufficient mRNA for expression array analysis. One approach to low specimen RNA input has been to use indirect labelling techniques to increase fluorescence signal intensity, such as with aminoallyl nucleotides. Although less expensive, we and other colleagues have found that indirect labelling techniques are not always reliable compared to direct labelling methods. For valuable tumor specimen, reliability is paramount. A very recent report used amino C6dT-modified random hexamers to prime cDNA synthesis in conjunction with aminoallyldUTP and increased fluorescence intensity enough such that as little as 1 µg of total RNA from cell lines gave sufficient signal for cDNA microarray hybridization [14]. The reliability of this method with human tumor specimen warrants further testing. RNA amplification techniques have been developed to address the need for sufficient RNA from tiny specimen for microarray hybridization. Other examples of specimen requiring amplification for genome-wide characterization of gene expression include purified populations of cells obtained by either flow cytometry, laser capture microdissection, breast ductal or bronchial lavage, or microendoscopy. Although one group has used unamplified total RNA extracted from ~2 x 104 microdissected cells for hybridization on 5000 clone membrane-based arrays [15], most groups perform RNA amplification for this purpose [16-18], especially when using high-density slide-based arrays. The most commonly used mechanism for RNA amplification is a T7 based linear amplification method first developed by Van Gelder, Eberwine and coworkers [19-21]. This method utilizes a synthetic oligo(dT) primer containing the phage T7 RNA polymerase promoter to prime synthesis of first strand cDNA by reverse transcription of the poly(A)+ RNA component of total RNA. Second strand cDNA is synthesized by degrading the poly(A)+ RNA strand with RNase H, followed by second strand synthesis with E. coli DNA polymerase I. Amplified antisense RNA (aRNA) is obtained from in vitro transcription of the double-stranded cDNA (ds cDNA) template using T7 RNA 0 Page 2 of 15 0 (page number not for citation purposes) 0 BMC Genomics 2002, 3 0 Table 1: Correlation coefficients of amplified and unamplified expression levels of 14,044 genes selected according to the described criteria. Amplifications with or without TS primer and with two different ds cDNA cleanup protocols were performed on BC91 total RNA. 0 Column for ds cDNA cleanup 0 Reference RNA amplified 0 Total RNA 0 Poly(A)+ RNA 0 Total RNA 0 Poly(A)+ RNA 0 Virtual Average Virtual Average 0 Stefan Tomiuk is a member of the bioinformatics group at MEMOREC, a Cologne-based biotechnology company focusing on gene discovery and expression profiling by SAGE and cDNA microarrays. He participates in building up the company's cDNA collection and is responsible for the selection of DNA fragments suitable for microarray application. Kay Hofmann is head of the bioinformatics group at MEMOREC. 0 Microarray probe selection strategies 1 Stefan Tomiuk and Kay Hofmann 0 Keywords: cDNA microarray, expression profiling, high throughput, clustering, hybridisation 0 During recent years, DNA microarrays have become the method of choice to monitor the expression level of a large number of genes. Depending on the focus of the study and the method of microarray fabrication, a number of different strategies for probe selection may be most appropriate. One consideration concerns the length of the probe, ranging from some 25 residues used for oligonucleotide arrays to complete cDNAs. Unless resources are truly unlimited, an important decision to be made is the amount of effort to be put into the selection of genes and gene fragments. While high-throughput cDNA arraying projects usually will select from a collection of existing cDNA clones, smaller projects focusing on a number of selected genes can afford to selectively amplify fragments optimised for that purpose. This paper discusses the full scope of probe selection strategies, highlighting the problems that may be encountered in the various systems. 0 DNA microarrays are made up of a collection of distinct nucleic acid samples, arranged in a regular lattice of spots on a solid support generally made of coated glass. Arrays intended to monitor changes in the expression level of various genes use cDNA samples or synthetic oligonucleotides derived from cDNA sequences.1,2 Other possible array applications include the detection of mutations or copy number changes on the genome level 3±5 and thus use samples derived from genomic DNA. The successful application of each DNA microarray technique requires particular conditions and prerequisites, which impose certain criteria for selecting appropriate DNA probes. The following paragraphs focus on probe selection strategies for the more widely used expression arrays of both the oligonucleotide- and cDNA-using variety. Nevertheless, some of these criteria are also valid for mutation-detection arrays. 0 GENERAL CONSIDERATIONS 0 When monitoring the expression level of a large number of genes, sufficient sensitivity and specificity of an array, as well as the broad coverage of all relevant genes, are of crucial importance. In addition, the quality of the array should guarantee the reproducibility of the results to ensure their statistical significance. A further prerequisite for a successful interpretation of the array results is a correct assignment and annotation of the DNA probes, providing an unambiguous link to the corresponding entries in gene and literature databases. Some aspects of probe design, including the fragment length, are influenced by the manufacturing process of the arrays. Photolithographic procedures allow a massively parallel production of oligonucleotide arrays, but are restricted to an oligonucleotide length of 20±25 nucleotides due to the high error rate of each extension cycle.6±8 Alternative methods for in situ oligonucleotide synthesis, employing high-precision delivery of chemical 0 Tomiuk and Hofmann 0 Physical properties of the probe influence hybridisation kinetics 0 High coverage but poor sample annotation in high density arrays Short vs. long array probes 0 reliable hybridisation properties but the increased viscosity might complicate the array manufacturing process. In addition, increasing the fragment length raises the danger of non-specific cross-hybridisation events. If fragments of very heterogeneous length are used, the comparability of the investigated genes and the robustness of the array might suffer from the different hybridisation kinetics. Oligonucleotide probes with the length of 50±60 nucleotides may not be suitable for reliably distinguishing single base mismatches, but show an improved specificity and sensitivity compared to shorter oligonucleotides.9,30 0 The most appropriate probe selection strategy depends primarily on the objective of the experiment. As summarised in Figure 1, there is a whole spectrum of different approaches, differing in aspects of throughput, accuracy and the necessary effort before and after the microarray experiment. In situations where little prior information on relevant genes is available, or where the prime motivation is an unbiased overview of global changes in gene expression patterns, the high-density method is the appropriate choice. Typically, samples are selected from a preexisting collection of cDNA sequences or fragments, or they are synthesised by a method amenable to high throughput. The downside of this approach is a general lack of reliable sample annotation, shifting some of the necessary work to the post-hybridisation phase. These highdensity microarrays, which aim to cover the complete transcriptome of a biological system,2,7 are in contrast to small but specialised arrays that are designed with a focus on defined subject areas such as, for example, genes relevant to a particular metabolic pathways or a particular tissue type.31,32 The limited number of DNA fragments on these low-density arrays allows a more thorough selection and annotation protocol. Obviously, there also exists a whole range of intermediates 0 Microarray probe selection strategies 0 The quality of ESTbased arrays depends on the reliability of the library used 0 Spotting without prior sequencing 0 PCR-amplification is the most reliable but most expensive probe generating method 0 between ultrahigh-density and highaccuracy arrays. In the following paragraphs, some common strategies for probe selection are discussed. The easiest and cheapest method consists of the spotting of clones from a library without prior sequencing. Only those clones that show differential expression after hybridisation are submitted to sequencing and further analysis. This strategy is particularly useful for arrays produced in small editions, since only a small fraction of presumably interesting genes must be annotated. The more frequently a particular array set-up is used, the less efficient becomes the deferment of the sequence analysis. Typical applications include highthroughput screens for potential new drug targets,33,34 or the analysis of `exotic' biological systems without any available sequence information. Owing to the frequent representation bias of some genes, a normalisation of the library used is strongly recommended for reaching a more equal distribution.35 A somewhat more refined strategy relies on available collections of sequenced cDNA clones. Most of the available clones have the status of ESTs (expressed sequence tags36 ), and their corresponding sequences are collected in the dbEST database.37 Access to the physical clones of most animal ESTs is provided by the IMAGE consortium (Integrated Molecular Analysis of Genomes and their Expression),38 and by 0 several distributors. Since clones from this exhaustive collection are also available in large sets, they are a valuable and widely used source for microarray probes. For plants and other organisms, similar sources exist. A comm 0 Research Update 0 Genome Analysis 0 Eubacterial phylogeny based on translational apparatus proteins 1 Celine Brochier, Eric Bapteste, David Moreira and Herve Philippe 0 Lateral gene transfers are frequent among prokaryotes, although their detection remains difficult. If all genes are equally affected, this questions the very existence of an organismal phylogeny. The complexity hypothesis postulates the existence of a core of genes (those involved in numerous interactions) that are unaffected by transfers. To test the hypothesis, we studied all the proteins involved in translation from 45 eubacterial taxa, and developed a new phylogenetic method to detect transfers. Few of the genes studied show evidence for transfer. The phylogeny based on the genes devoid of transfer is very consistent with the ribosomal RNA tree, suggesting that an eubacterial phylogeny does exist. 0 The completion of many genome sequence projects has revealed the fundamental importance of lateral gene transfers 0 species and that have no (or very few) duplicated copies. We concatenated the sequences of the 57 genes into a large fusion (~ 9000 amino acid positions). The phylogeny based on this fusion is very similar to that inferred from rRNA and gene content. Detailed analysis revealed that 13 out of the 57 gene phylogenies were INCONGRUENT (see Glossary) with the phylogeny based on the fusion of the 57 genes, either due to methodological treereconstruction problems or to a few recent LGTs. A true organismal phylogeny for Bacteria seems to exist, which could be fully resolved by the analysis of a core group of very rarely transferred genes. 0 Phylogenetic analysis of a large protein fusion 0 For our analysis, we retrieved from the public databanks and from ongoing 0 Congruence and incongruence: Congruence is the agreement between phylogenies obtained using different datasets or different reconstruction methods. Trees are perfectly congruent if they display the same topology; that is, they reflect the same evolutionary history. By contrast, incongruent trees show conflicting robust nodes, which could be due to different evolutionary histories (e.g. lateral gene transfers) or tree reconstruction problems. law: Traditional models of sequence evolution assume that all positions in the sequences are equally likely to undergo a substitution, which reduces the complexity of these models. However, in reality, positions in sequences are more or less `free' to vary; that is, they have different probabilities of undergoing substitutions. This limits the biological realism of traditional models and their efficiency for phylogenetic reconstruction. The variation of substitution rates is commonly approximated using a gamma distribution, also known as a law, which has a shape parameter that specifies the range of rate variation [a]. Small values result in an L-shaped distribution with extreme variation of rates (most sites are invariable, but a few have very high substitution rates). As gets larger, the range of variation diminishes, until approaches infinity and all sites have the same substitution rate. HKY model: The Hasegawa, Kishino and Yano [b] model of sequence evolution is a merger of the Felsenstein [c] and the Kimura two-parameter models [d], which allows transitions and transversions to occur at different rates and base frequencies to vary during the course of evolution, respectively. Jack-knife analysis: A statistical method to evaluate the robustness of an inference. It is based on the construction of random sub-samples of the original alignment by taking a fraction of the positions without replacement (in contrast to the bootstrap method, which allows replacement). Usually, trees are reconstructed with the random sub-samples and the robustness of each node is estimated as the number of its occurrences among these trees [e]. Log-Det method: A method to evaluate evolutionary distances that are consistent for sequences with different nucleotide or amino acid composition [f]. This approach is required because other methods tend to group sequences on the basis of their composition, irrespective of their evolutionary history. Kishino-Hasegawa test: A test used for the estimation of incompatibility between alternative tree topologies with the same taxonomic sampling but obtained using 0 different datasets [g]. Two tree topologies are significantly different if the differences of their likelihood values (expressed as the lnL, where L is the likelihood) is larger than 1.96 standard error in the estimation of likelihood. For a recent criticism of this test see Ref. [h]. Principal component analysis (PCA): This involves a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. Principal components are obtained by projecting the multivariate data vectors on the space spanned by the eigen vectors. 0 Research Update 0 Proteobacteria Spirochetes Green sulfur 0 Chlamydiales Proteobacteria 0 Mycoplasmas (Low G+C Gram positives) 0 Green sulfur 0 D. radiodurans 0 Low G+C Gram positives 0 High G+C Gram positives Thermotogales 0 Low G+C Gram positives 5 High G+C Gram positives 0 D. radiodurans 0 Aquificales 0 TRENDS in Genetics 0 genome projects sequences homologous to all Escherichia coli proteins classified as involved in translation in the Cluster of Orthologous Genes (COG) database [7], as well as the 16S and 23S rRNAs. We aligned 76 proteins from 45 bacterial species, having eliminated any proteins that are present only in a restricted sample of phyla (see http://sorex.snv.jussieu.fr/ translation/translation.html). In addition, as a sample of transferred genes, we used the tRNA synthetases (tRS), most of which are known to have undergone numerous LGTs (perhaps related to antibiotic resistance [8,9]). The 76 genes were analysed individually, and 19 of them were excluded from further analyses because they were: (1) difficult to align reliably, (2) present in less than 42 of the 45 species, or (3) have more than one copy for certain phyla (indicating possible ancient duplications and losses, and/or LGTs). The remaining 57 genes, after elimination of ambiguously aligned regions (alignments available on our website), were concatenated for the 45 bacterial species into a large fusion of 8857 amino acids (fusion P1). Most of 0 the best-known bacterial phyla were represented, of which we had a broad taxonomic sampling for Proteobacteria and Gram-positive bacteria. We do not use Archa 0 Robustness, Flexibility, and the Role of Lateral Inhibition in the Neurogenic Network 0 Summary Background: Many gene networks used by developing organisms have been conserved over long periods of evolutionary time. Why is that? We showed previously that a model of the segment polarity network in Drosophila is robust to parameter variation and is likely to act as a semiautonomous patterning module. Is this true of other networks as well? Results: We present a model of the core neurogenic network in Drosophila. Our model exhibits at least three related pattern-resolving behaviors that the real neurogenic network accomplishes during embryogenesis in Drosophila. Furthermore, we find that it exhibits these behaviors across a wide range of parameter values, with most of its parameters able to vary more than an order of magnitude while it still successfully forms our test patterns. With a single set of parameters, different initial conditions (prepatterns) can select between different behaviors in the network's repertoire. We introduce two new measures for quantifying network robustness that mimic recombination and allelic divergence and use these to reveal the shape of the domain in the parameter space in which the model functions. We show that lateral inhibition yields robustness to changes in prepatterns and suggest a reconciliation of two divergent sets of experimental results. Finally, we show that, for this model, robustness confers functional flexibility. Conclusions: The neurogenic network is robust to changes in parameter values, which gives it the flexibility to make new patterns. Our model also offers a possible resolution of a debate on the role of lateral inhibition in cell fate specification. Introduction In this paper, we use a computer model to explore the properties of the neurogenic network, originally characterized in Drosophila melanogaster. This is but one example of the many networks of cross-regulatory genes at work in complex organisms. Other familiar examples include the networks of segment polarity genes, of cell cycle genes, of circadian clock genes, and so on. Each of these seems to have remained more or less intact through long periods of evolutionary time and across 0 Robustness in the Neurogenic Network 779 0 embryos and imaginal disks. Figure 1 shows our summary of the core genes, their products, and their interactions. In crafting Figure 1, we approached the modelbuilding process as a biochemist approaches in vitro reconstitution; by adding to the system piece by piece, we hope to figure out how each design feature contributes to the function of the essential core network. We rationalize our choice of this diagram in the Supplementary Material available with this article online, with a synopsis as follows (Below, "ac" and "Ac" refer to the real achaete gene and its protein product, whereas "ac" and "AC" refer to corresponding nodes in the model): Delta (Dl) is a ligand for the receptor Notch (N). When Dl activates N, a cleaved-off cytoplasmic piece of N binds to the transcription factor Suppressor of Hairless (Su(H)), and that heterodimer activates Enhancer of split (E(spl)) complex genes. The proneural genes achaete (ac) and scute (sc) encode transcription factors that actually specify neural fate. Both Ac and Sc are autoactivating and cross-activating: they promote their own, and each others', transcription. Thus, the proneural genes constitute a bistable switch at the heart of the neurogenic network. They also activate transcription of E(spl) and Dl. E(spl) in turn represses transcription of ac and sc. Thus, the loop works as follows: something activates ac and/or sc in the neural-competent cluster. They upregulate Dl, whose product activates N in neighboring cells, which, through Su(H), activates E(spl). E(spl) represses ac and sc in those neighboring cells. To achieve a neural fate, a cell must upregulate ac and sc enough that their autoactivation overwhelms E(spl)-mediated repression due to neighboring cells signaling through N. We constructed three different models of the network in Figure 1, which we call "augmented", "standard", and "reduced". The standard network includes all components and interactions shown in Figure 1, except for cis-negative regulation of N activity by Dl and E(spl) autorepression (Figure 1 without red or blue connections). Experimental evidence for each of the latter interactions exists (see the Supplementary Material), but the literature has not given them much attention. Neither did we initially, but our results below regarding the aug- 0 mented network (which adds the red connections) indicate that these may indeed be important. Our reduced network eliminates intracellular negative feedback from AC and/or SC to suppress ac and sc transcription (blue connections replacing red and green connections and their E(spl) hub). Such a simplified network could have functioned in a precursor to the Drosophila network since the similar process of anchor cell specification in the worm Caenorhabditis elegans appears to take place without E(spl)-like genes or function (X. Karp and I. Greenwald, personal 0 Involvement of Putative SNF2 Chromatin Remodeling Protein DRD1 in RNA-Directed DNA Methylation 0 Current Biology 802 0 eling protein CHR35 (At2g16390) [15], which is a member of a previously uncharacterized SNF2-like protein subfamily that is unique to plants. The DRD1 subfamily can be defined by four ProDom [16] domains (Figure 5). These overlap with matches to the functional signatures SNF2_N and HELICc, which together constitute the SWI/ SNF ATPase domain essential for chromatin remodeling activity [17]. The drd1-1 mutation consists of a G-to-R change in the putative Mg2 binding site of SNF2_N. Five additional drd1 alleles (drd1-2, drd1-3, drd1-4, drd1-5, and drd1-6) were identified and sequenced. They all 0 contained a mutation in strongly conserved or functionally implicated regions of the SWI/SNF ATPase domain (Figure 5). The DRD1 subfamily comprises six additional members, including a clear DRD1 homolog in rice (BAC84084) (Figure S2). CHR34 (At2g21450), which still shares all six ProDom domains, is the Arabidopsis protein most similar to DRD1. Another rice protein (AAM15781) is highly similar to DRD1 and also contains all six domains. The remaining three members [At1g05480, T25N20.14 (Q9ZVY9, similar to CHR31), and CHR40 (At3g24340)] have only four of the six ProDom domains in common 0 SNF2 Protein DRD1 and RNA-Directed DNA Methylation 803 0 The stability of proteins in extreme environments Rainer Jaenicke* and Gerald Boehm 0 Three complete genome sequences of thermophilic bacteria provide a wealth of information challenging current ideas concerning phylogeny and evolution, as well as the determinants of protein stability. Considering known protein structures from extremophiles, it becomes clear that no general conclusions can be drawn regarding adaptive mechanisms to extremes of physical conditions. Proteins are individuals that accumulate increments of stabilization; in thermophiles these come from charge clusters, networks of hydrogen bonds, optimization of packing and hydrophobic interactions, each in its own way. Recent examples indicate ways for the rational design of ultrastable proteins. 0 been isolated -- thousands of microbes were isolated from the first samples collected from the Challenger Deep at 110 MPa [2], but very few of them were truly barophilic [3·]. Their proteins are still terra incognita. 0 Limits of stability and growth 0 Proteins, independent of their mesophilic or extremophilic origin, consist exclusively of the 20 canonical natural amino acids. In the multicomponent system of the cytosol, these are known to undergo covalent modifications at extremes of temperature, pH and pressure (deamidation, elimination, disulfide interchange, oxidation, Maillard reactions, hydrolysis, etc. [4]). Extremophiles must compensate for amino acid degradation either by using compatible protectants or by enhanced synthesis and repair. Little is known about the chemistry involved, for example, in the hydrothermal decomposition of proteins, and even less is known about protection and repair. Applying temperatures beyond 100°C, the thermal stabilities of the common amino acids are (Val,Leu)>Ile>Tyr>Lys>His>Met>Thr>Ser>Trp>(Asp,Glu, Arg,Cys). In many cases, the half-lives of the degradation reactions are significantly shorter than the generation time of hyperthermophilic microorganisms [5]; to this limit, biomolecules could still be resynthesized at biologically feasible rates. The temperature at which ATP hydrolysis becomes the limiting factor for viability lies between 110 and 140°C [6]. This temperature limit coincides with the temperature range at which the hydrophobic hydration of proteins vanishes and water becomes an `ordinary solvent' [1]. Apparently, both the integrity of the natural amino acids and the formation of the hydrophobic core upon protein folding are essential for viability. Extrinsic factors and compatible solutes may enhance the stability and shift the limits of growth of prokaryotes as well as eukaryotes [7]. 0 Life on earth exhibits an enormous adaptive capacity. Except for centers of volcanic activity, the surface of our planet is `biosphere'. In quantitative terms, the limits of the biologically relevant physical variables are -40 to +115°C (in the stratosphere and hydrothermal vents, respectively), 120 MPa (for hydrostatic pressures in the deep sea), aw 0.6 (for the activity of water in salt lakes) and 1 (arbitrary units) 0 Oligonucleotide length (nt) 0 GCN4 - Average sensitivity 0 GCN4 - Average specificity 0 Oligonucleotide length (nt) 0 Specific / non specific > 0 Formamide (%) 0 Non-specific, specific intensity 0 Formamide (%) 0 Tiling start position (nt) 0 Data extraction from composite oligonucleotide microarrays 1 Ilya Shmulevich*, Jaakko Astola1, David Cogdell, Stanley R. Hamilton and Wei Zhang 0 ABSTRACT Microarray or DNA chip technology is revolutionizing biology by empowering researchers in the collection of broad-scope gene information. It is well known that microarray-based measurements exhibit a substantial amount of variability due to a number of possible sources, ranging from hybridization conditions to image capture and analysis. In order to make reliable inferences and carry out quantitative analysis with microarray data, it is generally advisable to have more than one measurement of each gene. The availability of both betweenarray and within-array replicate measurements is essential for this purpose. Although statistical considerations call for increasing the number of replicates of both types, the latter is particularly challenging in practice due to a number of limiting factors, especially for in-house spotting facilities. We propose a novel approach to design so-called composite microarrays, which allow more replicates to be obtained without increasing the number of printed spots. INTRODUCTION Oligonucleotide arrays (1,2), both synthesized and spotted, enjoy several advantages over cDNA-based arrays (3,4), such as simpler methodology to obtain DNA and better quality control, options to select high-specificity sequences to avoid cross-hybridization, and the potential to detect alternative spliced variants of genes (5). It is known that microarray gene expression measurements exhibit both between-slide and within-slide variability (6) and that apart from making efforts to improve the technology, having replicate measurements is essential for improving the reliability of subsequent quantitative analysis. Dealing with between-slide variability involves repeating entire microarray experiments. There exist some limitations, however, such as availability of RNA as well as cost factors. To address within-slide variability, the typical approach entails printing replicate spots on the same slide. However, spotting robots typically have a limitation on the number of spots that can be reliably printed. Thus, increasing 0 PAGE 2 OF 5 0 each well were resuspended in 1 ml of 50% DMSO array buffer (50 mM for each oligo). Spotting Oligos were spotted onto poly-L-lysine glass slides by a G3 solid pin spotter (Genomic Solutions, Ann Arbor, MI, USA), baked at 65°C for 90 min, and crosslinked with 65 mJ of ultraviolet radiation. Probe labeling, hybridization and quantification 0 or more oligos into the same spot. The challenge then is to recover the individual gene intensities by observing the intensities of the mixtures. This is, in fact, conceptually simpler than the blind source separation problem because we know exactly which genes are present in which spots and because intensities are simply scalars and not time-varying signals. In addition, the contributions from the mixed oligos are expected to be mutually independent, as they are designed to be non-homologous to each other, which is a fundamental assumption of all oligonucleotide microarrays. The obvious benefit of this approach is that each gene is given an opportunity to make several contributions in different spots, each time with a different partner, and therefore, is also a type of replication. The question is whether the original gene expressions can be reliably recovered from such mixtures. 0 The microarray experiments were performed as described previously (13). Briefly, triplicate reverse transcription reactions using 100 mg of total RNA from RKO cells incorporated Cy3 d-CTP into cDNA. After G50 column purification, replicates were combined for uniformity and distributed to three identical microarray slides. Each slide was hybridized overnight at 60°C in a humid incubator, then washed at 37°C with increasing stringency until 0.1Q SSC was used. Slides were scanned on a LSIV laser scanner (Genomic Solutions, Ann Arbor, MI, USA) and quantified using ArrayVision software (Imaging Research, Inc, St Catherine's, Ontario, Canada). RESULTS Our experiment consisted of designing a spotted microarray containing 30 genes represented in 50 bp oligos that are expressed at different levels in RKO colon cancer cells based on our prior experiments. Those genes were spotted individually five times each, as well as mixtures of all possible pairs of genes, for a total of (30 Q 29) / 2 = 435 pairs. Thus, each of the 30 genes appeared 29 times with different partner genes. Finally, each mixture was replicated five times to facilitate statistical analysis. Total RNA was isolated from RKO colon cancer cells and used for microarray experiments. As a first step, we proceeded to discover how the intensities of signals of the mixtures are related to signal intensities of the individual genes. Prior to any experimentation, it was expected that the intensity of the mixture should be at least an increasing function of the individual intensities. In other words, the higher the expression of the two genes, the higher is the signal from their mixture. It was further anticipated that the mixture would be a linear combination of the individual gene intensities. That is, if xi is the individual intensity of gene i, xj is the intensity of gene j ¹ i, and yk(i,j) is the intensity of the mixture of genes i and j, then yk(i,j) = a(xi + xj) + n, i, j, = 1, ..., 30, for some scalar a and additive error component n. Here, k(i, j) is simply an index that counts from 1 to 435, so k(1, 2) = 1, k(1,3) = 2, ..., k(29,30) = 435. Note that since genes are simply mixed in equal proportions, there is no notion of `first' or `second' gene and thus, we would not expect different weights ai and aj for genes xi and xj. Also, for the least-squares approach that we use below, no statistical description of the error component n is required. Rewriting the above relationship in vector-matrix notation, we have: y = aAx + n where y is a 435 Q 1 vector of mixtures, x is a 30 Q 1 vector of individual gene intensities, A is a binary matrix of size 435 Q 30 in which row k(i, j) contains ones in the ith and jth positions 0 MATERIALS AND METHODS Oligonucleotide design For the proof-of-principle experiments, we 0 A novel sensitive microarray approach for differential screening using probes labelled with two different radioelements 1 H. Salin, T. Vujasinovic, A. Mazurie, S. Maitrejean1, C. Menini, J. Mallet and S. Dumas* 0 LGN, UMR 7091, CNRS, Batiment CERVI, 5eme Etage, Hopital Pitie Salpetriere, 83 boulevard de l'Hopital, F-75013 Paris, France and 1Biospace Mesures, 10 rue Mercoeur, F-75011 Paris, France 0 ABSTRACT We have developed a novel microarray approach for differential screening using probes labelled with two different radioelements. The complementary DNAs from the reverse transcription of mRNAs from two different biological samples were labelled with radioelements of significantly different energies (3H and 35S or 33P). Radioactive images corresponding to the expressed genes were acquired with a MicroImager, a real time, high resolution digital autoradiography system. An algorithm was used to process the data such that the initially acquired radioactive image was filtered into two subimages, each representative of the hybridisation result specific for one probe. The simultaneous screening of gene expression in two different biological samples requires <100 ng mRNA without any amplification. In such conditions, the technique is sensitive enough to directly quantify the amount of mRNA even when present in small amounts: 107 molecules in the probe as assessed with an added control sequence and 2 x 105 molecules with an endogenous tyrosine hydroxylase mRNA. This novel technique of double radioactive labelling on a microarray is thus suitable for the comparison of gene expression in two different biological samples available in only small quantities. Consequently, it has great potential for various biological fields, such as neuroscience. INTRODUCTION DNA array technology is increasingly used for large-scale screening of gene expression. The availability of laser devices that can differentiate between several fluorescent dyes has led to most development efforts being concentrated on fluorescent labelling of probes to be hybridised onto DNA arrays (the immobilised nucleic acid is called the `target' and the free nucleic acid is called the `probe'). The use of two different fluorescent dyes, one to label probes from a control tissue and one to label probes from a tissue of interest, allows normalised quantification of gene expression. For example, standard high 0 PAGE 2 OF 7 0 of starting material required for radioactive labelling is only 2-400 ng mRNA to detect 2 x 107 molecules (12). Previously, such analyses were possible only for one mRNA sample at a time. A technique comparing several mRNA samples on the same high density array but attaining the sensitivity discussed above would be of great value. For example, the results could be normalized, each RNA sample being used as a control for the other, on each target of the microarray, as is possible with double fluorescent labelling (2). These considerations led us to develop a technique for simultaneous hybridisation of two differently labelled radioactive probes on the same glass support microarray and detection of the hybridisation result for each probe separately. The development of this procedure required a device for detection of radioactive emission that could discriminate between different radioactive emission spectra and also with a spatial discrimination appropriate for the microarray density. The MicroImager has these properties. We have previously shown the potential of this device in the discrimination of the radioactive emissions of two different radioelements for in situ hybridisation of two probes on a single tissue section (13,14). Here we describe methods of labelling and hybridisation allowing work with two radioactive probes simultaneously on a single glass support microarray. The sensitivity of this method was analysed and we demonstrate the potential of this novel approach in cases where only small samples are available. MATERIALS AND METHODS Gene array PCR products 300-1500 bp long were purified using the concert nucleic acid purification system and then spotted with an arrayer (Genetix) onto polylysine-coated slides (15). The cDNA clones used were obtained from adult rat brains by RT-PCR, from a positive and exogenous control luciferase cDNA sequence (572 bp insert) in the pGEM-T easy vector (Promega, France) and from a negative and exogenous control neomycin phosphotransferase cDNA sequence (738 bp insert) in the pGEM-T easy vector (Promega). A total of 384 clones were spotted onto the microarray. The microarray plan was made up of four blocks of four rows and 24 columns (as shown in Fig. 2). This plan was in duplicate on every microarray. Preparation of the luciferase RNA The luciferase RNA was prepared from the luciferase cDNA described above using the riboprobe combination system T7 (Promega). RNA extraction mRNA was directly isolated from crude extracts of rat brain tissues on magnetic beads [oligo(dT)25 Dynabeads; Dynal]. All experimental procedures were carried out in accordance with the European Communities Council Directive (24.xi.1986) and with the guidelines of the CNRS and the French Agricultural and Forestry Ministry (decree 87848, licence number A91429). All efforts were made to minimise animal suffering and to use only the number of animals necessary to produce reliable scientific data. 0 Sample preparation for hybridisation Aliquots of 100 ng mRNA were mixed with 0.1 µg random hexamers from a Superscript First-Strand Synthesis System for RT-PCR (Life Technologies, France), heated to 70°C for 10 min and cooled on ice. Probe synthesis and labelling were then performed in the presence of 5 mM MgCl2, 1x reverse transcription buffer (Life Technologies), 10 mM dithiothreitol, 100 U RNaseOUT RNase inhibitor (Life Technologies), 0.05 mM ddTTP, 0.5 mM dGTP and dTTP, 100 U Superscript II reverse transcriptase (Life Technologies) and 10 µCi [35S]dATP (Amersham) and 0.5 mM dCTP or 20 µCi [3H]dCTP (Amersham) and 0.5 mM dATP for the phosphorylated and tritiated probes, respectively, by incubation of the mixtures at 42°C for 50 min. RNA was eliminated by heating at 70°C for 15 min and treatment with 2 U RNase H (Life Technologies) at 37°C for 20 min. Unincorporated nucleotides were removed by passage through a P10 column (Bio-Rad). Hybridisation The probes were added to the hybridisation buffer (3.5x SSC, 0.3% SDS), heated to 95°C for 2 min, cooled to room temperature and then put on the microarray under parafilm (Fuji). Hybridisation was performed in a cassette chamber (Telechem) submerged in a water bath at 60°C for 16-17 h. Following hybridisation, arrays were rinsed at room temperature in 2x SSC, 0.1% SDS, then 2x SSC, then 0.2x SSC, each washing step lasting 2 min. Acquisition of radioactive images with a MicroImager (Biospace Mesures, Paris, France) A thin foil of scintillating paper was placed in contact with the microarrays. -Particles emitted by the hybridised probes were identified by acquisition of the light spot emissions in the scinti 0 Sensitivity and Specificity of Photoaptamer Probes* 1 Drew Smith§, Brian D. Collins, James Heil, and Tad H. Koch¶ 0 Proteomics, the study of protein expression at the scale of cell, tissue, or organism (1, 2), has been defined by a single technology: two-dimensional gel separation followed by mass spectrometric analysis (3, 4). Although this technology is mature, powerful, and wonderfully sophisticated, it suffers from evident limitations in speed and sensitivity. Several days are required to process a single sample, and only 1000 of the most abundant proteins can be detected (5). The ideal proteomic technology would process samples in minutes or hours and be able to quantify even the most weakly expressed proteins. Two-dimensional gels and chromatographic methods separate and identify proteins on the basis of their physical characteristics. An alternative approach is to identify proteins by specific recognition. The potential advantage of this approach is that proteins that have similar size and charge but which 0 The abbreviations used are: SELEX, systematic evolution of ligands by exponential enrichment; A, aptamer; aFGF, acidic fibroblast growth factor; bFGF, basic fibroblast growth factor; NHS, N-hydroxysuccinimide; PDGF, platelet-derived growth factor; T, target protein; HIV, human immunodeficiency virus. 0 Molecular & Cellular Proteomics 2.1 0 Photoaptamer Probes 0 under the harshest and most stringent conditions necessary to reduce background and improve signal. What is not established is the effect of photocross-linking on the specificity of the capture step. We set out to characterize, systematically and quantitatively, a set of photocross-linking aptamers, photoaptamers, with regard to their sensitivity and specificity. The photoreactive unit incorporated into our photoaptamers is 5-bromodeoxyuridine (BrdUrd), used for decades in protein-nucleic acid cross-linking studies. Rather than use short wave (254 or 266 nm) UV light for cross-linking, however, we irradiate at 308 nm using a XeCl excimer laser. This technique was developed by Koch and colleagues (12-16) and has been shown to result in specific and high yield cross-linking reactions. Light at 308 nm induces photoelectron transfer from a nearby electron donor to the bromouracil base via either excitation of the BrdUrd, excitation of the electron donor, or excitation of a BrdUrdelectron donor charge transfer state (17, 18). Amino acid residues that can serve as electron donors in BrdUrd photocross-linking include Tyr, Trp, His, Phe, Cys, Cys-Cys, and Met of which only Tyr and Trp are excited at 308 nm (16 -20). Cross-linking results from subsequent reaction of the resulting radical ion pair. In the absence of an electron donor the BrdUrd efficiently relaxes back to ground state (17). We hypothesized that photocross-linking via photoelectron transfer would actually enhance the specificity of the aptamer-protein capture reaction: although a protein might bind an aptamer nonspecifically, the probability that an appropriate amino acid would be positioned to cross-link with a BrdUrd residue would be low. Some evidence for this view has been presented by Golden and co-workers (9), who showed that basic fibroblast growth factor (bFGF) photoaptamers could cross-link picomolar concentrations of target in the presence of serum with very little nonspecific cross-linking. Using these bFGF photoaptamers and a new photoaptamer raised against the HIV coat protein gp120MN we evaluated both the equilibrium binding constant and the relative rate of cross-linking to target proteins. We then compared these values to the values for a set of non-target proteins. These non-target proteins were chosen to provide an exacting test of specificity: 1) aFGF and gp120SF2 are the commercially available proteins most closely related to the target proteins; 2) platelet-derived growth factor (PDGF) is a highly basic heparinbinding growth factor that is notorious for its nonspecific DNA binding; and 3) thrombin is another heparin-binding protein. These experiments confirm the specificity of the photocross-linking reaction in the solution phase. We extend these results to microarray format by measuring cross-linking of immobilized photoaptamers to target protein. We find that the sensitivity and specificity of photocross-linking are maintained in this format: target proteins can be detected at subnanomolar concentrations in buffer and at nanomolar concentrations when spiked into serum. 0 EXPERIMENTAL PROCEDURES 0 Revealing Global Regulatory Features of Mammalian Alternative Splicing Using a Quantitative Microarray Platform 0 Molecular Cell 930 0 sive use of the latter approach was the application of "exon-junction" microarrays for the discovery of exon skipping events in human tissues and cell lines (Johnson et al., 2003). These authors used custom microarrays containing oligonucleotide probes complementary to mapped exon-exon junction sequences in RefSeq genes for the main purpose of discovering new AS events in human transcripts. Despite the progress described above, a system has not yet been described that permits the large-scale quantitative profiling of alternative splicing in mammalian cell and tissue sources. This is primarily due to limitations stemming from the design of existing microarrays and the lack of suitable algorithms for data analysis. In this paper, we describe a microarray platform that permits the simultaneous quantification of the levels of thousands of alternative exons in mammalian cell and tissues sources. We have applied this system to the analysis of the regulation of 3126 sequence-verified AS events in diverse mouse tissues. The resulting data have generated hundreds of new inferences for functional roles of tissue-specific AS, insights into how the evolutionary origins of alternative exons relate to their inclusion levels in normal tissues, and information on global features of AS that underlie tissue-type specificity. This study therefore demonstrates the utility of a quantitative microarray platform for generating fundamental new insights into the global regulation of alternative splicing in mammals. Results A Custom Microarray for Quantitative Profiling of AS in Mammalian Cells In order to perform large-scale quantitative analyses of functionally diverse AS events in mammalian tissues, we developed a custom microarray to represent sequencevalidated AS events mined from mouse cDNA and EST sequence databases (refer to Experimental Procedures). To minimize representation of possible splicing errors or relatively low-abundance transcripts, we selected "cassette-type" AS events with the highest numbers of supporting cDNA and EST sequences from different cell and tissue sources. To enhance the sensitivity of detection and quantification of inclusion/exclusion levels of alternative exons, each AS event was measured by using six different oligonucleotide probes: one body probe for each exon sequence, designated as "C1, A and C2" probes (C, constitutive; A, alternative), and one junction probe for each of the three splice-junction sequences generated by AS, designated as "C1-A, A-C2 and C1-C2" probes (Figure 1A). In addition, a control probe specific to each intron sequence (located between C1 and A) was included to permit detection of unspliced pre-mRNA and/or contaminating genomic DNA in the hybridizations. From an initial starting set of 4892 AS events in our database, 3126 AS events were selected for monitoring on a single ink-jet printed microarray, manufactured by Agilent Technologies (Figure 1B). The vast majority of the AS events correspond to cassette-type alternative exons, and additional events may correspond to mutually exclusive alternative exons. The 3126 AS events are 0 represented by 2647 distinct genes, with 413 of the genes containing two or more AS events. In addition, 54 of the AS events represented on the microarray are duplicates and were monitored by sets of probes that in some cases are complementary to different sequences within the same exons. These served as reproducibility controls (see below). The 2647 AS genes represented on the microarray are associated with 1118 distinct Gene Ontology Biological Process (GO-BP) categories among a total set of 2362 GO-BP categories assigned to 10,361 Mouse Gene Informatics (MGI) markers (refer to Experimental Procedures; see below). This indicates that the AS genes represented on the microarray are associated with a diverse range of biological functions in mammalian cells. Quantitative Microarray Profiling of Alternative Splicing in Mouse Tissues In order to assess the performance of our microarray system and to reveal global properties of alternative splicing in mammalian tissues, we hybridized 0 Molecular Cancer Therapeutics 0 Transcriptome analysis of endometrial cancer identifies peroxisome proliferator-activated receptors as potential therapeutic targets 1 Cathrine M. Holland,1,2 Samir A. Saidi,2 Amanda L. Evans,1 Andrew M. Sharkey,1 John A. Latimer,2 Robin A.F. Crawford,2 D. Stephen Charnock-Jones,2 Cristin G. Print,1 and Stephen K. Smith1,2 0 Endometrial carcinoma is the most common gynecologic malignancy and comprises 97% of all uterine cancers (1). 0 There is a peak incidence between ages 55 and 65 years, with <5% of endometrial cancers occurring below age 40 years (2). The majority are of an endometrioid histologic subtype and display an association with obesity and diabetes mellitus (2). There is a pressing need to better understand the molecular basis for this disease, as 25% of women present with extrauterine disease with 5-year survival rates of f31% and 10% for Federation Internationale des Gynaecologistes et Obstetristes stages 3 and 4 disease, respectively (2). An improved understanding of events at a molecular level is essential in the development of targeted therapy, with a view to improving survival and cure rates. There are increasing efforts to gain a more global view of the multiple, interrelated molecular changes that occur during tumorigenesis (3 - 6). The gene microarray is a highthroughput technology able to interrogate multiple genetic changes within tissues and cells (7 - 9). Consequently, there has been a marked increase in the use of microarrays to interrogate cancers at the genomic level. In addition to screening for candidate genes, microarrays may provide molecular diagnoses, thus avoiding some of the weaknesses of conventional diagnostic techniques (4, 10). Despite the increasing use of microarray technology in cancer research, there have been difficulties obtaining meaningful biological information. The cost of genomewide, commercially available arrays may prohibit large experimental samples, and there are multiple sources of variation in experimental results complicating data analysis and interpretation (11). Large-scale gene expression analyses of endometrial cancer have mostly been confined to small sample sets and cell lines (12, 13) and have employed genome-wide, commercially available microarray systems (12). Previous microarray studies in endometrial cancer have highlighted differences in the abundance of individual genes between benign and malignant tissues (12, 13), although there has been little advance in the understanding of pathway-specific alterations that may contribute to endometrial tumorigenesis. Independent component analysis (ICA) is a sophisticated statistical method that aims to identify patterns of coregulated genes rather than individual transcript changes (14). We previously have applied high-density cDNA microarrays to determine gene transcript abundance in epithelial ovarian cancer (14). 0 Materials and Methods 0 Tumor Samples and RNA Preparation Twenty frozen endometrial carcinoma tissues, three atypical complex hyperplasias, and eight postmenopausal benign endometrial control tissues (four atrophic and four 0 PPARa Is a Molecular Target in Endometrial Cancer 0 quantitative, real-time PCR experiments were done in the ABI PRISM 7700 Sequence Detector (Applied Biosystems) according to the manufacturer's instructions and were done in triplicate. The resultant data were averaged for each sample. No-template controls were included in each experiment. Specific oligonucleotide primers and probes were used. These were designed for each of five genes [cyclooxygenase-2 (COX-2), vascular endothelial growth factor-B (VEGF-B), PPARa, PPARg, and retinoid X receptor h (RXRh)] using Primer Express 1.5 software (Applied Biosystems). Sequences are given below: (a) COX-2 5V-TGATCCCCAGGGCTCAAA-3V (forward primer), 5V-ATCTGTCTTGAAAAACTGATGCGT-3V (reverse primer), 5V-6FAM-TGATGTTTGCATTCTTTGCCCAGCACTTAMRA-3V (probe); (b) VEGF-B 5V-AGCACCAAGTCCGGATG-3V (forward primer), 5V-GTCTGGCTTCACAGCACTG-3V (reverse primer), 5V-6FAM-AGATCCTCATGATCCGGTACCCGTTAMRA-3V (probe); (c) PPARa 5V-GACGTGCTTCCTGCTTCATAGA-3V (forward primer), 5V-CACCATCGCGACCAGATG-3V (reverse primer), 5V-6FAM-TGGAGCTCGGCGCACAACCA-TAMRA3V (probe); (d) PPARg 5V-CAGAGCAAAGAGGTGGCCAT-3V (forward primer), 5V-GCTTTTGGCATACTCTGTGATCTC-3V (reverse primer), 5V-6FAM-CATCTTTCAGGGCTGCCAGTTTCGCTAMRA-3V (probe); (e) RXRh 5V-CCATCCGCAAAGACCTTACATAC-3V (forward primer), 5V-GTTCCGCTGGCGCTTG-3V (reverse primer), 5-6FAM-TGCCGGGACAACAAAGACTGCACATAMRA-3V (probe). Results for gene abundance in each sample were normalized to abundance of an endogenous control gene. 18S rRNA was used as an endogenous control for all genes, with the exception of VEGF-B for which h-actin was used. Preliminary experiments to determine tha 0 Patterns of Temperature Adaptation in Proteins from Methanococcus and Bacillus 1 John H. McDonald,* Alicia M. Grasso,* and Lidia K. Rejto 0 McDonald et al. 0 Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by D Microarray Hybridization 1 Paul T. Spellman,* Gavin Sherlock,* Michael Q. Zhang, Vishwanath R. Iyer,§ Kirk Anders,* Michael B. Eisen,* Patrick O. Brown,§ David Botstein,*¶ and Bruce Futcher 0 INTRODUCTION In 1981 Hereford and coworkers discovered that yeast histone mRNAs oscillate in abundance during the cell division cycle (Hereford et al., 1981). To date 104 messages that are cell cycle regulated have been identified using traditional methods, and it was estimated that some 250 cell cycle-regulated genes might exist (Price et al., 1991). There are several reasons why genes might be regulated in a periodic manner coincident with the cell cycle. Such regulation might be required for the proper functioning of mechanisms that maintain order during cell division. Alternatively, regulation of these genes could simply allow conservation of resources. Much of the literature has focused on the 0 posttranscriptional mechanisms that control the basic timing of the cell cycle. However, there is also clear evidence that trans-acting factors play a critical role in the regulation of the abundance of many cell cycle- regulated transcripts. Most identified cell cycle controls that exert influence over mRNA levels do so at the level of transcription. Three major types of cell cycle transcription factors are known in yeast, the MBF and SBF factors, Mcm1p-containing factors, and Swi5p/Ace2p (Table 1). Many genes expressed at about the G1/S transition contain MCB or SCB elements in their promoters to which MBF and SBF bind respectively (for review, see Koch and Nasmyth, 1994). It is now apparent that SBF is not as specific for SCBs as was originally thought but, rather, can bind, at least in some cases, to motifs more closely matching the MCB consensus (Partridge et al., 1997). MBF and SBF are activated posttranslationally by Cln3p-Cdc28p, and SBF, at least, is inacti3273 0 by The American Society for Cell Biology 0 P.T. Spellman et al. 0 Table 1. Transcription factors that regulate the cell cycle Complex SBF MBF Mcm1p SFF Ace2p Swi5p Composition Swi6p Swi6p Mcm1p SFF Ace2p Swi5p Swi4p Mbp1p Site name SCB MCB MCM1 SFF SWI5 SWI5 Site CACGAAA ACGCGT TTACCNAATTNGGTAA GTMAACAA ACCAGC ACCAGC Reference Nasmyth, 1985; Andrews and Herskowitz, 1989 Lowndes et al., 1991; McIntosh et al., 1991; Koch et al., 1993 Acton et al., 1997 Althoefer et al., 1995 Dohrmann et al., 1996 Knapp et al., 1996 0 vated by Clb2p-Cdc28p (Amon et al., 1993). It is this cyclin-dependent activation and inactivation that causes MBF- and SBF-mediated transcription to be cell cycle regulated. Mcm1p can bind with other DNA binding proteins to mediate a specific biological effect. In cooperation with Ste12p, Mcm1p directs the cell cycle expression of some genes in early G1 phase (Oehlen et al., 1996). In cooperation with an uncloned factor called "Swi five factor" (SFF), it induces the expression of CLB1, CLB2, BUD4, and SWI5 in M (Lydall et al., 1991; Sanders and Herskowitz, 1996). Finally, possibly acting without a partner, it induces transcription of CLN3, SWI4, and CDC6 at the M/G1 boundary (McInerny et al., 1997). The Mcm1p SFF combination is interesting, because it is somehow activated by Clb2p-Cdc28p, and Mcm1p SFF then induces further transcription of CLB2. Thus, Mcm1p is part of a positive feedback loop for CLB2 transcription. Finally, Swi5p and Ace2p, which are transcriptionally controlled by Mcm1p and SFF, are responsible for the expression of many genes in M and M/G1 (Kovacech et al., 1996). Some of these genes are responsible for inactivating Clb2p and promoting cytokinesis, thus allowing exit from mitosis, and allowing the cycle to begin anew. Many cell cycle-regulated genes are involved in processes that occur only once per cell cycle. Such processes include DNA synthesis, budding, and cytokinesis. Additionally many of these genes are involved in controlling the cell cycle itself, although in most cases it is unclear whether their regulated transcription is absolutely required. The cell division cycle is thus a complex self-regulating program, such that 0 Strains used in this study are shown in Table 2. 0 Media and Growth Conditions 0 YEP medium (Sherman, 1991) was used in all experiments, supplemented with an appropriate carbon source. Carbon sources are indicated in the descriptions of each experiment and were used at a 0 Molecular Biology of the Cell 0 Microarray Manufacture 0 Yeast ORFs were amplified using gene PAIRS primers (Research Genetics, Huntsville, AL). One hundred-microliter PCR reactions were performed in 96-well PCR plates using each primer pair with the following reagents: 1 M each primer, 200 M each dATP, dCTP, dTTP, and dGTP, 1 PCR buffer (Perkin Elmer-Cetus, Norwalk, CT), 2 mM MgCl2, and 2 U of Taq DNA polymerase (Perkin Elmer-Cetus). Thermalcycling was performed in Perkin Elmer-Cetus 9600 thermalcyclers with a 5-min denaturation step at 94°C, followed by 30 cycles with melting, annealing, and extension temperatures and times of 94°C, 30 s; 54°C, 45 s; and 72°C, 3 min 30 s, respectively. Production of the correct PCR product was verified by gel electrophoresis. Products deemed to have failed were reamplified either by repeating the PCR reaction with the gene PAIRS primers, ordering custom primers, or using the yeast ORF DNA (Research Genetics) as a template. Reamplification of failed PCRs used the same protocol as initial amplification. DNAs were prepared and printed onto microarrays as described previously (Shalon et al., 1996; DeRisi et al., 1997 [http:/ /cmgm. stanford.edu/pbrown/]; Eisen and Brown, 1999) with 190- m spacing between the centers of each element. Each microarray was visually inspected, and all microarrays used in this study were estimated to be missing 1% of all elements except for arrays used in the cdc15 experiments, which were missing 3% of all elements. 0 Size-based Synchronization 0 Nine l 0 DNA Microarrays of the Complex Human Cytomegalovirus Genome: Profiling Kinetic Class with Drug Sensitivity of Viral Gene Expression 1 JAMES CHAMBERS,1 ANA ANGULO,2 DHAMMIKA AMARATUNGA,1 HONGQING GUO,1 YING JIANG,1 JACKSON S. WAN,1 ANTON BITTNER,1 KLAUS FRUEH,1 MICHAEL R. JACKSON,1 PER A. PETERSON,1 MARK G. ERLANDER,1 AND PETER GHAZAL2* Departments of Immunology and Molecular Biology, Division of Virology, The Scripps Research Institute, La Jolla, California 92037,2 and The R. W. Johnson Pharmaceutical Research Institute, San Diego, California 921211 0 MATERIALS AND METHODS Selection and synthesis of oligonucleotides for DNA microarrays. The complete set of ORFs from the HCMV genome was analyzed with a custom se- 0 CHAMBERS ET AL. 0 J. VIROL. 0 GTACCGTTGTACGCATTACAC3 ) and 18120 (5 GACGAAGATG CCGATGTGTGAC3 ). The resulting PCR fragments were isolated from agarose gels and then radiolabelled with [ -32P]dATP by the random-primed labelling method (Boehringer, Mannheim, Germany) according to the manufacturer's protocol. For TRL8-IRL8, TRL9-IRL9, UL15, UL31, UL48, UL66, and UL73, the corresponding oligonucleotides shown in Fig. 1 were used as probes, after being [ -32P]ATP end labelled with polynucleotide kinase (Stratagene). Oligonucleotide probes were hybridized to the filters for 1 h at 45°C by using Quick Hybridization solutions (Stratagene) under conditions recommended by the manufacturer. PCR-generated probes were hybridized with the filters for 12 h at 65°C in 1 Denhardt's solution, 6 SSC, and 100 g of denatured salmon sperm DNA/ml. Filters were washed to a stringency of 0.1% sodium dodecyl sulfate (SDS) at 60°C or 1% SDS at 42°C depending whether PCR-generated DNA fragments or oligonucleotides, respectively, were used during the hybridization. Hybridization signals were quantitated by using a Molecular Dynamics PhosphorImager system with ImageQuant software. MEME analysis of the upstream noncoding DNA sequences. The computer program Multiple EM for Motif Elicitation (MEME) was used to search for sequence motifs in 500 bp of noncoding sequences upstream of the initiation codon. MEME analysis was performed by using the sequence of strain AD169 of HCMV. The 5 noncoding regions were categorized according to class of expression as follows: E (TRL4-IRL4, UL104-5, UL11, UL112, UL124, UL13, UL16-7, UL24, UL26-7, UL35, UL4-5, UL45, UL53-7, UL77-9, US8-14, US16-7, US19, US23-4, US26, US28, and US30), early-late (E-L) (TRL-IRL6, TRLIRL10, TRL-IRL12, TRL-IRL13, UL1, UL106, UL130, UL40, UL44, UL46-7, UL49, UL72, UL83-5, UL95-8, US6-7, and US29), and L (TRL-IRL8, TRLIRL11, TRL-IRL14, UL100, UL103, UL111A, UL117, UL119, UL131, UL14, UL18, UL2-3, UL7, UL9, UL25, UL29, UL32-3, UL43, UL48, UL52, UL59, UL67, UL73, UL80, UL82, UL91-3, UL99, US18, and US27). By using MEME, 30 motifs (10 of 8 bases in length, 10 of 10 bases in length or longer, and 10 of 12 bases in length or longer) were derived from each gene set. The distribution of the combined 90 patterns was identified, allowing for 10% mismatch. MEME is available on the World Wide Web (20a). The resulting motifs that developed a significant polarized distribution pattern are summarized in Table 2. In addition, the transcription factor database (TFD) was used to search for known regulatory sequences. The TFD was downloaded from the National Center for Biotechnology Information. 0 quence analysis program that selected a 75-base sequence to be used as a microarray deposition target. The analysis preferentially selects unique sequences with a 3 gene bias and a G-C content of 40 to 60% and rejects sequences that contain homopolymeric stretches and potential hairpin structures. The 3 gene bias is preferred, as fluorescently labelled cDNA prepared for hybridization is generated by using oligo(dT) to prime poly(A) tails of mRNA. The selected target sequences were synthesized by using a PE Perseptive BioSystem (Framingham, Mass.) Expedite MOSS DNA synthesizer with membrane columns. Synthesized gene target oligonucleotides were cleaved, deprotected, and purified by standard procedures. Target oligonucleotides were transferred in triplicate to 96-well master plates at a concentration of 1 g/ l (in 3 SSC [1 SSC is 0.15 M NaCl plus 0.015 M sodium citrate]) for robotic deposition. The sequence of oligonucleotides comprising the deposited HCMV ORF microarray is shown in Fig. 1. The small ORF UL48/49 (8) and the UL74 ORF described by Huber and Compton (13) were not included in the present chip design. Also shown in Fig. 1 is a subset of cellular genes that were included as internal controls for normalization between chips, as follows: elongation factor 1-alpha (accession no. M29548), human acidic ribosomal phosphoprotein (RiboPO; accession no. M17885), alpha tubulin (accession no. K00558), glyceraldehyde-3-phosphate deh 0 Accounting Units in DNA 1 S. J. BELL AND D. R. FORSDYKE* 0 Chargaff's first parity rule (%A = %T and %G = %C) is explained by the Watson-Crick model for duplex DNA in which complementary base pairs form individual accounting units. Chargaff's second parity rule is that the first rule also applies to single strands of DNA. The limits of accounting units in single strands were examined by moving windows of various sizes along sequences and counting the relative proportions of A and T (the W bases), and of C and G (the S bases). Shuffled sequences account, on average, over shorter regions than the corresponding natural sequence. For an E. coli segment, S base accounting is, on average, contained within a region of 10 kb, whereas W base accounting requires regions in excess of 100 kb. Accounting requires the entire genome (190 kb) in the case of Vaccinia virus, which has an overall ``Chargaff difference'' of only 0.086% (i.e. only one in 1162 bases does not have a potential pairing partner in the same strand). Among the chromosomes of Saccharomyces cerevisiae, the total Chargaff differences for the W bases and for the S bases are usually correlated. In general, Chargaff differences for a natural sequence and its shuffled counterpart diverge maximally when 1 kb sequence windows are employed. This should be the optimum window size for examining correlations between Chargaff differences and sequence features which have arisen through natural selection. We propose that Chargaff's second parity rule reflects the evolution of genome-wide stem-loop potential as part of shortand long-range accounting processes which work together to sustain the integrity of various levels of information in DNA. 0 Academic Press 0 Introduction When the base composition of natural duplex DNA is determined it is found that the quantities of A and T are equal and the quantities of C and G are equal. This is Chargaff's famous first parity rule (Chargaff, 1951). If a long DNA duplex is cut into two and the base composition of each part determined, the rule is found to hold precisely for the two parts, as for the duplex of 0 origin. This division of the duplex can be continued down to individual bases (pairing with their complementary bases on the opposite strand of the duplex). Again Chargaff's parity rule is obeyed precisely (Watson & Crick, 1953). Disregarding nearest-neighbour influences (Turner, 1996), single base pairs can be regarded as fundamental ``accounting units''. The summation of these individual accounting units results in the precise A = T and C = G equivalences of duplex DNA sequences. That the equivalences have arisen, and are maintained, because they are of adaptive value to an 0 Academic Press 0 expected to resemble that resulting from the tossing of a biased coin for which heads (A or C) would be slightly favoured/disfavoured over tails (T or G), respectively, depending on their relative proportions in the total segment. The base composition o 0 Review: Proteins with Repeated Sequence--Structural Prediction and Modeling 1 Andrey V. Kajava 0 The relationship between the amino acid sequence and the three-dimensional structure of proteins with internal repeats is discussed. In particular, correlations between the amino acid composition and the ability to fold in a unique structure, as well as classification of the structures based on their repeat length, are described. This analysis suggests rules that can be used for the structural prediction of repeat-containing proteins. The paper is focused on prediction and modeling of solenoid-like proteins with the repeat length ranging between 5 and 40 residues. The models of leucine-rich repeat proteins and bacterial proteins with pentapeptide repeats are examined in light of the recently solved structures of the related molecules. © 2001 Academic Press Key Words: classification; molecular modeling; prediction; tandem repeats; structural bioinformatics. 0 Copyright © 2001 by Academic Press All rights of reproduction in any form reserved. 0 REVIEW: STRUCTURAL PREDICTION OF REPEAT-CONTAINING PROTEINS 0 their number has grown to about 40 since then (Groves and Barford, 1999; Kobe and Kajava, 2000). Despite this progress, these proteins are still underrepresented in the structural databases (about 0.5% of all structures), compared with sequence databases (about 5%). This lack of structural information is explained by the fact that the large molecular weight and the elongated shape of these molecules hamper X-ray and NMR studies. These difficulties add importance to the theoretical approaches. In this article, molecular modeling of several solenoidlike proteins will be described and some rules will be formulated for the theoretical prediction and modeling of these types of repetitive proteins. 0 IS A PROTEIN WITH REPEATS STRUCTURED OR UNSTRUCTURED? 0 This is the first question to answer when approaching a repetitive protein to predict its 3D structure. Most protein molecules fold into only one particular conformation determined by their amino acid sequence. This is especially correct for proteins with aperiodic sequences that fold into globular structures. Unstructured fragments of globular proteins, if any, represent only a minor part of the molecules and are located in loops or connections between stable structural domains. In contrast, proteins with repeats frequently do not have unique stable 3D structures. For example, experimental studies have failed to demonstrate the presence of a unique 3D structure for elastin (Urry et al., 1995), small proline-rich proteins of cell envelopes (Steinert et al., 1999), the circumsporozoite protein of Plasmodium falciparum (Esposito et al., 1989; Dyson et al., 1990), glutenin from wheat (Van Dijk et al., 1997), the serine-rich domain of rtoA protein from Dictyostelium discoideum (Brazill et al., 2000), histidine-proline-rich glycoprotein (Borza et al., 1996), and H1 histones (Hartman et al., 1977). The elastin molecules containing a set of repeats, e.g., VGVAPG and GFGVGAGVP, are unstructured and covalently cross-linked to generate an elastic meshwork that enables tissues such as arteries and lungs to deform and stretch without damage (Urry et al., 1995). The small proline-rich 3 protein of the human cell envelope having GxTKVPEP repeats (here and further in the text, "x" indicates a position with any residue) adopts a loose structure with some regions of protein occasionally folding in -turn conformations (Steinert et al., 1999). The circumsporozoite protein from P. falciparum, an agent of malaria, comprises a long tandem array of NANP repeats. This repetitive region can be elongated and flexible and may function similarly to the outer cell carbohydrates. The H1 histone molecules are thought to be responsible for pulling chromatin nucleosomes 0 The Comparative Genomics of Polyglutamine Repeats: Extreme Difference in the Codon Organization of Repeat-Encoding Regions Between Mammals and Drosophila 1 M. Mar Alba,1 Mauro F. Santibanez-Koref,2 John M. Hancock2,* ` ´~ 0 Abstract. Polyglutamine repeats within proteins are common in eukaryotes and are associated with neurological diseases in humans. Many are encoded by tandem repeats of the codon CAG that are likely to mutate primarily by replication slippage. However, a recent study in the yeast Saccharomyces cerevisiae has indicated that many others are encoded by mixtures of CAG and CAA which are less likely to undergo slippage. Here we attempt to estimate the proportions of polyglutamine repeats encoded by slippage-prone structures in species currently the subject of genome sequencing projects. We find a general excess over random expectation of polyglutamine repeats encoded by tandem repeats of codons. We nevertheless find many repeats encoded by nontandem codon structures. Mammals and Drosophila display extreme opposite patterns. Drosophila contains many proteins with polyglutamine tracts but these are generally encoded by interrupted structures. These structures may have been selected to be resistant to slippage. In contrast, mammals (humans and mice) have a high proportion of proteins in which repeats are encoded by tandem codon structures. In humans, these include most of the triplet expansion disease genes. 0 Key words: Glutamine repeats -- Replication slippage -- Comparative genome analysis -- Repeat evolution -- Triplet expansion diseases -- Triplet repeats -- Genome evolution 0 quences encoding polyglutamine repeats in the yeast genome (Alba et al. 1999a) indicated that the majority does not consist of long runs of single codons, suggesting that in yeast point mutation is an important process in generating polyglutamine repeats. These observations raise the question to what extent the contribution of point mutation and slippage to the evolution of these structures differs in different evolutionary lineages. To study this we have analyzed large protein data sets from a further four model organisms that are currently the subjects of genome sequencing projects (Escherichia coli, Caenorhabditis elegans, Arabidopsis thaliana, Drosophila melanogaster) and compared them with S. cerevisiae, Mus musculus, and Homo sapiens repeats. The results show similarities and differences between species. For most of the eukaryotic species there is an overrepresentation of tracts encoded by long CAG tandem repeats, supporting the idea that recent slippage has been involved in the generation of a significant proportion of the tracts. However, on average about 70% of the tracts do not show evidence of recent slippage, and in D. melanogaster there is no clear evidence of a strong contribution from slippage. Furthermore, in the two mammalian species about one-third of the tracts are exclusively encoded by CAG and the length of the tracts is on average much longer than in other species. This suggests that slippage has played a more important role in the evolution of polyglutamine regions in mammals than in other taxa. Methods Database Searches 0 BLASTP (Altschul et al. 1990) at the NCBI was used to find all GenBank entries which contained genes encoding long polyglutamine tracts ( 6 glutamines) from E. coli, S. cerevisiae, C. elegans, A. thaliana, D. melanogaster, M. musculus, and H. sapiens. Redundancy in the primary data sets was eliminated by running FASTA within the GCG package (Pearson and Lipman 1988; GCG 1997). Sequences with 95% identity were considered redundant, and only one representative sequence was used in the subsequent analysis. Where there was a discrepancy in the length of the polyglutamine tract in nearly identical sequences, we took the sequence with the longest tract. 0 Analysis of Codon Repeats 0 We used statistical analysis to analyze two properties of polyglutamine repeat-encoding regions. The first was the extent of deviation of the codon organization within these regions from random. This was measured by considering the deviation of the length of the longest run of each codon type from chance expectation (Alba et al. 1999a,b). The second property was the over- or underrepresentation of tandem codon repeats of a particular length in the whole set of polyglutamine-coding regions in a given species. Length of the Longest Homogeneous Run. As described previously (Alba et al. 1999a,b) the organizational homogeneity or otherwise of a region encoding a polyglutamine repeat has to be considered in the 0 Table 1. Polyglutamine tracts in different species Length of polyglutamine tract Species S. cerevisiae C. elegans A. thaliana D. melanogaster M. musculus H. sapiens 0 CAG relative frequencya Genome 0.307 0.331 0.442 0.716 0.743 0.674 Tracts 0.450* 0.430* 0.465 (NS) 0.728 (NS) 0.824* 0.830* 0 Pure codon tracts CAG 4.7% 2.2% 2.2% 7.3% 37.2% 26.2% CAA 5.4% 5.8% 11.3% 0% 0% 0% 0 Chi-square test of the r 0 Tendency for Local Repetitiveness in Amino Acid Usages in Modern Proteins 1 Kazuhisa Nishizawa1*, Manami Nishizawa1 and Ki Seok Kim2 0 Systematic analyses of human proteins show that neural and immune system-specific, and therefore, relatively ``modern'' proteins have a tendency for repetitive use of amino acids at a local scale ($1-20 residues), while ancient proteins (human homologues of Escherichia coli proteins) do not. Those protein subsegments which are unique based on homology search account for the repetitiveness. Simulation shows that such repetitiveness can be maintained by frequent duplication on a very short scale (one to two codons) in the presence of substitutive point mutation, while the latter tends to mitigate the repetitiveness. DNA analyses also show the presence of cryptic (i.e. ``out of the codon frame'') repetitiveness, which cannot fully be explained by features in protein sequences. Simulative modification of the amino acid sequences of immune systemspecific proteins estimate that 2.4 duplication events occur during the period equivalent to ten events of substitution mutation. It is also suggested that the repetitiveness leads to longitudinal unevenness within a given peptide domain. Those peptide motifs which contain similarly charged residues are likely to be generated more frequently in the presence of the tendency for repetitiveness than in its absence. Therefore, the neutral propensity of DNA for duplication, which can also tend to generate repetitiveness in amino acid sequences, seems to be manifested primarily when the constraints on amino acid sequences are relatively weak, and yet may be positively contributing to generation of unevenness in modern proteins. 0 Academic Press 0 Keywords: microsatellite; coding regions; peptide motif; triplet repeat 0 Academic Press 0 Repetitive Use of Amino Acids 0 Results and Discussion 0 Repetitive Use of Amino Acids 0 Identifying Differentially Expressed Genes in cDNA Microarray Experiments 0 ABSTRACT A major goal of microarray experiments is to determine which genes are differentially expressed between samples. Differential expression has been assessed by taking ratios of expression levels of different samples at a spot on the array and agging spots (genes) where the magnitude of the fold difference exceeds some threshold. More recent work has attempted to incorporate the fact that the variability of these ratios is not constant. Most methods are variants of Student's t -test. These variants standardize the ratios by dividing by an estimate of the standard deviation of that ratio; spots with large standardized values are agged. Estimating these standard deviations requires replication of the measurements, either within a slide or between slides, or the use of a model describing what the standard deviation should be. Starting from considerations of the kinetics driving microarray hybridization, we derive models for the intensity of a replicated spot, when replication is performed within and between arrays. Replication within slides leads to a beta-binomial model, and replication between slides leads to a gamma-Poisson model. These models predict how the variance of a log ratio changes with the total intensity of the signal at the spot, independent of the identity of the gene. Ratios for genes with a small amount of total signal are highly variable, whereas ratios for genes with a large amount of total signal are fairly stable. Log ratios are scaled by the standard deviations given by these functions, giving model-based versions of Studentization. An example is given. Key words: beta-binomial model, microarray replication. 0 BAGGERLY ET AL. 0 INTRODUCTION 0 he human biological system is under the control of perhaps 40,000 genes. Genes are the encoded blueprints for the proteins that perform cellular functions. In going from genes to proteins, there is an intermediate step in which DNA is transcribed to single-stranded messenger RNA (mRNA). It is through mRNA that genes produce protein. Most of the time, the levels of mRNA re ect the abundance of the corresponding proteins in the cell. Perturbations of the cellular environment by such factors as radiation, heat, food intake, or genetic mutation lead to altered expression in a speci c group of genes. A goal of functional genomics is to apply high-throughpu t technologies to identify, from the vast number of genes, the few genetic and molecular changes associated with a de ned phenotype. Identi cation of these genes can help us diagnose disease, identify targets for speci c therapeutic intervention, or simply understand the basis of the underlying biological processes. A primary tool for functional genomics is the Complementary DNA (cDNA) microarray, which is commonly used to measure the relative expression levels of thousands of genes in a given cell population. Using this approach, researchers have successfully found disease related genes (Bittner et al., 2000; Clark et al., 2000; Fuller et al., 1999), and have developed new molecular classi cation schemes for cancers (Bittner et al., 2000; Golub et al., 1999). Microarrays are produced in a laboratory by placing thousands of different cDNA clones onto a solid surface: a nylon membrane or a chemically coated glass microscopy slide. For example, in a typical experiment we print 4,800 spots in a 4 £ 12 format of patches, where each patch contains 100 different spots arranged in a 10 £ 10 grid. At each spot, approximately 2 nanograms of a speci c gene are deposited by a robotic arrayer. Once on the slide, the originally double-stranded DNA is denatured so that it splits into single strands which are bound to the surface. These single strands are then available to serve as speci c attractants to the complementary single-stranded DNA molecules, a process called hybridization. To assess the expression levels of the genes in a given cell population, the cells are broken apart chemically (lysed) and total RNA is isolated according to a standard procedure. Then reverse transcriptase is used to convert the mRNA back into single-stranded complementary DNA, which is more stable than RNA. During the process of reverse transcription, uorescent dyes or radioactively labeled nucleotides can be incorporated, providing a signal that can be monitored by detectors. Further, two or more different uorescent dyes can be used to label different samples, thus allowing simultaneous monitoring of two samples on the same microarray. After the labeled cDNA in a solution is obtained, it is placed onto the microarray surface and incubated to allow speci c binding to the different DNA molecules bound to the array. We customarily call the immobilized DNA on the microarray "probe" and the labeled DNA in solution "target." (This target/probe dichotomy is, unfortunately, not set; the literature contains both this usage and the converse. We have chosen to follow the de nition adopted in the January 1999 supplement to Nature Genetics, "The Chipping Forecast.") The amount of probe on the array is assumed to be vastly in excess of the amount of target, so that the amount binding to the probe is a function of the target copy number in the mixture. After washing to remove the nonspeci c binding, the hybridized microarray is scanned using a laser scanner (for uorescence) or a phosphorimage r (for radioactive labels). We will focus on uorescent labeling on glass slides in this paper, but the model proposed also holds for radioactive labeling, since the hybridization kinetics are similar. Both scanners produce computer images of the entire array whose pixel values are processed to estimate the rough amounts associated with individual spots. Unfortunately, these measurements do not correspond perfectly to the true expression levels. Reverse transcription and label incorporation work with different ef ciencies for different mRNA sequences, so the relative expression levels of different genes within a sample cannot be measured reliably. However, the relative expression levels of the same gene in two different samples can be measured, as the reverse transcription ef ciencies should be about the same. Comparing the images introduces two types of offset that must be corrected for. First, there is a multiplicative offset, a normalization factor, associated with scans being made using different gain settings or using different amounts of raw material in the two samples. Second, there is a background level associated with the nonspot portions of the image, which must be subtracted before comparisons are made. Estimating and correcting for these offsets introduces variation, which we shall address below. For more detailed descriptions of the experimental protocols used in microarray preparation, the reader is referred to some of the papers addressing protocols (Eisen and Brown, 1999; Hedge et al., 2000). 0 IDENTIFYING DIFFERENTIALLY EXPRESSED GENES 0 Thus, cDNA microarrays allow us to compare genetic pro les of different samples (Schena et al., 1995, 1996). We may be able to use these pro les to identify genetic markers associated with various diseases by contrasting diseased and healthy tissue. Further, we may arrive at a more objective method of pathology that allows us to identify molecularly distinct subcategories of diseases, paving the way for more focused treatments. Some of this potential is beginning to be realized (Alizadeh et al., 1999; Alon et al., 1999; DeRisi et al., 1997; Eisen and Brown, 1999; Golub et al., 1999; Hughes et al., 2000b; Lee et al., 2000; Pollack et al., 1999; Ross et al., 2000; Scherf et al., 2000). Books on the methodology, (Schena, 1999, 2000) are beginning to appear. From a statistical point of view, the initial question to be addressed in comparing relative expression levels is whether an observed difference corresponds to a real difference or simply a statistical uctuation: How do we assess signi cance? Early papers (Schena et al., 1995, 1996; DeRisi et al., 1996) focused on sets of genes exhibiting more than a k-fold difference in expression level between samples, where the value of k was chosen more or less arbitrarily. Focusing on fold differences reduces to focusing on ratios, or equivalently log ratios, of expression levels. We prefer log ratios because they visually emphasize the equal importance of ratios of k and 1=k; on the log scale these have the same magnitude and differ only in sign. 0 Assessing signi cance: Historical background 0 In the rst statistical attack on the problem of assessing when a log ratio is "signi cant" (Chen et al., 1997), the use of a xed fold-difference is restated by assuming that the coef cient of variation associated with each signal is constant, but the fold multiple for signi cance thresholding is chosen in a less ad hoc fashion. The authors assess the overall level of variability associated with the log ratio measurements for a few "housekeeping" genes whose level of expression is assumed to be constant across samples an 0 General nonlinear framework for the analysis of gene interaction via multivariate expression arrays 1 Seungchan Kim Edward R. Dougherty 1 Michael L. Bittner Yidong Chen 0 National Institutes for Health National Human Genome Research Institute Laboratory for Cancer Genetics 1 Krishnamoorthy Sivakumar 1 Paul Meltzer Jeffrey M. Trent 0 National Institutes for Health National Human Genome Research Institute Laboratory for Cancer Genetics 0 Abstract. A cDNA microarray is a complex biochemical-optical system whose purpose is the simultaneous measurement of gene expression for thousands of genes. In this paper we propose a general statistical approach to finding associations between the expression patterns of genes via the coefficient of determination. This coefficient measures the degree to which the transcriptional levels of an observed gene set can be used to improve the prediction of the transcriptional state of a target gene relative to the best possible prediction in the absence of observations. The method allows incorporation of knowledge of other conditions relevant to the prediction, such as the application of particular stimuli or the presence of inactivating gene mutations, as predictive elements affecting the expression level of a given gene. Various aspects of the method are discussed: prediction quantification, unconstrained prediction, constrained prediction using ternary perceptrons, and design of predictors given small numbers of replicated microarrays. The method is applied to a set of genes undergoing genotoxic stress for validation according to the manner in which it points toward previously known and unknown relationships. The entire procedure is supported by software that can be applied to large gene sets, has a number of facilities to simplify data analysis, and provides graphics for visualizing experimental data, multiple gene interaction, and prediction logic. © 2000 Society of Photo-Optical Instrumentation 0 Sequences and clones for over a million expressed sequenced tagged sites ESTs are currently widely available. Characterization of these genes lies behind the ability to collect them. Only 14% of identified clusters contain genes even tenuously associated with a known functionality. One way of gaining insight into a gene's role in cellular activity is to study its expression pattern in a variety of circumstances and contexts, as it responds to its environment and to the action of other genes. Recent methods facilitate large scale surveys of gene expression in which transcript levels can be determined for thousands of genes simultaneously. In particular, cDNA microarrays result from a complex biochemical-optical system incorporating robotic spotting and computer image formation and analysis.1-5 Since transcription control is accomplished by a method which interprets a variety of inputs,6-8 we require analytical tools for expression profile data that can detect the types of multivariate influences on decision making produced by complex genetic networks. In this paper we discuss a statistical-operational framework for finding associations between expression patterns of genes by determining whether knowledge of the transcriptional levels of a small 0 gene set can be used to predict the transcriptional state of another gene. A feature of the method is that it allows one to incorporate knowledge of other conditions, such as the application of particular stimuli or the presence of inactivating gene mutations, as predictive elements, thereby broadening the classes of information that can be simultaneously evaluated in modeling biological decision making. Our focus is on a general framework: the determination-prediction paradigm for analysis of gene interaction, comparison of constrained and unconstrained prediction in the face of limited microarray replications, estimation of the degree of determination given limited replications, interpretation of the results, and software to assist interpretation. Experimental results will be given for the purposes of explanation and verification. A particular instance of the general methodology has been applied in a separate biological paper see Sec. 4 .9 A methodological perspective is important for appreciating the range of applicability of the proposed framework, which is not limited to cDNA microarrays, but can be used for studying interaction in the context of other kinds of arrays. The mechanism of intergene association is not a factor in statistical prediction. The only factor is the ability to predict the target level from the predictor levels. The predictor genes may be upstream or downstream from the target gene in the 0 SPIE 0 October 2000 0 actual genetic network, some may be upstream and some downstream, or they may be distributed about the network in such a way that their relation to the target gene is based on chains of interaction among various intermediate genes. Whatever the relationship of the predicting genes to the predicted, if knowledge of their states allows us to better predict the expression level of the target gene, then we infer there is some relationship--the better the prediction, the stronger the relation. As the first step in carrying out nonlinear genomic prediction on gene expression profiles, data complexity is reduced by thresholding the changes in transcript level into ternary expression data: 1 down regulated , 1 up regulated , or 0 invariant . This simplification is motivated by the way in which analysis is carried out on cDNA microarrays and by the need to collect many samples where gene expression levels vary due to altered cellular states. To find connections between genes, enough conditions must be sampled to detect the independent functioning of different genetic networks. This amount of sampling requires data from numerous arrays. When viewed across many arrays, the absolute intensity of signal detected by each element of the detector in this hybridization based assay can be seen to vary based both on the process of preparing and printing the EST elements, and the processes of preparing and labeling the cDNA representations of the RNA pools. This problem is solved via internal standardization. An algorithm that first calibrates the data internally to each microarray and statistically determines whether the data justify the conclusion that expression is up regulated or down regulated with 99% confidence is used to detect significant changes in the transcript level.10 Requiring a high confidence level insures that the logical values 1 and 1 represent significant down and up regulation, and do not result from experimental variability. 0 Nonlinear Multivariate Prediction 0 The purpose of nonlinear multivariate prediction filtering is to predict estimate the output of a nonlinear system. Consider a system S having inputs X 1 ,X 2 , . . . ,X m to be observed and measured, along with other inputs, which we may have no way of measuring, and may not even be able to identify Figure 1 . We do not assume a known mechanism by which the output is determined, nor is there an assumption of causality. The prediction problem is to estimate the output of S given only the inputs X 1 ,X 2 , . . . ,X m . As indicated in Figure 1, we view X 1 ,X 2 , . . . ,X m as input variables to a logical system L that yields a logical value Y pred that best predicts the value Y that S would provide, given the knowledge of the inputs X 1 ,X 2 , . . . ,X m . Statistical training uses only the fact that X 1 ,X 2 , . . . ,X m are among the inputs to S, the output Y of S can be measured, and a logical system L can be constructed whose output Y pred statistically approximates Y. The underlying scientific assumption is that the full system S is beyond the reach of current technology and our knowledge of S is derived from its effect on observable input variables. The logic of L represents an operational model of our understanding. It is crucial to recognize that this operational model is contingent on existing technology, which determines the inputs that can be observed, the manner in which the inputs are 0 A Comprehensive View of Regulation of Gene Expression by Double-stranded RNA-mediated Cell Signaling* 1 Gary Geiss§, Ge Jin§¶, Jinjiao Guo¶, Roger Bumgarner, Michael G. Katze, and Ganes C. Sen¶ 0 Double-stranded (ds) RNA, a common component of virus-infected cells, is a potent inducer of the type I interferon and other cellular genes. For identifying the full repertoire of human dsRNA-regulated genes, a cDNA microarray hybridization screening was conducted using mRNA from dsRNA-treated GRE cells. Because these cells lack all type I interferon genes, the possibility of gene induction by autocrine actions of interferon was eliminated. Our screen identified 175 dsRNA-stimulated genes (DSG) and 95 dsRNA-repressed genes. A subset of the DSGs was also induced by different inflammatory cytokines and viruses demonstrating interconnections among disparate signaling pathways. Functionally, the DSGs encode proteins involved in signaling, apoptosis, RNA synthesis, protein synthesis and processing, cell metabolism, transport, and structure. Induction of such a diverse family of genes by dsRNA has major implications in host-virus interactions and in the use of RNAi technology for functional ablation of specific genes. 0 Double-stranded (ds)1 RNA is not a major constituent of mammalian cells, but many viruses produce it during their replication cycle as either an essential intermediate for RNA synthesis or a byproduct generated by annealing of complementary mRNAs encoded by the opposite strands of a DNA virus genome (1). In addition, some viruses encode RNA species, such as VA RNA or EBER RNA, which have considerable ds structures. Virtually nothing is known about how dsRNA affects viral and cellular gene expression and functions in a virally infected cell, although the role of PKR, the dsRNA-activated protein kinase, in inhibiting protein synthesis has been studied in cells infected with a variety of viruses (2). In the host-virus interaction context, dsRNA is closely associated with the interferon (IFN) system. dsRNA is a potent inducer of type I IFN synthesis and is believed to be the primary viral gene product that causes IFN production by 0 infected cells (3). dsRNA has important roles in IFN actions as well. It is the obligatory activator of two classes of IFN-induced enzymes: PKR, the IFN-induced protein kinase, and 2-5(A) synthetases, whose products activate the latent ribonuclease, RNaseL. Moreover, transcription of some IFN-stimulated genes (ISGs) is also induced by dsRNA (4). That this induction is direct and not mediated by induced IFN was convincingly demonstrated in IFN unresponsive cells and in cells that are devoid of the IFN gene locus (5, 6). Direct induction of some ISGs by dsRNA suggests that the encoded proteins will be induced in virally infected cells without any involvement of IFNs. Thus regulation of viral gene expression by these proteins is relevant for all infected cells, even in the absence of IFN treatment. Several transcription factors such as NF B, IRF-3, and ATF-1, are known to be activated by dsRNA (7). Their activation is mediated by protein kinases including PKR, p38, JNK2, and IKK (7, 8) although the pathways of activation are not completely understood. For genes that are induced by either IFN or dsRNA, the same cis-element regulates their induction by both reagents. But entirely different signaling pathways and transcription factors are used by the two inducers (5). There has not been any attempt to systematically define the full repertoire of dsRNA-regulated genes. Identification of these genes is required not only for revealing the nature of all signaling pathways used by dsRNA but also for defining the set of proteins that are induced by dsRNA or virus infection. In the current study, we started this investigation using a cDNA microarray hybridization analysis of RNA isolated from dsRNA-treated and -untreated GRE cells that are devoid of the type I IFN locus and cannot synthesize IFNs. Using this approach, in the current study we have identified more than a hundred DSGs, only a few of which were previously known to be dsRNA-inducible. Furthermore we also identified multiple down-regulated genes. These genes were induced or repressed by dsRNA strongly, rapidly, and transiently. The encoded proteins are involved in a broad range of cellular functions and metabolic pathways. 0 EXPERIMENTAL PROCEDURES 0 dsRNA-regulated Gene Expression 0 Identification of dsRNA-regulated Genes (DRGs)--For undertaking a systematic analysis of human DRGs, we chose to use the glioma cell line, GRE (5). These cells lack the type I IFN locus and hence cannot synthesize IFN- or any of the multiple IFN- species in response to dsRNA or other stimuli. Because dsRNA treatment of GRE cells cannot induce IFNs, the possi- 0 bility of secondary induction of the IFN-stimulated genes by autocrine actions of IFNs was eliminated. This consideration was highly pertinent because dsRNA is known to be a potent inducer of IFNs, and several DSGs are known to be induced by IFN as well. GRE cells were treated with the dsRNA, poly(I) poly(C), for 6 h and poly(A) RNA was isolated from treated and untreated cells. We chose the length of treatment to be 6 h, because our previous studies have shown that this is the optimum time for induction of 561 mRNA that encodes the 56 kDa protein, P56 (5). The two sets of 0 Copyright 1997 by the American Chemical Society 0 The Efficiency of Light-Directed Synthesis of DNA Arrays on Glass Substrates 1 Glenn H. McGall,* Anthony D. Barone, Martin Diggelmann, Stephen P. A. Fodor, Erik Gentalen, and Nam Ngo 0 building blocks in combination with polymeric semiconductor photoresist films as the photoimageable component.3 The development of chemistry and processes for DNA array 0 American Chemical Society 0 McGall et al. Scheme 1 0 (acetic anhydride/1-methylimidazole/2,6-lutidine/THF) and oxidation (I2/pyridine-H2O).7 After removing the acyl protecting groups from the bound fluorescein, relative densities of hydroxyl groups in different regions of the support could then be determined from surface fluorescence intensities. 0 For the purpose of this study, it was not necessary to achieve an absolute measure of the amount of bound fluorescein in any given region of the substrate, although the photon-counting capability of the fluorescence microscope would, in principle, enable one to do so. Instead, differences in surface fluorescence were used to obtain relatiVe values for surface density, providing a simple, internally consistent method for measuring chemical and photochemical efficiencies. 0 Beaucage, S. L. In Protocols for Oligonucleotides and Analogs; Agrawal, S., Ed.; Humana Press: Totowa, New Jersey, 1993; pp 33-61. 0 Light-Directed Synthesis of DNA Arrays on Glass Scheme 2 0 One potential source of interference with this kind of analysis is fluorescence quenching due to energy transfer interactions between adjacent fluorophores on the surface. The initial density of surface functional groups on the silanated glass substrates that were used in this work have been estimated to be in the range of 10-30 pmol/cm2.6 Assuming that the initial silanation of the support g 0 AAAI Press 0 The value of prior knowledge in discovering motifs with MEME 1 Timothy L. Bailey and Charles Elkan 0 MEME is a tool for discovering motifs in sets of protein or DNA sequences. This paper describes several extensions to MEME which increase its ability to find motifs in a totally unsupervised fashion, but which also allow it to benefit when prior knowledge is available. When no background knowledge is asserted, MEME obtains increased robustness from a method for determining motif widths automatically, and from probabilistic models that allow motifs to be absent in some input sequences. On the other hand, MEME can exploit prior knowledge about a motif being present in all input sequences, about the length of a motif and whether it is a palindrome, and (using Dirichlet mixtures) about expected patterns in individual motif positions. Extensive experiments are reported which support the claim that MEME benefits from, but does not require, background knowledge. The experiments use seven previously studied DNA and protein sequence families and 75 of the protein families documented in the Prosite database of sites and patterns, Release 11.1. 0 The new sequence model type allows each each sequence in the training set to have exactly zero or one occurrences of each motif. This type of model is ideally suited to discovering multiple motifs in the majority of cases encountered in practice. The motif-width heuristic allows MEME to automatically discover several motifs of differing, unknown widths in a single DNA or protein dataset. We also describe an improved method of finding multiple, different motifs in a single dataset. 0 Overview of MEME 0 The principal input to MEME is a set of DNA or protein sequences. Its principal output is a series of probabilistic sequence models, each corresponding to one motif, whose parameters have been estimated by expectation maximization (Dempster, Laird, & Rubin 1977). In a nutshell, MEME's algorithm is a combination of expectation maximization (EM), 0 OOPS, ZOOPS, and TCM models 0 The different types of sequence model supported by MEME make differing assumptions about how and where motif occurrences appear in the dataset. We call the simplest model type OOPS since it assumes that there is exactly one occurrence per sequence of the motif in the dataset. This type of model was introduced by Lawrence & Reilly (1990). This paper describes for the first time a generalization of OOPS, called ZOOPS, which assumes zero or one motif occurrences per dataset sequence. Finally, TCM (two-component mixture) models assume that there 0 Supported by NIH Genome Analysis Pre-Doctoral Training Grant No. HG00005. 0 MEME is an unsupervised learning algorithm for discovering motifs in sets of protein or DNA sequences. This paper describes the third version of MEME. Earlier versions were described previously (Bailey & Elkan 1994), (Bailey & Elkan 1995a). The MEME extensions on which this paper focuses are methods of incorporating background knowledge, or coping with its lack. For incorporating background knowledge, these innovations include automatic detection of inverse-complement palindromes in DNA sequence datasets, and using Dirichlet mixture priors with protein sequence datasets. Dirichlet mixture priors bring information about which amino acids share common properties and thus are likely to be interchangeable in a given position in a protein motif. This paper also describes a new type of sequence model and a new heuristic for automatically determining the width of a motif which remove the need for the user to provide two types of information. 0 an EM-based heuristic for choosing the starting point for EM, a maximum likelihood ratio-based (LRT-based) heuristic for determining the best number of model free parameters, multistart for searching over possible motif widths, and greedy search for finding multiple motifs. 0 for . The last column is an inverted version of the first column, the second to last column is an inverted version of the second column, and so on. As will be described below, MEME automatically chooses whether or not to enforce the palindrome constraint, doing so only if it improves the value of the LRT-based objective function. 0 Expectation maximization 0 Consider searching for a single motif in a set of sequences by fitting one of the three sequence model types to it. The dataset of sequences, each of length , will be referred to as . There are possible starting positions for a motif occurrence in each sequence. The starting point(s) of the occurrence(s) of the motif, if any, in each of the sequences are unknown and are represented by the the variables (called the "missing information") where if a motif occurrence starts in position in sequence , and otherwise. The user selects one of the three types of model and MEME attempts to maximize the likelihood function of a model of that type , where is a vector containing given the data, all the parameters of the model. MEME does this by using EM to maximize the expectation of the joint likelihood of the model given the data and the missing information, . This is done iteratively by repeating the following two steps, in order, until a convergence criterion is met. E-step: compute 0 jhEg4 ki ¢ X 0 M-step: solve 0 x 2 n te ki g qjhE4 g pl n So mEl ¢ fX 0 DNA palindromes 0 where is a vector containing all the parameters of the model. This process is known to converge (Dempster, Laird, & Rubin 1977) to a local maximum of the likelihood function . Joint likelihood functions. MEME assumes each sequence in the training set is an independent sample from a member of either the OOPS, ZOOPS or TCM model families and uses EM to maximize one of the following likelihood functions. The logarithm of the joint likelihood for models 0 It is not necessary that all of the sequences be of the same length, but this assumption will be made in what follows in order to simplify the exposition of the algorithm. In particular, under this assumption, . 0 That is, 0 A DNA palindrome is a sequence whose inverse complement is the same as the original sequence. DNA binding sites for proteins are often palindromes. MEME models a DNA palindrome by constraining the parameters of corresponding columns of a motif to be the same: 0 Here, is the probability of letter occurring at either a background position (I ) or at position of a motif occurrence (Q ), is the parameters of the background component of the sequence model, and is the parameters of the motif component. Formally, the parameters of an OOPS model are the letter frequencies for the background and each column of the motif, and the width of the motif. The ZOOPS model type adds a new parameter, , which is the prior probability of a sequence containing a motif occurrence. A TCM model, which allows any number of (non-overlapping) motif occurrences to exist within a sequence, replaces with , where is the prior probability that any position in a sequence is the start of a motif occurrence. 0 rGFd 0 are zero or more non-overlapping occurrences of the motif in each sequence in the dataset, as described by Bailey & Elkan (1994). Each of these types of sequence model consists of two components which model, respectively, the motif and nonmotif ("background") positions in sequences. A motif is modeled by a sequence of discrete random variables whose parameters give the probabilities of each of the different letters (4 in the case of DNA, 20 in the case of proteins) occurring in each of the different positions in an occurrence of the motif. The background positions in the sequences are modeled by a single discrete random variable. If the width of the motif is , and the alphabet for sequences is , we can describe the parameters of the two components of each of the three model types in the same way as 0 For a ZOOPS model, the joint log likelihood is 0 For a ZOOPS model, 0 For a TCM model, 0 The M-step. The M-step of EM in MEME reestimates using the following formula for models of all three types: 0 if otherwise. 0 Finding multiple motifs 0 All three sequence model types supported by MEME model sequences containing a single motif (albeit a TCM model can describe sequences with multiple occurrences of the same motif). To find multiple, non-overlapping, different motifs in a single dataset, MEME uses greedy search. It incorporates information about the motifs already discovered into the current model to avoid rediscovering the same motif. The process of discovering one motif is called a pass of 0 The conditional probability of a lengthsubsequence generated according to the background or motif component of a TCM model is defined to be 0 is a vector-valued indicator variable of lengt 0 New topical antiandrogenic formulations can stimulate hair growth in human bald scalp grafted onto mice 1 Amnon Sintov a,*, Sima Serafimovich b, Amos Gilhar b 0 Keywords: Androgenetic alopecia; Flutamide; Finasteride; Topical drug delivery; Skin permeation; Mice 0 Introduction Testosterone metabolites exert a significant hormonal influence on hair growth by interacting with receptors at the follicular papilla. It has long been known that an increased susceptibility of 0 scalp follicles to these androgens is the main cause of androgenetic alopecia (or male-pattern baldness) in genetically predisposed individuals (Imperato-McGinley et al., 1974; Ebling et al., 1991). In this type of alopecia, scalp follicles exhibit increased levels and activity of scalp 5a-reductase isoenzyme, which converts testosterone (T) to dihydrotestosterone (DHT) (Bingham and Shaw, 1973; Schweikert and Wilson, 1974). Taken together, increased conversion of T to DHT and 0 increased DHT binding capacity in bald scalp as compared to hairy scalp (Sawaya et al., 1989) provide a mechanistic explanation for androgenetic alopecia. DHT shortens the hair cycle and progressively miniaturizes scalp follicles. The miniaturized follicles all remain present and thus the possibility of reversal by re-enlargement exists. It is reasonable, therefore, to suppose that by administration of 5a-reductase inhibitors and/or non-steroidal antiandrogens, this reversal should occur. Finasteride, a 4-azasteroid inhibitor of 5a-reductase, was introduced by Merck in 1989. Finasteride is known to inhibit the prostate 5a-reductase isoenzyme type 2 more effectively than type 1 isoenzyme predominantly found in the skin of the scalp. However, while type 1 isoenzyme is located in the sebaceous glands, there is still significant activity of type 2 isoenzyme in the hair follicles (Sawaya and Price, 1997). This is, therefore, the reason why finasteride decreased the level of DHT in bald scalps after a long-term oral administration (Diani et al., 1992; Dallob et al., 1994); it also provides the justification for the topical mode of delivery. It should be emphasized that oral finasteride has already been introduced as an effective hair growth treatment, with only minor systemic adverse effects. Nevertheless, systemic therapy for a disorder such as male-pattern baldness is obviously not the treatment of choice if the option of topical delivery is available option. Another agent with a hair growth potential is the nonsteroidal anti-androgen flutamide. This drug, produced by Schering-Plough, was introduced as a new potent compound for treatment of prostatic carcinoma (Martindale, 1993). The systemic administration of flutamide causes several unwanted side effects, such as reducing libido and impairing spermatogenesis in men and feminizing male fetuses in pregnant women. Topical administration, therefore, is an important goal for such a drug, especially if indicated for skin disorders. In a comparative study, Chen et al. (1995) showed that topical administration of finasteride (in ethanol/propylene glycol vehicle) caused local inhibition of androgen-controlled sebaceous gland growth in hamster flank organ and that had a 0 similar action to that of the same doses of flutamide. To date, clinical studies have not been performed for testing the efficacy of topical flutamide in male-pattern baldness. It is likely that the success (i.e. effective with minimal systemic exposure) of this drug would be dependent on a well-designed vehicle that would increase skin accumulation and decrease percutaneous absorption. In this paper, we present a new topical base formulation for finasteride and flutamide (representing two anti-DHT categories). We studied the effect of the topical preparations of these two compounds on the growth of human hair in a murine transplantation model. The effect was monitored in scalp skin biopsies taken from bald subjects before plastic surgery procedures. This model which has been described previously by Gilhar et al. (1988), Van Neste (1996) and De Brouwer et al. (1997), is specific to male-pattern baldness, in which hairs of the bald skin graft do not re-enlarge after transplantation, while the hair of grafts taken from patients with alopecia areata (an auto-immune problem) begin to grow shortly after transplantation (Gilhar and Krueger, 1987). To correlate the pharmacological efficacy of the new drug-vehicle system with its cutaneous penetration properties, topical preparations containing flutamide were tested in vitro using excised hairless mouse skin. 0 Materials and methods 0 Formulation 0 Gel preparations containing 1% of flutamide (Eulexin, Schering-Plough Lab., Belgium) or finasteride (Proscarfi, Merck Sharp & Dohme, UK) were produced as follows. The drug was dissolved in ethyl alcohol (30% w/w in the final gel for flutamide, and 58% w/w in the final gel for finasteride); then 1% glyceryl oleate (as an enhancer) and distilled water were added gradually with mixing. The solutions were finally gelled by adding 4% hydroxypropyl methylcellulose (for flutamide) or ethylcellulose (for finasteride). A vehicle corresponding to the flutamide formula- 0 tion but containing no drugs was prepared for the purpose of in vivo comparison. In addition, a 1% flutamide formulation without enhancer was prepared and tested in vitro together with the formulation containing the enhancer (as described above), and a hydroalcoholic formulation (1:1 ethanol-water). 0 the subcutaneous tissue over the lateral thoracic cage of each mouse, and covered with a standard band aid dressing. The dressing was removed on day 7, and the grafts, which were located at the surface, were treated from day 8 for 60 days as described below. The procedure protocol related to animals was reviewed and approved by the Institutional Animal Care and Use Committee. 0 Animals 2.4. Treatment 0 Severe combined immune deficient mice (male Prkdc SCID-Charles River, UK), 2 - 3 months of age, were used in this study. The mice were grown in a pathogen-free animal facility. Specimens of each topical preparation, 20-30 mg, were spread gently over each transplanted 0 Skin grafting 0 Punch grafts, 0.5mm2, obtained from scalp skin of five bald men were used for transplantation to the SCID mice (three grafts per mouse). The transplantation procedure was performed as previously described (Gilhar et al., 1988). Each graft was inserted, through an incision in the skin, into 0 Table 1 Distribution of the histological hair structures in the treated grafts Anagen (%) Before treatment Finasteride Flutamide Vehicle (control) 0 30.4 47.0 10.5 0 Finasteride Flutamide Vehicle (control) 0 a No difference between groups was found for T or DHT (P\0.05). 0 Catagen (%) 35.7 22.8 26.5 24.6 0 Telogen (%) 64.2 46.8 26.5 64.9 0 scopically in the horizontal sections with the aid of a calibrated ocular micrometer. Hair structures in the histological specimens were counted. 0 In 6itro permeation testing 0 The in vitro diffusion of a topical drug through skin (in which the flux of the drug molecules through human cadaver or animal skin is determined) was performed basically according to the FDA guidelines (Skelly et al., 1987). Bas 0 Ecdysone-regulated puff genes 2000 1 C.S. Thummel 0 Keywords: Ecdysone; Drosophila metamorphosis; Gene regulation 0 these hormones could act directly on the nucleus, triggering a complex regulatory cascade of gene expression (Yamamoto and Alberts, 1976). Through a series of detailed and elegant studies, Ashburner and co-workers proposed a model for the regulation of gene expression by 20-hydroxyecdysone (referred to hereafter as ecdysone) (Fig. 1). Briefly, this model proposed that ecdysone, bound to its specific receptor, directly induces the expression of a small set of early regulatory genes. The protein products of these genes, in turn, repress their own expression and induce a much larger set of late target genes. It was assumed that these late genes would function as effectors that directly or indirectly control the appropriate biological responses to the pulse of ecdysone. Ashburner and colleagues also determined that the late puffs could be divided into two classes, based on their regulation by ecdysone (Ashburner and Richards, 1976). The early-late puffs are induced relatively rapidly after the addition of hormone and require the continuous presence of ecdysone for their activity, much like the early puffs. The late-late puffs, in contrast, are induced at later times and are prematurely induced upon ecdysone withdrawal. This latter result was interpreted to mean that the ecdy- 0 E63-1: an ecdysone-inducible calcium binding protein that can regulate salivary gland glue secretion Molecular analysis of the 63F early puff provided the first evidence that not all early puffs encode transcriptional regulators. This work identified a pair of divergently transcribed ecdysone-inducible genes: E63-1 and E63-2 (Andres and Thummel, 1995). E63-2 produces a single 1.2 kb mRNA with no extended open reading frames. Genetic studies indicate that this gene has no essential functions during development, suggesting that it may only be expressed due to its proximity to E63-1 (Vaskova et al., 2000). In contrast, E63-1 encodes a calcium-binding protein with four EF hands, most closely related to calmodulin. The regulation of E63-1 provides a further departure from prior studies of early puff genes, in that it is induced by ecdysone in a tissue-specific manner. Low to moderate levels of E63-1 are widely expressed in the third instar larvae, prior to the late larval ecdysone pulse. Only in the salivary gland is E63-1 transcription rapidly and directly induced by the hormone at puparium formation (Andres and Thummel, 1995). This restricted pattern of induction, combined with the known role of calcium-binding proteins in regulating secretion, led to the proposal that E63-1 might contribute to the physiology of the salivary gland by regulating ecdysoneinduced secretion. Although loss-of-function mutants provide an ideal means of testing this model, inactivation of the E63-1 gene has no detectable effect on viability or reproduction (Vaskova et al., 2000). In retrospect, this is not surprising, given that other calcium-binding proteins are encoded by the Drosophila genome. Consistent with possible functional redundancy in this pathway, recent studies have shown that salivary glands compromised for both calmodulin and E63-1 are defective in glue secretion (T.V. Do and A.J. Andres, personal communication). In addition, ectopic expression of E63-1 in transgenic animals is sufficient to trigger glue secretion if the intracellular calcium levels are elevated (A. Biyasheva et al., 2001). Moreover, ecdysone alone can lead to increased levels of intracellular calcium in larval salivary glands, with a detectable increase after 2 h of exposure. Ecdysone thus leads to two responses that can synergistically trigger salivary gland glue secretion -- increased levels of E63-1 expression as well as increased cytoplasmic calcium levels (Fig. 2). Although the time frame for calcium elevation suggests that this is a secondary-response to the hormone, the mechanism by which calcium levels are effected remains to be determined. E63-1 protein shows dynamic changes in its subcellular distribution as the salivary glands secrete glue, providing further evidence of a possible role in glue secretion (Vaskova et al., 2000). Initially, before the glue is secreted, E63-1 is localized to cell membranes, in the 0 The E23 early puff gene may regulate ecdysone responses by controlling intracellular hormone concentrations The 23E ecdysone-inducible puff is among the last early puffs described by Ashburner to be 0 Special Feature 0 Signalling by CD95 and TNF receptors: Not only life and death 0 Walter and Eliza Hall Institute of Medical Research, Royal Melbourne Hospital, Parkville, Victoria, Australia 0 Summary Members of the TNF family of receptors play important roles in normal physiology and in defence. The recent rapid progress in the understanding of the mechanisms of apoptosis has been accompanied by assumptions that TNF family receptors such as CD95(Fas/APO-1) only have a role in regulating cell survival. While regulation of cell death is one important function of TNF family receptors, they are capable of activating signal transduction pathways that have many other effects. The present review will focus on signalling of some TNF family receptors in the immune system, not only for apoptosis, but also for survival or activation. Key words: apoptosis, CD95, NF-B, signal transduction, TNF receptors. 0 TNF receptor family 0 The tumour necrosis factor receptor (TNFR)/nerve growth factor receptor (NGFR) family of molecules regulate a number of biological functions, such as growth, differentiation and apoptosis in multiple cell types. In the immune system, members of this receptor family are involved in the development of peripheral lymphoid organs, regulation of induced inflammatory responses and removal of cells at the end of an immune response. The TNFR family consists of more than 15 different molecules. Most are type I membrane proteins which resemble each other largely in their extracellular regions, which all contain 2-6 characteristic cysteine-rich domains.1 The TNF family receptors are activated upon binding of their cognate ligands, most of which are trimers with a structure similar to TNF. Sometimes the ligands are cell bound type II membrane proteins, but several are cleaved off and appear as soluble trimers. Induction of trimers or higher order complexes of the TNF family of receptors allows their cytoplasmic domains to aggregate intracytoplasmic signalling molecules. 0 so-called because it is required for these receptors to transmit apoptotic signals. The DD is a protein-protein interaction motif consisting of six alpha helices that allow two proteins with DD to bind to each other. Structurally the DD is related to two other homotypic interaction domains, the death effector domain (DED), and the caspase recruitment domain (CARD).2 0 Death domain adaptors: TRADD, FADD, RIP and RAIDD 0 Binding of TNF to TNFR1 induces recruitment of the DDcontaining protein TRADD to the DD of TNFR1.3 Overexpression of TRADD alone also induces the TNF-regulated responses apoptosis and activation of the transcription factors NF-B and Jun kinase (JNK), presumably because TRADD provides docking sites for downstream signalling proteins to the receptor complex.4 Two of the proteins that TRADD recruits to the signalling complex also bear death domains. One of these, RIP, has an N-terminal DD and a C-terminal kinase domain. Knockout studies have shown that RIP is required for induction of NFB by TNF.5 The other, Fas-associated protein with death domain (FADD), has a C-terminal DD, and an N-terminal DED. The FADD is required for cell death signalling by TNFR1 and also by CD95, to which it binds directly via its death domain.6-8 The DED of FADD allows it to bind to DED in the pro-domain of caspase 8. Through these interactions, ligation of TNFR1 or CD95 can result in the formation of a death-inducing signalling complex, which leads to activation of caspase 8, a cell death effector protease. Once activated, caspase 8 cleaves and activates downstream caspases, such as caspase 3, ultimately leading to cell death. Because cells from mice lacking caspase 8 are resistant to death induced by TNF receptors, CD95 and DR3, apoptosis triggered by all of these receptors must converge on this caspase.9 However, FADD must have other functions because FADD knockout mice die during embryogenesis, and lymphocytes from FADD-dominant negative transgenic mice do not proliferate normally in response to T cell mitogens in vitro.10-12 0 Signalling pathways controlled by TNF receptors 0 The cytoplasmic domains of the TNFR family, which are more diverse than the extracellular portions, do not have any intrinsic enzymatic activity, hence they signal by inducing aggregation of intracellular adaptor molecules (Fig. 1). 0 Death domains 0 The cytoplasmic domains of TNFR1 (p55), CD95 (Fas/ APO-1), NGFR (p75), death receptor (DR) 3, TRAIL-R1 and TRAIL-R2 all bear a motif termed a `death domain' (DD), 1 C Magnusson and DL Vaux 0 The group of TNF receptor-associated factors (TRAF) interact with members of the TNFR family. There are to date six TRAF proteins identified, TRAF1, TRAF2, TRAF3 (CRAF, LAP-1, CD40-bp), TRAF4 (CART1), TRAF5 and TRAF6 (review18). With the exception of TRAF4, TRAF proteins interact with receptor molecules either directly, or indirectly through binding to other TRAF, or through binding to TRADD. The TNFR2 (p75), CD40, CD30 and lymphotoxin- receptor (LTR) contain conserved, cytoplasmic TRAF binding motifs and are able to bind directly to TRAF proteins. Because TRAF2 can bind to TRADD, which in turn can associate with TNFR1, TRAF2 can indirectly participate in signalling from this receptor as well. The TRAF molecules share similar C-terminal domains, designated the TRAF domain, which is involved in protein-protein interactions. TRAF2, TRAF3, TRAF5 and TRAF6 also bear an N-terminal RING finger, a zinc binding motif found in several types of intracellular proteins.19-23 TNF receptor-associated factor proteins interact as homodimers or in heterodimeric complexes. For example, TRAF2 binds to TRADD, the TNFR2, LTR, CD40 or CD30 via its C-terminal TRAF domain, probably as a heterodimeric complex with TRAF1 or TRAF5, or as a homodimer.18,19 It has also been shown that TRAF proteins may signal from other receptors in addition to TNFR family molecules. TRAF6, which binds to CD40, is also involved in IL-1 receptor signalling through interaction with IRAK, a serine/ threonine kinase that also has a DD.24 Studies of TRAF2 and TRAF3 knockout mice have shown that TRAF proteins are required for activation of Jun/AP-1 signalling by TNF receptors, and have important roles for normal development, since these mice die during early life.25,26 0 RIP is an adaptor protein with a C-terminal death domain that can associate with the DD in the cytoplasmic domain of CD95. Via TRADD, RIP can also associate with the TNFR1.4 Cells from RIP knockout mice show increased susceptibility to TNF-mediated killing and fail to activate NF-B in response to TNF.5 This indicates that RIP is required for NF-B activation by TNF. Because RIP is a serine threonine kinase, it is likely to phosphorylate, and thereby activate, kinases that phosphorylate the inhibitor of NF-B, IB.13 Interestingly, RIP knockout mice also have abnormal development of lymph nodes, similar to those in lymphotoxin (LT) receptor-deficient mice.14,15 Therefore it is possible that RIP also takes part in signalling from these receptors. However, because the LTR lacks a DD, if it does signal via RIP then it must do so indirectly (see following). Another DD-bearing adaptor molecule implicated in TNF signalling of apoptosis is `RIP-associated ICH-1/CED-3homologous protein with a death domain' (RAIDD). In addition to the DD, RAIDD has a CARD which allows it to bind to the CARD of procaspase 2.16 Overexpression of RAIDD in vitro induces apoptosis, suggesting that this interaction is functional. However, the significance of this pathway for induction of cell death is uncertain because neither CD95 ligand (CD95L) nor TNF are able to induce apoptosis in mice lacking FADD or caspase 8. In these mice, RAIDD and caspase 2 would presumably be able to function normally. Furthermore, TNF- was still able to induce cell death in the absence of caspase 2.17 0 Inhibitor-of-apoptosis proteins 0 In some cell types in vitro, ligation of CD95 is able to activate the JNK/SAPK pathway. A candidate for mediating this 0 CD95 and TNF receptor signalling 0 activity is the CD95 `death domain-associated protein' Daxx, which was identified in yeast two-hybrid 0 Springer-Verlag 1997 1 Russell L. Margolis · Meena R. Abraham · Shawn B. Gatchell · Shi-Hua Li · Arif S. Kidwai · Theresa S. Breschel · O. Colin Stine · Colleen Callahan · Melvin G. McInnis · Christopher A. Ross 0 cDNAs with long CAG trinucleotide repeats from human brain 0 Trinucleotide repeat expansion mutation is now know to cause 12 diseases, most with neuropsychiatric features (Linblad and Schalling 1996; Paulson and Fischbeck 1996; Ross 1995; Zoghbi 1996). Seven of these are known as the type 1 disorders - spinocerebellar ataxia type 1 (SCA1, Orr et al. 1993), SCA2 (Imbert et al. 1996; Pulst et al. 1996; Sanpei et al. 1996), Machado-Joseph disease (MJD or SCA3, Kawaguchi et al. 1994), SCA6 (Zhuchenko et al. 1997), dentatorubral pallidoluysian atrophy (DRPLA, Koide et al. 1994; Nagafuchi et al. 1994), Huntington's disease (HD, Huntington's Disease Collaborative Research Group 1993), and spinal and bulbar muscular atrophy (SBMA, La Spada et al. 1991). Each is caused by a (CAG)n expansion in an open reading frame, resulting in an expanded glutamine repeat. The properties of the repeats in the other (type 2) expansion mutation diseases vary widely. Myotonic dystrophy is caused by a 3 untranslated (CTG)n expansion (Brook et al. 1992; Fu et al. 1992; Mahadevan et al. 1992), the A and E forms of fragile X syndrome (Fu et al. 1991; Knight et al. 1993; Kremer et al. 1991; Verkerk et al. 1991) and some cases of Jacobsen's syndrome (Jones et al. 1995) result from 5 untranslated region (CCG)n expansions, and Friedreich's ataxia is caused by an intronic (GAA)n expansion (Campuzano et al. 1996). Expandable trinucleotide repeats therefore are found in translated, transcribed but untranslated, and intronic regions; they may be G-C or A-T rich and range from minimal to highly variable in length in the normal population. At least four lines of evidence indicate that additional disorders may arise from trinucleotide repeat expansion mutations. First, an antibody (IC2) that specifically recognizes expanded glutamine repeats detects an expansion segregating with SCA7 (Trottier et al. 1995). Second, indirect evidence of CAG expansion has been detected using rapid expansion detection (RED, Schalling et al. 1993) in a pedigree with SCA7, and less clearly in heterogeneous populations of patients with bipolar affective 0 disorder and schizophrenia (Linblad et al. 1996; Linblad and Schalling 1996; O'Donovan et al. 1995). Third, several neurodegenerative disorders, including SCA4, SCA5, SCA7, and familial Parkinson disease, are phenotypically similar to the type I expansion mutation disorders. Fourth, anticipation, the phenomenon of increasing phenotypic severity or decreasing age of onset in successive generations affected by a disease (McInnis 1996; Ross et al. 1993), is found in most of the expansion mutation diseases. Anticipation has been detected in a disparate group of other diseases, including affective disorder (Engstrom et al. 1995; McInnis et al. 1993; Nylander et al. 1994), schizophrenia (Chotai et al. 1995; Gorwood et al. 1996; Stober et al. 1995; Thibaut et al. 1995), autism (Stine 1993), familial Parkinsonism (Bonifati et al. 1995; Markopoulou et al. 1995; Payami et al. 1995; Plante-Bordeneuve et al. 1995), familial leukemias (Horwitz et al. 1996), Crohn's disease (Polito et al. 1996), Meniere's disease (Morrison 1995), torsion dystonia (LaBuda et al. 1993), rheumatoid arthritis (McDermott et al. 1996), facioscapulohumeral muscular dystrophy (Tawil et al. 1996), Holt-Oram syndrome (NewburyEcob et al. 1996), and familial spastic paraplegia (Raskind et al. 1997). We have sought to identify candidate genes for these disorders by screening cDNA libraries for the presence of DNA fragments containing CAG, CCG, CCA, and AAT trinucleotide repeats (Li et al. 1993; Margolis et al. 1995 a, b). Our description of CTG-B37, a cDNA fragment with a highly polymorphic CAG repeat located within an open reading frame on chromosome 12, directly led to the finding that an expansion mutation within the CTGB37 repeat causes DRPLA (Koide et al. 1994; Nagafuchi et al. 1994). This same strategy of screening cDNA libraries for trinucleotide repeats was later employed to identify the MJD gene (Kawaguchi et al. 1994) and the SCA6 gene (Zhuchenko et al. 1997). Screening genomic contigs for trinucleotide repeats was used to clone the gene for SCA2 (Pulst et al. 1996). Based on the repeats that expand to cause disease, repeats with the highest likelihood of undergoing expansion mutation consist of at least six consecutive CAG or CTG triplets in the transcribed portions of genes expressed in brain. To identify genes with these features, we have screened human adult frontal cortex and fetal brain cDNA libraries at high stringency for the presence of CAG or CTG repeats. We now report the identification and mapping of 19 of these cDNA fragments. 0 Materials and methods 0 cDNA cloning Adult human 0 EVects of a motilin receptor agonist (ABT-229) on upper gastrointestinal symptoms in type 1 diabetes mellitus: a randomised, double blind, placebo controlled trial 1 N J Talley, M Verlinden, D J Geenen, R B Hogan, D RiV, R W McCallum, R J Mack 0 Motilin is a 22 amino acid peptide hormone that is expressed throughout the gut.1 Motilin stimulates interdigestive antral contractions promoting gastric emptying; the receptor has recently been identified.2 Erythromycin is a potent motilin agonist, inducing phase 3 of the migrating motor complex1; it accelerates gastric emptying in healthy volunteers as well as in patients with diabetic gastroparesis or those post-vagotomy.3 4 Dyspepsia is a common problem in patients with diabetes mellitus.5 6 Between 27% and 58% of type 1 diabetics are reported to have gastroparesis, usually aVecting solids but less often liquids.7 8 Symptoms of diabetic gastroparesis include postprandial distress, early satiety, bloating, fullness, and nausea and vomiting, but while gastroparesis is common, only a minority have overt symptomatology.7 8 Moreover, these symptoms also occur frequently in diabetics who do not have objective evidence of gastroparesis.6 The underlying mechanisms remain in dispute but disturbed vagal parasympathetic function and poor glycaemic control may both be important.8 9 In addition, increased levels of motilin have been observed in diabetic gastroparesis which is likely to be a compensatory mechanism as motilin levels decreased with the introduction of a prokinetic.10 A prokinetic agent in diabetic gastroparesis has the potential to increase gastric emptying, improve dyspepsia, and better control plasma glucose levels. There has therefore been considerable interest in developing new prokinetics for gastroparesis, including motilin agonists that lack antibiotic activity. ABT-229 has potent motilin agonist activity with essentially no antibiotic action.11 12 It dose dependently accelerates gastric emptying, and has a half life of 20 hours.11 12 Multidose studies have shown that the maximally eVective dose was 5 mg twice daily for accelerating gastric emptying and 2.5 mg twice daily retained a modest but significant prokinetic eVect.12 We aimed to test the hypothesis that ABT-229 would relieve postprandial symptoms in patients with diabetes mellitus. We further hypothesised that the maximum therapeutic gain over placebo would be observed in patients with diabetic gastroparesis on higher doses of ABT-229. To test these hypotheses, we conducted a randomised, placebo controlled, 0 Abbreviations used in this paper: HbA1c, glycated haemoglobin. 0 Talley, Verlinden, Geenen, et al 0 dose ranging trial in North American patients with type 1 diabetes mellitus. Methods The trial was approved by the local institutional review boards, and all patients gave informed consent. 0 PATIENT SELECTION 0 Ambulatory patients at least 18 years of age with documented type 1 diabetes were eligible to be enrolled. All patients were by definition insulin dependent. A minimum three month history of chronic upper abdominal discomfort (that is, one or more of postprandial fullness, bloating, epigastric discomfort, early satiety, belching after meals, postprandial nausea, vomiting, or epigastric pain) was required. A total of 383 patients were screened (by 33 investigators in the USA and three in Canada between June 1997 and August 1998) (fig 1). Patients were required to have a normal upper endoscopy (that is, no ulcers or erosions in the oesophagus and gastroduodenum) in the three months before randomisation. Furthermore, during the baseline evaluation over 14 days, patients had to have experienced one or more symptoms of postprandial upper abdominal discomfort on three or more days per week and on average have suYciently severe symptoms (defined as an upper abdominal discomfort severity score of >149 mm and a postprandial fullness severity score of >29 mm on visual analogue scales, as described below). Patients were only enrolled if there were no serious comorbid illnesses and screening laboratory values were normal. Excluded were patients with gastrooesophageal reflux disease, based on a normal endoscopy (only erythema was permitted), and 0 n = 383 Patients screened n = 113 Screening failures n = 270 Patients randomised n=1 Patient did not receive study drug n = 269 Intent to treat patients n = 15 Prematurely discontinued n = 254 Completed trial 0 Each site was supplied with separate sets of study drug for the gastric emptying strata (normal and delayed); to ensure random assignment, patients in each strata were given a number in sequential order from a separate computer generated randomisation list. A total of 270 patients were randomised but one was lost to follow up after the drug was dispensed and this patient was excluded. Patients treated (n=269) were randomly assigned to receive ABT-229 1.25 mg (n=55), 2.5 mg (n=58), 5 mg (n=53), 10 mg (n=55), or placebo (n=48) twice daily before breakfast and dinner for four weeks. These four doses were chosen based on the gastrokinetic eVects of ABT-229 administered in healthy subjects.12 The 2.5 mg twice daily dose was only marginally significantly superior to placebo as it accelerated gastric emptying of the evening meal only. The maximally eVective dose in healthy subjects was 5 mg twice daily. As the gastrokinetic eVects of ABT-229 were largest in those with slower gastric emptying, a 1.25 mg dose was included in the trial. To account for the possibility that patients with diabetic gastroparesis might be more resistant to therapy and require a higher dose, 10 mg was also included. Overall, 15 patients prematurely discontinued; the reasons were adverse events (n=10), treatment failure (n=2), lost to follow up (n=1), or other reasons (n=2), and the distribution was similar in each arm (fig 1). In total, 254 patients completed the trial. 0 Adverse events n = 10 Lost to follow up n=1 Treatment failures n=2 0 The placebo was identical in appearance to active therapy. All medication was supplied in double blinded multidose bottles. An administrative blind break occurred for one patient. 0 Other reasons n=2 0 Compliance, measured by a tablet count at week 4, was excellent. A minimum of 97% of patients in each treatment arm were at least 75% compli 0 Quality Indicators Increase the Reliability of Microarray Data 1 Wolfgang Raffelsberger,1 Doulaye Dembele,1 Mike G. Neubauer,2 Marco M. Gottardis,3 and Hinrich Gronemeyer1,* 0 Institut de Genetique et de Biologie Moleculaire et Cellulaire, CNRS/INSERM/ULP, B.P. 10142, F-67404 Illkirch Cedex, C. U. de Strasbourg, France Departments of 2Applied Genomics and 3Oncology Drug Discovery, Bristol-Myers Squibb Pharmaceutical Research Institute, Princeton, New Jersey 08543-4000, USA 0 Large-scale gene expression profiling with DNA microarrays opens new dimensions to molecular biology but still lacks the overall precision of traditional low-scale techniques. We developed a novel strategy of data processing linking search stringency to quality indicators for efficient detection of low-level, regulated genes. Using retinoid-induced differentiation of NB-4 promyelocytic cells, the variation of expression profiles between biological duplicates was studied and compared with the changes induced by all-trans retinoic acid (atRA) treatment. An analysis of 4320 genes showed that retinoic acid has mainly geneactivating function in NB-4 cells. Treatment with atRA for 18 hours induced metabolic genes that may be associated with cell differentiation and signaling factors triggering later events leading to apoptosis; cytokine genes were among the highest stimulated by atRA. Notably, we identified a regulatory loop inhibiting MYC action: as MYC was downregulated, a cognate repressor of MYC was upregulated. Key Words: retinoic acid, cell differentiation, gene expression profiling, biostatistics 0 Until recently only a limited number of genes were accessible to gene expression profiling, as northern blot, RT-PCR, and ribonuclease protection assays are designed for single genes or small groups of genes at a time. During the course of the human genome project, comprehensive cDNA libraries became available allowing the development of techniques for massive parallel expression profiling. Two types of microarrays emerged either using oligonucleotides directly synthesized on a chip surface (Affymetrix) [reviewed in 1,2] or depositing cDNA PCR products on glass slides [reviewed in 1,3]. In parallel, clustering algorithms for data analysis have been developed [4-7]. High-density microarrays allowed genome-wide screening programs for identification of target genes or expression profiles in disease and cancer [reviewed in 8-10]. Large amounts of data have been generated quickly, but several types of problems encourage the development of novel concepts for data evaluation. Large data sets with intrinsic variation ("noisy data") have to be interpreted by recognizing and excluding outlier data from subsequent analysis in an automated and highly reliable way. 0 Edge Effect and Normalization The microarrays used had a considerable edge effect: spots located close to the edge of a slide displayed lower fluorescence signals than duplicate spots in the center of the slide. For each column a correction factor was introduced minimizing the normalized differences of spot-duplicate (left/right). As low spot intensity values have 3- to 10-fold elevated deviation (Fig. 1A), only the 60% most intense spot pairs were used. Spots at saturation were excluded. All normalizations between replicate slides or subsequently between different samples were based on the assumption that there are no major changes in expression levels for the bulk part of the genes tested. This was a valid assumption--it is supported by near-identical shapes of cumulative frequency histograms of fluorescence intensities for different slides after median normalization (Fig. 1B). Comparison with Quantitative RT-PCR and Previous Results Obtained with Affymetrix GeneChips From preliminary experiments 18 genes were selected and their atRA-induced expression was assessed by real-time PCR. In general, most results were in agreement with the 0 arrays revealed upregul 0 Assessing the Drosophila melanogaster and Anopheles gambiae Genome Annotations Using Genome-Wide Sequence Comparisons 1 Olivier Jaillon,1 Carole Dossat,1 Ralph Eckenberg,1 Karin Eiglmeier,2 Beatrice Segurens,1 Jean-Marc Aury,1 Charles W. Roth,2 Claude Scarpelli,1 ´ Paul T. Brey,2 Jean Weissenbach,1 and Patrick Wincker1,3 0 Genoscope/Centre National de Sequencage and CNRS UMR 8030, 91057 Evry Cedex, France; 2Unite de Biochimie ´ ¸ ´ et Biologie Moleculaire des Insectes, Institut Pasteur, Paris 75724 Cedex 15, France ´ We performed genome-wide sequence comparisons at the protein coding level between the genome sequences of Drosophila melanogaster and Anopheles gambiae. Such comparisons detect evolutionarily conserved regions (ecores) that can be used for a qualitative and quantitative evaluation of the available annotations of both genomes. They also provide novel candidate features for annotation. The percentage of ecores mapping outside annotations in the A. gambiae genome is about fourfold higher than in D. melanogaster. The A. gambiae genome assembly also contains a high proportion of duplicated ecores, possibly resulting from artefactual sequence duplications in the genome assembly. The occurrence of 4063 ecores in the D. melanogaster genome outside annotations suggests that some genes are not yet or only partially annotated. The present work illustrates the power of comparative genomics approaches towards an exhaustive and accurate establishment of gene models and gene catalogues in insect genomes. 0 nome annotations. We therefore carried out this type of global comparison between these two insect genomes. 0 RESULTS AND DISCUSSION 0 The Drosophila Annotation 0 Genome Research 0 Jaillon et al. 0 Ecores 47,134 n.d. 46,742 n.d. 0 Genes 13,468 n.d. 13,666 n.d. 0 Exons 54,771 n.d. 61,085 n.d. 0 Ecores/ gene 3.17 n.d. 3.2 n.d. 0 Genes and exons stand for annotated genes and exons in the corresponding versions. 0 Genome Research 0 Drosophila/Anopheles Genomes Comparison 0 eral explanations that are not mutually exclusive may account for this observation. The high number of ecores could be the consequence of (1) an increased coding capacity in the genome of Anopheles, or (2) a larger number of pseudogenes or unmasked tranposable elements in Anopheles, or (3) problems in the sequence assembly. Explanations (1) and (2) were not supported by a previous comparative analysis (Zdobnov et al. 2002). The presence of at least two different haplotypes in the A. gambiae strain sequenced is known to have int 0 How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach 1 Wei Pan*, Jizhen Lin and Chap T Le* 0 comment reviews 0 deposited research refereed research interactions 0 Microarrays are used to measure the (relative) expression levels of thousands of genes (or expressed sequence tags). A comparison of gene expression in cells or tissues from two conditions may provide useful information on important biological processes or functions [1,2]. The challenge now is how to detect those genuine changes from noisy data. It is now known that simply using fold changes, as in the earlier days, is unreliable and inefficient [3,4]. More sophisticated statistical methods are called for. Many proposals have appeared in the literature [3-10]. In particular, it has been noticed that it may be necessary to design an experiment that uses multiple arrays (or multiple spots on each array) containing multiple measurements for each gene under each 0 condition. One reason is that because of a high noise-tosignal ratio, a single array may not provide enough information that can be reliably extracted [11]. More important, multiple measurements from each gene make it possible to assess the potentially different variability of genes. The problem then seems to fall within the traditional two-sample comparison in statistics. Two of the best known two-sample statistical tests are the two-sample t-test and the Wilcoxon test (or equivalently, Mann-Whitney test). The t-test is parametric and is based on the assumption that the gene-expression levels have normal distributions. In contrast, the Wilcoxon test is nonparametric and is based on the ranks of observed gene-expression levels. Although the t-test is robust to departures from normality and the Wilcoxon test 0 Genome Biology 0 Results and discussion 0 A statistical model 0 We consider a generic situation that, for each gene i, I = 1,2,...,N, we have (relative) expression levels X1i,..., Xmi from m microarrays under condition 1, and Y1i,..., Ymi from m arrays under condition 2. We need to assume that m is an even integer. A general statistical model is assumed for gene expression data: Xji = 0 where P(1),i and P(2),i are the mean expression levels for gene i under the two conditions respectively, and Hji and eli are independent random errors with means and variances E( ji) = E(eli) = 0, Var( ji) = 0 depend on the mean expression P(c),i. Also, we do not even need to assume that V2(1),i = V2(2),i unless P(1),i = P(2),i. A goal is to detect all genes with P(1),i z P(2),i. This can be accomplished through statistical hypothesis testing. 0 nonparametrically. T 0 Copyright 2004 by the Genetics Society of America DOI: 10.1534/genetics.104.026658 0 The DrosDel Collection: A Set of P-Element Insertions for Generating Custom Chromosomal Aberrations in Drosophila melanogaster 1 Edward Ryder,* Fiona Blows,* Michael Ashburner,* Rosa Bautista-Llacer,* Darin Coulson,* Jenny Drummond,* Jane Webster,* David Gubb,* Nicola Gunton,* Glynnis Johnson,* Cahir J. O'Kane,* David Huen,* Punita Sharma,* Zoltan Asztalos,* Heiko Baisch, Janet Schulze, Maria Kube, Kathrin Kittlaus, Gunter Reuter, Peter Maroy, ° Janos Szidonya, Asa Rasmuson-Lestander,§ Karin Ekstrom,§ Barry Dickson,** ¨ Christoph Hugentobler, Hugo Stocker, Ernst Hafen, Jean Antoine Lepesant, Gert Pflugfelder,§§ Martin Heisenberg,*** Bernard Mechler, Florenci Serras, Montserrat Corominas, Stephan Schneuwly,§§§ Thomas Preat,**** John Roote* and Steven Russell*,1 0 ENETICALLY tractable model organisms are valuable research tools for uncovering basic biological principles that are conserved through evolution. Many molecular pathways, such as signaling cascades, gene regulatory pathways, and cell cycle control circuits, were first characterized genetically in model systems. The subsequent molecular cloning of the genes involved in such pathways has shown how evolution has utilized basic molecular building blocks to control a wide variety of biological processes. Key to the success of such approaches has been the ability to carry out genetic screens 0 for components that function in particular pathways and characterize how individual genes participate in such pathways. The fruit fly, Drosophila melanogaster, is one such tractable model that has been used extensively to elucidate many conserved genetic hierarchies. One particularly powerful approach with Drosophila is the ability to rapidly carry out focused genome-wide screens for pathway components by identifying loci that modify specific phenotypes (see St. Johnston 2002 for review). In this approach, a sensitized genetic background, most commonly exhibiting an easily scored adult phenotype such as rough eyes or a wing defect, is used to search for mutations in genes that make the phenotype more severe (enhancer) or more like wild type (suppressor). Mutation-bearing chromosomes are introduced into the 0 E. Ryder et al. 0 specific recombinase (FRT site) placed within intron one. In the case of RS3, a second FRT site is placed upstream of the first of the mini-white exons; in the case of RS5 the second FRT site is located downstream of the mini-white exons. Golic and Golic demonstrated how a pair of RS3 and RS5 elements can be used to generate chromosome rearrangements by design. These chromosome rearrangements include both deficiencies and duplications (Figure 6). Since the insertion site of any P element can be precisely mapped to the genomic sequence, the end points of any chromosome aberration derived from a pair of these RS elements can be determined with single-base-pair resolution. The problem of genetic background heterogeneity is less easily overcome. Powerful genetic methods are available with D. melanogaster to construct "isogenic" lines and we have used these methods in our current screen (Ashburner 1989). However, in the absence of practical methods to preserve these lines cryogenically, there is no way to prevent the slow, but inevitable, divergence of these lines in subsequent years. While this may be a drawback in the long term, there can be no doubt that, in the medium term, a deficiency kit in a homogeneous genetic background will be of considerable utility in genome-scale analysis of Drosophila. We describe here the construction of a set of isogenic lines that form the basis for a mobilization screen with RS elements. We describe the isolation and mapping of 3000 new P-element-insertion lines on this background and demonstrate their utility for generating deletions precisely mapped onto the genome sequence. This work is a prelude to an ongoing effort to generate a precisely mapped deletion kit that will cover as much of the genome of D. melanogaster as is possible. In addition, we have constructed a genetic and computational toolkit that allows individual researchers to design and synthesize deletions in regions of particular interest. The materials we have generated are all publicly available. 0 MATERIALS AND METHODS Genetic nomenclature is according to FlyBase (2003). The FM7 balancer stocks were ob 0 Steroid signaling in plants and insects--common themes, different pathways 1 Carl S. Thummel1 and Joanne Chory2,3 0 Outside of mammals, two model systems have been the focus of intensive genetic studies aimed at defining the molecular mechanisms of steroid hormone action--the flowering plant, Arabidopsis thaliana, and the fruit fly, Drosophila melanogaster. Studies in Arabidopsis have benefited from a detailed description of the brassinosteroid (BR) biosynthetic pathway, allowing the effects of mutations to be linked to specific enzymatic steps. More recently, the signaling cascade that functions downstream from BR production has been defined, revealing for the first time how the hormone can exert its effects on gene expression through a cell surface receptor and phosphorylation cascade. In contrast, studies of steroid hormone action in Drosophila began in the nucleus, with a detailed description of the transcription puffs activated by the steroid hormone 20-hydroxyecdysone (20E) in the giant polytene chromosomes. Subsequent genetic studies have revealed that these effects are exerted through nuclear receptors, much like mammalian hormone signaling. Most recently, genetic studies have begun to elucidate the ecdysteroid biosynthetic pathway which, until recently, remained largely undefined. Our current understanding of steroid hormone signaling in Arabidopsis and Drosophila provides a number of intriguing parallels as well as distinct differences. At least some of these differences, however, appear to be due to deficiencies in our understanding of these pathways. Below we discuss recent breakthroughs in defining the molecular mechanisms of BR biosynthesis and signaling in plants, and we compare and contrast this pathway with what is known about the mechanisms of ecdysteroid action in Drosophila. We raise some current questions in these fields, the answers to which may reveal other similarities in steroid signaling in plants and animals. Brassinosteroid biosynthesis and homeostasis Although plants and animals diverged more than 1 billion years ago, it is remarkable that polyhydroxylated 0 steroidal molecules are used as hormones in both of these kingdoms, as well as in algae and fungi. Brassinosteroids (BRs), a class of plant-specific steroid hormones, control many of the same developmental and physiological processes as their animal and fly counterparts, including regulation of gene expression, cell division and expansion, differentiation, programmed cell death, and homeostasis. The regulation of these processes by BRs, acting together with other plant hormones, leads to the promotion of stem elongation and pollen tube growth, leaf bending and epinasty, root growth inhibition, proton-pump activation, and xylem differentiation (Mandava 1988; Clouse and Sasse 1998). In addition, useful agricultural applications have been found such as increasing yield and improving stress resistance of several major crop plants (Ikebawa and Zhao 1981; Cutler et al. 1991). Although the existence and biological activity of these plant steroids had been described in a large body of literature, they only found their way into the mainstream of plant hormone biology a few years ago, when the available biochemical and physiological data were complemented by the identification of BR-deficient mutants of Arabidopsis (Clouse et al. 1996; Kauschmann et al. 1996; Li et al. 1996; Szekeres et al. 1996), pea (Nomura et al. 1999), and tomato (Bishop et al. 1999; Koka et al. 2000). Mutations in 8 loci of Arabidopsis and several additional loci in tomato and pea result in plants with reduced levels of BR biosynthetic intermediates and lead to distinct phenotypes (Bishop et al. 1996; Li et al. 1996; Szekeres et al. 1996; Choe et al. 1998a,b, 1999a,b, 2000; Klahre et al. 1998; Nomura et al. 1999; Kang et al. 2001). In Arabidopsis, loss-of-function mutations in these genes have pleiotropic effects on development. In the dark, the mutants are short, have thick hypocotyls and open, expanded cotyledons, develop primary leaf buds, and inappropriately express light-regulated genes. In the light, these mutants are dark green dwarfs, have reduced apical dominance and male fertility, display altered photoperiodic responses, show delayed chloroplast and leaf senescence, have reduced xylem content, and respond improperly to fluctuations in their light environment 0 Thummel and Chory 0 (Chory et al. 1991, 1994; Millar et al. 1995; Szekeres et al. 1996; Fig. 1). Such phenotypic differences between BRdeficient mutants and wild-type Arabidopsis plants indicate that these genes (and by inference, BRs) play an important role throughout Arabidopsis development. Exogenous application of brassinolide (BL, the most active BR, and generally thought to be the endpoint of the biosynthetic pathway) leads to the normalization of their phenotypes. A biosynthetic pathway derived solely from biochemical studies provided an excellent framework for the characterization of these mutants, and was in turn confirmed and refined by their analysis (for review, see Clouse and Sasse 1998; Noguchi et al. 2000; Friedrichsen and Chory 2001; Fig. 1). Because of their striking mutant phenotypes, which led to the identification of most BR biosynthetic genes, considerable progress has been made in understanding the mechanisms of BR homeostasis. Multiple control mechanisms for regulating the levels of BRs in plants have been identified, including regulation of biosynthesis, inactivation, and feedback regulation from the signaling pathway. BR-deficient mutants have helped to determine that BL is not synthesized via a simple linear biosynthetic pathway. Recently, two pathways, the early C-6 oxidation and late C-6 oxidation pathways, were proposed for the biosynthesis of BL (Choi et al. 1996, 1997). In the early C-6 oxidation pathway, hydroxylation of the side chain occurs after C6 oxidation, whereas in the late C-6 oxidation pathway the hydroxylation of the side chain occurs before position 6 of the B-ring is oxidized. Feeding experiments with intermediates of both path- 0 ways provided strong genetic evidence that both pathways operate in Arabidopsis (Fujioka et al. 1997; Choe et al. 1998a). A study with dwf4 mutants suggests that 6-deoxo-cathasterone is a starting point for a new subpathway as this compound is able to rescue dwf4 mutations (Choe et al. 1998a). Of note, DWF4, a C-22 hydroxylase, appears to be the major rate-limiting step in the BR biosynthetic pathway based on feeding studies and overexpression of DWF4 in transgenic plants (Choe et al. 2001). Similarly, 6-6 -hydroxycampestanol could also be a starting point for a different subpathway whose intermediates act as "bridging molecules" between the early and late C-6 oxidation pathways. One simple explanation for plants having multiple pathways of BL biosynthesis is that these subpathways might be differentially regulated by various environmental or developmental signals. A possible point for light-regulation of BR biosynthesis has very recently been identified and is indicated in red in Figure 1 (Kang et al. 2001). In addition, feeding experiments using det2 and dwf4 mutants have shown that BRs in the late C-6 oxidation pathway are more effective in rescuing light phenotypes, whereas the BRs in the early C-6 oxidation pathways show stronger activity in promoting hypocotyl elongation of darkgrown seedlings (Fujioka et al. 1997; Choe et al. 1998a). Endogenous levels of BRs are increased in BR-signaling mutants, such as Arabidopsis bri1 and its orthologous mutants in tomato, pea, and rice (discussed below; Noguchi et al. 1999; Yamamuro et al. 2000; Bishop and Yokota 2001). These BR-insensitive mutants show the largest increases in the early C-6 oxidation BRs. In Ara- 0 GENES & DEVELOPMENT 0 Steroid hormone signaling 1 Fredj Tekaia a,*, Edouard Yeramian b, Bernard Dujon a 0 Keywords: Hyperthermophiles; Mesophiles; Thermostability; Amino acid composition; Evolution; Multivariate analyses 0 Introduction One major aim of large-scale genomic projects is to reach a global understanding of the physiological functioning of living organisms. Such understanding must encompass the 0 puzzling discovery that certain organisms live in extreme conditions of temperature, pressure, and salinity, which were originally thought to be incompatible with life (for a recent revue see Rothschild and Mancinelli, 2001, and references therein). With the genomic sequences of these organisms becoming available, it is rather surprising that no striking genomic counterparts seem to be associated with such extreme lifestyles. For example, at the DNA level, an 0 GENERAL AND COMPARATIVE 0 Yolk steroid hormones and sex determination in reptiles with TSD 0 Abstract In reptiles with temperature-dependent sex determination (TSD), the temperature at which the eggs are incubated determines the sex of the offspring. The molecular switch responsible for determining sex in these species has not yet been elucidated. We have examined the dynamics of yolk steroid hormones during embryonic development in the snapping turtle, Chelydra serpentina, and the alligator, Alligator mississippiensis, and have found that yolk estradiol (E2 ) responds differentially to incubation temperature in both of these reptiles. Based upon recently reported roles for E2 in modulation of steroidogenic factor 1, a transcription factor known to be significant in the sex differentiation process, we hypothesize that yolk E2 is a link between temperature and the gene expression pathway responsible for sex determination and differentiation in at least some of these species. Here we review the evidence that supports our hypothesis. O 2003 Elsevier Science (USA). All rights reserved. 0 Temperature-dependent sex determination Sex determination is thought to occur in two basically different modes. There is genetic sex determination (GSD), in which sex chromosomes determine the sex of the individual and environmental sex determination (ESD), where environmental factors determine sex. In one form of ESD, temperature-dependent sex determination (TSD), the temperature at which the eggs are incubated determines the sex of the hatchlings. There are three different patterns or temperature profiles that have been described for TSD species, male-female (MF), female-male (FM), and female-male-female (FMF). In the MF pattern, low temperatures produce a majority of males, high temperatures produce mostly females, and intermediate temperatures produce a ratio of males to females. The intermediate temperature that produces a 1:1 ratio of males to females is referred to as the pivotal temperature for the species. Several turtle species have been reported to show this profile, including the painted turtle, Chrysemys picta and the red-eared slider turtle, Trachemys scripta (Ewert et al., 1994). In the FM pattern, the temperature regimen is reversed, with high 0 temperatures producing mainly males, low temperatures producing primarily females, and again, intermediate temperatures producing ratios of males to females. This pattern has been reported for some lizards (Viets et al., 1994), including the skink, Eulamprus tympanum, the only viviparous TSD lizard reported to date (Robert and Thompson, 2001). In the third TSD pattern, FMF, females are produced at low temperatures, a majority of males are produced at an intermediate temperature, and predominantly females are produced again at high temperatures. In this system there are two pivotal temperatures at which ratios of males to females are produced. This pattern is displayed in all the crocodilians studied to date, including the American alligator, Alligator mississippiensis (Lang and Andrews, 1994). In the snapping turtle, Chelydra serpentina, the usual TSD pattern is FMF (Ewert et al., 1994), however, the TSD pattern in some populations of snapping turtles varies slightly from that described, being MF, with males predominating at lower temperatures, females at higher temperatures, and a single pivotal temperature range. The period of development during which sex is determined, the thermosensitive period (TSP), falls within the middle one-third to one half of the total incubation time (Wibbels et al., 1991a), and temperature influences the rate of development as well as the sex of the hatchling. 0 Temperature is apparently not the only factor influencing sex determination, at least in some of these species. There are reports of large variations in the ratios of males to females produced among clutches of eggs laid by different females at the pivotal temperature where one would expect to see a 1:1 ratio (Rhen and Lang, 1998, Fig. 1). This would indicate that other factors, perhaps some maternal contribution could influence the outcome of the sex determining process. Clutch identity or ``clutch effects'' have also been reported to influence other aspects of offspring fitness, including residual yolk mass, fat body mass and total mass of hatchling snapping turtles (Rhen and Lang, 1999). Moreover, studies of post-hatch growth of snapping turtles showed significant clutch effects in growth rates that were independent of egg mass (Rhen and Lang, 1995). These differences could also be due to differential hormone deposition in yolk, as has been reported in some avian species (Frank et al., 1991; Schwabl, 1996; Schwabl et al., 1997). 0 Gene expression patterns during sex differentiation of TSD reptiles What is known about the sex differentiation process in reptiles with TSD? The gene expression pattern that leads to sex determination and subsequent testis or ovary differentiation, has been defined best in mammalian species, which utilize GSD. SRY (Sex-determining region of the Y chromosome) is thought to be the primary determinant of testis differentiation in mouse and human systems (reviewed by Koopman et al., 2001), but there is no known homologue of SRY in TSD reptiles. There are a number of candidate genes that are present 0 but since the embryonic adrenal gland is extremely active, these results do not accurately reflect activity of the gonad alone (T. Wibbels, personal communication). Since in mammalian species SF-1 works in conjunction with SOX9 to up-regulate AMH for male differentiation, SF-1 must participate in completely different interactions in chickens and alligators, where it is upregulated in females. Recent reports indicate that DAX1, an orphan nuclear receptor, inhibits the expression of genes in the male differentiation pathway possibly by modulating the activity of SF-1 (reviewed by Parker and Schimmer, 2002). DAX1 also has reported interactions with estrogen receptors and is thought to act as a corepressor, so could play a role in estrogen signaling pathways (Zhang et al., 2000). Cytochrome P450 aromatase expression, a 0 FEBS 23893 0 Gene expression data analysis 1 Alvis Brazma*, Jaak Vilo 0 what are the functional roles of di¡erent genes and in what cellular processes do they participate; how are genes regulated, how do genes and gene products interact, what are these interaction networks ; how does gene expression level di¡er in various cell types and states, how is gene expression changed by various diseases or compound treatments. 0 Knowing the gene transcript abundance in various tissues, developmental stages and under various conditions is important for attacking these questions. Although mRNA is not the 0 ultimate product of a gene, transcription is the ¢rst step in gene regulation, and information about the transcript levels is needed for understanding gene regulatory networks. Moreover, the measurement of mRNA levels currently is considerably cheaper and can be done in a more high-throughput way than direct measurements of the protein levels. The correlation between the mRNA and protein abundance in the cell may not be straightforward, nevertheless the absence of mRNA in a cell is likely to imply a not very high level of the respective protein and thus at least qualitative estimates about the proteome can be based on the transcriptome information. The mRNA and protein level correlation studies are under way (see [1]). The ability to monitor gene expression at the transcript level has become possible due to the advent of DNA microarray technologies (see [2]). A microarray is a glass slide, onto which single-stranded DNA molecules are attached at ¢xed locations (spots). There may be tens of thousands of spots on an array, each related to a single gene. Microarrays exploit the preferential binding of complementary single-stranded nucleic acid sequences. There are several variations of microarray technologies each used in a speci¢c way. One of the most popular experimental platforms is used for comparing mRNA abundance in two di¡erent samples (or a sample and a control). RNA from the sample and control cells are extracted and labeled with two di¡erent £uorescent labels, e.g. a red dye for the RNA from the sample population and a green dye for that from the control population. Both extracts are washed over the microarray. Gene sequences from the extracts hybridize to their complementary sequences in the spots. To measure the relative abundance of the hybridized RNA the array is excited by a laser. If the RNA from the sample population is in abundance, the spot will be red, if the RNA from the control population is in abundance, it will be green. If sample and control bind equally, the spot will be yellow, while if neither binds, it will not £uoresce and appear black. Thus, from the £uorescence intensities and colors for each spot, the relative expression levels of the genes in the sample and control populations can be estimated. By measuring transcription levels of genes in an organism under various conditions, at di¡erent developmental stages and in di¡erent tissues, we can build up `gene expression pro¢les' which characterize the dynamic functioning of each gene in the genome. We can imagine the expression data represented in a matrix with rows representing genes, columns representing samples (e.g. various tissues, developmental stages and treatments), and each cell containing a number characterizing the expression level of the particular gene in the particular sample. We will call such a table a gene expres- 0 sion matrix. Building up a database of such matrices will help us to understand gene regulation, metabolic and signaling pathways, the genetic mechanisms of disease, and the response to drug treatments. For instance, if overexpression of certain genes is correlated with a certain cancer, we can explore which other conditions a¡ect the expression of these genes and which other genes have similar expression pro¢les. We can also investigate which compounds (potential drugs) lower the expression level of these genes. 2. From raw data to gene expression matrix Like many experimental technologies, microarrays measure the target quantity (i.e. relative or absolute mRNA abundance) indirectly by measuring another physical quantity ^ the intensity of the £uorescence of the spots on the array for each £uorescent dye, i.e. for each optical wavelength 0 (so-called channel). Therefore the raw data produced by microarrays are in fact monochrome images (Fig. 1). Transforming these images into the gene expression matrix is a nontrivial process: the spots corresponding to genes on the microarray should be identi¢ed, their boundaries determined, the £uorescence intensity from each spot measured and compared to the background intensity and to these intensities for other channels. The software for this initial image processing is often provided with the image scanner, since it will depend on particular properties of the hardware. Often laborious manual adjustment of the grid for spots is used. We will not discuss the raw data processing in detail in this paper, some survey of image analysis software can be found on http:// cmpteam4.unil.ch/biocomputing/array/software/MicroArray_ Software.html. In any physical experiment it is important to know not only the value of the measurement, but also the standard error or 0 Nutrient control of gene expression in Drosophila: microarray analysis of starvation and sugar-dependent response 1 Ingo Zinke, Christina S.Schutz, E Jorg D.Katzenberger, Matthias Bauer and E Michael J.Pankratz1 0 E Institut fur Genetik, Forschungszentrum Karlsruhe, Postfach 3640, D-76021 Karlsruhe, Germany 0 We have identified genes regulated by starvation and sugar signals in Drosophila larvae using whole-genome microarrays. Based on expression profiles in the two nutrient conditions, they were organized into different categories that reflect distinct physiological pathways mediating sugar and fat metabolism, and cell growth. In the category of genes regulated in sugar-fed, but not in starved, animals, there is an upregulation of genes encoding key enzymes of the fat biosynthesis pathway and a downregulation of genes encoding lipases. The highest and earliest activated gene upon sugar ingestion is sugarbabe, a zinc finger protein that is induced in the gut and the fat body. Identification of potential targets using microarrays suggests that sugarbabe functions to repress genes involved in dietary fat breakdown and absorption. The current analysis provides a basis for studying the genetic mechanisms underlying nutrient signalling. Keywords: fat/feeding/microarrays/starvation/sugar 0 Halaas, 1998). Malfunctioning of physiological pathways underlying nutrient signalling and energy homeostasis can have major consequences for human health, and the modern society is facing ever increasing cases of physiological disturbances such as eating disorders, diabetes and obesity. As the dietary requirement for sugars, fats and amino acids is essentially universal, many aspects of the basic logic of nutrient signalling should be conserved. The finding that both Drosophila and Caenorhabditis elegans possess components of insulin signalling supports this view (Lehner, 1999; Brogiolo et al., 2001; Gems and Partridge, 2001). As part of our analysis of Drosophila larval feeding behaviour, we previously identified lipase 3 (lip3) and phosphoenolpyruvate carboxykinase (pepck) as being upregulated upon starvation (Zinke et al., 1999). Upon addition of sugar, this upregulation was completely suppressed for lip3, but not for pepck. These results demonstrated that different nutrient conditions can have very specific effects on gene expression patterns in Drosophila larvae. We have now used Affymetrix microarrays to identify genes regulated by starvation and by sugar in order to study the mechanisms underlying nutrient signalling. Based on the pattern of response to different nutrient conditions and on existing knowledge of metabolic pathways, we could categorize the identified genes into groups that reflect distinct physiological functions. We have further characterized a zinc finger transcription factor that is one of the earliest and highest upregulated genes upon sugar ingestion. Identification of potential target genes indicates that this transcription factor functions to repress genes involved in dietary fat breakdown and absorption. 0 Drosophila larvae are continuous feeders and show large growth in a relatively short time period. About 5 days after egg laying (AEL), they stop feeding, leave the food to enter the wandering stage and pupariate shortly thereafter (Figure 1A). Within this normal developmental progression, there are several notable variations that become apparent under different environmental conditions. One intriguing observation was made by Beadle et al. (1938). When larvae are starved before 70 h AEL, they die within several days, whereas if they are starved after this time point, they do not grow, but still survive and differentiate to give rise to small adult flies. The authors concluded that some `organizational change occurs in larvae at about 70 h' and termed this the `70 h change' (Beadle et al., 1938). This survival after the 70 h change period is independent of whether the larvae are starved or placed on sugar; however, before the 70 h, larvae placed in sugar live for much longer than those under starvation conditions (over a 0 a European Molecular Biology Organization 0 Nutrient control of gene expression 0 week as compared with ~2 days; see also Britton and Edgar, 1998; Zinke et al., 1999). Clearly, there is a difference in the metabolic programme that becomes activated across this point upon change in nutrient status. As the period before 70 h is critical for survival, we decided to perform the experiments prior to this point. For each time and nutrient condition, two chips were used with each chip being hybridized to the samples collected independently (Figure 1B). 0 Categorization of nutrient-dependent genes 0 Mechanisms for differences in monozygous twins 1 Paul Gringrasa,*, Wai Chenb,c 0 Keywords: Twin; Monozygous; Genetic mechanisms 0 Introduction Over 200 pairs of twins are assessed each year at the Multiple Births Foundation, London. Despite often appearing indistinguishable to strangers, no `identical' twins assessed are so alike that their mothers fail to distinguish them accurately. Physical differences may be as subtle as one small mole, or a differently positioned hair crown; 0 but still, they exist and are unmistakable once identified. Many parents can also differentiate their `identical' twins by their personalities, some even claim from a very early age. Physical similarities between MZ twins are well recognised; and these similarities have long formed the basis of many instruments and clinical methods designed to classify zygosity, such as questionnaires and physical examinations. Even the most experienced practitioners can, however, `misclassify' zygosity in about 6% of cases [1], and molecular genetic methods are now the preferred method for establishing zygosity [2]. The term `identical'--although frequently used--is not synonymous with `monozygous' (MZ). Most MZ twins are phenotypically very similar, yet there are significant numbers of MZ pairs who are neither phenotypically nor genotypically identical. Even if one assumes a completely equal `apportioning' of genetic endowment when twinning occurs, the twin pair will only remain identical if post-zygotic genetic, post-zygotic epi-genetic and post-zygotic environmental factors affect each twin equally. Given the extent of these influences and many potential opportunities for disruption during the long and complex intrauterine development, it is perhaps surprising that so many MZ twins do turn out to be so alike. Nevertheless, it is these anomalous cases of discordant twins that have taught us much about human genetics, development and twinning in the past. It is likely that they will continue to do so when new technologies are applied to future research in this area. This review summarises some past findings of well established studies, and also some from more recent exploratory studies using more experimental techniques and designs. We will first consider the ante-natal environmental factors and their effects, and then the genetic factors that contribute to discordance in MZ twins. Some examples of discordancy do not necessarily fit into the above neat categories. For convenience, they have been grouped together and discussed in the final section on `discordancies of unknown origin'. 0 Timing of monozygous twinning Monozygous (MZ) twinning occurs when one single fertilised egg gives rise to two separate embryos. The timing of this division can be an important contributory factor in determining the post-zygotic discordance in MZ twins. This timing can be characterised by the differences in amniotic sac, chorionic and placental anatomical formation [3]. In principle, the earlier twinning occurs, the less the twins will share common supportive structures; and the later, the more. The extreme example of late twinning are conjoint twins who even share some somatic organs. If twinning takes place prior to the first 4 days after conception, two separate placentas and sets of membranes are formed: that is, one set for each embryo. Such twins are called dichorionic (DC) MZ twins, and they account for about one third of all MZ twins. After the `fourth' day, the progenitor cells of the placenta become separated from the inner cell mass of the embryo. As a result, for twinning occurring after this, only one single placenta will develop. This single monochorionic (MC) placenta serves both 0 Amnionicity Diamniotic Diamniotic Monoamniotic 0 Chorionicity Dichorionic Monochorionic Monochorionic twins 0 Frequency One-third of monozygous twins Approximately two-thirds monozygous twins Five percent of monozygous twins Conjoined twins 0 Timing for conjoint twins is theoretical and only suggested by animal models. 0 embryos, and in the majority of cases, contains anastomoses of blood vessels that connect the embryos. After about the eighth day, the MC MZ pair will share a common amniotic sac, in addition to the common MC placenta [4]. About 5% of MZ twins are monochorionic (MC) and monoamniotic (MA). Twinning after the second week results in the very rare phenomenon of conjoined twins (see Table 1). All MC twins are MZ by definition, and this is still the `gold standard' when defining monozygosity. Although often seen in animals, vascular communications in dichorionic placentae in man are extremely rare [5]. The combination of monochorionicity and arterioarterial anastomoses is a better proof of monozygosity than any genetic test currently available. If placentation has not already been established by ultrasound in the first trimester, it relies on placental examination by pathologists; unfortunately, this still has not become routine clinical practice in most hospitals, despite numerous pleas in the literature [6,7]. 0 Ante-natal environmental factors 3.1. Chorionicity, twin -twin transfusion syndrome and discordant birth weight Anastomotic connections between foetal circulations are present in around 90% of MC placentas. These anastomoses can result in the `twin to twin transfusion syndrome' (TTTS) [8]. This can result either in a chronic ante-partum transfusion or acute intrapartum transfusion. In the former event, growth discordance occurs and there are risks for both the donor and recipient. These include the possibility of the donor becoming malnourished and growth retarded, while the recipient is at risk of cardiac hypertrophy, polycythaemia and hydramnios. In general, the mortality and morbidity rate for both twins in this situation is high without intervention [9]. The acute transfusion syndrome occurs intrapartum and causes increased mortality and morbidity, through both hypovolaemia and hypotension in one twin, and polycythaemia in the other. Even without TTTS, discordant birth weight in MZ twins remains common as a result of: (1) unequal in-utero blood supply, and hence growth; and perhaps (2) in theory, unequal division of inner cell mass at twinning. Although such differences may diminish 0 with age, there is a growing body of evidence that significant discrepancy in birth weight may lead to long-lasting physiological changes in both twins. The concept of `foetal programming' proposes that intrauterine growth affects long-term growth and metabolism in later life. Epidemiological studies linking low birth weight with hypertension and coronary artery disease in adult life suggest that undernutrition before birth `programmes' later cardiovascular outcome [10]. Associations between `small for dates' babies with later insulin resistance and cardiovascular disease are consistent with the hypothesis that late gestation may be a window of sensitivity to nutrition in terms of its influence on later cardiovascular disease. In twins discordant for the development of non-insulin dependant diabetes (NIDDM), birth weight has been found to be lower in the affected twin [11]. Investigators continue to use twins with discordant birth weight as a means to test the `foetal programming' hypothesis, while assuming the twin pair would share common confounding variables such as social class, genetic endowment and post-natal environments. Two teams have recently reported the importance of birth weight in twins, independent of genetic differences, in influencing their blood pressure as adults [12]. Evidence for `foetal programming' has even been found in early infancy: in a small cohort of MZ twins, where a twin - twin transfusion had occurred, differences in arterial distensibility were found in the donor twin when compared to the recipient [13]. Appealing though the findings from twin studies may be, the extent to which they are generalisable to singleton population is un 0 Genome-wide identification of in vivo Drosophila Engrailed-binding DNA fragments and related target genes 1 Pascal Jean Solano1,*, Bruno Mugat1,*, David Martin2, Franck Girard1, Jean-Marc Huibant1, Conchita Ferraz1, Bernard Jacq2, Jacques Demaille1 and Florence Maschat1, 0 1Institut de Genetique Humaine (UPR 1142). 141 rue de la Cardonille, 34396 Montpellier, France 2Laboratoire de Genetique et Physiologie du Developpement (UMR 6545), IBDM, Parc Scientifique 0 de Luminy, 13288 Marseille, 0 Cedex 9, France 0 SUMMARY Chromatin immunoprecipitation after UV crosslinking of DNA/protein interactions was used to construct a library enriched in genomic sequences that bind to the Engrailed transcription factor in Drosophila embryos. Sequencing of the clones led to the identification of 203 Engrailed-binding fragments localized in intergenic or intronic regions. Genes lying near these fragments, which are considered as potential Engrailed target genes, are involved in different developmental pathways, such as anteroposterior patterning, muscle development, tracheal pathfinding or axon guidance. We validated this approach by in vitro and in vivo tests performed on a subset of Engrailed potential targets involved in these various pathways. Finally, we present strong evidence showing that an immunoprecipitated genomic DNA fragment corresponds to a promoter region involved in the direct regulation of frizzled2 expression by engrailed in vivo. 0 Key words: Engrailed, Chromatin immunoprecipitation, In vivo targets, Drosophila 0 INTRODUCTION Identification of target genes that are directly regulated by transcription factors is a key issue in developmental biology, and has been the purpose of several recent studies. Indeed, the genome-wide location of DNA-binding proteins using genomic microarrays has been performed in yeast (Iyer et al., 2001; Lieb et al., 2001; Ren et al., 2000). In mammalian cells, CpG island microarrays have allowed the identification of promoter regions capable of binding to the E2F transcription factor (Weinmann et al., 2002). Recently, whole-genome microarray assays associated with bioinformatic methods have also been successfully performed to identify direct target genes of the Dorsal transcription factor in Drosophila (Markstein et al., 2002; Stathopoulos et al., 2002). Identifying the genes that are directly regulated by transcription factors, rather than merely in the downstream pathways, remains essential for understanding gene function (Liang and Biggin, 1998; Mannervik, 1999; Furlong et al., 2001; Egger et al., 2002). Homeodomain transcription factors play key roles during development by coordinating the behavior of most cells within their domains of expression (Garcia-Bellido, 1975; Lawrence and Morata, 1992), and identifying their target genes is challenging (Biggin and McGinnis, 1997). Interestingly, whereas homeodomain proteins recognize closely related binding sites, they are involved in specific genetic pathways and their absence produces very specific phenotypic effects 0 P. J. Solano and others Weinmann et al., 2001; Weinmann et al., 2002). However, UV light is believed to be more efficient in fixing proteins that are directly bound to DNA (Toth and Biggin, 2000). In the present report, we constructed a library enriched in genomic sequences that bind Engrailed protein in Drosophila embryos, by using UV crosslinking and chromatin immunoprecipitation (UV-X-ChIP). Systematic sequencing of the recovered clones led to the identification of 203 potential direct targets of engrailed and evidence is presented to show that some of them represent bona fide engrailed targets. MATERIALS AND METHODS 0 Tissue-Specific Gene Expression and Ecdysone-Regulated Genomic Networks in Drosophila 0 Developmental Cell 60 0 midgut, larval epidermal cells and adult epidermal progenitor cells (midgut imaginal islands), respond in opposite ways to ecdysone. The larval epidermal cells initiate the process of programmed cell death, while the imaginal cells proliferate and form the adult midgut. These diverse responses to a single hormone offer an opportunity to study tissue-specific genomic activity during a developmental process that is coordinately regulated throughout the animal. We define the complements of genes expressed during the process of metamorphosis in specific tissues. We show that computational analysis of genome-wide gene expression patterns can facilitate the identification of cis-regulatory elements and a cognate transcription factor. We also show that the network that controls metamorphosis can be extended beyond the ecdysone-regulatory cascade to include components of other well-studied signaling pathways. 0 Results Identification of Transcripts Enriched in Different Tissues and Organs Delineating networks on a genome-wide scale requires a catalog of gene expression patterns in each tissue or organ. Of particular interest are those genes that have high levels of expression in only certain tissues or times during development. We isolated five different organs and tissues from the Drosophila melanogaster Canton-S strain (Figure 1A). Samples were collected in triplicate approximately 18 hr before puparium formation (BPF), when larvae are at the end of their feeding and growing phase but have not yet begun metamorphosis (Riddiford, 1993). We compared RNA isolated from each organ or tissue to a common reference RNA sample taken from identically staged whole animals. The use of a linear amplification protocol enabled small amounts of sample 0 Tissue-Specific Genomic Networks in Drosophila 61 0 BMC Bioinformatics 0 BioMed Central 0 Open Access 0 Array-A-Lizer: A serial DNA microarray quality analyzer 1 Andreas Petri*, Jan Fleckner and Mads Wichmann Matthiessen 0 Petri et al; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL. 0 Background: The proliferate nature of DNA microarray results have made it necessary to implement a uniform and quick quality control of experimental results to ensure the consistency of data across multiple experiments prior to actual data analysis. Results: Array-A-Lizer is a small and convenient stand-alone tool providing the necessary initial analysis of hybridization quality of an unlimited number of microarray experiments. The experiments are analyzed for even hybridization across the slide and between fluorescent dyes in two-color experiments in spotted DNA microarrays. Conclusions: Array-A-Lizer allows the expedient determination of the quality of multiple DNA microarray experiments allowing for a rapid initial screening of results before progressing to further data analysis. Array-A-Lizer is directed towards speed and ease-of-use allowing both the expert and non-expert microarray researcher to rapidly assess the quality of multiple microarray hybridizations. Array-A-Lizer is available from the Internet as both source code and as a binary installation package. 0 The ongoing development of DNA microarray analysis equipment have diminished both the price and workload associated with microarray experiments leading to development of data at a tremendous rate. It is not unusual for a group of researchers to be able to produce and scan 50- 100 microarray slides per week. The processing of such large amounts of experimental data, first requires verification of the overall quality of the experiments. Array-ALizer employs two tests to monitor the quality of the hybridization with respect to uniformity across the slide as well as relative intensity of the fluorescent dyes in two color experiments: 1) spectrum analysis of the signal across the microarray slide and 2) comparison of the two dyes that are used in two-color experiments (for instance Cy3 and Cy5). 0 The Array-A-Lizer graphical user interface (GUI) is created in Borland Delphi and the statistical calculations are carried out in the R-project statistical scripting language [1]. Array-A-Lizer includes a microdistribution of the Rproject and contains options for specifying the graphical output type as either bitmaps or postscript. Array-A-Lizer supports experiment files from GenePixPro and Spotfinder through an open architecture, which can be extended to include other file formats. Array-A-Lizer runs on the Microsoft Windows platform. 0 Results and discussion 0 Array-A-Lizer is an application for rapid quality control of large DNA microarray experiments. The program consists of a collection of scripts, that are contained and accessed 0 Page 1 of 6 0 (page number not for citation purposes) 0 BMC Bioinformatics 2004, 5 0 through a GUI to ease their use (figure 1). The main advantage of the program is the rapid processing of an unlimited number of experiments. Array-A-Lizer generates reports with a graphical analysis of each experiment, providing the researcher with a rapid survey of the quality of experiments (figures 2 and 3). Additionally, the program returns an overview of the results in the system browser with hyperlinks to each analysis report (figure 4). Array-A-Lizer facilitates the generation of several plots that detail the quality of the experiments. Two different analysis modes can be chosen, resulting in either a set of diagnostic plots or a spatial representation of the data. In comparison to existing analysis packages, Array-A-Lizer is both quick and easy to use. It is a stand-alone application that can be installed on any desktop computer running MS Windows. It is intended for easy visualization of microarray data allowing both the expert and non-expert microarray researcher to assess the quality of multiple microarray hybridizations. 0 Diagnostic report In this mode, the experimental data are used to generate several diagnostic plots (figure 2) as well as statistics on 0 the identified spots. The Array-A-Lizer diagnostic report includes both MvA plots (figure 2A left)[2] and red/greenscatter plots (figure 2A right), both of which show spot intensities after local background subtraction. MvA plots display the log intensity ratio M = log2(R/G) versus the mean log intensity A = log 2 RG . This plot type is widely use to visualize array data because it directly displays the red to green ratios, which are often the quantities of interest in most experiments. Furthermore, MvA plots make it easy to identify intensity dependent biases in the data (i.e. curvature or 'banana shape'). In scatter plots, the intensities from the green channel are plotted against the red channel after log2 transformation. Genes displaying difference in signal intensities in the two channels are plotted off the diagonal and genes showing similar intensities are plotted close to the the diagonal. A common source of variation in microarray data acquisition is attributed by incorrectly balanced photomultiplier tube (PMT) settings during scanning. This results in overall differences in signal intensities obtained from either channel and a shift of the data from the x-axis (M = 0) or 0 Page 2 of 6 0 (page number not for citation purposes) 0 BMC Bioinformatics 2004, 5 0 Page 3 of 6 0 (page number not for citation purposes) 0 BMC Bioinformatics 2004, 5 0 Page 4 of 6 0 (page number not for citation purposes) 0 BMC Bioinformatics 2004, 5 0 the diagonal (red = green) of the ideal MvA and scatterplot respectively (figure 2B). Finally, the diagnostic analysis generates histograms of the log2 transformed data for comparison of the distribution of intensities between the two channels. The histograms display the signal intensities across the slide (figure 2C). Overamplified channels (PMT levels are set too high) will result in many saturated spots, which is revealed as an over representation of high intensity values (figure 2D). The diagnostic report includes information on which files were used for the analysis, the number of saturated spots, and the number of negative values, i.e. the number of spots where the background intensity was higher than the foreground intensity. 0 Spatial report The spatial analysis results in a graphical representation of microarray data according to the location on the slide (figure 3). From each channel, three different plots are generated showing the log2 transformed foreground intensities, the background intensities, and a plot showing the location of negative values (background higher than fore- 0 ground). This analysis method can be used to identify spatial effects on the hybridized arrays such as fading or illumination at the edges due to cover-slip effects (figure 3A and 3B) or scratches and artifacts resulting from inadequate washing of slides (figure 3C and 3D). The cut-off values on the background plot can be set from the GUI prior to starting the analysis. Keeping these limits fixed will allow easy detection of pronounced fluctuations in background intensities both between and within slides. 0 With the reduced cost and labor of DNA m 0 TECHNICAL REPORTS 0 CA). Touchdown PCR amplifications were performed as recommended18. Cycle sequencing protocols were used with ABI sequencers at the Hutchinson Center Biotechnology Facility. DHPLC. Mutation detection was performed using the Transgenomic WAVE system. Following PCR amplification, the Pfu polymerase was inactivated, and the DNA samples were heated and cooled to form heteroduplexes18. For most fragments, the predicted WAVE (v.3.5) melting temperatures and separation gradients were used19. 0 We thank Bruce Draper for helpful discussions. This work was supported by grant RO1 GM29009 (to S.H.) from the National Institutes of Health. S.H. is an investigator of the Howard Hughes Medical Foundation, which also provided support for Karen Wolfe of the James Roberts lab, whom we thank for helping us with the screen. 0 High-fidelity mRNA amplification for gene profiling 1 Ena Wang1,3, Lance D. Miller2,3, Galen A. Ohnmacht1, Edison T. Liu2, and Francesco M. Marincola1* 0 TECHNICAL REPORTS 0 QUANTITATIVE TRAIT LOCI IN DROSOPHILA 1 Trudy F. C. Mackay 0 Phenotypic variation for quantitative traits results from the simultaneous segregation of alleles at multiple quantitative trait loci. Understanding the genetic architecture of quantitative traits begins with mapping quantitative trait loci to broad genomic regions and ends with the molecular definition of quantitative trait loci alleles. This has been accomplished for some quantitative trait loci in Drosophila. Drosophila quantitative trait loci have sex-, environmentand genotype-specific effects, and are often associated with molecular polymorphisms in non-coding regions of candidate genes. These observations offer valuable lessons to those seeking to understand quantitative traits in other organisms, including humans. 0 Transfer of genetic material from one strain to another by repeated backcrosses. With marker-assisted introgression, markers that distinguish the parental strains are used to track the desired interval and select against the undesired genotype. 0 The ease with which Mendelian and quantitative traits give up their genetic secrets is inversely proportional to the relative importance of the two classes of trait for human health, agriculture, evolution and even functional genomics. Although devastating to the possessor, highly deleterious alleles that cause inborn errors of metabolism and other single gene disorders are rare in the general population. By contrast, susceptibility to common diseases such as atherosclerosis, arthritis, diabetes, hypertension and schizophrenia is affected by multiple genetic factors and by the environment. These diseases are therefore quantitative traits (FIG. 1), and affect a large proportion of the human population. Similarly, individuals vary quantitatively in their response to drug therapy. There is great excitement in the human genetics community and the pharmaceutical industry that susceptibility loci for common diseases and individual variation in drug response can be identified and the molecular basis for this variation determined. This knowledge will herald a new era of personalized medicine in which environment-specific risk factors for common diseases are assessed for individual genotypes (and hopefully avoided by the patient) and pharmaceutical treatment is genotype dependent. Similar arguments apply to the agriculture industry, in which most characters of economic importance in domestic animal and crop species are quantitative. There is a long history of success in improving productivity traits 0 by selective breeding for favourable phenotypes. Knowledge of the allelic status at each locus affecting these traits will greatly facilitate this process, and will enable INTROGRESSION of favourable alleles from other strains, while simultaneously eliminating deleterious alleles. Variation for quantitative traits is the raw material on which the forces of evolution act to produce phenotypic diversity and adaptation. Major research efforts in evolutionary quantitative genetics are aiming to determine how genetic variation for adaptive quantitative traits is maintained in natural populations; whether the loci at which variation occurs within a population are the same as those that cause divergence between populations and species; and how the answers to these questions depend on the relationship of the trait to the ultimate quantitative trait -- reproductive fitness. So a comprehensive understanding of the evolutionary process is contingent on a detailed description of the molecular genetic basis of variation for quantitative traits in natural populations. The complete genome sequences of the yeast Saccharomyces cerevisiae1, the nematode Caenorhabditis elegans2 and the fruitfly Drosophila melanogaster3 reveal that a large fraction of these genomes is uncharted phenotypic territory. In Drosophila, for example, only 2,500 of the 13,600 genes and predicted genes (18%) have been characterized by classic genetic and molecular methods3. An important challenge for the future is to devise ways of determining the phenotypic effects of 0 NATURE REVIEWS | GENETICS 0 Macmillan Magazines Ltd 0 A1A1 Phenotype Phenotype A1A2 A2A2 A1A1 A1A2 A2A2 Phenotype Frequency A1A1 A1A2 A2A2 0 Phenotypic value 0 No GEI Parallel reaction norms 0 GEI Reaction norms cross 0 GEI Change of variance 0 ANTAGONISTIC PLEIOTROPY 0 Alternative homozygous genotypes (A1A1, A2A2) have opposite phenotypic effects under different conditions. 0 CONDITIONAL NEUTRALITY 0 The difference between quantitative trait loci genotypes is only expressed under some conditions. 0 A statistic to quantify dispersion about the mean. In quantitative genetics, the phenotypic variance, VP , is the observed variation of the trait in a population. VP is partitioned into components due to variation in the additive (VA) dominance (VD ) and epistatic (VI ) genetic variance, the variance attributable to the environment (VE ), and gene-environment correlations and interactions. 0 uncharacterized and predicted genes. Conventional screens for mutations with large phenotypic effects can lead to the identification of function for a biased sample of genes -- mutating one gene in a pathway in which there is functional redundancy might not cause a major effect on the phenotype. Furthermore, homozygous lethal mutations define loci that are essential for viability, but less severe mutations at these loci may have unknown and unexpected pleiotropic effects on morphology, physiology and behaviour. So, genetic screens for mutations with subtle, quantitative effects and genetic analysis of naturally occurring variation for quantitative traits will be important components of the functional genomics tool kit. Until very recently, the genetic basis of variation for quantitative traits was inferred solely from statistical estimates of correlations between relatives, response to artificial selection and changes of mean and VARIANCE of the trait on inbreeding and crossing4,5. To reap the benefits of a thorough understanding of quantitative traits, we must lift this statistical fog6 and describe quantitative genetic variation in terms of complex genetics (FIG. 1). Specifically, a full understanding of the genetic architecture of a quantitative trait will require answers to the following questions. What are the loci at which mutational variation affecting the trait occurs? What are the spontaneous mutation rates at these loci? What loci affect naturally occurring variation within and between populations of a single species, and between species? What are the homozygous and heterozygous effects of alleles at these loci? Are the effects of the individual loci on the final phenotype independent (additive), or is the effect of multiple loci on the phenotype nonlinear (epistasis)? What is the effect of quantitative trait locus (QTL) alleles on multiple quantitative traits, including 0 reproductive fitness (pleiotropy)? How do the homozygous, heterozygous, epistatic and pleiotropic QTL effects vary between the sexes and in a range of ecologically relevant environments? What defines a QTL allele at the molecular level? What are QTL allele frequencies within and between populations? At present, detailed genetic dissection of quantitative traits is most feasible in genetically tractable and wellcharacterized model systems. Drosophila melanogaster is one of the model organisms that provides us with all the tools necessary for identifying QTL and characterizing them at the molecular level7 (FIG. 2). Over eight decades of research on this organism have provided us with a library of stocks that bear mutations at single loci and deficiency chromosomes that cover around 70% of the genome. The P transposable element has been harnessed as a transformation vector and modified for efficient insertional mutagenesis, analysis of tissue-specific expression patterns, general and targeted overexpression, and, most recently, homologous rec 0 review review 0 In control: systematic assessment of microarray performance 1 Harm van Bakel & Frank C.P. Holstege+ 0 Expression profiling using DNA microarrays is a powerful technique that is widely used in the life sciences. How reliable are microarrayderived measurements? The assessment of performance is challenging because of the complicated nature of microarray experiments and the many different technology platforms. There is a mounting call for standards to be introduced, and this review addresses some of the issues that are involved. Two important characteristics of performance are accuracy and precision. The assessment of these factors can be either for the purpose of technology optimization or for the evaluation of individual microarray hybridizations. Microarray performance has been evaluated by at least four approaches in the past. Here, we argue that external RNA controls offer the most versatile system for determining performance and describe how such standards could be implemented. Other uses of external controls are discussed, along with the importance of probe sequence availability and the quantification of labelled material. Keywords: expression profiling; external controls; microarray; performance; quality; spikes 0 DNA microarrays are universal tools that can be applied throughout the life sciences (Brown & Botstein, 1999; Lockhart & Winzeler, 2000; Young, 2000). mRNA-expression profiling is the most frequent application. Such microarray hybridizations determine changes in mRNA levels between two samples or result in an absolute quantification that is correlated to mRNA levels. How reliable are these measurements? Given the widespread interest, it is surprising that there have been relatively few systematic analyses of microarray performance. One reason for this lack of assessment is the complicated nature of microarray technology; there is no single `microarray technology', but rather a collection of different technology platforms. Established platforms include Affymetrix GeneChips (Santa Clara, CA, USA), PCR-product-based cDNA arrays and long oligomer arrays that are manufactured in-house or by Agilent (Palo Alto, CA, USA). New platforms are still being introduced, such as the Illumina Beadarray 0 (San Diego, CA, USA; Fan et al, 2004) or the Universal Hexamer Array from Agilix (New Haven, CT, USA; Roth et al, 2004). To complicate matters further, many technical alternatives are possible within each platform for each of the numerous steps between sample preparation and data analysis. These include diverse methods of generating labelled material, various hybridization conditions, different microarray scanners and settings, a range of imagequantification techniques, and several approaches for determining statistically and biologically significant differential gene expression. Microarray technology is therefore an amalgamation of many different techniques, even within individual technology platforms. This complexity makes the need for comparing performance even stronger, whilst confounding such comparisons. Determining reliability is a complicated undertaking if all aspects are to be assessed in a non-arbitrary way across the different platforms and their variants. In addition, reliability is a sensitive issue for those groups that provide the technology. Finally, not every application requires reliable estimates of mRNA level changes. This should be interpreted as an indication of the power of microarray technology, as even lower quality data can yield important results. Improved performance would nevertheless benefit all applications. A high degree of reliability is a requirement if certain fields, such as systems biology (Ideker et al, 2001) or diagnostic mRNAexpression profiling (van de Vijver et al, 2002) are to mature. A strong argument can be made for investigating how the technology can be systematically assessed, given its increased usage, the costs that are involved and the fact that the aim is to determine the mRNA levels of all genes, including those that are expressed at nearly zero levels. Here, we describe approaches for determining microarray performance and propose that the use of external control RNAs is a versatile and robust method for achieving this goal. 0 Accuracy and precision 0 Which performance parameters should be assessed? The two main characteristics of data quality are accuracy and precision. Whereas accuracy refers to how close a measurement is to the real value, precision indicates how often a measurement yields the same result (Fig 1). When microarray data are discussed, the focus is often on precision; that is, reproducibility rather than accuracy. Reproducibility is easier to assess, by taking repeated measurements. Previous reviews have discussed the pitfalls that are involved in determining reproducibility, such as the confusion between 0 EUROPEAN MOLECULAR BIOLOGY ORGANIZATION 0 Controlling microarray performance H. van Bakel & F.C.P. Holstege 0 Measured mean 0 Measured mean 0 mized. Confounding artefacts are still being uncovered (Diehl et al, 2001; Ramdas et al, 2001; Chuaqui et al, 2002; Fare et al, 2003; Martinez et al, 2003; Raghavachari et al, 2003; t Hoen et al, 2003; Lyng et al, 2004). Therefore, monitoring quality would benefit individual hybridizations and projects. This could also aid in analyses of the data that are now being collected in public databases (Edgar et al, 2002; Brazma et al, 2003). In these cases, internal quality control would allow the refinement of decisions about which data to use, depending on the requirement for different quality parameters. 0 Real value 0 Real value 0 Measured mean 0 Measured mean 0 Approaches to determining performance 0 One method that can be used to optimize protocols is to measure and increase the signal intensity (Rickman et al, 2003; Wrobel et al, 2003). The underlying assumption is that increased signal-to-noise ratios will yield better quality hybridizations. However, an increase in signal might be aspecific; for example, owing to increased crosshybridization or the nonspecific binding of fluorophores to nucleicacid probes (Chuaqui et al, 2002). It is therefore risky to optimize signal-to-noise ratios without knowing whether specificity is being maintained. A second approach is to determine the correlation between new methods and an approach that is already in use. Different amplification and labelling techniques are usually assessed by comparison to a standard cDNA-synthesis protocol (Mahadevappa & Warrington, 1999; Manduchi et al, 2002; Gupta et al, 2003; t Hoen et al, 2003; Kenzelmann et al, 2004). A correlation coefficient only shows how similarly two protocols behave; it does not give information on their individual accuracy. A high correlation (Barczak et al, 2003) might therefore mean that the technologies that are being compared both suffer from the same error. Moreover, a low correlation (Tan et al, 2003) still begs the question of which technique is better. Another use of correlation is to monitor reproducibility; for example, between the two dye channels of cDNA arrays. The drawback is that the technology is being optimized for yielding identical intensities, rather than for accurately reporting what most users are interested in: differences in mRNA levels. Perfectly tight same-versus-same scatter plots, which are often touted in publications or advertisements as proof of superior performance, should be treated with caution. Optimization that is based on achieving tight scatter plots can lead to a decreased ability to report changes in mRNA levels. Ideally, optimization should focus on reporting relative or absolute mRNA levels and should take into account the entire range of expression levels. A third method for performance evaluation is to use an established cell-culture experiment in which changes in mRNA levels are verified by other means, such as northern blotting analysis or quantitative reverse transcription (RT)-PCR (Taniguchi et al, 2001; Yuen et al, 2002; Polacek et al, 2003; Loguinov et al, 2004; Roth et al, 2004). Using such established differentials is a good method because it optimizes the reporting of differences in expression, which is the goal of most microarray hybridizations. One disadvantage is that verification and optimization are driven by the differences that are reported by the microarrays, rather than by all of the mRNA-level differences that are present in the experimental system. There is no test for false-negative differentials unless RT-PCR, for example, is carried out on many hundreds of genes that are not reported as being differentially expressed in the microarray experiment. A further drawback is that this method, similar to those described above, does not lend itself to the routine assessment of each individual microarray hybridization before optimization. 0 Real value 0 Real value 0 Genome-Wide Location and Function of DNA Binding Proteins 1 Bing Ren,1* Francois Robert,1* John J. Wyrick,1,2* ¸ Oscar Aparicio,2,4 Ezra G. Jennings,1,2 Itamar Simon,1 Julia Zeitlinger,1 Jorg Schreiber,1 Nancy Hannett,1 ¨ Elenita Kanin,1 Thomas L. Volkert,1 Christopher J. Wilson,5 Stephen P. Bell,2,3 Richard A. Young1,2 0 Understanding how DNA binding proteins control global gene expression and chromosomal maintenance requires knowledge of the chromosomal locations at which these proteins function in vivo. We developed a microarray method that reveals the genome-wide location of DNA-bound proteins and used this method to monitor binding of gene-specific transcription activators in yeast. A combination of location and expression profiles was used to identify genes whose expression is directly controlled by Gal4 and Ste12 as cells respond to changes in carbon source and mating pheromone, respectively. The results identify pathways that are coordinately regulated by each of the two activators and reveal previously unknown functions for Gal4 and Ste12. Genome-wide location analysis will facilitate investigation of gene regulatory networks, gene function, and genome maintenance. Many proteins bind to specific sites in the genome to regulate genome expression and maintenance. Transcriptional activators, for example, bind to specific promoter sequences and recruit chromatin modifying complexes and the transcription apparatus to initiate RNA synthesis (1-3). The reprogramming of gene expression that occurs as cells move through the cell cycle, or when cells sense changes in their environment, is effected in part by changes in the DNA binding status of transcriptional activators. Distinct DNA binding proteins are also associated with origins of DNA replication, centromeres, telomeres, and other sites, where they regulate chromosome replication, condensation, cohesion, and other aspects of genome maintenance (4, 5). Our understanding of these proteins and their functions is limited by our knowledge of their binding sites in the genome. The genome-wide location analysis method we have developed allows protein-DNA interactions to be monitored across the entire yeast genome (6). The method combines a modified chromatin immunoprecipitation (ChIP) procedure, which has been previously used to study protein-DNA interactions at a small number of 0 in galactose using our analysis criteria (Fig. 2A). These included seven genes previously reported to be regulated by Gal4 (GAL1, GAL2, GAL3, GAL7, GAL10, GAL80, and GCY1). The MTH1, PCL10, and FUR4 genes were also bound by Gal4 and activated in galactose. Each of these results was confirmed by conventional ChIP analysis (Fig. 2B) (6), and MTH1, PCL10, and FUR4 activation in galactose was found to be dependent on Gal4 (Fig. 2C). Both microarray and conventional ChIP showed that Gal4 binds to GAL1, GAL2, GAL3, and GAL10 promoters under glucose and galactose conditions, but the binding was generally weaker in 0 specific DNA sites (7), with DNA microarray analysis. Briefly, cells were fixed with formaldehyde, harvested, and disrupted by sonication. The DNA fragments cross-linked to a protein of interest were enriched by immunoprecipitation with a specific antibody. After reversal of the cross-links, the enriched DNA was amplified and labeled with a fluorescent dye (Cy5) with the use of ligation-mediated-polymerase chain reaction (LM-PCR). A sample of DNA that was not enriched by immunoprecipitation was subjected to LM-PCR in the presence of a different fluorophore (Cy3), and both immunoprecipitation (IP)-enriched and -unenriched pools of labeled DNA were hybridized to a single DNA microarray containing all yeast intergenic sequences (Fig. 1). A single-array error model (8) was adopted to handle noise associated with low-intensity spots and to permit a confidence estimate for binding (P value). When independent samples of 1 ng of genomic DNA were amplified with the LM-PCR method, signals for greater than 99.8% of genes were essentially identical within the error range (P value 10 3). The IP-enriched/unenriched ratio of fluorescence intensity obtained from three independent experiments was used with a weighted average analysis method to calculate the relative binding of the protein of interest to each sequence represented on the array. To investigate the accuracy of the genomewide location analysis method, we used it to identify sites bound by the transcriptional activator Gal4 in the yeast genome. Gal4 activates genes necessary for galactose metabolism and is among the best characterized transcriptional activators (1, 9). We found 10 genes to be bound by Gal4 (P value 0.001) and induced 0 glucose (6). The consensus Gal4 binding sequence that occurs in the promoters of these genes (CGGN11CCG) can also be found at many sites through the yeast genome where Gal4 binding is not detected; therefore, sequence alone is not sufficient to account for the specificity of Gal4 binding in vivo. Previous studies of Gal4-DNA binding have suggested that additional factors such as chromatin structure contribute to specificity in vivo (10, 11). The identification of MTH1, PCL10, and FUR4 as Gal4-regulated genes reveals previously unknown functions for Gal4 and explains how regulation of several different metabolic pathways can be coordinated (Fig. 2D). MTH1 encodes a transcriptional repressor of certain HXT genes involved in hexose transport (12). Our results suggest that the cell responds to galactose by increasing the concentration of its galactose transporter at the expense of other transporters. In other words, while Gal4 activates expression of the galactose transporter gene GAL2, Gal4 induction of the MTH1 repressor gene leads to reduced levels of glucose transporter expression. The Pcl10 cyclin associates with Pho85p and appears to repress the formation of glycogen (13). Thus, the observation that PCL10 is Gal4-activated suggests that reduced glycogenesis occurs to maximize the energy obtained from galactose metabolism. FUR4 encodes a uracil permease (14), and its induction by Gal4 may reflect a need to increase intracellular pools of pyrimadines to permit efficient uridine 5 -diphosphate (UDP) addition to galactose catalyzed by Gal7. We next investigated the genome-wide binding profile of the transcription activator Ste12, which functions in the response of haploid yeast to mating pheromones (15). Activation of the pheromone-response pathway by mating pheromones causes cell cycle arrest and transcriptional activation of more than 200 genes in a Ste12-dependent fashion (8, 15). However, it is not clear which of these genes is directly regulated by Ste12 and which are regulated by other ancillary factors. The genomewide binding profile of epitope-tagged Ste12, determined before and after pheromone treatment in three independent experiments, indicates that 29 pheromone-induced genes are regulated directly by Ste12. Figure 3A lists the yeast genes whose promoter regions are bound by Ste12 at the 99.5% confidence level (i.e., P value 0.005) and whose expression is induced by factor. These 29 genes are likely to be directly regulated by Ste12 because (i) all have promoter regions bound by Ste12, (ii) exposure to pheromone causes an increase in their transcription, and (iii) pheromone induction of transcription is dependent on Ste12. Of the genes that are directly regulated by Ste12, 11 are already known to participate in various steps of the mating process (Fig. 3B). FUS3 and STE12 encode components of the signal transduction pathway involved in the response to pheromone (16); AFR1 and GIC2 are required for the formation of mating projections (17-19); FIG2, AGA1, FIG1, and FUS1 are involved in cell fusion (20-23); and CIK1 0 The End of the Microarray Tower of Babel: Will Universal Standards Lead the Way? 1 Ernest S. Kawasaki 0 NCI Advanced Technolog y Center, Bethesda, MD 0 A PRolIfERAtIon of MIcRoARRAy PlAtfoRMs And AssocIAtEd tEchnologIEs 0 Table 1 gives a list of sources for obtaining whole genome arrays, which are defined as arrays that have approximately the entire gene complement of the genome represented on one slide or chip. You will note that there are large differences in the size of the probes, the number of probe sets, and the total number of probes per array. This and many other technological differences found in these platforms will be enumerated, with pointers as to how or why 0 The enD oF The micRoARRAy ToweR oF BABel 0 The probe size, number of probes sets and the total number of probes per array are indicated. 0 these differences can cause discordant results between platforms. The nomenclature convention followed here is that the "probe" is the gene sequence arrayed on the chip, and the "target" is the RNA sequence to be labeled and hybridized to the probes. Probe manufacture. The probes for the arrays may be made in situ by photolithographic or ink-jet methods, or by standard oligonucleotide synthesis protocols followed by attachment to various substrates.3 Because the methods are so varied, it is difficult to estimate the purity of the probes or their true sizes, and large differences in these parameters can have a great influence on signal intensi- 0 A decade of microarray publications. The number of publications per year derived from Pubmed using the terms "microarray" or `microarrays" is shown. 0 e.s. KAwAsAKi 0 for detecting mRNAs of low abundance than the long probe arrays. Thus, probe size can be a confounding factor when comparing the same genes across many platforms (Table 1). Probe element size and concentration. The element or spot size diameters range from 11 microns to ~200 microns in the different platforms. The size of the array elements (spots), their size in µ2 , and concentration in the number of molecules per spot are given in Table 2. There is also a large difference in the number of probe molecules per spot, with estimates from several million to hundreds of millions of molecules. This can heavily influence the kinetics of hybridization, signal quantification, and signal intensities of the probes, and these important factors will vary from platform to platform. Probe number per array. The number of probe sets may vary from 30,000 to 54,000, but the total number of probes per array actually ranges from about 30,000 to greater than 500,000 (Table 2). Microarrays may contain one probe per gene or up to twenty probes per gene. This fact alone can make it difficult to directly compare the data from platforms with such a wide range of the number of probes per mRNA sequence. Proper probe annotation. This is an intense area of investigation.6-8 The sequence databases for expressed genes are still in a state of flux, such that probe sequences derived from older databases may be dramatically different from the latest version. It has been found that some probe sequences no longer exist in the database, or were not annotated properly and now have different IDs or names. Thus, platforms may have probe sequences that do not exist in the genome or have the incorrect designation, and this has been an important source of confusion in the analysis of array data. Target preparation. There is no standard way of isolating RNA for target labeling, although almost all microarray experimentalists follow the rule of analyzing the integrity of their RNA samples before beginning labeling steps. Many expression profiling experiments in the past were uninterpretable simply because of poor RNA quality. A common method to test RNA integrity is through the use of an Agilent 2100 Bioanalyzer, which provides an electrophoretic tracing and a RNA integrity number (RIN) for judging RNA quality.9 Target synthesis. Targets are commonly synthesized via cDNA reactions on total RNA or by in vitro synthesis of linearly amplified RNA using T7 RNA polymerase technologies.10 The cDNA targets are thought to faithfully represent the original concentrations of the mRNA in the sample, but linearly amplif 0 BIOINFORMATICS APPLICATIONS NOTE 0 arrayMagic: two-colour cDNA microarray quality control and preprocessing 1 Andreas Buness, Wolfgang Huber, Klaus Steiner, Holger Sueltmann and Annemarie Poustka 0 that can at any time be re-run or extended. The compendium technology (Gentleman, 2004) can be used to produce distributable objects containing the data as well as revivable documents reporting the processing. We aimed to integrate normalization methods, quality scores and visualizations that had been reported previously. In addition, we provide tools for dealing with different microarray layouts within one experiment and for merging data from replicate probes or hybridizations. The researcher obtains an instant overview on the quality of the experiment. 0 Normalization strategies for two-colour microarrays can be divided into two groups: adjustment of the colour channels or of the log-ratios. Moreover, depending on the experimental design and the objectives either a single channel intensity or a log-ratio-based analysis might be more appropriate. The tool offers log-ratio-based normalization by means of the loess method (Yang et al., 2002) and direct intensitybased normalization by means of vsn (Huber et al., 2002) and quantile normalization (Bolstad et al., 2003) methods. We will also use the terms `log-ratios' and `log-transformed intensities' for the data resulting from the vsn method. Groups of hybridizations, subsets of spots, e.g. by grid, print-tip or PCR plate, as well as colour channels can be normalized separately. Plots characterizing the distributions of the log-ratios and colour channels before and after normalization were generated (Fig. 1b). 0 Two-colour cDNA microarray technology has evolved into a routine laboratory procedure. Our motivation in implementing arrayMagic was to deal with the large amount of data generated by microarray projects in an efficient, reliable and reproducible manner. We focused on preprocessing and quality assurance, leaving out high-level analysis which has to be adressed specifically. The main design goal was to allow for the rapid construction of customized quality assessment and control (QA/QC) and preprocessing pipelines for such projects from a small set of building blocks. arrayMagic bridges the gap between the image quantification software and subsequent statistical and explorative analyses like testing for differential expression or classification. It simplifies the task of building processing pipelines that are reproducible, which means that even for idiosyncratic experimental designs and non-trivial combinations and selections of the data the whole procedure from raw data to normalized, quality-controlled, annotated and summarized data is documented in a not too verbose script 0 QUALITY CONTROL AND ASSESSMENT 0 Quality assured data are prerequisite for any reliable highlevel analysis. In addition, quality control allows to monitor and improve the laboratory procedures. The quality of hybridizations is best assessed in the context of normalization. In a model-based approach like vsn, the model is a summary of past experience and our expectations on the data. Thus, it can be used to identify hybridizations or groups of measurements that do not fit. Other methods 0 arrayMagic: two-colour microarray quality control 0 like loess or quantile normalization place more emphasis on making the data conform in any situation. In these cases, statistics of the data distribution can be calculated (e.g. location and scale of the distribution of normalized log-ratios) and compared against expectations. Moreover, as long as the majority of the data are assumed to be acceptable, outlier detection methods can be used for quality control. Visual inspection of the data is supported by spatial falsecolour representations of foreground and background intensities and the log-ratios. This allows to detect scratches and artefacts (Fig. 1a). Most notably, the spatial plots of the normalized data are useful for assessing the necessity of background correction and for assuring spatial homogeneity of the data. Several quality scores are calculated, stored in a report file and are visualized in part. These scores include spot replicate concordance, the correlation of the two colour channels and a robust measure of noise W for each hybridization. W is defined as the median absolute deviation of the normalized log-ratios qi , i.e. W = madi (qi ) = mediani (|qi - medianj (qj )|). A minority of differentially expressed genes should not disturb W . We do not find it practical to define universally applicable thresholds on quality scores. They should be evaluated not on the level of a single hybridization, but in the context of all data in the experiment. In our experience this has been very useful in detecting outliers in large-scale experiments. In particular, a global view on all pairwise similarities between all hybridizations shown in Figure 1c has proved to be useful. For two arrays a and b, we define a similarity score Sab = madi (xia - xib ), where xia can be the log-ratio of the i-th probe on the a-th array, or the log-transformed normalized intensity of an individual colour channel. Especially in the 0 case of biologically related samples, this is an informative measure of similarity. 0 The open source software tool arrayMagic facilitates the analysis of two colour cDNA microarray data. It aims to provide quality assured and normalized data. The scriptbased pipeline supports reproducible batch-like processing. The workflow starts with quantified image scan result files. Several quality scores and diagnostics are calculated and visualized, which offer a broad view. The processed data can be exported as HTML-file or as tab-delimited file with spot and sample annotation and may serve as input for follow-up analysis in commonly used tools of choice. Naturally, high-level follow-up analysis in the framework of R and Bioconductor is supported by adequate representation of the data. Documentation of all functionality and a step-by-step example following a typical workflow is part of the package. 0 A.Buness et al. 0 Gentleman,R. (2004) Reproducible research: a bioinformatics case study. Stat. Appl. Genet. Mol. Biol., 3. Gentleman,R., Carey,V.J., Bates,D.J., Bolstad,B.M., Dettling,M., Dudoit,S., Ellis,B., Gautier,L., Ge,Y., Gentry,J. et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Bioconductor Project Working Papers. Working Paper 1. Huber,W., von Heydebreck,A., Sueltmann,H., Poustka,A. and Vingron,M. (2002) Variance stabilization applied to microarray 0 Normalization of microarray data using a spatial mixed model analysis which includes splines 1 David Baird1,, Peter Johnstone2 and Theresa Wilson3 0 AgResearch, 0 techniques for normalization have been suggested, including linear regression (Hedenfalk et al., 2001), ratio statistics (Chen et al., 1997), local smoothing (Yang et al., 2002) and analysis of variance (Kerr et al., 2000; Chu et al., 2002). Yang et al. (2002) compare these approaches and suggested a method which allows for differences induced by different print tips. We extend this idea to model the rows and columns over the whole slide and within the print tips and also autocorrelation in the printing order. This differs from other methodology in that we are able to correct unwanted variation arising from unevenness of the slide surface and scanning efficiency. The usual statistical modelling approach is taken where all possible sources of noise are jointly fitted in one model, with the need for each term being assessed using statistical significance of the reduction in remaining unexplained variation. Model terms can be added or removed as required. The fitted model then indicates where useful modification of our protocols and equipment would help minimize variation in future experiments. 0 METHODS Amplification of ESTs 0 Microarray technology has been used extensively to survey patterns of gene expression in a range of biological models. Using our own collection of bovine expressed sequence tags (ESTs) we have constructed large cDNA arrays (up to 22 000 ESTs) for use in several of our research projects. For such large arrays it is essential to identify sources of variation and correct for them to allow for robust use of this technology. Through normalization procedures, such variations can be identified and removed to obtain data for follow on research. The analysis of the microarrays, is a two-step analysis; a within slide analysis aimed at normalization and if required standardization, and then a between slide analysis to estimate the differences between targets and their consistency. Various 0 Mixed models using splines for microarray data 0 C, washed for 5 min each in (1) 2 x SSC, 0.1% SDS, (2) 1 x SSC and (3) 0.1 x SSC, centrifuged at 500 g for 5 min, dried and scanned. 0 Allocation of probes to slides 0 Randomization is a well-known device used to ensure the valid application of significance tests and confidence intervals (Fisher, 1951). Randomization also disarms critics who suggest an allocation of experimental units has been chosen which is favourable to an author's hypothesis (Cox, 1992). Because of these properties, it is routine in traditional experiments to randomly allocated treatments to the experimental units. In microarray experiments the physical constraints imposed by the storage of probes in 96-well plates and by the microarray printing robots, ensure that a fully randomized layout is not possible. However, printing the 96-well plates in random order is possible and is justified in that some randomization is better than the alternative of no randomization. 0 ANALYSIS Measure of differential expression in probes 0 that the value M will be randomly distributed around a mean value of 0. Other approaches to handling values close or below background can be used. One option is to make no background correction, which will shrink all values of M towards zero, with large reductions for spots of low intensity and minimal reductions on spots with high intensity. This has the advantage of reducing the variation of low-intensity spots, but the disadvantage of reducing sensitivity of identifying differentially expressed ESTs with low expression levels. Any spatial trends not eliminated in the log ratios due to trends in the background can be estimated and removed as part of the spatial model, as explained later in this paper. Another alternative is suggested by Durbin and Rocke (2003), in the context of transforming the single channel's expression, add a constant to all values in each channel as part of a more complex transformation. The constant to be added in the Durbin and Rocke approach is estimated as that giving the best stabilized error variance. For large expression values, these approaches have virtually no effect on the log ratio, but for values just below and above the minimum cut off, the relative differences between the approaches may be substantial. The advantage of using logs over more complicated transforms is that the resulting values are more naturally interpreted by the experimenter. Which approach is best, in terms of giving unbiased results can only be ascertained by a uniform study, that is not available in our current datasets. 0 Within slide dye bias 0 It is typically found that the mean of M at a certain level of log-intensity depends on the level of intensity of the probe. If we define A, the average log-intensity of the probe as A= 0 We have used a value of 0.5 for k, but have tried values between 0.1 and 1.0. The value of k controls how much the information on the probe is down weighted, with larger values reducing the value of M towards 0. If both dyes have negative corrected intensities, then there is no information in the probe, and M is set to be a missing value. It is expected that the majority of probes in the sample will show no differential expression, and 0 then a plot of M versus A [an MA plot (Dudoit et al., 2002)], often shows a departure from the zero reference line. It is expected that the level of differential expression is independent of the brightness of the probes. Figure 1 shows the MA plot for one of our microarrays. It can be seen that the m