0 scientific comment 0 Acta Crystallographica Section D 0 Biological Crystallography 0 ISSN 0907-4449 0 WWWWhy does nature stutter? A survey of strands of repeated amino acids 1 Edgar F. Meyer* and W. John Tollett Jr² 0 Human stuttering is a simple example of the repetition of sounds or symbols, sometimes associated with single letters, and may be used to illustrate the amazing repetition of amino acids (symbolized by a letter, e.g. W) in proteins. A survey of available databases with highly improbable strings of single amino acids is tabulated. This paper concludes with a challenge to the crystallographic community to probe the structural origins of the structure±function relationship in this neglected area. When nature stutters, we should pay attention. 0 Current address: A&M Consolidated High School, College Station, TX 77840, USA. 0 Introduction 0 That 34 virus structures were detected suggests that this model may be overly simplistic and that crosscorrelations may occur, but our purpose here is to report a finding and encourage others to explore its implications, be they probabilistic, statistical, genetic, functional or structural. 0 International Union of Crystallography Printed in Denmark ± all rights reserved 0 As gene, protein and structural databases were searched, who would have guessed that 67 consecutive threonines would be found in Cryptosporidium parvum (Barnes et al., 1998)? The probability of 67 repeats in a random sequence at a specific site is $1 in 2067 = 1/1.5 A 1087 events; the difference in probabilities is exponentially significant. Even though this statistical approximation begs for a more rigorous treatment, it is amazing. WWWWhat is nature telling us? Long consecutive strands of positively or negatively charged amino acids must carry electrostatic penalties, yet these too abound. In a nuclear transport protein (PDB code 1qbk), polyaspartate is augmented by two glutamates to create a startling exposed strand of 14 consecutive negatively charged residues. Intuitively, one could assume that uncharged amino acids would be more likely to occur repetitively, but polymethionine also has a relatively low occurrence (7). Because of pronounced peptide backbone angular constraints, proline was considered to be a `helix breaker', but polyPro actually forms a left-handed helix (1jvr). In HIV-1 reverse transcriptase (1c9r; residues 315±326), an extended polyAla strand is parallel to an -helix that is also rich in Ala. Conversely, a 12-Ala repeat forms a cluster of three -helices at the tip of a tumor necrosis factor receptor (1czz). At this stage, it appears that while polyPro may be structurally conserved, polyAla is not. PolyCys is one of the few repeat sequences which is generally buried, forming a tight trimer knot in a spider toxin (1qdp), a triple S±S knot (1ag8), and a tight buried loop central to an amazing chain of seven S±S linkages in the ferric hydroxamate uptake receptor (1cw3, 1a4z). These searches reveal a wide range of structures, populations and probabilities, summarized by abbreviated tables [tables also 0 Acta Cryst. (2001). D57, 181±186 0 Meyer & Tollett 0 WWWWhy does nature stutter? 0 scientific comment 0 Table 1 0 GenBank results, 23 June 2000. 0 =$key&id=1); the related Chime links will make the structural results more readily accessible to a broader audience]. While some entries of gene sequences are deposited without comment and/or literature 0 citation (Table 1), many protein sequence entries (e.g. PIR, SwissProt, EMBL) are cited (Table 2) and infer functional roles. Although smallest in size, the Protein Data Bank (Bernstein et al., 1977; Meyer, 1997; 0 Amino acid Alanine 0 Residues 129±148 129±148 497±517 497±517 241±260 241±260 241±260 241±260 241±260 13±42 138±187 24±69 720±768 266±311 50±95 777±822 285±325 11±33 1856±1900 362±402 152±191 58±95 58±95 0 GenBank ID# GBINV:DMJ001164 GBINV:AE003814 GBINV:DMU11383 GBINV:DMOVO GBPRI:AF117979 GBPRI:D82344 GBROD:MMPHOX2B GBROD:AB015672 GBPRI:AB015671 GBPRI:HUMFMR1 GBINV:DDU38197 GBINV:AF019981 GBINV:DDI238883 GBINV:AF104350 GBINV:AE001416 GBINV:AE001418 GBPLN:F11A17 GBPRI:HSU63332 GBINV:AF153362 GBVRT:CCJ002238 GBPRI:HSU80741 GBPRI:HUMTFIIDA GBPRI:HS191N21 0 Arginine Asparagine 0 GBPRI:HUMTFIID GBINV:AF024654 GBINV:AE003446 GBROD:MMJ225123 GBROD:AF028737 GBPLN:SCYBR289W GBPLN:SCDPB3 GBPLN:YSCSNF5 GBINV:AE003536 GBPLN:ATF17C15 GBPLN:ATF23E13 GBPLN:ATCHRIV85 GBPRI:HUMARB GBPRI:L29496 GBPRI:HSU16371 GBPLN:ATAC011708 GBINV:AE003451 GBINV:AE003430 GBINV:DMSEG0007 GBVRL:AF169823 GBINV:CELC15C7 GBSYN:AF025672 0 Meyer & Tollett 0 WWWWhy does nature stutter? 0 Acta C 0 ANALYTICAL BIOCHEMISTRY 0 Effects of relative humidity and buffer additives on the contact printing of microarrays by quill pins 1 Mark K. McQuain,a Kevin Seale,b Joel Peek,b Shawn Levy,c and Frederick R. Haseltona,* 0 Abstract DNA microarrays printed with quill pins exhibit significant variation in probe DNA spots. Interspot variations and nonuniform distribution of probe within spots are major sources of experimental uncertainty in microarray analysis. To gain better insight into the sources of variation, we analyzed 450 consecutive depositions printed at relative humidities between 40 and 80% using three print buffers. Increasing relative humidity improved printing performance by delaying pin failure but did not reduce the variability in spot characteristics. Adding either betaine or dimethyl sulfoxide (DMSO) to the print buffer also improved quill pin performance. Least interspot variation was observed with the DMSO additive printed at 80% relative humidity, but this additive also resulted in the greatest intraspot variation. Least intraspot variation was observed with 1.5 M betaine printed at 60% relative humidity, but these conditions produced microarrays with high interspot variability. Evaporation of printing solution from the quill reservoir appeared to be the primary cause of interspot and intraspot variations. Our studies indicate that relative humidity and printing solution additives reduce evaporation. Based on the spot variability requirements for a particular application, humidity and additives may be chosen to optimize either inter- or intraspot variability. O 2003 Elsevier Science (USA). All rights reserved. 0 Keywords: DNA microarrays; Microfluidics 0 DNA microarrays are important tools for obtaining high-throughput genetic information and are often used for expression profiling, gene copy estimation, and polymorphism analysis [1-11]. Though they have been applied successfully in many research applications, there are significant problems which limit their use to qualitative analysis of large signal changes. To compensate for experimental variability, almost all current microarray analyses rely on differential measurement techniques that assess results compared to a reference [12]. Analysis is often focused on the most reliable and repeatable portions of the data [13]. The difficulty in interpreting the remaining data is usually attributed to a variety of factors, including inter- and intraspot variations [14,15]. 0 Abbreviations used: SSC, standard saline citrate; DMSO, dimethyl sulfoxide; R.H., relative humidity; RFU, relative fluorescence unit. 0 interest to be captured and stored electronically. Length calibration was achieved using a laser-etched reference grid positioned to achieve sharp focus at the same height as the point of pin contact with the printing surface. Scanning of multiple spots printed manually or robotically For manual printing, the video microscope apparatus described above was used. Depositions of a freshly loaded pin were recorded over the course of a 10-min period at the rate of one deposition every 3 s. For robotic printing, a commercial robot (designed by 0 Comparative effects of levosulpiride and cisapride on gastric emptying and symptoms in patients with functional dyspepsia and gastroparesis 0 Background: The efficacy of several prokinetic drugs on dyspeptic symptoms and on gastric emptying rates are well-established in patients with functional dyspepsia, but formal studies comparing different prokinetic drugs are lacking. Aim: To compare the effects of chronic oral administration of cisapride and levosulpiride in patients with functional dyspepsia and delayed gastric emptying. Methods: In a double-blind crossover comparison, the effects of a 4-week administration of levosulpiride (25 mg t.d.s.) and cisapride (10 mg t.d.s.) on the gastric emptying rate and on symptoms were evaluated in 30 dyspeptic patients with functional gastroparesis. At the beginning of the study and after levosulpiride or cisapride treatment, the gastric emptying time of a standard meal was measured by 13C-octanoic acid 0 breath test. Gastrointestinal symptom scores were also evaluated. Results: The efficacy of levosulpiride was similar to that of cisapride in significantly shortening (P < 0.001) the t1/2 of gastric emptying. No significant differences were observed between the two treatments with regards to improvements in total symptom scores. However, levosulpiride was significantly more effective (P < 0.01) than cisapride in improving the impact of symptoms on the patients' every-day activities and in improving individual symptoms such as nausea, vomiting and early postprandial satiety. Conclusion: The efficacy of levosulpiride and cisapride in reducing gastric emptying times with no relevant sideeffects is similar. The impact of symptoms on patients' everyday activities and the improvement of some symptoms such as nausea, vomiting and early satiety was more evident with levosulpiride than cisapride. 0 Prokinetic drugs have been extensively tested in the treatment of functional dyspepsia. This is because gastrointestinal motor abnormalities and, in particular, delayed gastric emptying have been frequently reported in patients suffering from this common syndrome.1±6 0 These abnormalities are regarded as a likely source of symptoms even if no clear cause±effect relationship between severity of symptoms and degree of delay in gastric emptying has been proven to date.7 Among prokinetic drugs, several placebo-controlled trials have provided evidence on the efficacy of cisapride and dopamine receptor antagonists such as metoclopramide, domperidone, and recently levosulpiride in the treatment of functional dyspepsia.8±28 Metoclopramide, domperidone and levosulpiride have both antiemetic and prokinetic properties because they antagonize dopamine receptors in the central nervous system as 0 C. MANSI et al. 0 O 2000 Blackwell Science Ltd, Aliment Pharmacol Ther 14, 561±569 0 MATERIALS AND METHODS 0 LEVOSULPIRIDE AND CISAPRIDE IN FUNCTIONAL DYSPEPSIA 0 impact on every-day activities was scored as: 0, not at all bothersome; 1, a little bit bothersome; 2, moderately bothersome; 3, quite a bit bothersome; 4, extremely bothersome. The cut-off values of symptom scores for inclusion in the study was established on the basis of the data obtained by the same questionnaires filled in by 200 healthy volunteers (84 males 116 females, aged 42 4 years). A score decrease of at least 50% was defined as a `symptom improvement'. The reproducibility of the symptom questionnaire had previously been validated in 40 patients with functional dyspepsia. The score evaluation of their symptoms was performed by the patients themselves on two separate occasions (2±4 weeks apart). The calculated K-values were 0.84 for total severity scores, whereas scores for frequency, duration and impact were 0.72, 0.69, and 0.87, respectively. Gastric emptying studies Gastric emptying time was measured by means of 13 C-octanoic acid breath test as previously described.34 This test was performed during the run-in period and at the end of each treatment. Patients were given a standard test meal consisting of one egg with 5 g of butter, two slices of white bread and 150 mL of water; 100 mg 13C-octanoic acid (Cortex Italia, Milan, Italy) was incorporated into the homogenized egg yolk, which was baked separately from the egg white. For practical reasons, the test meal was given at 13.00 hours, after an overnight fast, and eaten in 10 min. In order to interfere as little as possible with the subjects' normal eating habits, they were allowed to eat a light breakfast restricted to 100 mL of milk alone with 10 g of sugar at 07.00/08.00 hours. Females were studied during the first 10 days of the menstrual cycle. Breath samples were collected just before, and every 15 min after the test meal for 6 h; 13CO2 measurements were performed with an isotope ratio mass spectrometer 0 THE THERMODYNAMICS OF DNA STRUCTURAL MOTIFS 1 John SantaLucia, 1,2 and Donald Hicks2 0 Key Words secondary structure, prediction, hybridization, oligonucleotides, nucleic acid folding s Abstract DNA secondary structure plays an important role in biology, genotyping diagnostics, a variety of molecular biology techniques, in vitro-selected DNA catalysts, nanotechnology, and DNA-based computing. Accurate prediction of DNA secondary structure and hybridization using dynamic programming algorithms requires a database of thermodynamic parameters for several motifs including Watson-Crick base pairs, internal mismatches, terminal mismatches, terminal dangling ends, hairpins, bulges, internal loops, and multibranched loops. To make the database useful for predictions under a variety of salt conditions, empirical equations for monovalent and magnesium dependence of thermodynamics have been developed. Bimolecular hybridization is often inhibited by competing unimolecular folding of a target or probe DNA. Powerful numerical methods have been developed to solve multistate-coupled equilibria in bimolecular and higher-order complexes. This review presents the current parameter set available for making accurate DNA structure predictions and also points to future directions for improvement. 0 Loop Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hairpin Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Internal Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bulges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Coaxial Stacking Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multibranched Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . QUALITY OF SECONDARY STRUCTURE PREDICTIONS . . . . . . . . . . . . . . . . . . MULTISTATE MODELING OF DNA FOLDING AND HYBRIDIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FUTURE DIRECTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 INTRODUCTION Biological Importance of DNA Secondary Structure 0 Molecular Biology and Biotechnology Applications of DNA Secondary Structure 0 THERMODYNAMICS OF DNA MOTIFS 0 of biotechnology techniques that exploit the three-dimensional folding potential of DNA have also been demonstrated including DNA nanotechnology (75) and DNA computing (21). 0 The DNA Folding Problem 0 Similar to the protein and RNA folding problems, there is a corresponding "DNA folding problem" in which it is desired to predict the structure and folding energy of the DNA given its sequence. Fortunately, several features of DNA and RNA make them especially amenable to structure prediction. Notably, DNA and RNA secondary structures result from strong Watson-Crick pairing interactions, and tertiary interactions are a weaker second-order effect (81). Thus, to an excellent approximation, tertiary interactions may be neglected and accurate secondary structure prediction is possible. The strong pairing rules also allow for the DNA secondary structure to be reduced to discrete interactions in which two positions in a sequence are either paired or not. Even with the neglect of tertiary interactions such as pseudoknots, however, the number of possible secondary structures is approximately 1.8N, where N is the sequence length (95). Fortunately, with the discrete pairing approximation, DNA and RNA are suitable for powerful dynamic programming algorithms, which were described in a previous review (83). Dynamic programming algorithms guarantee that for a given set of rules, the minimum energy structure (i.e., optimal) will be found in computation time order N3 with memory order N2, thereby allowing predictions of sequences with fewer than 10,000 nucleotides with currently available computers. Dynamic programming algorithms also predict suboptimal structures within user-defined energy and distance windows (94). This is important because the energy rules are not perfect and tertiary interactions are neglected (as are interactions with proteins and the specific interactions with magnesium or other cofactors). Thus, one of the few structures near the free-energy minimum is likely to be correct. It is important to note the important difference between selected functional sequences and random sequences of DNA or RNA. Random sequences have a low probability of folding into compact three-dimensional structures stabilized by tertiary interactions; thus random sequences are most amenable to secondary structure prediction because the neglect of tertiary interactions is appropriate. On the other hand, selected sequences (selected either by evolution or by in vitro selection, or rationally designed) are more likely to contain tertiary interactions, which compromise the reliability of the secondary structure prediction algorithms. This difference makes DNA folding much easier to predict (for random sequences) than corresponding biologically selected RNAs. Note that dynamic programming algorithms also neglect kinetically trapped structures and assume structures are populated according to an equilibrium Boltzmann distribution; thus the structures close to minimum free energy are most probable. Recently, we have also extended the dynamic programming algorithm to predict bimolecular optimal and suboptimal structures so that match and mismatch hybridizations of a short probe to long-target DNA may be readily identified on 0 Overview of the DNA Thermodynamic Database 0 Dynamic programming algorithms for DNA secondary structure predicti 0 Articles Nearest-Neighbor Thermodynamics and NMR of DNA Sequences with Internal A,A, C,C, G,G, and T,T Mismatches 1 Nicolas Peyret, P. Ananda Seneviratne, Hatim T. Allawi, and John SantaLucia, * 0 ABSTRACT: Thermodynamic measurements are reported for 51 DNA duplexes with A,A, C,C, G,G, and T,T single mismatches in all possible Watson-Crick contexts. These measurements were used to test the applicability of the nearest-neighbor model and to calculate the 16 unique nearest-neighbor parameters for the 4 single like with like base mismatches next to a Watson-Crick pair. The observed trend in stabilities of mismatches at 37 °C is G,G > T,T A,A > C,C. The observed stability trend for the closing Watson-Crick pair on the 5 side of the mismatch is G,C g C,G g A,T g T,A. The mismatch contribution to duplex stability ranges from -2.22 kcal/mol for GGC,GGC to +2.66 kcal/mol for ACT, ACT. The mismatch nearest-neighbor parameters predict the measured thermodynamics with average deviations of G°37 ) 3.3%, H° ) 7.4%, S° ) 8.1%, and TM ) 1.1 °C. The imino proton region of 1-D NMR spectra shows that G,G and T,T mismatches form hydrogen-bonded structures that vary depending on the Watson-Crick context. The data reported here combined with our previous work provide for the first time a complete set of thermodynamic parameters for molecular recognition of DNA by DNA with or without single internal mismatches. The results are useful for primer design and understanding the mechanism of triplet repeat diseases. 0 DNA mismatches occur in vivo due to misincorporation of bases during replication (1), heteroduplex formation during homologous recombination (2), mutagenic chemicals (3, 4), ionizing radiation (5), and spontaneous deamination (6). Knowledge of the thermodynamics of DNA mismatches will be useful for elucidating the mechanisms of polymerase fidelity and mismatch repair efficiency. Moreover, thermodynamic parameters for mismatch formation are important for DNA secondary structure prediction (see http://sun2.science.wayne.edu/jslsun2 and http://mfold1.wustl.edu/mfold/dna/form1.cgi). Recent work has shown that triplet repeat sequences form transiently stable hairpins that contain like with like base mismatches (714). The formation of these secondary structures can induce genome expansion or deletion during replication (15, 16) resulting in at least 11 different human diseases (17-19). Mismatch thermodynamics is also important for molecular biological techniques such as PCR (20), Southern blotting (21), single-stranded conformational polymorphism (SSCP) (22-24), sequencing by hybridization (25, 26), antigene targeting (27), Kunkel site-directed mutagenesis (28), and optimization of DNA chip arrays for diagnostics (29). These techniques require optimization of sequence, temperature, 0 and solution conditions to avoid detection or amplification of wrong sequences. Previous work from our laboratory has shown that a NN1 model is valid to describe the thermodynamics of DNA structures involving canonical A,T and G,C base pairs (30-32) as well as G,T (31), G,A (33), C,T (34), and A,C (35) mismatches. We hypothesized that the nearestneighbor model is also applicable to single A,A, C,C, G,G, and T,T mismatches. To test this hypothesis, thermodynamic measurements of 45 sequences combined with 6 from the literature (36, 37) were used to derive NN parameters for like with like base mismatches. 1-D NMR and CD studies were used to qualitatively probe the structures formed by the mismatches. These data combined with our previous results provide a complete thermodynamic database for DNA molecular recognition by DNA with or without single internal mismatches. MATERIALS AND METHODS DNA Synthesis and Purification. Oligonucleotides were graciously provided by Hitachi Chemical Research and were synthesized on solid support using standard phosphoramidite chemistry (38). Oligonucleotides were detached from the 0 Abbreviations: Na EDTA, disodium ethylenediaminetetraacetate; 2 eu, entropy unit; MES, 2-(4-morpholino)ethane sulfonate; NMR, nuclear magnetic resonance; NN, nearest-neighbor; SVD, singular value decomposition; TLC, thin-layer chromatography; UV, ultraviolet. 0 Y°total ) Y°initiation + Y°sym + 2Y°(GG/CC) + 2Y°(GA/CT) + 2Y°(AG/TC) + 2Y°(GT/CT) (2) 0 The notation GT/CT refers to a 5GT3 dimer hydrogen bonded to a 3CT5 dimer with the mismatch underlined. The mismatch contribution to duplex stability is given by rearranging eq 2: 0 2Y°(GT/CT) ) Y°total - Y°initiation - Y°sym 2Y°(GG/CC) - 2Y°(GA/CT) - 2Y°(AG/TC) (3) 0 Thus, the mismatch contribution is calculated by subtracting the initiation, symmetry, and Watson-Crick nearest-neighbor increments (31) from the total experimental value. Number of Linearly Independent Parameters. In our previous studies of G,T, G,A, A,C, and C,T single mismatches, we showed that it is impossible to uniquely solve for eight dimer nearest neighbors from a data set of oligomers containing only single internal mismatches (31). Instead, within the limits of the nearest-neighbor model, only seven linearly independent trimers are sufficient to accurately predict internal mismatch thermodynamics. In the case of single like with like base mismatches (i.e., A,A, C,C, G,G, and T,T), however, symmetry allows for a unique solution of four internal nearest-neighbor dimers to be found. In particular, the dimer nearest neighbors can be uniquely solved from sequences that contain these trimers: 0 where X ) A, C, G, or T. According to the nearest-neighbor model, any sequence with an internal X,X mismatch can be determined from linear combinations of eqs 4a-d. It should be noted, however, that even though it is possible to uniquely solve for the X,X dimer nearest-neighbor parameters from a set of oligonucleotides with only internal mismatches, these parameters cannot be used to accurately predict the thermodynamics of duplexes with terminal mismatches. As we found earlier (31), terminal mismatches always make favorable contributions to dup 0 REVIEW ARTICLE 0 The marks, mechanisms and memory of epigenetic states in mammals 1 Vardhman K. RAKYAN, Jost PREIS, Hugh D. MORGAN and Emma WHITELAW1 0 It is well recognized that there is a surprising degree of phenotypic variation among genetically identical individuals, even when the environmental influences, in the strict sense of the word, are identical. Genetic textbooks acknowledge this fact and use different terms, such as ` intangible variation ' or ` developmental noise ', to describe it. We believe that this intangible variation results from the stochastic establishment of epigenetic modifications to the DNA nucleotide sequence. These modifications, which may involve cytosine methylation and chromatin remodelling, result in alterations in gene expression which, in turn, affects the phenotype of the organism. Recent evidence, from our work and that of others in mice, suggests that these epigenetic 0 modifications, which in the past were thought to be cleared and reset on passage through the germline, may sometimes be inherited to the next generation. This is termed epigenetic inheritance, and while this process has been well recognized in plants, the recent findings in mice force us to consider the implications of this type of inheritance in mammals. At this stage we do not know how extensive this phenomenon is in humans, but it may well turn out to be the explanation for some diseases which appear to be sporadic or show only weak genetic linkage. 0 Key words : chromatin, inheritance, methylation. 0 The various cell types in a multicellular organism are genotypically identical and yet phenotypically different. This is due to differences in the patterns of gene expression that exist between the different cell groups. The stable maintenance of these differences is thought to be due to epigenetic control of gene expression. This involves physically ` marking ' the DNA, without altering the nucleotide sequence, either by the addition of methyl groups to certain cytosine bases and\or the packaging of the DNA into a highly condensed state. These modifications interfere with the DNA-protein interactions that facilitate transcription, resulting in transcriptional silencing of the epigenetically modified allele. Epigenetic modifications can, therefore, cause phenotypic variation in the absence of genetic differences. It is well recognized that ` silenced ' alleles can be inherited through many rounds of DNA replication, and therefore epigenetic modifications or ` marks ' can be maintained through mitotic cell divisions. Generally, however, it has been assumed that these marks are erased and reset at some stage during gametogenesis or early embryogenesis to reinstate the totipotency of the developing embryo. There is now an increasing body of evidence which suggests that epigenetic marks at some mammalian alleles are not completely erased from one generation to the next, resulting in complex patterns of inheritance that do not conform to Mendelian principles. Therefore not only can phenotype vary in the absence of genetic and environmental factors, described by some as ` intangible variation ' [1] or ` developmental noise ' [2], but these phenotypic differences can also be inherited by the offspring. This review will present a brief overview of the role of methylation and chromatin remodelling in epigenetic regulation 0 of gene expression, followed by examples of classic epigenetic phenomena in mammals. We will then discuss the evidence available for epigenetic inheritance through the germline, with an emphasis on murine models, which suggest that this form of inheritance may be occurring at a number of mammalian loci. 0 EPIGENETIC MODIFICATIONS OF DNA 0 The two mechanisms by which DNA is epigenetically marked, although there may be others yet to be discovered, are methylation and chromatin condensation. Both of these mechanisms are associated with gene silencing, and recent evidence, discussed below, suggests that these two mechanisms are not mutually exclusive, but instead act in concert to silence gene expression in mammalian cells. 0 DNA methylation 0 Methylation involves the enzymic transfer of a methyl group to the 5-position of the pyrimidine ring of a cytosine residue [3-5]. This usually occurs at cytosine bases that are immediately followed by a guanine, known as CpG dinucleotides [6,7]. In mammalian genomes, the CpG dinucleotide is greatly underrepresented due to the increased spontaneous deamination rate of 5-methylcytosine into thymine. Of the CpGs present, approx. 70 % are methylated [8], whereas the majority of unmethylated CpGs occur in small clusters known as CpG islands, which are ordinarily found within or near promoters or first exons of ` housekeeping ' genes [9,10]. Methylation is catalysed by DNA methyltransferases (Dnmts) and four mammalian Dnmts have been identified so far, Dnmt1 0 V. K. Rakyan and others 0 the vicinity and reassociating with the newly assembled chromatin following DNA replication. Evidence for this mechanism comes from the observation that some HATs form part of a complex that remains associated with its target DNA throughout the cell cycle [42-44]. A second mechanism may involve targeting the HATs and HDACs to regions of methylated DNA, so that preexisting acetylation patterns are propagated along with methylation patterns during DNA replication. Indeed, it has recently been discovered that the maintenance methylase, Dnmt1, can interact with a histone deacetylase [45-47]. 0 Dnmt2 [12], Dnmt3A and Dnmt3B [13], although our understanding of how these enzymes function is sketchy at best. Dnmt1 is probably involved in maintaining methylation patterns through mitosis [14]. Following DNA replication, the two doublestranded daughter molecules initially contain a hemi-methylated CpG pattern, which is recognized and converted into the fully methylated parental pattern by Dnmt1 [15]. However, it has been found that the error rate of replication of methylation patterns of an artificially methylated DNA sequence transfected into cell lines is significantly higher than that observed for DNA replication [16,17]. In addition, a later study [18] showed that clonal populations of histologically homogenous cells did not have homologous methylation patterns. These findings have been confirmed by more recent work, using the highly sensitive bisulphite conversion method to analyse methylation patterns in i o [19,20]. Therefore the infidelity of replication of methylation patterns has the potential to generate phenotypic diversity among genetically identical cells of the same lineage. Dnmt2 may play a role in epigenetic control of centromere function [21], and Dnmt3A and 3B are thought to be de no o methylases which set up the initial patterns of methylation during embryogenesis [22]. However, data suggests that Dnmts have overlapping functions [23,24], and the precise role of any particular Dnmt is determined by the cellular context. During mammalian development, there are ` waves ' of extensive demethylation of the genome in the primordial germ cell stage and pre-implanatation embryo [25-28]. A mammalian protein with specific demethylase activity for CpG dinucleotides has been reported [29,30], although it remains to be fully characterized biochemically. 0 Epigenetic regulation of transcription 0 The precise mechanisms by which methylation and chromatin compaction regulate transcription are unclear, although several studies suggest that these two mechanisms are linked. MECP2 (methyl-CpG binding protein 2) is a transcriptional repressor that selectively recognizes methylated CpG dinucleotides [48,49]. MECP2, and other methyl-CpG binding proteins, associate with co-repressor complexes that include HDACs [50-53]. This directs the formation of stable repressive chromatin structures [54]. Recent findings [51,52] link the four different methyl-CpG binding domain (MBD) proteins, MECP2, MBD1, MBD2 and MBD3, with the chromatin-remodelling machinery, providing further evidence for the association between methylation and chromatin remodelling. Therefore it seems that methylation acts through histone deacetylation to establish a repressive chromatin state that blocks the access of the transcription machinery, although at present we do not know how the initial patterns of methylation are set up de no o. However, for certain organisms, e.g. Drosophila, methylation is observed only in very early embryogenesis [55] (for decades it was believed that DNA methylation was non-existent in Drosophila), and others like the yeast Schizosaccharomyces pombe, do not methylate their DNA at all. Therefore in some eukaryotic organisms chromatinmediated mechanisms alone may be sufficient to mediate epigenetic regulation of gene expression. 0 Chromatin packaging 0 In the nucleus, DNA exists as a nucleoprotein complex termed chromatin. Chromatin is assembled from arrays of nucleosomes, each of which is approx. 200 bp of linear DNA wrapped around an octamer of histone proteins. Two distinct types of chromatin are known, heterochromatin and euchromatin. Heterochromatin is believed to represent regions of DNA-protein complexes that are in a tightly packed conformation [31,32]. Constitutive heterochromatin is usually found at the centromeric and subtelomeric regions of chromosomes 0 Spot shape modelling and data transformations for microarrays 1 Claus Thorn Ekstrom1,, Soren Bak2 , Charlotte Kristensen2, and Mats Rudemo1 0 Department 0 In order to study lowly expressed genes in microarray experiments, it is useful to increase the photometric gain in the scanning. However, a large gain may cause some pixels for highly expressed genes to become saturated, i.e. the registered 0 Present address: Poalis A/S, Buelowsvej 25, 1870 Frederiksberg C, Denmark 0 pixel values become censored at the upper limit, which with 16-bit precision is 216 - 1 = 65535. Techniques for adjustment of highly expressed signal intensities are given in Wit and McClure (2003) based on a small set of available spot summaries, such as spot mean, spot median and spot variance. As mentioned in Wit and McClure (2003), it should be possible to get more accurate adjustments when all pixel values are available. In the present paper, we study spatial statistical models for pixel values that should enable such adjustments. A convenient type of modelling is to transform data to become approximately Gaussian distributed with a mean value function determined by gene intensities and spot shapes and a corresponding covariance function. For such models, censored pixel values can be estimated optimally. We investigate several types of transformations on the pixel level such as the logarithmic transformation, the Box-Cox family (Box and Cox, 1964) and the inverse hyperbolic sine transformation (Huber et al., 2002; Durbin et al., 2002), also called the generalized logarithm (Rocke and Durbin, 2003). The inverse hyperbolic sine transformation has been proven useful for analyzing microarray spot intensities, but here we apply it at the pixel level. The Box-Cox transformation with exponent 0.5, i.e. a square root transformation optimal for Poisson distributed counts, has been used at pixel level analysis of microarray data by Glasbey and Ghazal (2003). The spot shapes studied include three types suggested by Wierling et al. (2002): (i) a cylindric plateau spot distribution, (ii) an isotropic two-dimensional (2D) Gaussian distribution and (iii) a crater spot distribution consisting of a difference between two scaled isotropic 2D Gaussian distributions. These models does not seem to provide a satisfactory description for the dataset considered, and we introduce a new class of models with polynomial-hyperbolic spot shape. With a second degree polynomial we get a considerably improved performance. This spot shape may be regarded as a generalization of the cylindric plateau spot shape. 0 Spot shape models and transformations 0 The models are applied to a dataset obtained with a specially designed spotted 50mer oligonucleotide microarray. Here, the expression of 452 selected genes in transgenic Arabidopsis plants are compared with the corresponding genes in wildtype plants. Data include scans with different photometric gains ranging from no saturation to heavy saturation. 0 where 1 > 0, and an inverse hyperbolic sine transformation 0 DATA, TRANSFORMATIONS AND EXPLORATORY ANALYSIS Materials 0 Y = k arsinh 0 SPOT SHAPE MODELS 0 Based on empirical observations of spot intensity profiles as seen in Figure 1 as well as in Duggan et al. (1999) (Fig. 2) and Glasbey and Ghazal (2003) (Fig. 1), we desire a spatial spot shape model to have the following three properties: (i) isotropic, i.e. that the average intensity at a pixel x only depends on the distance from x to the spot centre and not on the direction from the centre; (ii) should allow for spot-shapes resembling both `volcanos/craters/donuts' and `plateaus'. Spot intensities are often highest near the edge of the spot and smaller near the spot centre making the resulting spot shape resemble a volcano (middle panel of Fig. 1); and (iii) allow for spatial correlation, i.e. pixels close together and with the same distance from the spot centre should be more correlated than pixels further apart. 0 Let Z = Z(x) denote the intensity of a pixel x. Here, Z is a 16-bit integer, i.e. 0 Z 216 - 1 = 65535. Let Y (x) denote a transformation of Z(x), Y (x) = f (Z(x), ), (1) 0 where f (·, ) is a family of transformation depending on the parameter vector . In the following, we shall consider three transformations: A logarithmic transformation Y = k log(Z + 1 ), (2) 0 C.T.Ekstrom et al. 0 January 2003 0 The Importance of Thermodynamic Equilibrium for High Throughput Gene Expression Arrays 1 Gyan Bhanot,* Yoram Louzoun,y Jianhua Zhu,z and Charles DeLisiz 0 ABSTRACT We present an analysis of physical chemical constraints on the accuracy of DNA micro-arrays under equilibrium and nonequilibrium conditions. At the beginning of the article we describe an algorithm for choosing a probe set with high specificity for targeted genes under equilibrium conditions. The algorithm as well as existing methods is used to select probes from the full Saccharomyces cerevisiae genome, and these probe sets, along with a randomly selected set, are used to simulate array experiments and identify sources of error. Inasmuch as specificity and sensitivity are maximum at thermodynamic equilibrium, we are particularly interested in the factors that affect the approach to equilibrium. These are analyzed later in the article, where we develop and apply a rapidly executable method to simulate the kinetics of hybridization on a solid phase support. Although the difference between solution phase and solid phase hybridization is of little consequence for specificity and sensitivity when equilibrium is achieved, the kinetics of hybridization has a pronounced effect on both. We first use the model to estimate the effects of diffusion, crosshybridization, relaxation time, and target concentration on the hybridization kinetics, and then investigate the effects of the most important kinetic parameters on specificity. We find even when using probe sets that have high specificity at equilibrium that substantial crosshybridization is present under nonequilibrium conditions. Although those complexes that differ from perfect complementarity by more than a single base do not contribute to sources of error at equilibrium, they slow the approach to equilibrium dramatically and confound interpretation of the data when they dissociate on a time scale comparable to the time of the experiment. For the best probe set, our simulation shows that steady-state behavior is obtained in a relaxation time of ;12-15 h for experimental target concentrations ;(10y13 y 10y14)M, but the time is greater for lower target concentrations in the range (10y15-10y16)M. The result points to an asymmetry in the accuracy with which upand downregulated genes are identified. 0 INTRODUCTION Single assay characterization of the response of thousands of genes to environmental perturbations is altering the research paradigm in biomolecular science. Applications are increasing explosively in areas as wide ranging as gene expression and regulation (Lashkari et al., 1997), genotyping and resequencing, and drug discovery and disease stratification (Eisen et al., 1998). The potential impact of micro-arrays on basic and applied biology is so important that an entire industry has been spawned, using any of dozens of variants of two generic methods to fabricate arrays--either direct deposition of probes (Schena et al., 1998; DeRisi et al., 1996; Duggan et al., 1999) or covalent attachment by in situ synthesis (Hughes et al., 2001; LeProust et al., 2000; Lipshutz et al., 1999; Singh-Gasson et al., 1999). The former method allows a wide range of substances such as presynthesized oligomers, proteins, cloned DNA, etc., to be used as probes. The latter is generally restricted to oligonucleotides but offers higher specificity. The central theme of this article is the physical chemical limits of specificity; i.e., conditions that allow the best specificity we consider mainly, though not exclusively, arrays of 20-30 nucleotides long probes, manufactured by in situ synthesis. These conditions minimize false hybridizations resulting from the slow equilibration that is characteristic of long probes, and avoid competition between surface-bound and solubilized probes. Typically an array of tens to hundreds of thousands of different pixels, each consisting of a homogeneous set of 1-10 million oligonucleotide probes, is used to determine the expression levels of genes of known sequence. The molecules to be assayed, e.g., cDNA, are hybridized, during a 12-15 h incubation, with probes chosen to be their reverse complements The most common detection method relies on fluorescence. Usually molecules from the target and reference cells are labeled with red and green dyes respectively; pixels are then scanned at the two distinct wavelengths to determine expression changes. Genes that are up- or downregulated in response to drugs, hormones, or other environmental influences are thus quickly identified. Although micro-array assays are high throughput in the sense that in excess of 10,000 genes at a time are probed, the number of false-positives is high, even for arrays prepared by in situ synthesis. Increased specificity is typically achieved by sacrificing sensitivity: only genes with a pronounced change in expression level, typically in the fifth percentile, are scored as having changed. The screened set, or a select 0 Gene Array Thermodynamics 0 group of the screened set, is then investigated further using traditional methods such as Northern blotting. Increased throughput is generally achieved by increased array density. However, as the above remarks imply, a substantial increase in throughput can be achieved by a well validated, high-specificity system. To increase specificity by rational design procedures, it is helpful to have a clear understanding of the physical limitations of the assay. This includes understanding the conditions that will provide the best specificity, the robustness to deviations from optimal conditions, the relation of optimal conditions to those prevalent in the most common experimental procedures, and strategies for optimization. This article is divided into two broad components: equilibrium and kinetic. In the first section, we outline the thermodynamics of hybridization. Specificity and sensitivity are maximum when equilibrium has been achieved, but even under this ideal condition the method used to select probes affects the formation of crosshybrids, and thus it affects specificity. Probe selection is a large optimization problem. We discuss this below, and present a new probe selection method. Further below, we use this method to select probes for the full set of yeast genes and compare the specificities obtained at equilibrium where both specificity and sensitivity are maximum. This has particular implications for long probes inasmuch as length substantially reduces the rate at which equilibrium is approached, and consequently increases false-positives if equilibrium is not achieved. 0 melting temperature is easily obtained. Define b as the equilibrium constant for bimolecular nucleation (formation of the first bond) in units of inverse concentration, and let K be the (dimensionless) equilibrium constant for the formation of the remainder of the helix. For a helix with n bases, there will be n-1 stacking interactions. We write the sum of the standard Gibbs free energies for the n-1 stacks as DHyTDS, so that the corresponding intramolecular equilibrium constant is K ¼ e½ydDHyTDSÞ=RT , where DH and DS are the sums of the standard enthalpies and entropies for base stacking, in accordance with the base sequence. The free energy of the nucleation event also, to some extent, depends on the basepairs that nucleate dimerization. If A be the free strand concentration and B the concentration of hybrids, and we assume the molecules are either fully hybridized or completely separated, then, B ¼ bA2 K: (1) 0 If cT is the total strand concentration, then by conservation cT ¼ 2B þ A: In addition, at the melting temperature Tm we have by definition 2B ¼ A. Substituting these relations in the equation for B, and utilizing the definition of K, we have that, Tm ¼ DH : ½RlogdbcT Þ þ DS (2) 0 The presence of a surface 0 Thermodynamics of hybridization 0 Melting profiles 0 As temperature is increased, an initially fully intact hybrid will gradually destabilize, and at high enough temperature, the strands will separate. Approximately 90% of the transition occurs over a temperature range of ;10-15 degrees for 25-mers, with the range narrowing as length increases. The so-called melting curve, determined under equilibrium conditions, is cooperative and has an inflection point which is referred to as the melting temperature, Tm. The melting temperature is defined as the temperature at which half the total number of strands are free (i.e., not hybridized). In general the population of hybridized strands will have a distribution of intact basepairs, and the arrangement of a given number of pairs will also be distributed. The common practice of neglecting partially hybridized states reduces a very complex multistage model to a two state model, eliminates the physical basis for cooperativity, and broadens the melting profile. For short chains, however, it has little affect on the midpoint of the transition, introducing an error that is within the error caused by experimental uncertainty in the stacking free energy. For this two-state model in which partially hybridized states are neglected, a sequence-dependent expression for the 0 The formation of a DNA hybrid consists of a bimolecular nucleation event followed by formation of a double 1 Arnold Vainrub B. Montgomery Pettitt 0 Surface Electrostatic Effects in Oligonucleotide Microarrays: Control and Optimization of Binding Thermodynamics 0 retical analysis of the surface electrostatic effects,6 which is in accord with recent experiments,7 we describe here the effect of the surface charge density on the melting curve and match/mismatch discrimination ratio for surface hybridization, and predict possible substantial improvements in several properties for microarrays. The surface material, dielectric or metal, 0 Vainrub and Pettitt 0 and the surface electrostatic conditions are shown to be critically important because they strongly determine the yield of the nucleic acid target hybridization to the surface-immobilized oligonucleotide probes. We propose to use these properties for control and enhancement of sensitivity during surface hybridization. In particular, an equal sensitivity of the probes with different base-pair composition may be achieved by adjustment of their specific linker molecule length or the local surface charge. Further, we suggest enhancement of the match/mismatch discrimination by narrowing the melting curve by optimizing the surface charge. Finally, we discuss a new microarray design using hybridization at low salt where the duplex stability is achieved by the positive surface charge. Under these conditions the target's secondary structure is melted, allowing hybridization to most of the target's nucleotides and increasing the sequencing information up to tenfold. 0 RESULTS AND DISCUSSION Statistical Thermodynamics of Hybridization 0 THEORETICAL MODEL AND CALCULATION METHODS 0 where n is the fraction of the hybridized probes in equilibrium, C0 is the concentration of the targets, and G is the molar Gibbs free energy of the probe:target duplex formation. Equation (1) is valid under the condition that the target concentration is constant. For brevity, we omit a straightforward derivation for a general case when targets are depleted because of hybridization. Note that at constant temperature Eq. (1) corresponds to the well-known Langmuir adsorption isotherm equation, which is often used to interpret microarray experiments.3 For discussing the mechanism of the interaction below, we introduce here the interaction Gibbs free energy with the surface for the probe Vp, target Vt, and duplex Vd. This interaction impacts the hybridization equilibrium and therefore the parameters in Eq. (1) in several ways. First, the target concentrations on the surface Cs and in solution C0 vary according to the Boltzmann distribution formula Cs C0 exp( Vt/RT) (2) 0 Second, the Gibbs free energy differences of the duplex formation on the surface Gs and in solution G differ by the change of the interaction energy after and before hybridization, (Vd Vp Vt). Thus Gs G Vd Vp Vt (3) 0 Equations 2 and 3 account for the target concentration and duplex binding strength changes near the surface, respectively. Substitution of Eqs. (2) and (3) in Eq. (1) gives the formula ns 1/{1 C0 1 exp[( G Vd Vp)/RT]}, (4) 0 Surface Electrostatic Effects 0 which describes the effect of surface interactions on the hybridization equilibrium. This equation differs from Eq. (1) for hybridization in bulk by addition of (Vd Vp) to the hybrid formation free energy. Hence, if duplex and probe are attracted to the surface (Vd 0 and Vp 0), the stronger attraction of the duplex for the surface Vd Vp promotes duplex formation. In contrast, a stronger surface repulsion of the duplex than the probe shifts the hybridization equilibrium toward melting of duplexes into single strand targets and probes. This approach can be also used out of thermodynamic equilibrium when the target's concentration on the surface Cs is determined not by the Boltzmann distribution Eq. (2), but rather by some steady state transport process. The corresponding Cs and Eq. (3) should be substituted in Eq. (1) to obtain the equilibrium yield of the duplexes in surface hybridization, ns. This is relevant to electronic DNA chips where the assayed nucleic acid is transported by electrokinetic drag13,14 and flow-through biochips.15 0 Surface Electrostatic Interaction 0 In order to evaluate the hybridization with the surface tethered probes, one need to know the probe Vp and duplex Vd interaction energies in Eq. (4). Recently, we calculated the oligonucleotide-surface interaction in electrolyte solution.6 We assumed the electrostatic interaction to be dominant since in microarray applications typically the oligonucleotide is tethered to the surface through a sufficiently long linker molecule, making the short-range van der Waals forces weak and therefore their effect small. The electrostatic Gibbs free energy was shown to be a sum of two components, V1 and V2. As depicted in Figure 1, V1 corresponds to the direct electrostatic interaction with the surface charge and is attractive (repulsive) for the positively (negatively) charged surface because of the negative charge of the nucleic acid target. V2 is the target's electrostatic free e 0 BGX: a fully Bayesian gene expression index for Affymetrix GeneChip data 1 By ANNE-METTE K. HEIN 0 Department of Epidemiology and Public Health, Imperial College, Norfolk Place, London W2 1PG, UK 0 Department of Epidemiology and Public Health, Imperial College, Norfolk Place, London W2 1PG, UK 1 HELEN C. CAUSTON 0 Microarray Centre, MRC Clinical Sciences Centre, Imperial College, Hammersmith Hospital, London W12 0NN, UK 1 GRAEME K. AMBLER and PETER J. GREEN 0 Some key words: Bayesian, Affymetrix, GeneChip, probe-level analysis, gene expression, differential expression, MCMC 0 Introduction Microarrays are one of the new technologies that have developed in line with the sequencing of the human and other genomes and developments in miniaturization and robotics. They permit 0 A.K. Hein et al. 0 the expression profiles of tens of thousands of genes to be measured in a single experiment and promise to revolutionize the biomedical and life sciences. This is partly because the gene expression profiles obtained form a `signature' -- a molecular phenotype -- that can be used to characterize the type, age, disease state and growth conditions of an organism. Affymetrix are one of the leading manufacturers of microarrays (Affymetrix gene expression arrays are also referred to as `GeneChips') and these are widely used. They differ from many other array types in that a single labelled extract is hybridized to each array and because they contain multiple `match' and `mismatch' sequences for each transcript. This presents particular challenges for low-level data analysis including the integration of data from the multiple probes representing each transcript on an array to provide a measure that represents gene expression and its inherent uncertainty, and the bringing into par (`normalization') of data from different arrays. 0 Affymetrix Oligonucleotide arrays The oligonucleotide array technology exploits two fundamental biological properties: (a) mRNA is an intermediate product between genes encoded in DNA and their protein products, so mRNA abundance can be used as a measure of gene expression, and (b) single stranded RNA molecules have a high affinity to form double stranded structures. Pairing between RNA strands is highly specific and complementary strands have particularly high binding affinities. Oligonucleotide arrays contain hundreds of thousands of features. A feature is a small rectangular area, containing a large number of identical oligonucleotides. In general, a different oligonucleotide sequence is represented at each feature. The features on oligonucleotide arrays are referred to as probes. A measure of the abundance of a particular transcript RNA in a biological sample can be obtained by going through the following procedure: isolating RNA, making a labelled representation of it, fragmenting the sample, hybridizing the labelled, fragmented RNA to an array, washing off the material that has not hybridized and scanning the array to obtain fluorescence intensities at each probe (Schena et al., 1995). The abundance of a transcript is related to the intensity measured at the features representing the complementary RNA sequence. On GeneChip arrays oligonucleotides of length 25 are used. However, many genes are similar, sharing common motifs or subsequences, and cannot, in general, be uniquely identified by a single sequence of length 25. Therefore each gene is represented by a probe set, consisting of a number of probe pairs. A probe pair consists of a perfect match probe (PM) and a mismatch probe (MM). At each perfect match probe, an oligonucleotide which perfectly matches part of the transcript is represented. The detection of transcripts at the PMs of a probe set indicates that the gene is expressed, and the level of detection indicates the degree of expression. However, although complementary RNA sequences have particularly high affinities, sequences that are complementary over only part of the length of the sequence, or shorter sequence fragments, may also hybridize. We refer to the hybridization of non-complementary transcripts to the probes as non-specific hybridization. This is the motivation for including MM probes. The oligonucleotides represented at an MM probe are identical to those at the corresponding PM probe, except that the middle nucleotide is that of the complementary base. The intention is that, since PM and MM probes are almost identical, equal amounts of non-specific hybridization will occur at these probes. Excess hybridization to the PM probe, relative to the MM probe will be due to specific hybridization, that is, the hybridization of complementary transcripts. A probe set for a gene typically consists of 11-20 PM and MM probe pairs, and these represent the information available about the expression of the gene. 0 BGX: a new gene expression index 1.2. Gene expression experiments and analysis 0 The generation of gene expression data is a multi-step process, and variability (from different sources) may be introduced at a number of experimental stages. The variability of interest is that of biological origin, e.g., variability in gene expression between experimental conditions, individuals or tissue types. Variability of non-biological origin may arise due to differences in the preparation of the biological samples to be hybridized, in the manufacture of the arrays, or in the process of scanning the arrays (see Hartemink et al. (2001) for a more detailed discussion). The replicability of raw gene expression data is low and gene expression data is notoriously noisy. This can be clearly demonstrated by hybridizing two technical replicates of the same biological sample on two arrays. The intensities obtained will often be found to differ (Figure 1). FIGURE 1 ABOUT HERE The analysis of gene expression data is usually treated as a multi-step process. The individual steps often consist of correcting the intensities for background noise, estimation of gene expression indices, normalization between samples, assessment of which genes are differentially expressed and clustering of genes or conditions with similar expression profiles or patterns. The focus of this paper is on the steps leading to the estimation of gene expression and on detection of differential expression. A drawback of splitting up the analysis of gene expression data into separate steps that are dealt with independently is that the error associated with each step is ignored in the downstream analysis. In assessing differential expression, it is clearly of interest to know how reliable the expression index of a gene is. In turn, in the estimation of the gene expression index, it is of interest to quantify the variability in the background corrected intensities, on which the estimation is based. A primary aim of the work presented here is to develop a statistically coherent framework for the analysis of Affymetrix GeneChip arrays, in which the splitting up of the analysis into separate steps is avoided. 1.3. Bayesian hierarchical modelling of Affymetrix gene expression data In this paper we present Bayesian hierarchical models for the analysis of gene expression data, where all steps in the process, and thus the associated errors, are modelled simultaneously. For clarity, we first set out a model for estimating the expression of genes using data obtained from a single array. In the model, background correction for non-specific hybridization and calculation of gene expression indices are considered simultaneously. We base the inference on the full posterior distributions for the parameters, so that, in addition to point estimates of gene expression levels we obtain their credibility intervals. Next, we extend the model to encompass the more commonly encountered situation, in which different experimental conditions are considered, and where replicate arrays may be available under some or all of the conditions. Here all information is used simultaneously to make the relevant inferences: where replicate arrays are available, measures of the expression of genes are obtained from a simultaneous consideration of the probe sets for the genes on the arrays. When experimental conditions are compared it is often of interest to identify genes that are differentially expressed, and to rank the genes according to their degre 0 Short Technical Reports 0 SHORT TECHNICAL REPORTS 0 Analysis of DNA Microarrays by NonDestructive Fluorescent Staining Using SYBRfi Green II 0 ABSTRACT A simple, non-destructive procedure is described to determine the quality of DNA arrays before they are used. It consists of a preliminary staining step of the DNA microarray by using SYBRfi green II, a fluorophore with specific affinity for ssDNA, followed by a laser scan analysis. The surface quality, integrity and homogeneity of each DNA spot of the array can thus be assessed. After this preliminary control, which may avoid further analytical steps that lead to the waste of precious biological samples, a fully reversible staining procedure is performed that produces an array ready for subsequent use. 0 INTRODUCTION The use of microarrays is growing exponentially (5). The technology consists of dense arrays of DNA spots deposited on suitably prepared surfaces, mainly glass. Several formats have been 0 BioTechniques 0 plate and primers used. A portion of each PCR amplification product (5 µL) was examined by agarose gel electrophoresis, followed by ethidium bromide staining. Only PCR products showing a clear and strong band on UV transillumination were recovered by ethanol precipitation and resuspension in 15 µL 3x standard saline citrate (SSC) (450 mM NaCl, 45 mM sodium citrate, pH 7.0). The DNA concentration was determined using PicoGreenfi reagent (Molecular Probes), a fluorescent nucleic acid stain useful for quantitating dsDNA in solution. The final concentration of DNA averaged 50 ng/µL. Samples were transferred into 96-well plates, which were sealed and stored at -20°C until used. Preparation of Polylysine-Coated Glass Slides Standard glass microscope slides (Sigma Aldrich) were pre-cleaned by immersion for at least 2 h in an alkaline wash solution consisting of 10% (w/v) NaOH and 57% (v/v) ethanol, followed by rinsing five times in double-distilled 0 water. The slides were then gently shaken for 1 h in a coating solution consisting of 35 mL Poly-L-Lysine (Sigma Aldrich; 0.1% w/v in water), 35 mL filtered PBS and 280 mL doubledistilled water. Coated slides were extensively washed with double-distilled water, centrifuged at low speed, (80x g) dried in a vacuum drying oven at 45°C for 10 min and then stored at room temperature in a tightly sealed slide box. Slides were used after at least two weeks to produce a sufficiently hydrophobic surface. This aging process is a key step in obtaining a suitable surface for array preparation. Printing of DNA Microarrays Target DNA samples in 3x SSC were spotted on the glass slides using a piezoelectric pipet (Nanoplotter SystemTM, Gesim GmbH, Germany). The pipet was programmed to release about 10 nL DNA solution for each DNA spot. Spots were arrayed in a 20 x 20 arrangement (400 spots in a 1.5 x 1.5cm square with a center-to-center spacing between spots of approximately 750 µm) or a 30 x 30 arrangement (900 spots in a 1.5 x 1.5-cm square with a center-to-center spacing of 500 µm). After deposition, arrayed DNA spots were completely dried by overnight incubation at room temperature in a covered box. Printed slides were rehydrated (DNA side down) in a plastic humid chamber (Sigma Aldrich) until spots glistened and then snap-dried at 100°C. 0 BioTechniques 79 0 BMC Bioinformatics 0 Methodology article 0 BioMed Central 0 Open Access 0 In silico microdissection of microarray data from heterogeneous cell populations 1 Harri Laehdesmaeki1, llya Shmulevich2, Valerie Dunmire2, Olli Yli-Harja1 and Wei Zhang*2 0 Background: Very few analytical approaches have been reported to resolve the variability in microarray measurements stemming from sample heterogeneity. For example, tissue samples used in cancer studies are usually contaminated with the surrounding or infiltrating cell types. This heterogeneity in the sample preparation hinders further statistical analysis, significantly so if different samples contain different proportions of these cell types. Thus, sample heterogeneity can result in the identification of differentially expressed genes that may be unrelated to the biological question being studied. Similarly, irrelevant gene combinations can be discovered in the case of gene expression based classification. Results: We propose a computational framework for removing the effects of sample heterogeneity by "microdissecting" microarray data in silico. The computational method provides estimates of the expression values of the pure (non-heterogeneous) cell samples. The inversion of the sample heterogeneity can be facilitated by providing accurate estimates of the mixing percentages of different cell types in each measurement. For those cases where no such information is available, we develop an optimization-based method for joint estimation of the mixing percentages and the expression values of the pure cell samples. We also consider the problem of selecting the correct number of cell types. Conclusion: The efficiency of the proposed methods is illustrated by applying them to a carefully controlled cDNA microarray data obtained from heterogeneous samples. The results demonstrate that the methods are capable of reconstructing both the sample and cell type specific expression values from heterogeneous mixtures and that the mixing percentages of different cell types can also be estimated. Furthermore, a general purpose model selection method can be used to select the correct number of cell types. 0 Page 1 of 15 0 (page number not for citation purposes) 0 Recent developments in high-throughput genomic techTable 3: The measured mixing percentages. The measured mixing percentages (RKO/normal) in the five heterogeneous samples. 0 sample #1 RKO normal 100 0 0 nologies have revolutionized the approaches aimed at understanding biological systems and emphasized the need for computational and systems biology research. Microarray analysis, for instance, can provide massive amounts of information about a biological sample by simultaneously measuring thousands of transcript levels. Application of such methodologies has already yielded important molecular insight into cellular phenotypes under various experimental conditions [1] and provided new knowledge about the development and treatment of human diseases, such as cancers [2-4]. During the last several years, microarray technology has undergone continued improvement with better quality control in the overall measurement process, ranging from hybridization conditions to image processing techniques [5]. Nevertheless, to fully harness the power of the microarray technology to study biological materials such as cancer tissues, one has to deal with a source of measurement variability that comes from the biological materials themselves, which rarely consist of homogeneous cell populations. For example, except for a few types of immune-privileged tissues such as the brain, most solid tumor tissues contain infiltrating lymphocytes as a result of the immune response. Most tumor tissues also contain endothelial cells as part of the necessary vasculature systems that provide nutrients for the tumor cells. The complexity of this problem is that different tumor tissues contain different proportions of these non-tumor cells. Therefore, if tumor tissues are used without consideration of such a mixing phenomenon, measurement of differential gene expression will certainly be confounded by the heterogeneous cell populations. In some studies [6], pathologists carefully evaluated the tissues and only selected tissues with more than a certain percentage of tumor cells. This prescreening step, however, results in the exclusion of many tumor tissues for the study and contributes to the small sample size problem in some of the studies. Alternatively, laser capture microdissection (LCM) technology can be used to purify the tumor cells from mixed populations [7]. This approach has been very successful in DNA-based studies because of the relatively high stability of DNA. However, for microarray studies, which require less stable RNA, LCM has seen limited success because it is much 0 more challenging to maintain RNA stability during the microdissection process. Other drawbacks of LCM are that such procedures are time-consuming and yield insufficient quantities of RNA, thus requiring multiple amplification steps that may confound quantitative inferences from gene expression data. A recent paper by Ghosh [8] introduced a mixture model based framework for determining differential expression in the presence of mixed cell populations. In this study, we aim at reconstructing the actual expression values of the pure cell types from the heterogeneous mixtures. That is, we develop a computational method for removing the effect of mixing from heterogeneous samples and to microdissect microarray data in silico. Similar analytical approaches have been previously proposed by Lu et al. [9], Stuart et al. [10] and Venet et al. [11]. Lu et al. focused on estimating the fraction of cells in different phases of the cell cycle whereas Stuart et al. considered the problem of estimating the cell type specific expression patterns over all samples. Here we focus on estimating both the sample and cell type specific expression values using carefully controlled microarray experiments. The inversion of the 'cell mixing effect' can be made appreciably easier by providing estimates of the mixing percentages of different cell types in each measurement, which can be measured by an experienced pathologist. The entire process does not hinge upon such measurements, however, as the mixing percentages can be estimated within the modeling framework. Venet et al. [11] introduced some preliminary methods and results for tackling the same problem as we consider here. In particular, they used a similar regression based framework as in [10] and as we do. We also consider the problem of selecting the correct number of cell types using the cross-validation model selection framework. 0 The microarray data to which we apply our computational methods consists of five different heterogeneous mixtures of lymph node and colon cancer samples which are hereafter abbreviated as normal and RKO, respectively. For more details, see Materials and methods Section. Each 0 Page 2 of 15 0 (page number not for citation purposes) 0 heterogeneous mixture consists of different fractions of different cell samples, see Table 3. 0 Inversion of sample heterogeneity The first goal is to invert the mixing effect caused by sample heterogeneity. We apply the linear model developed in Materials and methods Section to the heterogeneous microarray data. The obtained results are presented below. 0 clearly shows that the heterogeneous samples ('m1' through 'm5') are located almost on a straight line in the 2-dimensional PCA space. Furthermore, the line on which the heterogeneous samples are lying is parallel to the first principal component, suggesting that the most significant variation in the data is due to the linear mixing effect. The estimated expression profile of the pure colon cancer cells and lymphocytes are close to samples number #1 and #5, respectively, indicating that the inversion of the mixing phenomenon produces reasonable results. The results are more easily appreciated when only the most significant PCA component is shown. As discussed above, the variation in the most significant PCA component is due to the mixing effect. The results in Figure 2 (a) are as in Figure 1, but now shown in 1-dimension in order to facilitate the interpretation. Results in Figure 2 (b), in turn, are as in Figure 2 (a) except that the inversion was done using only the samples #2, #3, and #4. This represents a more difficult and realistic case, since fewer mixtures are available. When comparing Figure 2 (a) with Figure 2 (b), one can conclude that the method performs slightly better when more samples are used to estimate the true expression profiles - a result that was expected. Overall performance, however, is good in both cases. The est 0 BMC Bioinformatics 0 BioMed Central 0 Open Access 0 ProbeMaker: an extensible framework for design of sets of oligonucleotide probes 1 Johan Stenberg*, Mats Nilsson and Ulf Landegren 0 Background: Procedures for genetic analyses based on oligonucleotide probes are powerful tools that can allow highly parallel investigations of genetic material. Such procedures require the design of large sets of probes using application-specific design constraints. Results: ProbeMaker is a software framework for computer-assisted design and analysis of sets of oligonucleotide probe sequences. The tool assists in the design of probes for sets of target sequences, incorporating sequence motifs for purposes such as amplification, visualization, or identification. An extension system allows the framework to be equipped with application-specific components for evaluation of probe sequences, and provides the possibility to include support for importing sequence data from a variety of file formats. Conclusion: ProbeMaker is a suitable tool for many different oligonucleotide design and analysis tasks, including the design of probe sets for various types of parallel genetic analyses, experimental validation of design parameters, and in silico testing of probe sequence evaluation algorithms. 0 Increasing numbers of methods are being developed for parallel nucleic acid analyses for different purposes. Many of these methods employ sets of oligonucleotide probes or probe pairs that hybridize to the sequences targeted for analysis, allowing the probe sequences to be acted upon by one or more enzymes, creating new molecular species that reflect the presence or nature of the different target sequences. The reaction products generally contain identifying sequences or other features that allow the separation of signals originating from different targets. This is the case in methods such as the multiplex oligonucleotide ligation assay (OLA) [1], the multiplex ligation-dependent probe amplification assay (MLPA) [2], the RNA- and cDNA-mediated annealing, selection, extension and ligation assays (RASL, DASL) [3,4], the GoldenGate genotyp- 0 ing assay [5], multiplex minisequencing [6], and the padlock or molecular inversion probe assay [7,8]. The latter method has been used to genotype more than 10,000 single nucleotide polymorphisms (SNPs) in multiplex. Another method that utilizes sets of oligonucleotide probes for multiplex processing of nucleic acid molecules is the selector amplification technique. This technique uses partially double-stranded oligonucleotides, called selectors, to circularize a selection of restriction fragments from total genomic DNA, and it incorporates a general sequence motif that allows parallel amplification of all circularized fragments using a single primer pair [9]. With molecular solutions to many tasks of highly parallel genetic analysis now at hand, other factors become limiting, such as the design and the synthesis of reagents. In the 0 Page 1 of 6 0 (page number not for citation purposes) 0 work presented here, we address the problem of largescale probe design. When large numbers of probes are combined, the risk for unintended interactions between probes and targets must be considered. This risk places strict requirements on the design of sets of probes to be used together. In particular, it is important that probes do not contain sequences that result in the production of detectable signal from any probe in the absence of its cognate target molecule, or that otherwise interfere with the activity of other probes in the set. Due to these and other constraints and the many possible alternative probe sequences to evaluate, the difficulty of designing probe sets increases rapidly with the size of the probe sets. Many computer programs exist for the design of oligonucleotide probes such as PCR primers [10-12], microarray probes [13,14], and more [15]. These programs define algorithms to evaluate the risk of primer or probe sequences being involved in undesired interactions such as probe homo- or heterodimer formation, cross-hybridization, false priming, etc. However, the available programs are generally limited in scope, and are not applicable to the task of designing sets of complex probes containing multiple sequence elements. The ProbeMaker software presented herein is a framework for computer-assisted design and analysis of sets of oligonucleotide probe sequences composed of several functional sequence elements. As the composition of probes and the constraints imposed on sets of probes vary between applications, this framework has been constructed to support the design of different types of probes using application-specific constraints, as defined by the user. ProbeMaker takes as input a set of target sequences and a number of sets of so-called 'tag' sequences. These tag sequences may serve as targets for restriction digestion, as binding sites for amplification primers or fluorescent detection probes, or as identification codes for individual amplification products that are decoded by hybridization to oligonucleotide arrays [16]. Probes are designed for each target by construction of target-specific sequences and addition of tag sequences according to rules specified by the user. Different combinations of sequence elements are evaluated for each probe, and a set of probe sequences is created that satisfies user-defined criteria. 0 it should have the potential to import sequence data from a variety of sources. The flexibility is provided by the target and probe sequence data structures used. Each target defines two template sequences that are used to construct target-specific sequences (TSSs) to use in the corresponding probe. Each probe is made up of two such TSSs and a number of tag sequences, which may be located 5' of, between, or 3' of the TSSs. As TSSs may be of zero length, this system allows the design of many different types of probes. Support for more than two TSSs per probe was not deemed necessary as this is not used in any current methods. Furthermore, targets may be grouped, allowing the program to perform selection of tag sequences based on the relations of target sequences, for example variants of the same polymorphic sequence. The extensibility is realized by using an extension mechanism for much of the functionality. Extensions are constructed in the form of Java classes that implement defined interfaces and may be loaded into the framework at run-time. This mechanism allows the addition of new target types and support for different formats for sequence input and output, as well as design constraints and acceptor schemes, the function of which will be described below. ProbeMaker may be run through a graphical user interface or from the command line. For the graphical user interface, a set of target sequences and sets of tag sequences are provided as input by the user. Application-specific parameters for probe design and evaluation are set through the user interface. When running ProbeMaker from the command line, a project file defining all sequences and parameters is used as input. The potential for supporting different file formats is provided by using the sequence input system of the MolTools Java library [17]. A combination of components for sequence file parsing, sequence notation conversion, and post-import modifications are used to allow creation of sets of any type of target from a variety of sequence file formats, with the possibility to carry out other operations on the imported data, such as selecting which position within the target sequence to design probes for, or to group or sort sequences based on some particular property. 0 The main objectives in the development of ProbeMaker were to provide a framework that is flexible, in the sense that it should support design of oligonucleotide probes for different purposes, and extensible, in that it should be possible to add support for designing new types of probes and to add new types of design constraints. Furthermore, the software should be adaptable to new applications, and 0 For a given set of targets, and a number of sets of tag sequences, ProbeMaker performs two tasks (Figure 1A). Firstly, TSSs are constructed for each target as determined by the target type in use, forming the basis for a probe for that target. Secondly, tag sequences are added to each probe sequentially in a pattern specified by the user. 0 Page 2 of 6 0 (page number not for citation purposes) 0 BMC Genomics 0 Research article 0 BioMed Central 0 Open Access 0 A generic approach for the design of whole-genome oligoarrays, validated for genomotyping, deletion mapping and gene expression analysis on Staphylococcus aureus 1 Yvan Charbonnier*1,2, Brian Gettler1,2, Patrice Francois1, Manuela Bento1, Adriana Renzoni3, Pierre Vaudaux3, Werner Schlegel2 and Jacques Schrenzel1,4 0 Background: DNA microarray technology is widely used to determine the expression levels of thousands of genes in a single experiment, for a broad range of organisms. Optimal design of immobilized nucleic acids has a direct impact on the reliability of microarray results. However, despite small genome size and complexity, prokaryotic organisms are not frequently studied to validate selected bioinformatics approaches. Relying on parameters shown to affect the hybridization of nucleic acids, we designed freely available software and validated experimentally its performance on the bacterial pathogen Staphylococcus aureus. Results: We describe an efficient procedure for selecting 40-60 mer oligonucleotide probes combining optimal thermodynamic properties with high target specificity, suitable for genomic studies of microbial species. The algorithm for filtering probes from extensive oligonucleotides libraries fitting standard thermodynamic criteria includes positional information of predicted targetprobe binding regions. This algorithm efficiently selected probes recognizing homologous gene targets across three different sequenced genomes of Staphylococcus aureus. BLAST analysis of the final selection of 5,427 probes yielded >97%, 93%, and 81% of Staphylococcus aureus genome coverage in strains N315, Mu50, and COL, respectively. A manufactured oligoarray including a subset of control Escherichia coli probes was validated for applications in the fields of comparative genomics and molecular epidemiology, mapping of deletion mutations and transcription profiling. Conclusion: This generic chip-design process merging sequence information from several related genomes improves genome coverage even in conserved regions. 0 Page 1 of 12 0 (page number not for citation purposes) 0 Current hybridization technologies allow assaying thousands of nucleic acid sequences in a single reaction on a solid substrate. Such massively parallel systems offer unprecedented opportunities for basic research and diagnostic applications, including gene sequencing [1], detection of genetic polymorphisms [2], genome-composition analysis [3,4] and measurement of gene expression profiles in prokaryotes [5,6] or cancer cells [7]. Oligonucleotide probes (up to 70-mer) offer more flexibility than cDNA probes since they can be tailored according to optimal in silico physico-chemical and specificity properties, and applied to any sequence data. Early available probe design software identified sets of probes sharing homogeneous thermodynamic properties for probe-target hybridization [8]. More elaborated software tools include cross-homology testing of probes against a reference database by BLAST (Basic Local Alignment Search Tool) [9,10] or prediction of secondary structures into the thermodynamically-based approach [1114]. A frequent drawback of some of these algorithms is to yield an excessive number of unprocessed BLAST outputs that complicates final selection of the most specific probes. Furthermore, these approaches do not take into consideration probe interaction with microarray surface, in particular the impact of mismatches position between the target and probes, as shown by Hughes et al [15]. Designing reliable oligonucleotide probes with available software is quite difficult for bacterial genomes with low GC content [16], low complexity in sequence composition, or frequent conserved repeats leading to erroneous target identification by cross-hybridization. The reported method (OliCheck) implements an algorithm for filtering oligonucleotide probes libraries sharing homogeneous thermodynamic properties by using positional information of predicted target-probe binding regions. An additional characteristic of OliCheck is to annotate probes recognizing highly conserved targets shared by different genomes. Staphylococcus aureus (S. aureus) was selected as a model organism for implementing and experimentally validating this approach. The choice of this clinically important pathogen for fundamental and applied genomic studies is prompted by the availability of several fully or partially sequenced strain genomes [16-18]. A set of feature elements was designed by OliCheck to yield an extensive S. aureus genome coverage. This S. aureus specific probe set together with control probes were used to manufacture an oligoarray that was extensively validated for comparative genomics, molecular epidemiology, mapping of deletion mutations, and transcription profiling applications. The specificity, signal-response linearity, and influence of hybridization temperatures for transcript profiling are also described. 0 Further genomic oligoarrays of several distinct microbial species have been successfully designed using this generic methodological approach. 0 In silico properties of the S. aureus oligoarray and manufacturing of StaphChip The final set of 5,335 S. aureus OliCheck-filtered probes recognized 97.5, 93.0, and 81.0% of N315, Mu50, and COL ORFs, respectively. The low residual percentage of 0 Page 2 of 12 0 (page number not for citation purposes) 0 Step A 0 N315 (2'593 ORFs) (2,593 0 BLAST probes 0 N315 (2'593 ORFs) (2,593 0 Hybridization intensities prediction (%) 0 Surface end 0 Solution end 0 Probe A 0 Step B 0 Probe B 0 BLAST probes 0 Hybridization intensities prediction (%) 0 Surface end 0 Solution end 0 Probe A 0 Step C 0 Probe B 0 Step D 0 BMC Genomics 0 BMC Genomics 2002, 3 0 BioMed Central 0 Methodology article 0 Open Access 0 Optimization and evaluation of T7 based RNA linear amplification protocols for cDNA microarray analysis 1 Hongjuan Zhao1, Trevor Hastie2, Michael L Whitfield3, Anne-Lise BorresenDale4 and Stefanie S Jeffrey*1 0 Background: T7 based linear amplification of RNA is used to obtain sufficient antisense RNA for microarray expression profiling. We optimized and systematically evaluated the fidelity and reproducibility of different amplification protocols using total RNA obtained from primary human breast carcinomas and high-density cDNA microarrays. Results: Using an optimized protocol, the average correlation coefficient of gene expression of 11,123 cDNA clones between amplified and unamplified samples is 0.82 (0.85 when a virtual array was created using repeatedly amplified samples to minimize experimental variation). Less than 4% of genes show changes in expression level by 2-fold or greater after amplification compared to unamplified samples. Most changes due to amplification are not systematic both within one tumor sample and between different tumors. Amplification appears to dampen the variation of gene expression for some genes when compared to unamplified poly(A)+ RNA. The reproducibility between repeatedly amplified samples is 0.97 when performed on the same day, but drops to 0.90 when performed weeks apart. The fidelity and reproducibility of amplification is not affected by decreasing the amount of input total RNA in the 0.3-3 µg range. Adding template-switching primer, DNA ligase, or column purification of double-stranded cDNA does not improve the fidelity of amplification. The correlation coefficient between amplified and unamplified samples is higher when total RNA is used as template for both experimental and reference RNA amplification. Conclusion: T7 based linear amplification reproducibly generates amplified RNA that closely approximates original sample for gene expression profiling using cDNA microarrays. 0 Gene expression profiling using complementary DNA (cDNA) microarrays is being applied for multiple purposes such as defining the taxonomy of different molecular 0 subtypes of human breast and other cancers [1-10] and discovering biomarkers and therapeutic targets [11,12]. A limitation of the use of this technology is that small specimens of human tissue, such as obtained by core needle or 0 Page 1 of 15 0 (page number not for citation purposes) 0 BMC Genomics 2002, 3 0 fine needle aspiration (FNA) biopsies, may not be sufficient for microarray hybridization using direct labelling protocols. Typical microarray labelling procedures require 2-4 µg poly(A)+ RNA or 25-50 µg total RNA per cDNA microarray. This amount of poly(A)+ RNA or total RNA can be obtained from samples of human tissue that weigh greater than 50-100 mg. However, core needle biopsies of breast cancers, for example, weigh in the 10-25 mg range and yield only 3-15 µg of total RNA. Small tumors identified using early detection strategies may thus be too small to excise a specimen with enough RNA for microarray analysis. A pilot study by Assersohn et al. [13] showed that only 15% of FNA samples from human breast cancers produced sufficient mRNA for expression array analysis. One approach to low specimen RNA input has been to use indirect labelling techniques to increase fluorescence signal intensity, such as with aminoallyl nucleotides. Although less expensive, we and other colleagues have found that indirect labelling techniques are not always reliable compared to direct labelling methods. For valuable tumor specimen, reliability is paramount. A very recent report used amino C6dT-modified random hexamers to prime cDNA synthesis in conjunction with aminoallyldUTP and increased fluorescence intensity enough such that as little as 1 µg of total RNA from cell lines gave sufficient signal for cDNA microarray hybridization [14]. The reliability of this method with human tumor specimen warrants further testing. RNA amplification techniques have been developed to address the need for sufficient RNA from tiny specimen for microarray hybridization. Other examples of specimen requiring amplification for genome-wide characterization of gene expression include purified populations of cells obtained by either flow cytometry, laser capture microdissection, breast ductal or bronchial lavage, or microendoscopy. Although one group has used unamplified total RNA extracted from ~2 x 104 microdissected cells for hybridization on 5000 clone membrane-based arrays [15], most groups perform RNA amplification for this purpose [16-18], especially when using high-density slide-based arrays. The most commonly used mechanism for RNA amplification is a T7 based linear amplification method first developed by Van Gelder, Eberwine and coworkers [19-21]. This method utilizes a synthetic oligo(dT) primer containing the phage T7 RNA polymerase promoter to prime synthesis of first strand cDNA by reverse transcription of the poly(A)+ RNA component of total RNA. Second strand cDNA is synthesized by degrading the poly(A)+ RNA strand with RNase H, followed by second strand synthesis with E. coli DNA polymerase I. Amplified antisense RNA (aRNA) is obtained from in vitro transcription of the double-stranded cDNA (ds cDNA) template using T7 RNA 0 Page 2 of 15 0 (page number not for citation purposes) 0 BMC Genomics 2002, 3 0 Table 1: Correlation coefficients of amplified and unamplified expression levels of 14,044 genes selected according to the described criteria. Amplifications with or without TS primer and with two different ds cDNA cleanup protocols were performed on BC91 total RNA. 0 Column for ds cDNA cleanup 0 Reference RNA amplified 0 Total RNA 0 Poly(A)+ RNA 0 Total RNA 0 Poly(A)+ RNA 0 Virtual Average Virtual Average 0 Stefan Tomiuk is a member of the bioinformatics group at MEMOREC, a Cologne-based biotechnology company focusing on gene discovery and expression profiling by SAGE and cDNA microarrays. He participates in building up the company's cDNA collection and is responsible for the selection of DNA fragments suitable for microarray application. Kay Hofmann is head of the bioinformatics group at MEMOREC. 0 Microarray probe selection strategies 1 Stefan Tomiuk and Kay Hofmann 0 Keywords: cDNA microarray, expression profiling, high throughput, clustering, hybridisation 0 During recent years, DNA microarrays have become the method of choice to monitor the expression level of a large number of genes. Depending on the focus of the study and the method of microarray fabrication, a number of different strategies for probe selection may be most appropriate. One consideration concerns the length of the probe, ranging from some 25 residues used for oligonucleotide arrays to complete cDNAs. Unless resources are truly unlimited, an important decision to be made is the amount of effort to be put into the selection of genes and gene fragments. While high-throughput cDNA arraying projects usually will select from a collection of existing cDNA clones, smaller projects focusing on a number of selected genes can afford to selectively amplify fragments optimised for that purpose. This paper discusses the full scope of probe selection strategies, highlighting the problems that may be encountered in the various systems. 0 DNA microarrays are made up of a collection of distinct nucleic acid samples, arranged in a regular lattice of spots on a solid support generally made of coated glass. Arrays intended to monitor changes in the expression level of various genes use cDNA samples or synthetic oligonucleotides derived from cDNA sequences.1,2 Other possible array applications include the detection of mutations or copy number changes on the genome level 3±5 and thus use samples derived from genomic DNA. The successful application of each DNA microarray technique requires particular conditions and prerequisites, which impose certain criteria for selecting appropriate DNA probes. The following paragraphs focus on probe selection strategies for the more widely used expression arrays of both the oligonucleotide- and cDNA-using variety. Nevertheless, some of these criteria are also valid for mutation-detection arrays. 0 GENERAL CONSIDERATIONS 0 When monitoring the expression level of a large number of genes, sufficient sensitivity and specificity of an array, as well as the broad coverage of all relevant genes, are of crucial importance. In addition, the quality of the array should guarantee the reproducibility of the results to ensure their statistical significance. A further prerequisite for a successful interpretation of the array results is a correct assignment and annotation of the DNA probes, providing an unambiguous link to the corresponding entries in gene and literature databases. Some aspects of probe design, including the fragment length, are influenced by the manufacturing process of the arrays. Photolithographic procedures allow a massively parallel production of oligonucleotide arrays, but are restricted to an oligonucleotide length of 20±25 nucleotides due to the high error rate of each extension cycle.6±8 Alternative methods for in situ oligonucleotide synthesis, employing high-precision delivery of chemical 0 Tomiuk and Hofmann 0 Physical properties of the probe influence hybridisation kinetics 0 High coverage but poor sample annotation in high density arrays Short vs. long array probes 0 reliable hybridisation properties but the increased viscosity might complicate the array manufacturing process. In addition, increasing the fragment length raises the danger of non-specific cross-hybridisation events. If fragments of very heterogeneous length are used, the comparability of the investigated genes and the robustness of the array might suffer from the different hybridisation kinetics. Oligonucleotide probes with the length of 50±60 nucleotides may not be suitable for reliably distinguishing single base mismatches, but show an improved specificity and sensitivity compared to shorter oligonucleotides.9,30 0 The most appropriate probe selection strategy depends primarily on the objective of the experiment. As summarised in Figure 1, there is a whole spectrum of different approaches, differing in aspects of throughput, accuracy and the necessary effort before and after the microarray experiment. In situations where little prior information on relevant genes is available, or where the prime motivation is an unbiased overview of global changes in gene expression patterns, the high-density method is the appropriate choice. Typically, samples are selected from a preexisting collection of cDNA sequences or fragments, or they are synthesised by a method amenable to high throughput. The downside of this approach is a general lack of reliable sample annotation, shifting some of the necessary work to the post-hybridisation phase. These highdensity microarrays, which aim to cover the complete transcriptome of a biological system,2,7 are in contrast to small but specialised arrays that are designed with a focus on defined subject areas such as, for example, genes relevant to a particular metabolic pathways or a particular tissue type.31,32 The limited number of DNA fragments on these low-density arrays allows a more thorough selection and annotation protocol. Obviously, there also exists a whole range of intermediates 0 Microarray probe selection strategies 0 The quality of ESTbased arrays depends on the reliability of the library used 0 Spotting without prior sequencing 0 PCR-amplification is the most reliable but most expensive probe generating method 0 between ultrahigh-density and highaccuracy arrays. In the following paragraphs, some common strategies for probe selection are discussed. The easiest and cheapest method consists of the spotting of clones from a library without prior sequencing. Only those clones that show differential expression after hybridisation are submitted to sequencing and further analysis. This strategy is particularly useful for arrays produced in small editions, since only a small fraction of presumably interesting genes must be annotated. The more frequently a particular array set-up is used, the less efficient becomes the deferment of the sequence analysis. Typical applications include highthroughput screens for potential new drug targets,33,34 or the analysis of `exotic' biological systems without any available sequence information. Owing to the frequent representation bias of some genes, a normalisation of the library used is strongly recommended for reaching a more equal distribution.35 A somewhat more refined strategy relies on available collections of sequenced cDNA clones. Most of the available clones have the status of ESTs (expressed sequence tags36 ), and their corresponding sequences are collected in the dbEST database.37 Access to the physical clones of most animal ESTs is provided by the IMAGE consortium (Integrated Molecular Analysis of Genomes and their Expression),38 and by 0 several distributors. Since clones from this exhaustive collection are also available in large sets, they are a valuable and widely used source for microarray probes. For plants and other organisms, similar sources exist. A comm 0 Research Update 0 Genome Analysis 0 Eubacterial phylogeny based on translational apparatus proteins 1 Celine Brochier, Eric Bapteste, David Moreira and Herve Philippe 0 Lateral gene transfers are frequent among prokaryotes, although their detection remains difficult. If all genes are equally affected, this questions the very existence of an organismal phylogeny. The complexity hypothesis postulates the existence of a core of genes (those involved in numerous interactions) that are unaffected by transfers. To test the hypothesis, we studied all the proteins involved in translation from 45 eubacterial taxa, and developed a new phylogenetic method to detect transfers. Few of the genes studied show evidence for transfer. The phylogeny based on the genes devoid of transfer is very consistent with the ribosomal RNA tree, suggesting that an eubacterial phylogeny does exist. 0 The completion of many genome sequence projects has revealed the fundamental importance of lateral gene transfers 0 species and that have no (or very few) duplicated copies. We concatenated the sequences of the 57 genes into a large fusion (~ 9000 amino acid positions). The phylogeny based on this fusion is very similar to that inferred from rRNA and gene content. Detailed analysis revealed that 13 out of the 57 gene phylogenies were INCONGRUENT (see Glossary) with the phylogeny based on the fusion of the 57 genes, either due to methodological treereconstruction problems or to a few recent LGTs. A true organismal phylogeny for Bacteria seems to exist, which could be fully resolved by the analysis of a core group of very rarely transferred genes. 0 Phylogenetic analysis of a large protein fusion 0 For our analysis, we retrieved from the public databanks and from ongoing 0 Congruence and incongruence: Congruence is the agreement between phylogenies obtained using different datasets or different reconstruction methods. Trees are perfectly congruent if they display the same topology; that is, they reflect the same evolutionary history. By contrast, incongruent trees show conflicting robust nodes, which could be due to different evolutionary histories (e.g. lateral gene transfers) or tree reconstruction problems. law: Traditional models of sequence evolution assume that all positions in the sequences are equally likely to undergo a substitution, which reduces the complexity of these models. However, in reality, positions in sequences are more or less `free' to vary; that is, they have different probabilities of undergoing substitutions. This limits the biological realism of traditional models and their efficiency for phylogenetic reconstruction. The variation of substitution rates is commonly approximated using a gamma distribution, also known as a law, which has a shape parameter that specifies the range of rate variation [a]. Small values result in an L-shaped distribution with extreme variation of rates (most sites are invariable, but a few have very high substitution rates). As gets larger, the range of variation diminishes, until approaches infinity and all sites have the same substitution rate. HKY model: The Hasegawa, Kishino and Yano [b] model of sequence evolution is a merger of the Felsenstein [c] and the Kimura two-parameter models [d], which allows transitions and transversions to occur at different rates and base frequencies to vary during the course of evolution, respectively. Jack-knife analysis: A statistical method to evaluate the robustness of an inference. It is based on the construction of random sub-samples of the original alignment by taking a fraction of the positions without replacement (in contrast to the bootstrap method, which allows replacement). Usually, trees are reconstructed with the random sub-samples and the robustness of each node is estimated as the number of its occurrences among these trees [e]. Log-Det method: A method to evaluate evolutionary distances that are consistent for sequences with different nucleotide or amino acid composition [f]. This approach is required because other methods tend to group sequences on the basis of their composition, irrespective of their evolutionary history. Kishino-Hasegawa test: A test used for the estimation of incompatibility between alternative tree topologies with the same taxonomic sampling but obtained using 0 different datasets [g]. Two tree topologies are significantly different if the differences of their likelihood values (expressed as the lnL, where L is the likelihood) is larger than 1.96 standard error in the estimation of likelihood. For a recent criticism of this test see Ref. [h]. Principal component analysis (PCA): This involves a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. Principal components are obtained by projecting the multivariate data vectors on the space spanned by the eigen vectors. 0 Research Update 0 Proteobacteria Spirochetes Green sulfur 0 Chlamydiales Proteobacteria 0 Mycoplasmas (Low G+C Gram positives) 0 Green sulfur 0 D. radiodurans 0 Low G+C Gram positives 0 High G+C Gram positives Thermotogales 0 Low G+C Gram positives 5 High G+C Gram positives 0 D. radiodurans 0 Aquificales 0 TRENDS in Genetics 0 genome projects sequences homologous to all Escherichia coli proteins classified as involved in translation in the Cluster of Orthologous Genes (COG) database [7], as well as the 16S and 23S rRNAs. We aligned 76 proteins from 45 bacterial species, having eliminated any proteins that are present only in a restricted sample of phyla (see http://sorex.snv.jussieu.fr/ translation/translation.html). In addition, as a sample of transferred genes, we used the tRNA synthetases (tRS), most of which are known to have undergone numerous LGTs (perhaps related to antibiotic resistance [8,9]). The 76 genes were analysed individually, and 19 of them were excluded from further analyses because they were: (1) difficult to align reliably, (2) present in less than 42 of the 45 species, or (3) have more than one copy for certain phyla (indicating possible ancient duplications and losses, and/or LGTs). The remaining 57 genes, after elimination of ambiguously aligned regions (alignments available on our website), were concatenated for the 45 bacterial species into a large fusion of 8857 amino acids (fusion P1). Most of 0 the best-known bacterial phyla were represented, of which we had a broad taxonomic sampling for Proteobacteria and Gram-positive bacteria. We do not use Archa 0 Robustness, Flexibility, and the Role of Lateral Inhibition in the Neurogenic Network 0 Summary Background: Many gene networks used by developing organisms have been conserved over long periods of evolutionary time. Why is that? We showed previously that a model of the segment polarity network in Drosophila is robust to parameter variation and is likely to act as a semiautonomous patterning module. Is this true of other networks as well? Results: We present a model of the core neurogenic network in Drosophila. Our model exhibits at least three related pattern-resolving behaviors that the real neurogenic network accomplishes during embryogenesis in Drosophila. Furthermore, we find that it exhibits these behaviors across a wide range of parameter values, with most of its parameters able to vary more than an order of magnitude while it still successfully forms our test patterns. With a single set of parameters, different initial conditions (prepatterns) can select between different behaviors in the network's repertoire. We introduce two new measures for quantifying network robustness that mimic recombination and allelic divergence and use these to reveal the shape of the domain in the parameter space in which the model functions. We show that lateral inhibition yields robustness to changes in prepatterns and suggest a reconciliation of two divergent sets of experimental results. Finally, we show that, for this model, robustness confers functional flexibility. Conclusions: The neurogenic network is robust to changes in parameter values, which gives it the flexibility to make new patterns. Our model also offers a possible resolution of a debate on the role of lateral inhibition in cell fate specification. Introduction In this paper, we use a computer model to explore the properties of the neurogenic network, originally characterized in Drosophila melanogaster. This is but one example of the many networks of cross-regulatory genes at work in complex organisms. Other familiar examples include the networks of segment polarity genes, of cell cycle genes, of circadian clock genes, and so on. Each of these seems to have remained more or less intact through long periods of evolutionary time and across 0 Robustness in the Neurogenic Network 779 0 embryos and imaginal disks. Figure 1 shows our summary of the core genes, their products, and their interactions. In crafting Figure 1, we approached the modelbuilding process as a biochemist approaches in vitro reconstitution; by adding to the system piece by piece, we hope to figure out how each design feature contributes to the function of the essential core network. We rationalize our choice of this diagram in the Supplementary Material available with this article online, with a synopsis as follows (Below, "ac" and "Ac" refer to the real achaete gene and its protein product, whereas "ac" and "AC" refer to corresponding nodes in the model): Delta (Dl) is a ligand for the receptor Notch (N). When Dl activates N, a cleaved-off cytoplasmic piece of N binds to the transcription factor Suppressor of Hairless (Su(H)), and that heterodimer activates Enhancer of split (E(spl)) complex genes. The proneural genes achaete (ac) and scute (sc) encode transcription factors that actually specify neural fate. Both Ac and Sc are autoactivating and cross-activating: they promote their own, and each others', transcription. Thus, the proneural genes constitute a bistable switch at the heart of the neurogenic network. They also activate transcription of E(spl) and Dl. E(spl) in turn represses transcription of ac and sc. Thus, the loop works as follows: something activates ac and/or sc in the neural-competent cluster. They upregulate Dl, whose product activates N in neighboring cells, which, through Su(H), activates E(spl). E(spl) represses ac and sc in those neighboring cells. To achieve a neural fate, a cell must upregulate ac and sc enough that their autoactivation overwhelms E(spl)-mediated repression due to neighboring cells signaling through N. We constructed three different models of the network in Figure 1, which we call "augmented", "standard", and "reduced". The standard network includes all components and interactions shown in Figure 1, except for cis-negative regulation of N activity by Dl and E(spl) autorepression (Figure 1 without red or blue connections). Experimental evidence for each of the latter interactions exists (see the Supplementary Material), but the literature has not given them much attention. Neither did we initially, but our results below regarding the aug- 0 mented network (which adds the red connections) indicate that these may indeed be important. Our reduced network eliminates intracellular negative feedback from AC and/or SC to suppress ac and sc transcription (blue connections replacing red and green connections and their E(spl) hub). Such a simplified network could have functioned in a precursor to the Drosophila network since the similar process of anchor cell specification in the worm Caenorhabditis elegans appears to take place without E(spl)-like genes or function (X. Karp and I. Greenwald, personal 0 Involvement of Putative SNF2 Chromatin Remodeling Protein DRD1 in RNA-Directed DNA Methylation 0 Current Biology 802 0 eling protein CHR35 (At2g16390) [15], which is a member of a previously uncharacterized SNF2-like protein subfamily that is unique to plants. The DRD1 subfamily can be defined by four ProDom [16] domains (Figure 5). These overlap with matches to the functional signatures SNF2_N and HELICc, which together constitute the SWI/ SNF ATPase domain essential for chromatin remodeling activity [17]. The drd1-1 mutation consists of a G-to-R change in the putative Mg2 binding site of SNF2_N. Five additional drd1 alleles (drd1-2, drd1-3, drd1-4, drd1-5, and drd1-6) were identified and sequenced. They all 0 contained a mutation in strongly conserved or functionally implicated regions of the SWI/SNF ATPase domain (Figure 5). The DRD1 subfamily comprises six additional members, including a clear DRD1 homolog in rice (BAC84084) (Figure S2). CHR34 (At2g21450), which still shares all six ProDom domains, is the Arabidopsis protein most similar to DRD1. Another rice protein (AAM15781) is highly similar to DRD1 and also contains all six domains. The remaining three members [At1g05480, T25N20.14 (Q9ZVY9, similar to CHR31), and CHR40 (At3g24340)] have only four of the six ProDom domains in common 0 SNF2 Protein DRD1 and RNA-Directed DNA Methylation 803 0 The stability of proteins in extreme environments Rainer Jaenicke* and Gerald Boehm 0 Three complete genome sequences of thermophilic bacteria provide a wealth of information challenging current ideas concerning phylogeny and evolution, as well as the determinants of protein stability. Considering known protein structures from extremophiles, it becomes clear that no general conclusions can be drawn regarding adaptive mechanisms to extremes of physical conditions. Proteins are individuals that accumulate increments of stabilization; in thermophiles these come from charge clusters, networks of hydrogen bonds, optimization of packing and hydrophobic interactions, each in its own way. Recent examples indicate ways for the rational design of ultrastable proteins. 0 been isolated -- thousands of microbes were isolated from the first samples collected from the Challenger Deep at 110 MPa [2], but very few of them were truly barophilic [3·]. Their proteins are still terra incognita. 0 Limits of stability and growth 0 Proteins, independent of their mesophilic or extremophilic origin, consist exclusively of the 20 canonical natural amino acids. In the multicomponent system of the cytosol, these are known to undergo covalent modifications at extremes of temperature, pH and pressure (deamidation, elimination, disulfide interchange, oxidation, Maillard reactions, hydrolysis, etc. [4]). Extremophiles must compensate for amino acid degradation either by using compatible protectants or by enhanced synthesis and repair. Little is known about the chemistry involved, for example, in the hydrothermal decomposition of proteins, and even less is known about protection and repair. Applying temperatures beyond 100°C, the thermal stabilities of the common amino acids are (Val,Leu)>Ile>Tyr>Lys>His>Met>Thr>Ser>Trp>(Asp,Glu, Arg,Cys). In many cases, the half-lives of the degradation reactions are significantly shorter than the generation time of hyperthermophilic microorganisms [5]; to this limit, biomolecules could still be resynthesized at biologically feasible rates. The temperature at which ATP hydrolysis becomes the limiting factor for viability lies between 110 and 140°C [6]. This temperature limit coincides with the temperature range at which the hydrophobic hydration of proteins vanishes and water becomes an `ordinary solvent' [1]. Apparently, both the integrity of the natural amino acids and the formation of the hydrophobic core upon protein folding are essential for viability. Extrinsic factors and compatible solutes may enhance the stability and shift the limits of growth of prokaryotes as well as eukaryotes [7]. 0 Life on earth exhibits an enormous adaptive capacity. Except for centers of volcanic activity, the surface of our planet is `biosphere'. In quantitative terms, the limits of the biologically relevant physical variables are -40 to +115°C (in the stratosphere and hydrothermal vents, respectively), 120 MPa (for hydrostatic pressures in the deep sea), aw 0.6 (for the activity of water in salt lakes) and 1 (arbitrary units) 0 Oligonucleotide length (nt) 0 GCN4 - Average sensitivity 0 GCN4 - Average specificity 0 Oligonucleotide length (nt) 0 Specific / non specific > 0 Formamide (%) 0 Non-specific, specific intensity 0 Formamide (%) 0 Tiling start position (nt) 0 Data extraction from composite oligonucleotide microarrays 1 Ilya Shmulevich*, Jaakko Astola1, David Cogdell, Stanley R. Hamilton and Wei Zhang 0 ABSTRACT Microarray or DNA chip technology is revolutionizing biology by empowering researchers in the collection of broad-scope gene information. It is well known that microarray-based measurements exhibit a substantial amount of variability due to a number of possible sources, ranging from hybridization conditions to image capture and analysis. In order to make reliable inferences and carry out quantitative analysis with microarray data, it is generally advisable to have more than one measurement of each gene. The availability of both betweenarray and within-array replicate measurements is essential for this purpose. Although statistical considerations call for increasing the number of replicates of both types, the latter is particularly challenging in practice due to a number of limiting factors, especially for in-house spotting facilities. We propose a novel approach to design so-called composite microarrays, which allow more replicates to be obtained without increasing the number of printed spots. INTRODUCTION Oligonucleotide arrays (1,2), both synthesized and spotted, enjoy several advantages over cDNA-based arrays (3,4), such as simpler methodology to obtain DNA and better quality control, options to select high-specificity sequences to avoid cross-hybridization, and the potential to detect alternative spliced variants of genes (5). It is known that microarray gene expression measurements exhibit both between-slide and within-slide variability (6) and that apart from making efforts to improve the technology, having replicate measurements is essential for improving the reliability of subsequent quantitative analysis. Dealing with between-slide variability involves repeating entire microarray experiments. There exist some limitations, however, such as availability of RNA as well as cost factors. To address within-slide variability, the typical approach entails printing replicate spots on the same slide. However, spotting robots typically have a limitation on the number of spots that can be reliably printed. Thus, increasing 0 PAGE 2 OF 5 0 each well were resuspended in 1 ml of 50% DMSO array buffer (50 mM for each oligo). Spotting Oligos were spotted onto poly-L-lysine glass slides by a G3 solid pin spotter (Genomic Solutions, Ann Arbor, MI, USA), baked at 65°C for 90 min, and crosslinked with 65 mJ of ultraviolet radiation. Probe labeling, hybridization and quantification 0 or more oligos into the same spot. The challenge then is to recover the individual gene intensities by observing the intensities of the mixtures. This is, in fact, conceptually simpler than the blind source separation problem because we know exactly which genes are present in which spots and because intensities are simply scalars and not time-varying signals. In addition, the contributions from the mixed oligos are expected to be mutually independent, as they are designed to be non-homologous to each other, which is a fundamental assumption of all oligonucleotide microarrays. The obvious benefit of this approach is that each gene is given an opportunity to make several contributions in different spots, each time with a different partner, and therefore, is also a type of replication. The question is whether the original gene expressions can be reliably recovered from such mixtures. 0 The microarray experiments were performed as described previously (13). Briefly, triplicate reverse transcription reactions using 100 mg of total RNA from RKO cells incorporated Cy3 d-CTP into cDNA. After G50 column purification, replicates were combined for uniformity and distributed to three identical microarray slides. Each slide was hybridized overnight at 60°C in a humid incubator, then washed at 37°C with increasing stringency until 0.1Q SSC was used. Slides were scanned on a LSIV laser scanner (Genomic Solutions, Ann Arbor, MI, USA) and quantified using ArrayVision software (Imaging Research, Inc, St Catherine's, Ontario, Canada). RESULTS Our experiment consisted of designing a spotted microarray containing 30 genes represented in 50 bp oligos that are expressed at different levels in RKO colon cancer cells based on our prior experiments. Those genes were spotted individually five times each, as well as mixtures of all possible pairs of genes, for a total of (30 Q 29) / 2 = 435 pairs. Thus, each of the 30 genes appeared 29 times with different partner genes. Finally, each mixture was replicated five times to facilitate statistical analysis. Total RNA was isolated from RKO colon cancer cells and used for microarray experiments. As a first step, we proceeded to discover how the intensities of signals of the mixtures are related to signal intensities of the individual genes. Prior to any experimentation, it was expected that the intensity of the mixture should be at least an increasing function of the individual intensities. In other words, the higher the expression of the two genes, the higher is the signal from their mixture. It was further anticipated that the mixture would be a linear combination of the individual gene intensities. That is, if xi is the individual intensity of gene i, xj is the intensity of gene j ¹ i, and yk(i,j) is the intensity of the mixture of genes i and j, then yk(i,j) = a(xi + xj) + n, i, j, = 1, ..., 30, for some scalar a and additive error component n. Here, k(i, j) is simply an index that counts from 1 to 435, so k(1, 2) = 1, k(1,3) = 2, ..., k(29,30) = 435. Note that since genes are simply mixed in equal proportions, there is no notion of `first' or `second' gene and thus, we would not expect different weights ai and aj for genes xi and xj. Also, for the least-squares approach that we use below, no statistical description of the error component n is required. Rewriting the above relationship in vector-matrix notation, we have: y = aAx + n where y is a 435 Q 1 vector of mixtures, x is a 30 Q 1 vector of individual gene intensities, A is a binary matrix of size 435 Q 30 in which row k(i, j) contains ones in the ith and jth positions 0 MATERIALS AND METHODS Oligonucleotide design For the proof-of-principle experiments, we 0 A novel sensitive microarray approach for differential screening using probes labelled with two different radioelements 1 H. Salin, T. Vujasinovic, A. Mazurie, S. Maitrejean1, C. Menini, J. Mallet and S. Dumas* 0 LGN, UMR 7091, CNRS, Batiment CERVI, 5eme Etage, Hopital Pitie Salpetriere, 83 boulevard de l'Hopital, F-75013 Paris, France and 1Biospace Mesures, 10 rue Mercoeur, F-75011 Paris, France 0 ABSTRACT We have developed a novel microarray approach for differential screening using probes labelled with two different radioelements. The complementary DNAs from the reverse transcription of mRNAs from two different biological samples were labelled with radioelements of significantly different energies (3H and 35S or 33P). Radioactive images corresponding to the expressed genes were acquired with a MicroImager, a real time, high resolution digital autoradiography system. An algorithm was used to process the data such that the initially acquired radioactive image was filtered into two subimages, each representative of the hybridisation result specific for one probe. The simultaneous screening of gene expression in two different biological samples requires <100 ng mRNA without any amplification. In such conditions, the technique is sensitive enough to directly quantify the amount of mRNA even when present in small amounts: 107 molecules in the probe as assessed with an added control sequence and 2 x 105 molecules with an endogenous tyrosine hydroxylase mRNA. This novel technique of double radioactive labelling on a microarray is thus suitable for the comparison of gene expression in two different biological samples available in only small quantities. Consequently, it has great potential for various biological fields, such as neuroscience. INTRODUCTION DNA array technology is increasingly used for large-scale screening of gene expression. The availability of laser devices that can differentiate between several fluorescent dyes has led to most development efforts being concentrated on fluorescent labelling of probes to be hybridised onto DNA arrays (the immobilised nucleic acid is called the `target' and the free nucleic acid is called the `probe'). The use of two different fluorescent dyes, one to label probes from a control tissue and one to label probes from a tissue of interest, allows normalised quantification of gene expression. For example, standard high 0 PAGE 2 OF 7 0 of starting material required for radioactive labelling is only 2-400 ng mRNA to detect 2 x 107 molecules (12). Previously, such analyses were possible only for one mRNA sample at a time. A technique comparing several mRNA samples on the same high density array but attaining the sensitivity discussed above would be of great value. For example, the results could be normalized, each RNA sample being used as a control for the other, on each target of the microarray, as is possible with double fluorescent labelling (2). These considerations led us to develop a technique for simultaneous hybridisation of two differently labelled radioactive probes on the same glass support microarray and detection of the hybridisation result for each probe separately. The development of this procedure required a device for detection of radioactive emission that could discriminate between different radioactive emission spectra and also with a spatial discrimination appropriate for the microarray density. The MicroImager has these properties. We have previously shown the potential of this device in the discrimination of the radioactive emissions of two different radioelements for in situ hybridisation of two probes on a single tissue section (13,14). Here we describe methods of labelling and hybridisation allowing work with two radioactive probes simultaneously on a single glass support microarray. The sensitivity of this method was analysed and we demonstrate the potential of this novel approach in cases where only small samples are available. MATERIALS AND METHODS Gene array PCR products 300-1500 bp long were purified using the concert nucleic acid purification system and then spotted with an arrayer (Genetix) onto polylysine-coated slides (15). The cDNA clones used were obtained from adult rat brains by RT-PCR, from a positive and exogenous control luciferase cDNA sequence (572 bp insert) in the pGEM-T easy vector (Promega, France) and from a negative and exogenous control neomycin phosphotransferase cDNA sequence (738 bp insert) in the pGEM-T easy vector (Promega). A total of 384 clones were spotted onto the microarray. The microarray plan was made up of four blocks of four rows and 24 columns (as shown in Fig. 2). This plan was in duplicate on every microarray. Preparation of the luciferase RNA The luciferase RNA was prepared from the luciferase cDNA described above using the riboprobe combination system T7 (Promega). RNA extraction mRNA was directly isolated from crude extracts of rat brain tissues on magnetic beads [oligo(dT)25 Dynabeads; Dynal]. All experimental procedures were carried out in accordance with the European Communities Council Directive (24.xi.1986) and with the guidelines of the CNRS and the French Agricultural and Forestry Ministry (decree 87848, licence number A91429). All efforts were made to minimise animal suffering and to use only the number of animals necessary to produce reliable scientific data. 0 Sample preparation for hybridisation Aliquots of 100 ng mRNA were mixed with 0.1 µg random hexamers from a Superscript First-Strand Synthesis System for RT-PCR (Life Technologies, France), heated to 70°C for 10 min and cooled on ice. Probe synthesis and labelling were then performed in the presence of 5 mM MgCl2, 1x reverse transcription buffer (Life Technologies), 10 mM dithiothreitol, 100 U RNaseOUT RNase inhibitor (Life Technologies), 0.05 mM ddTTP, 0.5 mM dGTP and dTTP, 100 U Superscript II reverse transcriptase (Life Technologies) and 10 µCi [35S]dATP (Amersham) and 0.5 mM dCTP or 20 µCi [3H]dCTP (Amersham) and 0.5 mM dATP for the phosphorylated and tritiated probes, respectively, by incubation of the mixtures at 42°C for 50 min. RNA was eliminated by heating at 70°C for 15 min and treatment with 2 U RNase H (Life Technologies) at 37°C for 20 min. Unincorporated nucleotides were removed by passage through a P10 column (Bio-Rad). Hybridisation The probes were added to the hybridisation buffer (3.5x SSC, 0.3% SDS), heated to 95°C for 2 min, cooled to room temperature and then put on the microarray under parafilm (Fuji). Hybridisation was performed in a cassette chamber (Telechem) submerged in a water bath at 60°C for 16-17 h. Following hybridisation, arrays were rinsed at room temperature in 2x SSC, 0.1% SDS, then 2x SSC, then 0.2x SSC, each washing step lasting 2 min. Acquisition of radioactive images with a MicroImager (Biospace Mesures, Paris, France) A thin foil of scintillating paper was placed in contact with the microarrays. -Particles emitted by the hybridised probes were identified by acquisition of the light spot emissions in the scinti 0 Sensitivity and Specificity of Photoaptamer Probes* 1 Drew Smith§, Brian D. Collins, James Heil, and Tad H. Koch¶ 0 Proteomics, the study of protein expression at the scale of cell, tissue, or organism (1, 2), has been defined by a single technology: two-dimensional gel separation followed by mass spectrometric analysis (3, 4). Although this technology is mature, powerful, and wonderfully sophisticated, it suffers from evident limitations in speed and sensitivity. Several days are required to process a single sample, and only 1000 of the most abundant proteins can be detected (5). The ideal proteomic technology would process samples in minutes or hours and be able to quantify even the most weakly expressed proteins. Two-dimensional gels and chromatographic methods separate and identify proteins on the basis of their physical characteristics. An alternative approach is to identify proteins by specific recognition. The potential advantage of this approach is that proteins that have similar size and charge but which 0 The abbreviations used are: SELEX, systematic evolution of ligands by exponential enrichment; A, aptamer; aFGF, acidic fibroblast growth factor; bFGF, basic fibroblast growth factor; NHS, N-hydroxysuccinimide; PDGF, platelet-derived growth factor; T, target protein; HIV, human immunodeficiency virus. 0 Molecular & Cellular Proteomics 2.1 0 Photoaptamer Probes 0 under the harshest and most stringent conditions necessary to reduce background and improve signal. What is not established is the effect of photocross-linking on the specificity of the capture step. We set out to characterize, systematically and quantitatively, a set of photocross-linking aptamers, photoaptamers, with regard to their sensitivity and specificity. The photoreactive unit incorporated into our photoaptamers is 5-bromodeoxyuridine (BrdUrd), used for decades in protein-nucleic acid cross-linking studies. Rather than use short wave (254 or 266 nm) UV light for cross-linking, however, we irradiate at 308 nm using a XeCl excimer laser. This technique was developed by Koch and colleagues (12-16) and has been shown to result in specific and high yield cross-linking reactions. Light at 308 nm induces photoelectron transfer from a nearby electron donor to the bromouracil base via either excitation of the BrdUrd, excitation of the electron donor, or excitation of a BrdUrdelectron donor charge transfer state (17, 18). Amino acid residues that can serve as electron donors in BrdUrd photocross-linking include Tyr, Trp, His, Phe, Cys, Cys-Cys, and Met of which only Tyr and Trp are excited at 308 nm (16 -20). Cross-linking results from subsequent reaction of the resulting radical ion pair. In the absence of an electron donor the BrdUrd efficiently relaxes back to ground state (17). We hypothesized that photocross-linking via photoelectron transfer would actually enhance the specificity of the aptamer-protein capture reaction: although a protein might bind an aptamer nonspecifically, the probability that an appropriate amino acid would be positioned to cross-link with a BrdUrd residue would be low. Some evidence for this view has been presented by Golden and co-workers (9), who showed that basic fibroblast growth factor (bFGF) photoaptamers could cross-link picomolar concentrations of target in the presence of serum with very little nonspecific cross-linking. Using these bFGF photoaptamers and a new photoaptamer raised against the HIV coat protein gp120MN we evaluated both the equilibrium binding constant and the relative rate of cross-linking to target proteins. We then compared these values to the values for a set of non-target proteins. These non-target proteins were chosen to provide an exacting test of specificity: 1) aFGF and gp120SF2 are the commercially available proteins most closely related to the target proteins; 2) platelet-derived growth factor (PDGF) is a highly basic heparinbinding growth factor that is notorious for its nonspecific DNA binding; and 3) thrombin is another heparin-binding protein. These experiments confirm the specificity of the photocross-linking reaction in the solution phase. We extend these results to microarray format by measuring cross-linking of immobilized photoaptamers to target protein. We find that the sensitivity and specificity of photocross-linking are maintained in this format: target proteins can be detected at subnanomolar concentrations in buffer and at nanomolar concentrations when spiked into serum. 0 EXPERIMENTAL PROCEDURES 0 Revealing Global Regulatory Features of Mammalian Alternative Splicing Using a Quantitative Microarray Platform 0 Molecular Cell 930 0 sive use of the latter approach was the application of "exon-junction" microarrays for the discovery of exon skipping events in human tissues and cell lines (Johnson et al., 2003). These authors used custom microarrays containing oligonucleotide probes complementary to mapped exon-exon junction sequences in RefSeq genes for the main purpose of discovering new AS events in human transcripts. Despite the progress described above, a system has not yet been described that permits the large-scale quantitative profiling of alternative splicing in mammalian cell and tissue sources. This is primarily due to limitations stemming from the design of existing microarrays and the lack of suitable algorithms for data analysis. In this paper, we describe a microarray platform that permits the simultaneous quantification of the levels of thousands of alternative exons in mammalian cell and tissues sources. We have applied this system to the analysis of the regulation of 3126 sequence-verified AS events in diverse mouse tissues. The resulting data have generated hundreds of new inferences for functional roles of tissue-specific AS, insights into how the evolutionary origins of alternative exons relate to their inclusion levels in normal tissues, and information on global features of AS that underlie tissue-type specificity. This study therefore demonstrates the utility of a quantitative microarray platform for generating fundamental new insights into the global regulation of alternative splicing in mammals. Results A Custom Microarray for Quantitative Profiling of AS in Mammalian Cells In order to perform large-scale quantitative analyses of functionally diverse AS events in mammalian tissues, we developed a custom microarray to represent sequencevalidated AS events mined from mouse cDNA and EST sequence databases (refer to Experimental Procedures). To minimize representation of possible splicing errors or relatively low-abundance transcripts, we selected "cassette-type" AS events with the highest numbers of supporting cDNA and EST sequences from different cell and tissue sources. To enhance the sensitivity of detection and quantification of inclusion/exclusion levels of alternative exons, each AS event was measured by using six different oligonucleotide probes: one body probe for each exon sequence, designated as "C1, A and C2" probes (C, constitutive; A, alternative), and one junction probe for each of the three splice-junction sequences generated by AS, designated as "C1-A, A-C2 and C1-C2" probes (Figure 1A). In addition, a control probe specific to each intron sequence (located between C1 and A) was included to permit detection of unspliced pre-mRNA and/or contaminating genomic DNA in the hybridizations. From an initial starting set of 4892 AS events in our database, 3126 AS events were selected for monitoring on a single ink-jet printed microarray, manufactured by Agilent Technologies (Figure 1B). The vast majority of the AS events correspond to cassette-type alternative exons, and additional events may correspond to mutually exclusive alternative exons. The 3126 AS events are 0 represented by 2647 distinct genes, with 413 of the genes containing two or more AS events. In addition, 54 of the AS events represented on the microarray are duplicates and were monitored by sets of probes that in some cases are complementary to different sequences within the same exons. These served as reproducibility controls (see below). The 2647 AS genes represented on the microarray are associated with 1118 distinct Gene Ontology Biological Process (GO-BP) categories among a total set of 2362 GO-BP categories assigned to 10,361 Mouse Gene Informatics (MGI) markers (refer to Experimental Procedures; see below). This indicates that the AS genes represented on the microarray are associated with a diverse range of biological functions in mammalian cells. Quantitative Microarray Profiling of Alternative Splicing in Mouse Tissues In order to assess the performance of our microarray system and to reveal global properties of alternative splicing in mammalian tissues, we hybridized 0 Molecular Cancer Therapeutics 0 Transcriptome analysis of endometrial cancer identifies peroxisome proliferator-activated receptors as potential therapeutic targets 1 Cathrine M. Holland,1,2 Samir A. Saidi,2 Amanda L. Evans,1 Andrew M. Sharkey,1 John A. Latimer,2 Robin A.F. Crawford,2 D. Stephen Charnock-Jones,2 Cristin G. Print,1 and Stephen K. Smith1,2 0 Endometrial carcinoma is the most common gynecologic malignancy and comprises 97% of all uterine cancers (1). 0 There is a peak incidence between ages 55 and 65 years, with <5% of endometrial cancers occurring below age 40 years (2). The majority are of an endometrioid histologic subtype and display an association with obesity and diabetes mellitus (2). There is a pressing need to better understand the molecular basis for this disease, as 25% of women present with extrauterine disease with 5-year survival rates of f31% and 10% for Federation Internationale des Gynaecologistes et Obstetristes stages 3 and 4 disease, respectively (2). An improved understanding of events at a molecular level is essential in the development of targeted therapy, with a view to improving survival and cure rates. There are increasing efforts to gain a more global view of the multiple, interrelated molecular changes that occur during tumorigenesis (3 - 6). The gene microarray is a highthroughput technology able to interrogate multiple genetic changes within tissues and cells (7 - 9). Consequently, there has been a marked increase in the use of microarrays to interrogate cancers at the genomic level. In addition to screening for candidate genes, microarrays may provide molecular diagnoses, thus avoiding some of the weaknesses of conventional diagnostic techniques (4, 10). Despite the increasing use of microarray technology in cancer research, there have been difficulties obtaining meaningful biological information. The cost of genomewide, commercially available arrays may prohibit large experimental samples, and there are multiple sources of variation in experimental results complicating data analysis and interpretation (11). Large-scale gene expression analyses of endometrial cancer have mostly been confined to small sample sets and cell lines (12, 13) and have employed genome-wide, commercially available microarray systems (12). Previous microarray studies in endometrial cancer have highlighted differences in the abundance of individual genes between benign and malignant tissues (12, 13), although there has been little advance in the understanding of pathway-specific alterations that may contribute to endometrial tumorigenesis. Independent component analysis (ICA) is a sophisticated statistical method that aims to identify patterns of coregulated genes rather than individual transcript changes (14). We previously have applied high-density cDNA microarrays to determine gene transcript abundance in epithelial ovarian cancer (14). 0 Materials and Methods 0 Tumor Samples and RNA Preparation Twenty frozen endometrial carcinoma tissues, three atypical complex hyperplasias, and eight postmenopausal benign endometrial control tissues (four atrophic and four 0 PPARa Is a Molecular Target in Endometrial Cancer 0 quantitative, real-time PCR experiments were done in the ABI PRISM 7700 Sequence Detector (Applied Biosystems) according to the manufacturer's instructions and were done in triplicate. The resultant data were averaged for each sample. No-template controls were included in each experiment. Specific oligonucleotide primers and probes were used. These were designed for each of five genes [cyclooxygenase-2 (COX-2), vascular endothelial growth factor-B (VEGF-B), PPARa, PPARg, and retinoid X receptor h (RXRh)] using Primer Express 1.5 software (Applied Biosystems). Sequences are given below: (a) COX-2 5V-TGATCCCCAGGGCTCAAA-3V (forward primer), 5V-ATCTGTCTTGAAAAACTGATGCGT-3V (reverse primer), 5V-6FAM-TGATGTTTGCATTCTTTGCCCAGCACTTAMRA-3V (probe); (b) VEGF-B 5V-AGCACCAAGTCCGGATG-3V (forward primer), 5V-GTCTGGCTTCACAGCACTG-3V (reverse primer), 5V-6FAM-AGATCCTCATGATCCGGTACCCGTTAMRA-3V (probe); (c) PPARa 5V-GACGTGCTTCCTGCTTCATAGA-3V (forward primer), 5V-CACCATCGCGACCAGATG-3V (reverse primer), 5V-6FAM-TGGAGCTCGGCGCACAACCA-TAMRA3V (probe); (d) PPARg 5V-CAGAGCAAAGAGGTGGCCAT-3V (forward primer), 5V-GCTTTTGGCATACTCTGTGATCTC-3V (reverse primer), 5V-6FAM-CATCTTTCAGGGCTGCCAGTTTCGCTAMRA-3V (probe); (e) RXRh 5V-CCATCCGCAAAGACCTTACATAC-3V (forward primer), 5V-GTTCCGCTGGCGCTTG-3V (reverse primer), 5-6FAM-TGCCGGGACAACAAAGACTGCACATAMRA-3V (probe). Results for gene abundance in each sample were normalized to abundance of an endogenous control gene. 18S rRNA was used as an endogenous control for all genes, with the exception of VEGF-B for which h-actin was used. Preliminary experiments to determine tha 0 Patterns of Temperature Adaptation in Proteins from Methanococcus and Bacillus 1 John H. McDonald,* Alicia M. Grasso,* and Lidia K. Rejto 0 McDonald et al. 0 Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by D Microarray Hybridization 1 Paul T. Spellman,* Gavin Sherlock,* Michael Q. Zhang, Vishwanath R. Iyer,§ Kirk Anders,* Michael B. Eisen,* Patrick O. Brown,§ David Botstein,*¶ and Bruce Futcher 0 INTRODUCTION In 1981 Hereford and coworkers discovered that yeast histone mRNAs oscillate in abundance during the cell division cycle (Hereford et al., 1981). To date 104 messages that are cell cycle regulated have been identified using traditional methods, and it was estimated that some 250 cell cycle-regulated genes might exist (Price et al., 1991). There are several reasons why genes might be regulated in a periodic manner coincident with the cell cycle. Such regulation might be required for the proper functioning of mechanisms that maintain order during cell division. Alternatively, regulation of these genes could simply allow conservation of resources. Much of the literature has focused on the 0 posttranscriptional mechanisms that control the basic timing of the cell cycle. However, there is also clear evidence that trans-acting factors play a critical role in the regulation of the abundance of many cell cycle- regulated transcripts. Most identified cell cycle controls that exert influence over mRNA levels do so at the level of transcription. Three major types of cell cycle transcription factors are known in yeast, the MBF and SBF factors, Mcm1p-containing factors, and Swi5p/Ace2p (Table 1). Many genes expressed at about the G1/S transition contain MCB or SCB elements in their promoters to which MBF and SBF bind respectively (for review, see Koch and Nasmyth, 1994). It is now apparent that SBF is not as specific for SCBs as was originally thought but, rather, can bind, at least in some cases, to motifs more closely matching the MCB consensus (Partridge et al., 1997). MBF and SBF are activated posttranslationally by Cln3p-Cdc28p, and SBF, at least, is inacti3273 0 by The American Society for Cell Biology 0 P.T. Spellman et al. 0 Table 1. Transcription factors that regulate the cell cycle Complex SBF MBF Mcm1p SFF Ace2p Swi5p Composition Swi6p Swi6p Mcm1p SFF Ace2p Swi5p Swi4p Mbp1p Site name SCB MCB MCM1 SFF SWI5 SWI5 Site CACGAAA ACGCGT TTACCNAATTNGGTAA GTMAACAA ACCAGC ACCAGC Reference Nasmyth, 1985; Andrews and Herskowitz, 1989 Lowndes et al., 1991; McIntosh et al., 1991; Koch et al., 1993 Acton et al., 1997 Althoefer et al., 1995 Dohrmann et al., 1996 Knapp et al., 1996 0 vated by Clb2p-Cdc28p (Amon et al., 1993). It is this cyclin-dependent activation and inactivation that causes MBF- and SBF-mediated transcription to be cell cycle regulated. Mcm1p can bind with other DNA binding proteins to mediate a specific biological effect. In cooperation with Ste12p, Mcm1p directs the cell cycle expression of some genes in early G1 phase (Oehlen et al., 1996). In cooperation with an uncloned factor called "Swi five factor" (SFF), it induces the expression of CLB1, CLB2, BUD4, and SWI5 in M (Lydall et al., 1991; Sanders and Herskowitz, 1996). Finally, possibly acting without a partner, it induces transcription of CLN3, SWI4, and CDC6 at the M/G1 boundary (McInerny et al., 1997). The Mcm1p SFF combination is interesting, because it is somehow activated by Clb2p-Cdc28p, and Mcm1p SFF then induces further transcription of CLB2. Thus, Mcm1p is part of a positive feedback loop for CLB2 transcription. Finally, Swi5p and Ace2p, which are transcriptionally controlled by Mcm1p and SFF, are responsible for the expression of many genes in M and M/G1 (Kovacech et al., 1996). Some of these genes are responsible for inactivating Clb2p and promoting cytokinesis, thus allowing exit from mitosis, and allowing the cycle to begin anew. Many cell cycle-regulated genes are involved in processes that occur only once per cell cycle. Such processes include DNA synthesis, budding, and cytokinesis. Additionally many of these genes are involved in controlling the cell cycle itself, although in most cases it is unclear whether their regulated transcription is absolutely required. The cell division cycle is thus a complex self-regulating program, such that 0 Strains used in this study are shown in Table 2. 0 Media and Growth Conditions 0 YEP medium (Sherman, 1991) was used in all experiments, supplemented with an appropriate carbon source. Carbon sources are indicated in the descriptions of each experiment and were used at a 0 Molecular Biology of the Cell 0 Microarray Manufacture 0 Yeast ORFs were amplified using gene PAIRS primers (Research Genetics, Huntsville, AL). One hundred-microliter PCR reactions were performed in 96-well PCR plates using each primer pair with the following reagents: 1 M each primer, 200 M each dATP, dCTP, dTTP, and dGTP, 1 PCR buffer (Perkin Elmer-Cetus, Norwalk, CT), 2 mM MgCl2, and 2 U of Taq DNA polymerase (Perkin Elmer-Cetus). Thermalcycling was performed in Perkin Elmer-Cetus 9600 thermalcyclers with a 5-min denaturation step at 94°C, followed by 30 cycles with melting, annealing, and extension temperatures and times of 94°C, 30 s; 54°C, 45 s; and 72°C, 3 min 30 s, respectively. Production of the correct PCR product was verified by gel electrophoresis. Products deemed to have failed were reamplified either by repeating the PCR reaction with the gene PAIRS primers, ordering custom primers, or using the yeast ORF DNA (Research Genetics) as a template. Reamplification of failed PCRs used the same protocol as initial amplification. DNAs were prepared and printed onto microarrays as described previously (Shalon et al., 1996; DeRisi et al., 1997 [http:/ /cmgm. stanford.edu/pbrown/]; Eisen and Brown, 1999) with 190- m spacing between the centers of each element. Each microarray was visually inspected, and all microarrays used in this study were estimated to be missing 1% of all elements except for arrays used in the cdc15 experiments, which were missing 3% of all elements. 0 Size-based Synchronization 0 Nine l 0 DNA Microarrays of the Complex Human Cytomegalovirus Genome: Profiling Kinetic Class with Drug Sensitivity of Viral Gene Expression 1 JAMES CHAMBERS,1 ANA ANGULO,2 DHAMMIKA AMARATUNGA,1 HONGQING GUO,1 YING JIANG,1 JACKSON S. WAN,1 ANTON BITTNER,1 KLAUS FRUEH,1 MICHAEL R. JACKSON,1 PER A. PETERSON,1 MARK G. ERLANDER,1 AND PETER GHAZAL2* Departments of Immunology and Molecular Biology, Division of Virology, The Scripps Research Institute, La Jolla, California 92037,2 and The R. W. Johnson Pharmaceutical Research Institute, San Diego, California 921211 0 MATERIALS AND METHODS Selection and synthesis of oligonucleotides for DNA microarrays. The complete set of ORFs from the HCMV genome was analyzed with a custom se- 0 CHAMBERS ET AL. 0 J. VIROL. 0 GTACCGTTGTACGCATTACAC3 ) and 18120 (5 GACGAAGATG CCGATGTGTGAC3 ). The resulting PCR fragments were isolated from agarose gels and then radiolabelled with [ -32P]dATP by the random-primed labelling method (Boehringer, Mannheim, Germany) according to the manufacturer's protocol. For TRL8-IRL8, TRL9-IRL9, UL15, UL31, UL48, UL66, and UL73, the corresponding oligonucleotides shown in Fig. 1 were used as probes, after being [ -32P]ATP end labelled with polynucleotide kinase (Stratagene). Oligonucleotide probes were hybridized to the filters for 1 h at 45°C by using Quick Hybridization solutions (Stratagene) under conditions recommended by the manufacturer. PCR-generated probes were hybridized with the filters for 12 h at 65°C in 1 Denhardt's solution, 6 SSC, and 100 g of denatured salmon sperm DNA/ml. Filters were washed to a stringency of 0.1% sodium dodecyl sulfate (SDS) at 60°C or 1% SDS at 42°C depending whether PCR-generated DNA fragments or oligonucleotides, respectively, were used during the hybridization. Hybridization signals were quantitated by using a Molecular Dynamics PhosphorImager system with ImageQuant software. MEME analysis of the upstream noncoding DNA sequences. The computer program Multiple EM for Motif Elicitation (MEME) was used to search for sequence motifs in 500 bp of noncoding sequences upstream of the initiation codon. MEME analysis was performed by using the sequence of strain AD169 of HCMV. The 5 noncoding regions were categorized according to class of expression as follows: E (TRL4-IRL4, UL104-5, UL11, UL112, UL124, UL13, UL16-7, UL24, UL26-7, UL35, UL4-5, UL45, UL53-7, UL77-9, US8-14, US16-7, US19, US23-4, US26, US28, and US30), early-late (E-L) (TRL-IRL6, TRLIRL10, TRL-IRL12, TRL-IRL13, UL1, UL106, UL130, UL40, UL44, UL46-7, UL49, UL72, UL83-5, UL95-8, US6-7, and US29), and L (TRL-IRL8, TRLIRL11, TRL-IRL14, UL100, UL103, UL111A, UL117, UL119, UL131, UL14, UL18, UL2-3, UL7, UL9, UL25, UL29, UL32-3, UL43, UL48, UL52, UL59, UL67, UL73, UL80, UL82, UL91-3, UL99, US18, and US27). By using MEME, 30 motifs (10 of 8 bases in length, 10 of 10 bases in length or longer, and 10 of 12 bases in length or longer) were derived from each gene set. The distribution of the combined 90 patterns was identified, allowing for 10% mismatch. MEME is available on the World Wide Web (20a). The resulting motifs that developed a significant polarized distribution pattern are summarized in Table 2. In addition, the transcription factor database (TFD) was used to search for known regulatory sequences. The TFD was downloaded from the National Center for Biotechnology Information. 0 quence analysis program that selected a 75-base sequence to be used as a microarray deposition target. The analysis preferentially selects unique sequences with a 3 gene bias and a G-C content of 40 to 60% and rejects sequences that contain homopolymeric stretches and potential hairpin structures. The 3 gene bias is preferred, as fluorescently labelled cDNA prepared for hybridization is generated by using oligo(dT) to prime poly(A) tails of mRNA. The selected target sequences were synthesized by using a PE Perseptive BioSystem (Framingham, Mass.) Expedite MOSS DNA synthesizer with membrane columns. Synthesized gene target oligonucleotides were cleaved, deprotected, and purified by standard procedures. Target oligonucleotides were transferred in triplicate to 96-well master plates at a concentration of 1 g/ l (in 3 SSC [1 SSC is 0.15 M NaCl plus 0.015 M sodium citrate]) for robotic deposition. The sequence of oligonucleotides comprising the deposited HCMV ORF microarray is shown in Fig. 1. The small ORF UL48/49 (8) and the UL74 ORF described by Huber and Compton (13) were not included in the present chip design. Also shown in Fig. 1 is a subset of cellular genes that were included as internal controls for normalization between chips, as follows: elongation factor 1-alpha (accession no. M29548), human acidic ribosomal phosphoprotein (RiboPO; accession no. M17885), alpha tubulin (accession no. K00558), glyceraldehyde-3-phosphate deh 0 Accounting Units in DNA 1 S. J. BELL AND D. R. FORSDYKE* 0 Chargaff's first parity rule (%A = %T and %G = %C) is explained by the Watson-Crick model for duplex DNA in which complementary base pairs form individual accounting units. Chargaff's second parity rule is that the first rule also applies to single strands of DNA. The limits of accounting units in single strands were examined by moving windows of various sizes along sequences and counting the relative proportions of A and T (the W bases), and of C and G (the S bases). Shuffled sequences account, on average, over shorter regions than the corresponding natural sequence. For an E. coli segment, S base accounting is, on average, contained within a region of 10 kb, whereas W base accounting requires regions in excess of 100 kb. Accounting requires the entire genome (190 kb) in the case of Vaccinia virus, which has an overall ``Chargaff difference'' of only 0.086% (i.e. only one in 1162 bases does not have a potential pairing partner in the same strand). Among the chromosomes of Saccharomyces cerevisiae, the total Chargaff differences for the W bases and for the S bases are usually correlated. In general, Chargaff differences for a natural sequence and its shuffled counterpart diverge maximally when 1 kb sequence windows are employed. This should be the optimum window size for examining correlations between Chargaff differences and sequence features which have arisen through natural selection. We propose that Chargaff's second parity rule reflects the evolution of genome-wide stem-loop potential as part of shortand long-range accounting processes which work together to sustain the integrity of various levels of information in DNA. 0 Academic Press 0 Introduction When the base composition of natural duplex DNA is determined it is found that the quantities of A and T are equal and the quantities of C and G are equal. This is Chargaff's famous first parity rule (Chargaff, 1951). If a long DNA duplex is cut into two and the base composition of each part determined, the rule is found to hold precisely for the two parts, as for the duplex of 0 origin. This division of the duplex can be continued down to individual bases (pairing with their complementary bases on the opposite strand of the duplex). Again Chargaff's parity rule is obeyed precisely (Watson & Crick, 1953). Disregarding nearest-neighbour influences (Turner, 1996), single base pairs can be regarded as fundamental ``accounting units''. The summation of these individual accounting units results in the precise A = T and C = G equivalences of duplex DNA sequences. That the equivalences have arisen, and are maintained, because they are of adaptive value to an 0 Academic Press 0 expected to resemble that resulting from the tossing of a biased coin for which heads (A or C) would be slightly favoured/disfavoured over tails (T or G), respectively, depending on their relative proportions in the total segment. The base composition o 0 Review: Proteins with Repeated Sequence--Structural Prediction and Modeling 1 Andrey V. Kajava 0 The relationship between the amino acid sequence and the three-dimensional structure of proteins with internal repeats is discussed. In particular, correlations between the amino acid composition and the ability to fold in a unique structure, as well as classification of the structures based on their repeat length, are described. This analysis suggests rules that can be used for the structural prediction of repeat-containing proteins. The paper is focused on prediction and modeling of solenoid-like proteins with the repeat length ranging between 5 and 40 residues. The models of leucine-rich repeat proteins and bacterial proteins with pentapeptide repeats are examined in light of the recently solved structures of the related molecules. © 2001 Academic Press Key Words: classification; molecular modeling; prediction; tandem repeats; structural bioinformatics. 0 Copyright © 2001 by Academic Press All rights of reproduction in any form reserved. 0 REVIEW: STRUCTURAL PREDICTION OF REPEAT-CONTAINING PROTEINS 0 their number has grown to about 40 since then (Groves and Barford, 1999; Kobe and Kajava, 2000). Despite this progress, these proteins are still underrepresented in the structural databases (about 0.5% of all structures), compared with sequence databases (about 5%). This lack of structural information is explained by the fact that the large molecular weight and the elongated shape of these molecules hamper X-ray and NMR studies. These difficulties add importance to the theoretical approaches. In this article, molecular modeling of several solenoidlike proteins will be described and some rules will be formulated for the theoretical prediction and modeling of these types of repetitive proteins. 0 IS A PROTEIN WITH REPEATS STRUCTURED OR UNSTRUCTURED? 0 This is the first question to answer when approaching a repetitive protein to predict its 3D structure. Most protein molecules fold into only one particular conformation determined by their amino acid sequence. This is especially correct for proteins with aperiodic sequences that fold into globular structures. Unstructured fragments of globular proteins, if any, represent only a minor part of the molecules and are located in loops or connections between stable structural domains. In contrast, proteins with repeats frequently do not have unique stable 3D structures. For example, experimental studies have failed to demonstrate the presence of a unique 3D structure for elastin (Urry et al., 1995), small proline-rich proteins of cell envelopes (Steinert et al., 1999), the circumsporozoite protein of Plasmodium falciparum (Esposito et al., 1989; Dyson et al., 1990), glutenin from wheat (Van Dijk et al., 1997), the serine-rich domain of rtoA protein from Dictyostelium discoideum (Brazill et al., 2000), histidine-proline-rich glycoprotein (Borza et al., 1996), and H1 histones (Hartman et al., 1977). The elastin molecules containing a set of repeats, e.g., VGVAPG and GFGVGAGVP, are unstructured and covalently cross-linked to generate an elastic meshwork that enables tissues such as arteries and lungs to deform and stretch without damage (Urry et al., 1995). The small proline-rich 3 protein of the human cell envelope having GxTKVPEP repeats (here and further in the text, "x" indicates a position with any residue) adopts a loose structure with some regions of protein occasionally folding in -turn conformations (Steinert et al., 1999). The circumsporozoite protein from P. falciparum, an agent of malaria, comprises a long tandem array of NANP repeats. This repetitive region can be elongated and flexible and may function similarly to the outer cell carbohydrates. The H1 histone molecules are thought to be responsible for pulling chromatin nucleosomes 0 The Comparative Genomics of Polyglutamine Repeats: Extreme Difference in the Codon Organization of Repeat-Encoding Regions Between Mammals and Drosophila 1 M. Mar Alba,1 Mauro F. Santibanez-Koref,2 John M. Hancock2,* ` ´~ 0 Abstract. Polyglutamine repeats within proteins are common in eukaryotes and are associated with neurological diseases in humans. Many are encoded by tandem repeats of the codon CAG that are likely to mutate primarily by replication slippage. However, a recent study in the yeast Saccharomyces cerevisiae has indicated that many others are encoded by mixtures of CAG and CAA which are less likely to undergo slippage. Here we attempt to estimate the proportions of polyglutamine repeats encoded by slippage-prone structures in species currently the subject of genome sequencing projects. We find a general excess over random expectation of polyglutamine repeats encoded by tandem repeats of codons. We nevertheless find many repeats encoded by nontandem codon structures. Mammals and Drosophila display extreme opposite patterns. Drosophila contains many proteins with polyglutamine tracts but these are generally encoded by interrupted structures. These structures may have been selected to be resistant to slippage. In contrast, mammals (humans and mice) have a high proportion of proteins in which repeats are encoded by tandem codon structures. In humans, these include most of the triplet expansion disease genes. 0 Key words: Glutamine repeats -- Replication slippage -- Comparative genome analysis -- Repeat evolution -- Triplet expansion diseases -- Triplet repeats -- Genome evolution 0 quences encoding polyglutamine repeats in the yeast genome (Alba et al. 1999a) indicated that the majority does not consist of long runs of single codons, suggesting that in yeast point mutation is an important process in generating polyglutamine repeats. These observations raise the question to what extent the contribution of point mutation and slippage to the evolution of these structures differs in different evolutionary lineages. To study this we have analyzed large protein data sets from a further four model organisms that are currently the subjects of genome sequencing projects (Escherichia coli, Caenorhabditis elegans, Arabidopsis thaliana, Drosophila melanogaster) and compared them with S. cerevisiae, Mus musculus, and Homo sapiens repeats. The results show similarities and differences between species. For most of the eukaryotic species there is an overrepresentation of tracts encoded by long CAG tandem repeats, supporting the idea that recent slippage has been involved in the generation of a significant proportion of the tracts. However, on average about 70% of the tracts do not show evidence of recent slippage, and in D. melanogaster there is no clear evidence of a strong contribution from slippage. Furthermore, in the two mammalian species about one-third of the tracts are exclusively encoded by CAG and the length of the tracts is on average much longer than in other species. This suggests that slippage has played a more important role in the evolution of polyglutamine regions in mammals than in other taxa. Methods Database Searches 0 BLASTP (Altschul et al. 1990) at the NCBI was used to find all GenBank entries which contained genes encoding long polyglutamine tracts ( 6 glutamines) from E. coli, S. cerevisiae, C. elegans, A. thaliana, D. melanogaster, M. musculus, and H. sapiens. Redundancy in the primary data sets was eliminated by running FASTA within the GCG package (Pearson and Lipman 1988; GCG 1997). Sequences with 95% identity were considered redundant, and only one representative sequence was used in the subsequent analysis. Where there was a discrepancy in the length of the polyglutamine tract in nearly identical sequences, we took the sequence with the longest tract. 0 Analysis of Codon Repeats 0 We used statistical analysis to analyze two properties of polyglutamine repeat-encoding regions. The first was the extent of deviation of the codon organization within these regions from random. This was measured by considering the deviation of the length of the longest run of each codon type from chance expectation (Alba et al. 1999a,b). The second property was the over- or underrepresentation of tandem codon repeats of a particular length in the whole set of polyglutamine-coding regions in a given species. Length of the Longest Homogeneous Run. As described previously (Alba et al. 1999a,b) the organizational homogeneity or otherwise of a region encoding a polyglutamine repeat has to be considered in the 0 Table 1. Polyglutamine tracts in different species Length of polyglutamine tract Species S. cerevisiae C. elegans A. thaliana D. melanogaster M. musculus H. sapiens 0 CAG relative frequencya Genome 0.307 0.331 0.442 0.716 0.743 0.674 Tracts 0.450* 0.430* 0.465 (NS) 0.728 (NS) 0.824* 0.830* 0 Pure codon tracts CAG 4.7% 2.2% 2.2% 7.3% 37.2% 26.2% CAA 5.4% 5.8% 11.3% 0% 0% 0% 0 Chi-square test of the r 0 Tendency for Local Repetitiveness in Amino Acid Usages in Modern Proteins 1 Kazuhisa Nishizawa1*, Manami Nishizawa1 and Ki Seok Kim2 0 Systematic analyses of human proteins show that neural and immune system-specific, and therefore, relatively ``modern'' proteins have a tendency for repetitive use of amino acids at a local scale ($1-20 residues), while ancient proteins (human homologues of Escherichia coli proteins) do not. Those protein subsegments which are unique based on homology search account for the repetitiveness. Simulation shows that such repetitiveness can be maintained by frequent duplication on a very short scale (one to two codons) in the presence of substitutive point mutation, while the latter tends to mitigate the repetitiveness. DNA analyses also show the presence of cryptic (i.e. ``out of the codon frame'') repetitiveness, which cannot fully be explained by features in protein sequences. Simulative modification of the amino acid sequences of immune systemspecific proteins estimate that 2.4 duplication events occur during the period equivalent to ten events of substitution mutation. It is also suggested that the repetitiveness leads to longitudinal unevenness within a given peptide domain. Those peptide motifs which contain similarly charged residues are likely to be generated more frequently in the presence of the tendency for repetitiveness than in its absence. Therefore, the neutral propensity of DNA for duplication, which can also tend to generate repetitiveness in amino acid sequences, seems to be manifested primarily when the constraints on amino acid sequences are relatively weak, and yet may be positively contributing to generation of unevenness in modern proteins. 0 Academic Press 0 Keywords: microsatellite; coding regions; peptide motif; triplet repeat 0 Academic Press 0 Repetitive Use of Amino Acids 0 Results and Discussion 0 Repetitive Use of Amino Acids 0 Identifying Differentially Expressed Genes in cDNA Microarray Experiments 0 ABSTRACT A major goal of microarray experiments is to determine which genes are differentially expressed between samples. Differential expression has been assessed by taking ratios of expression levels of different samples at a spot on the array and agging spots (genes) where the magnitude of the fold difference exceeds some threshold. More recent work has attempted to incorporate the fact that the variability of these ratios is not constant. Most methods are variants of Student's t -test. These variants standardize the ratios by dividing by an estimate of the standard deviation of that ratio; spots with large standardized values are agged. Estimating these standard deviations requires replication of the measurements, either within a slide or between slides, or the use of a model describing what the standard deviation should be. Starting from considerations of the kinetics driving microarray hybridization, we derive models for the intensity of a replicated spot, when replication is performed within and between arrays. Replication within slides leads to a beta-binomial model, and replication between slides leads to a gamma-Poisson model. These models predict how the variance of a log ratio changes with the total intensity of the signal at the spot, independent of the identity of the gene. Ratios for genes with a small amount of total signal are highly variable, whereas ratios for genes with a large amount of total signal are fairly stable. Log ratios are scaled by the standard deviations given by these functions, giving model-based versions of Studentization. An example is given. Key words: beta-binomial model, microarray replication. 0 BAGGERLY ET AL. 0 INTRODUCTION 0 he human biological system is under the control of perhaps 40,000 genes. Genes are the encoded blueprints for the proteins that perform cellular functions. In going from genes to proteins, there is an intermediate step in which DNA is transcribed to single-stranded messenger RNA (mRNA). It is through mRNA that genes produce protein. Most of the time, the levels of mRNA re ect the abundance of the corresponding proteins in the cell. Perturbations of the cellular environment by such factors as radiation, heat, food intake, or genetic mutation lead to altered expression in a speci c group of genes. A goal of functional genomics is to apply high-throughpu t technologies to identify, from the vast number of genes, the few genetic and molecular changes associated with a de ned phenotype. Identi cation of these genes can help us diagnose disease, identify targets for speci c therapeutic intervention, or simply understand the basis of the underlying biological processes. A primary tool for functional genomics is the Complementary DNA (cDNA) microarray, which is commonly used to measure the relative expression levels of thousands of genes in a given cell population. Using this approach, researchers have successfully found disease related genes (Bittner et al., 2000; Clark et al., 2000; Fuller et al., 1999), and have developed new molecular classi cation schemes for cancers (Bittner et al., 2000; Golub et al., 1999). Microarrays are produced in a laboratory by placing thousands of different cDNA clones onto a solid surface: a nylon membrane or a chemically coated glass microscopy slide. For example, in a typical experiment we print 4,800 spots in a 4 £ 12 format of patches, where each patch contains 100 different spots arranged in a 10 £ 10 grid. At each spot, approximately 2 nanograms of a speci c gene are deposited by a robotic arrayer. Once on the slide, the originally double-stranded DNA is denatured so that it splits into single strands which are bound to the surface. These single strands are then available to serve as speci c attractants to the complementary single-stranded DNA molecules, a process called hybridization. To assess the expression levels of the genes in a given cell population, the cells are broken apart chemically (lysed) and total RNA is isolated according to a standard procedure. Then reverse transcriptase is used to convert the mRNA back into single-stranded complementary DNA, which is more stable than RNA. During the process of reverse transcription, uorescent dyes or radioactively labeled nucleotides can be incorporated, providing a signal that can be monitored by detectors. Further, two or more different uorescent dyes can be used to label different samples, thus allowing simultaneous monitoring of two samples on the same microarray. After the labeled cDNA in a solution is obtained, it is placed onto the microarray surface and incubated to allow speci c binding to the different DNA molecules bound to the array. We customarily call the immobilized DNA on the microarray "probe" and the labeled DNA in solution "target." (This target/probe dichotomy is, unfortunately, not set; the literature contains both this usage and the converse. We have chosen to follow the de nition adopted in the January 1999 supplement to Nature Genetics, "The Chipping Forecast.") The amount of probe on the array is assumed to be vastly in excess of the amount of target, so that the amount binding to the probe is a function of the target copy number in the mixture. After washing to remove the nonspeci c binding, the hybridized microarray is scanned using a laser scanner (for uorescence) or a phosphorimage r (for radioactive labels). We will focus on uorescent labeling on glass slides in this paper, but the model proposed also holds for radioactive labeling, since the hybridization kinetics are similar. Both scanners produce computer images of the entire array whose pixel values are processed to estimate the rough amounts associated with individual spots. Unfortunately, these measurements do not correspond perfectly to the true expression levels. Reverse transcription and label incorporation work with different ef ciencies for different mRNA sequences, so the relative expression levels of different genes within a sample cannot be measured reliably. However, the relative expression levels of the same gene in two different samples can be measured, as the reverse transcription ef ciencies should be about the same. Comparing the images introduces two types of offset that must be corrected for. First, there is a multiplicative offset, a normalization factor, associated with scans being made using different gain settings or using different amounts of raw material in the two samples. Second, there is a background level associated with the nonspot portions of the image, which must be subtracted before comparisons are made. Estimating and correcting for these offsets introduces variation, which we shall address below. For more detailed descriptions of the experimental protocols used in microarray preparation, the reader is referred to some of the papers addressing protocols (Eisen and Brown, 1999; Hedge et al., 2000). 0 IDENTIFYING DIFFERENTIALLY EXPRESSED GENES 0 Thus, cDNA microarrays allow us to compare genetic pro les of different samples (Schena et al., 1995, 1996). We may be able to use these pro les to identify genetic markers associated with various diseases by contrasting diseased and healthy tissue. Further, we may arrive at a more objective method of pathology that allows us to identify molecularly distinct subcategories of diseases, paving the way for more focused treatments. Some of this potential is beginning to be realized (Alizadeh et al., 1999; Alon et al., 1999; DeRisi et al., 1997; Eisen and Brown, 1999; Golub et al., 1999; Hughes et al., 2000b; Lee et al., 2000; Pollack et al., 1999; Ross et al., 2000; Scherf et al., 2000). Books on the methodology, (Schena, 1999, 2000) are beginning to appear. From a statistical point of view, the initial question to be addressed in comparing relative expression levels is whether an observed difference corresponds to a real difference or simply a statistical uctuation: How do we assess signi cance? Early papers (Schena et al., 1995, 1996; DeRisi et al., 1996) focused on sets of genes exhibiting more than a k-fold difference in expression level between samples, where the value of k was chosen more or less arbitrarily. Focusing on fold differences reduces to focusing on ratios, or equivalently log ratios, of expression levels. We prefer log ratios because they visually emphasize the equal importance of ratios of k and 1=k; on the log scale these have the same magnitude and differ only in sign. 0 Assessing signi cance: Historical background 0 In the rst statistical attack on the problem of assessing when a log ratio is "signi cant" (Chen et al., 1997), the use of a xed fold-difference is restated by assuming that the coef cient of variation associated with each signal is constant, but the fold multiple for signi cance thresholding is chosen in a less ad hoc fashion. The authors assess the overall level of variability associated with the log ratio measurements for a few "housekeeping" genes whose level of expression is assumed to be constant across samples an 0 General nonlinear framework for the analysis of gene interaction via multivariate expression arrays 1 Seungchan Kim Edward R. Dougherty 1 Michael L. Bittner Yidong Chen 0 National Institutes for Health National Human Genome Research Institute Laboratory for Cancer Genetics 1 Krishnamoorthy Sivakumar 1 Paul Meltzer Jeffrey M. Trent 0 National Institutes for Health National Human Genome Research Institute Laboratory for Cancer Genetics 0 Abstract. A cDNA microarray is a complex biochemical-optical system whose purpose is the simultaneous measurement of gene expression for thousands of genes. In this paper we propose a general statistical approach to finding associations between the expression patterns of genes via the coefficient of determination. This coefficient measures the degree to which the transcriptional levels of an observed gene set can be used to improve the prediction of the transcriptional state of a target gene relative to the best possible prediction in the absence of observations. The method allows incorporation of knowledge of other conditions relevant to the prediction, such as the application of particular stimuli or the presence of inactivating gene mutations, as predictive elements affecting the expression level of a given gene. Various aspects of the method are discussed: prediction quantification, unconstrained prediction, constrained prediction using ternary perceptrons, and design of predictors given small numbers of replicated microarrays. The method is applied to a set of genes undergoing genotoxic stress for validation according to the manner in which it points toward previously known and unknown relationships. The entire procedure is supported by software that can be applied to large gene sets, has a number of facilities to simplify data analysis, and provides graphics for visualizing experimental data, multiple gene interaction, and prediction logic. © 2000 Society of Photo-Optical Instrumentation 0 Sequences and clones for over a million expressed sequenced tagged sites ESTs are currently widely available. Characterization of these genes lies behind the ability to collect them. Only 14% of identified clusters contain genes even tenuously associated with a known functionality. One way of gaining insight into a gene's role in cellular activity is to study its expression pattern in a variety of circumstances and contexts, as it responds to its environment and to the action of other genes. Recent methods facilitate large scale surveys of gene expression in which transcript levels can be determined for thousands of genes simultaneously. In particular, cDNA microarrays result from a complex biochemical-optical system incorporating robotic spotting and computer image formation and analysis.1-5 Since transcription control is accomplished by a method which interprets a variety of inputs,6-8 we require analytical tools for expression profile data that can detect the types of multivariate influences on decision making produced by complex genetic networks. In this paper we discuss a statistical-operational framework for finding associations between expression patterns of genes by determining whether knowledge of the transcriptional levels of a small 0 gene set can be used to predict the transcriptional state of another gene. A feature of the method is that it allows one to incorporate knowledge of other conditions, such as the application of particular stimuli or the presence of inactivating gene mutations, as predictive elements, thereby broadening the classes of information that can be simultaneously evaluated in modeling biological decision making. Our focus is on a general framework: the determination-prediction paradigm for analysis of gene interaction, comparison of constrained and unconstrained prediction in the face of limited microarray replications, estimation of the degree of determination given limited replications, interpretation of the results, and software to assist interpretation. Experimental results will be given for the purposes of explanation and verification. A particular instance of the general methodology has been applied in a separate biological paper see Sec. 4 .9 A methodological perspective is important for appreciating the range of applicability of the proposed framework, which is not limited to cDNA microarrays, but can be used for studying interaction in the context of other kinds of arrays. The mechanism of intergene association is not a factor in statistical prediction. The only factor is the ability to predict the target level from the predictor levels. The predictor genes may be upstream or downstream from the target gene in the 0 SPIE 0 October 2000 0 actual genetic network, some may be upstream and some downstream, or they may be distributed about the network in such a way that their relation to the target gene is based on chains of interaction among various intermediate genes. Whatever the relationship of the predicting genes to the predicted, if knowledge of their states allows us to better predict the expression level of the target gene, then we infer there is some relationship--the better the prediction, the stronger the relation. As the first step in carrying out nonlinear genomic prediction on gene expression profiles, data complexity is reduced by thresholding the changes in transcript level into ternary expression data: 1 down regulated , 1 up regulated , or 0 invariant . This simplification is motivated by the way in which analysis is carried out on cDNA microarrays and by the need to collect many samples where gene expression levels vary due to altered cellular states. To find connections between genes, enough conditions must be sampled to detect the independent functioning of different genetic networks. This amount of sampling requires data from numerous arrays. When viewed across many arrays, the absolute intensity of signal detected by each element of the detector in this hybridization based assay can be seen to vary based both on the process of preparing and printing the EST elements, and the processes of preparing and labeling the cDNA representations of the RNA pools. This problem is solved via internal standardization. An algorithm that first calibrates the data internally to each microarray and statistically determines whether the data justify the conclusion that expression is up regulated or down regulated with 99% confidence is used to detect significant changes in the transcript level.10 Requiring a high confidence level insures that the logical values 1 and 1 represent significant down and up regulation, and do not result from experimental variability. 0 Nonlinear Multivariate Prediction 0 The purpose of nonlinear multivariate prediction filtering is to predict estimate the output of a nonlinear system. Consider a system S having inputs X 1 ,X 2 , . . . ,X m to be observed and measured, along with other inputs, which we may have no way of measuring, and may not even be able to identify Figure 1 . We do not assume a known mechanism by which the output is determined, nor is there an assumption of causality. The prediction problem is to estimate the output of S given only the inputs X 1 ,X 2 , . . . ,X m . As indicated in Figure 1, we view X 1 ,X 2 , . . . ,X m as input variables to a logical system L that yields a logical value Y pred that best predicts the value Y that S would provide, given the knowledge of the inputs X 1 ,X 2 , . . . ,X m . Statistical training uses only the fact that X 1 ,X 2 , . . . ,X m are among the inputs to S, the output Y of S can be measured, and a logical system L can be constructed whose output Y pred statistically approximates Y. The underlying scientific assumption is that the full system S is beyond the reach of current technology and our knowledge of S is derived from its effect on observable input variables. The logic of L represents an operational model of our understanding. It is crucial to recognize that this operational model is contingent on existing technology, which determines the inputs that can be observed, the manner in which the inputs are 0 A Comprehensive View of Regulation of Gene Expression by Double-stranded RNA-mediated Cell Signaling* 1 Gary Geiss§, Ge Jin§¶, Jinjiao Guo¶, Roger Bumgarner, Michael G. Katze, and Ganes C. Sen¶ 0 Double-stranded (ds) RNA, a common component of virus-infected cells, is a potent inducer of the type I interferon and other cellular genes. For identifying the full repertoire of human dsRNA-regulated genes, a cDNA microarray hybridization screening was conducted using mRNA from dsRNA-treated GRE cells. Because these cells lack all type I interferon genes, the possibility of gene induction by autocrine actions of interferon was eliminated. Our screen identified 175 dsRNA-stimulated genes (DSG) and 95 dsRNA-repressed genes. A subset of the DSGs was also induced by different inflammatory cytokines and viruses demonstrating interconnections among disparate signaling pathways. Functionally, the DSGs encode proteins involved in signaling, apoptosis, RNA synthesis, protein synthesis and processing, cell metabolism, transport, and structure. Induction of such a diverse family of genes by dsRNA has major implications in host-virus interactions and in the use of RNAi technology for functional ablation of specific genes. 0 Double-stranded (ds)1 RNA is not a major constituent of mammalian cells, but many viruses produce it during their replication cycle as either an essential intermediate for RNA synthesis or a byproduct generated by annealing of complementary mRNAs encoded by the opposite strands of a DNA virus genome (1). In addition, some viruses encode RNA species, such as VA RNA or EBER RNA, which have considerable ds structures. Virtually nothing is known about how dsRNA affects viral and cellular gene expression and functions in a virally infected cell, although the role of PKR, the dsRNA-activated protein kinase, in inhibiting protein synthesis has been studied in cells infected with a variety of viruses (2). In the host-virus interaction context, dsRNA is closely associated with the interferon (IFN) system. dsRNA is a potent inducer of type I IFN synthesis and is believed to be the primary viral gene product that causes IFN production by 0 infected cells (3). dsRNA has important roles in IFN actions as well. It is the obligatory activator of two classes of IFN-induced enzymes: PKR, the IFN-induced protein kinase, and 2-5(A) synthetases, whose products activate the latent ribonuclease, RNaseL. Moreover, transcription of some IFN-stimulated genes (ISGs) is also induced by dsRNA (4). That this induction is direct and not mediated by induced IFN was convincingly demonstrated in IFN unresponsive cells and in cells that are devoid of the IFN gene locus (5, 6). Direct induction of some ISGs by dsRNA suggests that the encoded proteins will be induced in virally infected cells without any involvement of IFNs. Thus regulation of viral gene expression by these proteins is relevant for all infected cells, even in the absence of IFN treatment. Several transcription factors such as NF B, IRF-3, and ATF-1, are known to be activated by dsRNA (7). Their activation is mediated by protein kinases including PKR, p38, JNK2, and IKK (7, 8) although the pathways of activation are not completely understood. For genes that are induced by either IFN or dsRNA, the same cis-element regulates their induction by both reagents. But entirely different signaling pathways and transcription factors are used by the two inducers (5). There has not been any attempt to systematically define the full repertoire of dsRNA-regulated genes. Identification of these genes is required not only for revealing the nature of all signaling pathways used by dsRNA but also for defining the set of proteins that are induced by dsRNA or virus infection. In the current study, we started this investigation using a cDNA microarray hybridization analysis of RNA isolated from dsRNA-treated and -untreated GRE cells that are devoid of the type I IFN locus and cannot synthesize IFNs. Using this approach, in the current study we have identified more than a hundred DSGs, only a few of which were previously known to be dsRNA-inducible. Furthermore we also identified multiple down-regulated genes. These genes were induced or repressed by dsRNA strongly, rapidly, and transiently. The encoded proteins are involved in a broad range of cellular functions and metabolic pathways. 0 EXPERIMENTAL PROCEDURES 0 dsRNA-regulated Gene Expression 0 Identification of dsRNA-regulated Genes (DRGs)--For undertaking a systematic analysis of human DRGs, we chose to use the glioma cell line, GRE (5). These cells lack the type I IFN locus and hence cannot synthesize IFN- or any of the multiple IFN- species in response to dsRNA or other stimuli. Because dsRNA treatment of GRE cells cannot induce IFNs, the possi- 0 bility of secondary induction of the IFN-stimulated genes by autocrine actions of IFNs was eliminated. This consideration was highly pertinent because dsRNA is known to be a potent inducer of IFNs, and several DSGs are known to be induced by IFN as well. GRE cells were treated with the dsRNA, poly(I) poly(C), for 6 h and poly(A) RNA was isolated from treated and untreated cells. We chose the length of treatment to be 6 h, because our previous studies have shown that this is the optimum time for induction of 561 mRNA that encodes the 56 kDa protein, P56 (5). The two sets of 0 Copyright 1997 by the American Chemical Society 0 The Efficiency of Light-Directed Synthesis of DNA Arrays on Glass Substrates 1 Glenn H. McGall,* Anthony D. Barone, Martin Diggelmann, Stephen P. A. Fodor, Erik Gentalen, and Nam Ngo 0 building blocks in combination with polymeric semiconductor photoresist films as the photoimageable component.3 The development of chemistry and processes for DNA array 0 American Chemical Society 0 McGall et al. Scheme 1 0 (acetic anhydride/1-methylimidazole/2,6-lutidine/THF) and oxidation (I2/pyridine-H2O).7 After removing the acyl protecting groups from the bound fluorescein, relative densities of hydroxyl groups in different regions of the support could then be determined from surface fluorescence intensities. 0 For the purpose of this study, it was not necessary to achieve an absolute measure of the amount of bound fluorescein in any given region of the substrate, although the photon-counting capability of the fluorescence microscope would, in principle, enable one to do so. Instead, differences in surface fluorescence were used to obtain relatiVe values for surface density, providing a simple, internally consistent method for measuring chemical and photochemical efficiencies. 0 Beaucage, S. L. In Protocols for Oligonucleotides and Analogs; Agrawal, S., Ed.; Humana Press: Totowa, New Jersey, 1993; pp 33-61. 0 Light-Directed Synthesis of DNA Arrays on Glass Scheme 2 0 One potential source of interference with this kind of analysis is fluorescence quenching due to energy transfer interactions between adjacent fluorophores on the surface. The initial density of surface functional groups on the silanated glass substrates that were used in this work have been estimated to be in the range of 10-30 pmol/cm2.6 Assuming that the initial silanation of the support g 0 AAAI Press 0 The value of prior knowledge in discovering motifs with MEME 1 Timothy L. Bailey and Charles Elkan 0 MEME is a tool for discovering motifs in sets of protein or DNA sequences. This paper describes several extensions to MEME which increase its ability to find motifs in a totally unsupervised fashion, but which also allow it to benefit when prior knowledge is available. When no background knowledge is asserted, MEME obtains increased robustness from a method for determining motif widths automatically, and from probabilistic models that allow motifs to be absent in some input sequences. On the other hand, MEME can exploit prior knowledge about a motif being present in all input sequences, about the length of a motif and whether it is a palindrome, and (using Dirichlet mixtures) about expected patterns in individual motif positions. Extensive experiments are reported which support the claim that MEME benefits from, but does not require, background knowledge. The experiments use seven previously studied DNA and protein sequence families and 75 of the protein families documented in the Prosite database of sites and patterns, Release 11.1. 0 The new sequence model type allows each each sequence in the training set to have exactly zero or one occurrences of each motif. This type of model is ideally suited to discovering multiple motifs in the majority of cases encountered in practice. The motif-width heuristic allows MEME to automatically discover several motifs of differing, unknown widths in a single DNA or protein dataset. We also describe an improved method of finding multiple, different motifs in a single dataset. 0 Overview of MEME 0 The principal input to MEME is a set of DNA or protein sequences. Its principal output is a series of probabilistic sequence models, each corresponding to one motif, whose parameters have been estimated by expectation maximization (Dempster, Laird, & Rubin 1977). In a nutshell, MEME's algorithm is a combination of expectation maximization (EM), 0 OOPS, ZOOPS, and TCM models 0 The different types of sequence model supported by MEME make differing assumptions about how and where motif occurrences appear in the dataset. We call the simplest model type OOPS since it assumes that there is exactly one occurrence per sequence of the motif in the dataset. This type of model was introduced by Lawrence & Reilly (1990). This paper describes for the first time a generalization of OOPS, called ZOOPS, which assumes zero or one motif occurrences per dataset sequence. Finally, TCM (two-component mixture) models assume that there 0 Supported by NIH Genome Analysis Pre-Doctoral Training Grant No. HG00005. 0 MEME is an unsupervised learning algorithm for discovering motifs in sets of protein or DNA sequences. This paper describes the third version of MEME. Earlier versions were described previously (Bailey & Elkan 1994), (Bailey & Elkan 1995a). The MEME extensions on which this paper focuses are methods of incorporating background knowledge, or coping with its lack. For incorporating background knowledge, these innovations include automatic detection of inverse-complement palindromes in DNA sequence datasets, and using Dirichlet mixture priors with protein sequence datasets. Dirichlet mixture priors bring information about which amino acids share common properties and thus are likely to be interchangeable in a given position in a protein motif. This paper also describes a new type of sequence model and a new heuristic for automatically determining the width of a motif which remove the need for the user to provide two types of information. 0 an EM-based heuristic for choosing the starting point for EM, a maximum likelihood ratio-based (LRT-based) heuristic for determining the best number of model free parameters, multistart for searching over possible motif widths, and greedy search for finding multiple motifs. 0 for . The last column is an inverted version of the first column, the second to last column is an inverted version of the second column, and so on. As will be described below, MEME automatically chooses whether or not to enforce the palindrome constraint, doing so only if it improves the value of the LRT-based objective function. 0 Expectation maximization 0 Consider searching for a single motif in a set of sequences by fitting one of the three sequence model types to it. The dataset of sequences, each of length , will be referred to as . There are possible starting positions for a motif occurrence in each sequence. The starting point(s) of the occurrence(s) of the motif, if any, in each of the sequences are unknown and are represented by the the variables (called the "missing information") where if a motif occurrence starts in position in sequence , and otherwise. The user selects one of the three types of model and MEME attempts to maximize the likelihood function of a model of that type , where is a vector containing given the data, all the parameters of the model. MEME does this by using EM to maximize the expectation of the joint likelihood of the model given the data and the missing information, . This is done iteratively by repeating the following two steps, in order, until a convergence criterion is met. E-step: compute 0 jhEg4 ki ¢ X 0 M-step: solve 0 x 2 n te ki g qjhE4 g pl n So mEl ¢ fX 0 DNA palindromes 0 where is a vector containing all the parameters of the model. This process is known to converge (Dempster, Laird, & Rubin 1977) to a local maximum of the likelihood function . Joint likelihood functions. MEME assumes each sequence in the training set is an independent sample from a member of either the OOPS, ZOOPS or TCM model families and uses EM to maximize one of the following likelihood functions. The logarithm of the joint likelihood for models 0 It is not necessary that all of the sequences be of the same length, but this assumption will be made in what follows in order to simplify the exposition of the algorithm. In particular, under this assumption, . 0 That is, 0 A DNA palindrome is a sequence whose inverse complement is the same as the original sequence. DNA binding sites for proteins are often palindromes. MEME models a DNA palindrome by constraining the parameters of corresponding columns of a motif to be the same: 0 Here, is the probability of letter occurring at either a background position (I ) or at position of a motif occurrence (Q ), is the parameters of the background component of the sequence model, and is the parameters of the motif component. Formally, the parameters of an OOPS model are the letter frequencies for the background and each column of the motif, and the width of the motif. The ZOOPS model type adds a new parameter, , which is the prior probability of a sequence containing a motif occurrence. A TCM model, which allows any number of (non-overlapping) motif occurrences to exist within a sequence, replaces with , where is the prior probability that any position in a sequence is the start of a motif occurrence. 0 rGFd 0 are zero or more non-overlapping occurrences of the motif in each sequence in the dataset, as described by Bailey & Elkan (1994). Each of these types of sequence model consists of two components which model, respectively, the motif and nonmotif ("background") positions in sequences. A motif is modeled by a sequence of discrete random variables whose parameters give the probabilities of each of the different letters (4 in the case of DNA, 20 in the case of proteins) occurring in each of the different positions in an occurrence of the motif. The background positions in the sequences are modeled by a single discrete random variable. If the width of the motif is , and the alphabet for sequences is , we can describe the parameters of the two components of each of the three model types in the same way as 0 For a ZOOPS model, the joint log likelihood is 0 For a ZOOPS model, 0 For a TCM model, 0 The M-step. The M-step of EM in MEME reestimates using the following formula for models of all three types: 0 if otherwise. 0 Finding multiple motifs 0 All three sequence model types supported by MEME model sequences containing a single motif (albeit a TCM model can describe sequences with multiple occurrences of the same motif). To find multiple, non-overlapping, different motifs in a single dataset, MEME uses greedy search. It incorporates information about the motifs already discovered into the current model to avoid rediscovering the same motif. The process of discovering one motif is called a pass of 0 The conditional probability of a lengthsubsequence generated according to the background or motif component of a TCM model is defined to be 0 is a vector-valued indicator variable of lengt 0 New topical antiandrogenic formulations can stimulate hair growth in human bald scalp grafted onto mice 1 Amnon Sintov a,*, Sima Serafimovich b, Amos Gilhar b 0 Keywords: Androgenetic alopecia; Flutamide; Finasteride; Topical drug delivery; Skin permeation; Mice 0 Introduction Testosterone metabolites exert a significant hormonal influence on hair growth by interacting with receptors at the follicular papilla. It has long been known that an increased susceptibility of 0 scalp follicles to these androgens is the main cause of androgenetic alopecia (or male-pattern baldness) in genetically predisposed individuals (Imperato-McGinley et al., 1974; Ebling et al., 1991). In this type of alopecia, scalp follicles exhibit increased levels and activity of scalp 5a-reductase isoenzyme, which converts testosterone (T) to dihydrotestosterone (DHT) (Bingham and Shaw, 1973; Schweikert and Wilson, 1974). Taken together, increased conversion of T to DHT and 0 increased DHT binding capacity in bald scalp as compared to hairy scalp (Sawaya et al., 1989) provide a mechanistic explanation for androgenetic alopecia. DHT shortens the hair cycle and progressively miniaturizes scalp follicles. The miniaturized follicles all remain present and thus the possibility of reversal by re-enlargement exists. It is reasonable, therefore, to suppose that by administration of 5a-reductase inhibitors and/or non-steroidal antiandrogens, this reversal should occur. Finasteride, a 4-azasteroid inhibitor of 5a-reductase, was introduced by Merck in 1989. Finasteride is known to inhibit the prostate 5a-reductase isoenzyme type 2 more effectively than type 1 isoenzyme predominantly found in the skin of the scalp. However, while type 1 isoenzyme is located in the sebaceous glands, there is still significant activity of type 2 isoenzyme in the hair follicles (Sawaya and Price, 1997). This is, therefore, the reason why finasteride decreased the level of DHT in bald scalps after a long-term oral administration (Diani et al., 1992; Dallob et al., 1994); it also provides the justification for the topical mode of delivery. It should be emphasized that oral finasteride has already been introduced as an effective hair growth treatment, with only minor systemic adverse effects. Nevertheless, systemic therapy for a disorder such as male-pattern baldness is obviously not the treatment of choice if the option of topical delivery is available option. Another agent with a hair growth potential is the nonsteroidal anti-androgen flutamide. This drug, produced by Schering-Plough, was introduced as a new potent compound for treatment of prostatic carcinoma (Martindale, 1993). The systemic administration of flutamide causes several unwanted side effects, such as reducing libido and impairing spermatogenesis in men and feminizing male fetuses in pregnant women. Topical administration, therefore, is an important goal for such a drug, especially if indicated for skin disorders. In a comparative study, Chen et al. (1995) showed that topical administration of finasteride (in ethanol/propylene glycol vehicle) caused local inhibition of androgen-controlled sebaceous gland growth in hamster flank organ and that had a 0 similar action to that of the same doses of flutamide. To date, clinical studies have not been performed for testing the efficacy of topical flutamide in male-pattern baldness. It is likely that the success (i.e. effective with minimal systemic exposure) of this drug would be dependent on a well-designed vehicle that would increase skin accumulation and decrease percutaneous absorption. In this paper, we present a new topical base formulation for finasteride and flutamide (representing two anti-DHT categories). We studied the effect of the topical preparations of these two compounds on the growth of human hair in a murine transplantation model. The effect was monitored in scalp skin biopsies taken from bald subjects before plastic surgery procedures. This model which has been described previously by Gilhar et al. (1988), Van Neste (1996) and De Brouwer et al. (1997), is specific to male-pattern baldness, in which hairs of the bald skin graft do not re-enlarge after transplantation, while the hair of grafts taken from patients with alopecia areata (an auto-immune problem) begin to grow shortly after transplantation (Gilhar and Krueger, 1987). To correlate the pharmacological efficacy of the new drug-vehicle system with its cutaneous penetration properties, topical preparations containing flutamide were tested in vitro using excised hairless mouse skin. 0 Materials and methods 0 Formulation 0 Gel preparations containing 1% of flutamide (Eulexin, Schering-Plough Lab., Belgium) or finasteride (Proscarfi, Merck Sharp & Dohme, UK) were produced as follows. The drug was dissolved in ethyl alcohol (30% w/w in the final gel for flutamide, and 58% w/w in the final gel for finasteride); then 1% glyceryl oleate (as an enhancer) and distilled water were added gradually with mixing. The solutions were finally gelled by adding 4% hydroxypropyl methylcellulose (for flutamide) or ethylcellulose (for finasteride). A vehicle corresponding to the flutamide formula- 0 tion but containing no drugs was prepared for the purpose of in vivo comparison. In addition, a 1% flutamide formulation without enhancer was prepared and tested in vitro together with the formulation containing the enhancer (as described above), and a hydroalcoholic formulation (1:1 ethanol-water). 0 the subcutaneous tissue over the lateral thoracic cage of each mouse, and covered with a standard band aid dressing. The dressing was removed on day 7, and the grafts, which were located at the surface, were treated from day 8 for 60 days as described below. The procedure protocol related to animals was reviewed and approved by the Institutional Animal Care and Use Committee. 0 Animals 2.4. Treatment 0 Severe combined immune deficient mice (male Prkdc SCID-Charles River, UK), 2 - 3 months of age, were used in this study. The mice were grown in a pathogen-free animal facility. Specimens of each topical preparation, 20-30 mg, were spread gently over each transplanted 0 Skin grafting 0 Punch grafts, 0.5mm2, obtained from scalp skin of five bald men were used for transplantation to the SCID mice (three grafts per mouse). The transplantation procedure was performed as previously described (Gilhar et al., 1988). Each graft was inserted, through an incision in the skin, into 0 Table 1 Distribution of the histological hair structures in the treated grafts Anagen (%) Before treatment Finasteride Flutamide Vehicle (control) 0 30.4 47.0 10.5 0 Finasteride Flutamide Vehicle (control) 0 a No difference between groups was found for T or DHT (P\0.05). 0 Catagen (%) 35.7 22.8 26.5 24.6 0 Telogen (%) 64.2 46.8 26.5 64.9 0 scopically in the horizontal sections with the aid of a calibrated ocular micrometer. Hair structures in the histological specimens were counted. 0 In 6itro permeation testing 0 The in vitro diffusion of a topical drug through skin (in which the flux of the drug molecules through human cadaver or animal skin is determined) was performed basically according to the FDA guidelines (Skelly et al., 1987). Bas 0 Ecdysone-regulated puff genes 2000 1 C.S. Thummel 0 Keywords: Ecdysone; Drosophila metamorphosis; Gene regulation 0 these hormones could act directly on the nucleus, triggering a complex regulatory cascade of gene expression (Yamamoto and Alberts, 1976). Through a series of detailed and elegant studies, Ashburner and co-workers proposed a model for the regulation of gene expression by 20-hydroxyecdysone (referred to hereafter as ecdysone) (Fig. 1). Briefly, this model proposed that ecdysone, bound to its specific receptor, directly induces the expression of a small set of early regulatory genes. The protein products of these genes, in turn, repress their own expression and induce a much larger set of late target genes. It was assumed that these late genes would function as effectors that directly or indirectly control the appropriate biological responses to the pulse of ecdysone. Ashburner and colleagues also determined that the late puffs could be divided into two classes, based on their regulation by ecdysone (Ashburner and Richards, 1976). The early-late puffs are induced relatively rapidly after the addition of hormone and require the continuous presence of ecdysone for their activity, much like the early puffs. The late-late puffs, in contrast, are induced at later times and are prematurely induced upon ecdysone withdrawal. This latter result was interpreted to mean that the ecdy- 0 E63-1: an ecdysone-inducible calcium binding protein that can regulate salivary gland glue secretion Molecular analysis of the 63F early puff provided the first evidence that not all early puffs encode transcriptional regulators. This work identified a pair of divergently transcribed ecdysone-inducible genes: E63-1 and E63-2 (Andres and Thummel, 1995). E63-2 produces a single 1.2 kb mRNA with no extended open reading frames. Genetic studies indicate that this gene has no essential functions during development, suggesting that it may only be expressed due to its proximity to E63-1 (Vaskova et al., 2000). In contrast, E63-1 encodes a calcium-binding protein with four EF hands, most closely related to calmodulin. The regulation of E63-1 provides a further departure from prior studies of early puff genes, in that it is induced by ecdysone in a tissue-specific manner. Low to moderate levels of E63-1 are widely expressed in the third instar larvae, prior to the late larval ecdysone pulse. Only in the salivary gland is E63-1 transcription rapidly and directly induced by the hormone at puparium formation (Andres and Thummel, 1995). This restricted pattern of induction, combined with the known role of calcium-binding proteins in regulating secretion, led to the proposal that E63-1 might contribute to the physiology of the salivary gland by regulating ecdysoneinduced secretion. Although loss-of-function mutants provide an ideal means of testing this model, inactivation of the E63-1 gene has no detectable effect on viability or reproduction (Vaskova et al., 2000). In retrospect, this is not surprising, given that other calcium-binding proteins are encoded by the Drosophila genome. Consistent with possible functional redundancy in this pathway, recent studies have shown that salivary glands compromised for both calmodulin and E63-1 are defective in glue secretion (T.V. Do and A.J. Andres, personal communication). In addition, ectopic expression of E63-1 in transgenic animals is sufficient to trigger glue secretion if the intracellular calcium levels are elevated (A. Biyasheva et al., 2001). Moreover, ecdysone alone can lead to increased levels of intracellular calcium in larval salivary glands, with a detectable increase after 2 h of exposure. Ecdysone thus leads to two responses that can synergistically trigger salivary gland glue secretion -- increased levels of E63-1 expression as well as increased cytoplasmic calcium levels (Fig. 2). Although the time frame for calcium elevation suggests that this is a secondary-response to the hormone, the mechanism by which calcium levels are effected remains to be determined. E63-1 protein shows dynamic changes in its subcellular distribution as the salivary glands secrete glue, providing further evidence of a possible role in glue secretion (Vaskova et al., 2000). Initially, before the glue is secreted, E63-1 is localized to cell membranes, in the 0 The E23 early puff gene may regulate ecdysone responses by controlling intracellular hormone concentrations The 23E ecdysone-inducible puff is among the last early puffs described by Ashburner to be 0 Special Feature 0 Signalling by CD95 and TNF receptors: Not only life and death 0 Walter and Eliza Hall Institute of Medical Research, Royal Melbourne Hospital, Parkville, Victoria, Australia 0 Summary Members of the TNF family of receptors play important roles in normal physiology and in defence. The recent rapid progress in the understanding of the mechanisms of apoptosis has been accompanied by assumptions that TNF family receptors such as CD95(Fas/APO-1) only have a role in regulating cell survival. While regulation of cell death is one important function of TNF family receptors, they are capable of activating signal transduction pathways that have many other effects. The present review will focus on signalling of some TNF family receptors in the immune system, not only for apoptosis, but also for survival or activation. Key words: apoptosis, CD95, NF-B, signal transduction, TNF receptors. 0 TNF receptor family 0 The tumour necrosis factor receptor (TNFR)/nerve growth factor receptor (NGFR) family of molecules regulate a number of biological functions, such as growth, differentiation and apoptosis in multiple cell types. In the immune system, members of this receptor family are involved in the development of peripheral lymphoid organs, regulation of induced inflammatory responses and removal of cells at the end of an immune response. The TNFR family consists of more than 15 different molecules. Most are type I membrane proteins which resemble each other largely in their extracellular regions, which all contain 2-6 characteristic cysteine-rich domains.1 The TNF family receptors are activated upon binding of their cognate ligands, most of which are trimers with a structure similar to TNF. Sometimes the ligands are cell bound type II membrane proteins, but several are cleaved off and appear as soluble trimers. Induction of trimers or higher order complexes of the TNF family of receptors allows their cytoplasmic domains to aggregate intracytoplasmic signalling molecules. 0 so-called because it is required for these receptors to transmit apoptotic signals. The DD is a protein-protein interaction motif consisting of six alpha helices that allow two proteins with DD to bind to each other. Structurally the DD is related to two other homotypic interaction domains, the death effector domain (DED), and the caspase recruitment domain (CARD).2 0 Death domain adaptors: TRADD, FADD, RIP and RAIDD 0 Binding of TNF to TNFR1 induces recruitment of the DDcontaining protein TRADD to the DD of TNFR1.3 Overexpression of TRADD alone also induces the TNF-regulated responses apoptosis and activation of the transcription factors NF-B and Jun kinase (JNK), presumably because TRADD provides docking sites for downstream signalling proteins to the receptor complex.4 Two of the proteins that TRADD recruits to the signalling complex also bear death domains. One of these, RIP, has an N-terminal DD and a C-terminal kinase domain. Knockout studies have shown that RIP is required for induction of NFB by TNF.5 The other, Fas-associated protein with death domain (FADD), has a C-terminal DD, and an N-terminal DED. The FADD is required for cell death signalling by TNFR1 and also by CD95, to which it binds directly via its death domain.6-8 The DED of FADD allows it to bind to DED in the pro-domain of caspase 8. Through these interactions, ligation of TNFR1 or CD95 can result in the formation of a death-inducing signalling complex, which leads to activation of caspase 8, a cell death effector protease. Once activated, caspase 8 cleaves and activates downstream caspases, such as caspase 3, ultimately leading to cell death. Because cells from mice lacking caspase 8 are resistant to death induced by TNF receptors, CD95 and DR3, apoptosis triggered by all of these receptors must converge on this caspase.9 However, FADD must have other functions because FADD knockout mice die during embryogenesis, and lymphocytes from FADD-dominant negative transgenic mice do not proliferate normally in response to T cell mitogens in vitro.10-12 0 Signalling pathways controlled by TNF receptors 0 The cytoplasmic domains of the TNFR family, which are more diverse than the extracellular portions, do not have any intrinsic enzymatic activity, hence they signal by inducing aggregation of intracellular adaptor molecules (Fig. 1). 0 Death domains 0 The cytoplasmic domains of TNFR1 (p55), CD95 (Fas/ APO-1), NGFR (p75), death receptor (DR) 3, TRAIL-R1 and TRAIL-R2 all bear a motif termed a `death domain' (DD), 1 C Magnusson and DL Vaux 0 The group of TNF receptor-associated factors (TRAF) interact with members of the TNFR family. There are to date six TRAF proteins identified, TRAF1, TRAF2, TRAF3 (CRAF, LAP-1, CD40-bp), TRAF4 (CART1), TRAF5 and TRAF6 (review18). With the exception of TRAF4, TRAF proteins interact with receptor molecules either directly, or indirectly through binding to other TRAF, or through binding to TRADD. The TNFR2 (p75), CD40, CD30 and lymphotoxin- receptor (LTR) contain conserved, cytoplasmic TRAF binding motifs and are able to bind directly to TRAF proteins. Because TRAF2 can bind to TRADD, which in turn can associate with TNFR1, TRAF2 can indirectly participate in signalling from this receptor as well. The TRAF molecules share similar C-terminal domains, designated the TRAF domain, which is involved in protein-protein interactions. TRAF2, TRAF3, TRAF5 and TRAF6 also bear an N-terminal RING finger, a zinc binding motif found in several types of intracellular proteins.19-23 TNF receptor-associated factor proteins interact as homodimers or in heterodimeric complexes. For example, TRAF2 binds to TRADD, the TNFR2, LTR, CD40 or CD30 via its C-terminal TRAF domain, probably as a heterodimeric complex with TRAF1 or TRAF5, or as a homodimer.18,19 It has also been shown that TRAF proteins may signal from other receptors in addition to TNFR family molecules. TRAF6, which binds to CD40, is also involved in IL-1 receptor signalling through interaction with IRAK, a serine/ threonine kinase that also has a DD.24 Studies of TRAF2 and TRAF3 knockout mice have shown that TRAF proteins are required for activation of Jun/AP-1 signalling by TNF receptors, and have important roles for normal development, since these mice die during early life.25,26 0 RIP is an adaptor protein with a C-terminal death domain that can associate with the DD in the cytoplasmic domain of CD95. Via TRADD, RIP can also associate with the TNFR1.4 Cells from RIP knockout mice show increased susceptibility to TNF-mediated killing and fail to activate NF-B in response to TNF.5 This indicates that RIP is required for NF-B activation by TNF. Because RIP is a serine threonine kinase, it is likely to phosphorylate, and thereby activate, kinases that phosphorylate the inhibitor of NF-B, IB.13 Interestingly, RIP knockout mice also have abnormal development of lymph nodes, similar to those in lymphotoxin (LT) receptor-deficient mice.14,15 Therefore it is possible that RIP also takes part in signalling from these receptors. However, because the LTR lacks a DD, if it does signal via RIP then it must do so indirectly (see following). Another DD-bearing adaptor molecule implicated in TNF signalling of apoptosis is `RIP-associated ICH-1/CED-3homologous protein with a death domain' (RAIDD). In addition to the DD, RAIDD has a CARD which allows it to bind to the CARD of procaspase 2.16 Overexpression of RAIDD in vitro induces apoptosis, suggesting that this interaction is functional. However, the significance of this pathway for induction of cell death is uncertain because neither CD95 ligand (CD95L) nor TNF are able to induce apoptosis in mice lacking FADD or caspase 8. In these mice, RAIDD and caspase 2 would presumably be able to function normally. Furthermore, TNF- was still able to induce cell death in the absence of caspase 2.17 0 Inhibitor-of-apoptosis proteins 0 In some cell types in vitro, ligation of CD95 is able to activate the JNK/SAPK pathway. A candidate for mediating this 0 CD95 and TNF receptor signalling 0 activity is the CD95 `death domain-associated protein' Daxx, which was identified in yeast two-hybrid 0 Springer-Verlag 1997 1 Russell L. Margolis · Meena R. Abraham · Shawn B. Gatchell · Shi-Hua Li · Arif S. Kidwai · Theresa S. Breschel · O. Colin Stine · Colleen Callahan · Melvin G. McInnis · Christopher A. Ross 0 cDNAs with long CAG trinucleotide repeats from human brain 0 Trinucleotide repeat expansion mutation is now know to cause 12 diseases, most with neuropsychiatric features (Linblad and Schalling 1996; Paulson and Fischbeck 1996; Ross 1995; Zoghbi 1996). Seven of these are known as the type 1 disorders - spinocerebellar ataxia type 1 (SCA1, Orr et al. 1993), SCA2 (Imbert et al. 1996; Pulst et al. 1996; Sanpei et al. 1996), Machado-Joseph disease (MJD or SCA3, Kawaguchi et al. 1994), SCA6 (Zhuchenko et al. 1997), dentatorubral pallidoluysian atrophy (DRPLA, Koide et al. 1994; Nagafuchi et al. 1994), Huntington's disease (HD, Huntington's Disease Collaborative Research Group 1993), and spinal and bulbar muscular atrophy (SBMA, La Spada et al. 1991). Each is caused by a (CAG)n expansion in an open reading frame, resulting in an expanded glutamine repeat. The properties of the repeats in the other (type 2) expansion mutation diseases vary widely. Myotonic dystrophy is caused by a 3 untranslated (CTG)n expansion (Brook et al. 1992; Fu et al. 1992; Mahadevan et al. 1992), the A and E forms of fragile X syndrome (Fu et al. 1991; Knight et al. 1993; Kremer et al. 1991; Verkerk et al. 1991) and some cases of Jacobsen's syndrome (Jones et al. 1995) result from 5 untranslated region (CCG)n expansions, and Friedreich's ataxia is caused by an intronic (GAA)n expansion (Campuzano et al. 1996). Expandable trinucleotide repeats therefore are found in translated, transcribed but untranslated, and intronic regions; they may be G-C or A-T rich and range from minimal to highly variable in length in the normal population. At least four lines of evidence indicate that additional disorders may arise from trinucleotide repeat expansion mutations. First, an antibody (IC2) that specifically recognizes expanded glutamine repeats detects an expansion segregating with SCA7 (Trottier et al. 1995). Second, indirect evidence of CAG expansion has been detected using rapid expansion detection (RED, Schalling et al. 1993) in a pedigree with SCA7, and less clearly in heterogeneous populations of patients with bipolar affective 0 disorder and schizophrenia (Linblad et al. 1996; Linblad and Schalling 1996; O'Donovan et al. 1995). Third, several neurodegenerative disorders, including SCA4, SCA5, SCA7, and familial Parkinson disease, are phenotypically similar to the type I expansion mutation disorders. Fourth, anticipation, the phenomenon of increasing phenotypic severity or decreasing age of onset in successive generations affected by a disease (McInnis 1996; Ross et al. 1993), is found in most of the expansion mutation diseases. Anticipation has been detected in a disparate group of other diseases, including affective disorder (Engstrom et al. 1995; McInnis et al. 1993; Nylander et al. 1994), schizophrenia (Chotai et al. 1995; Gorwood et al. 1996; Stober et al. 1995; Thibaut et al. 1995), autism (Stine 1993), familial Parkinsonism (Bonifati et al. 1995; Markopoulou et al. 1995; Payami et al. 1995; Plante-Bordeneuve et al. 1995), familial leukemias (Horwitz et al. 1996), Crohn's disease (Polito et al. 1996), Meniere's disease (Morrison 1995), torsion dystonia (LaBuda et al. 1993), rheumatoid arthritis (McDermott et al. 1996), facioscapulohumeral muscular dystrophy (Tawil et al. 1996), Holt-Oram syndrome (NewburyEcob et al. 1996), and familial spastic paraplegia (Raskind et al. 1997). We have sought to identify candidate genes for these disorders by screening cDNA libraries for the presence of DNA fragments containing CAG, CCG, CCA, and AAT trinucleotide repeats (Li et al. 1993; Margolis et al. 1995 a, b). Our description of CTG-B37, a cDNA fragment with a highly polymorphic CAG repeat located within an open reading frame on chromosome 12, directly led to the finding that an expansion mutation within the CTGB37 repeat causes DRPLA (Koide et al. 1994; Nagafuchi et al. 1994). This same strategy of screening cDNA libraries for trinucleotide repeats was later employed to identify the MJD gene (Kawaguchi et al. 1994) and the SCA6 gene (Zhuchenko et al. 1997). Screening genomic contigs for trinucleotide repeats was used to clone the gene for SCA2 (Pulst et al. 1996). Based on the repeats that expand to cause disease, repeats with the highest likelihood of undergoing expansion mutation consist of at least six consecutive CAG or CTG triplets in the transcribed portions of genes expressed in brain. To identify genes with these features, we have screened human adult frontal cortex and fetal brain cDNA libraries at high stringency for the presence of CAG or CTG repeats. We now report the identification and mapping of 19 of these cDNA fragments. 0 Materials and methods 0 cDNA cloning Adult human 0 EVects of a motilin receptor agonist (ABT-229) on upper gastrointestinal symptoms in type 1 diabetes mellitus: a randomised, double blind, placebo controlled trial 1 N J Talley, M Verlinden, D J Geenen, R B Hogan, D RiV, R W McCallum, R J Mack 0 Motilin is a 22 amino acid peptide hormone that is expressed throughout the gut.1 Motilin stimulates interdigestive antral contractions promoting gastric emptying; the receptor has recently been identified.2 Erythromycin is a potent motilin agonist, inducing phase 3 of the migrating motor complex1; it accelerates gastric emptying in healthy volunteers as well as in patients with diabetic gastroparesis or those post-vagotomy.3 4 Dyspepsia is a common problem in patients with diabetes mellitus.5 6 Between 27% and 58% of type 1 diabetics are reported to have gastroparesis, usually aVecting solids but less often liquids.7 8 Symptoms of diabetic gastroparesis include postprandial distress, early satiety, bloating, fullness, and nausea and vomiting, but while gastroparesis is common, only a minority have overt symptomatology.7 8 Moreover, these symptoms also occur frequently in diabetics who do not have objective evidence of gastroparesis.6 The underlying mechanisms remain in dispute but disturbed vagal parasympathetic function and poor glycaemic control may both be important.8 9 In addition, increased levels of motilin have been observed in diabetic gastroparesis which is likely to be a compensatory mechanism as motilin levels decreased with the introduction of a prokinetic.10 A prokinetic agent in diabetic gastroparesis has the potential to increase gastric emptying, improve dyspepsia, and better control plasma glucose levels. There has therefore been considerable interest in developing new prokinetics for gastroparesis, including motilin agonists that lack antibiotic activity. ABT-229 has potent motilin agonist activity with essentially no antibiotic action.11 12 It dose dependently accelerates gastric emptying, and has a half life of 20 hours.11 12 Multidose studies have shown that the maximally eVective dose was 5 mg twice daily for accelerating gastric emptying and 2.5 mg twice daily retained a modest but significant prokinetic eVect.12 We aimed to test the hypothesis that ABT-229 would relieve postprandial symptoms in patients with diabetes mellitus. We further hypothesised that the maximum therapeutic gain over placebo would be observed in patients with diabetic gastroparesis on higher doses of ABT-229. To test these hypotheses, we conducted a randomised, placebo controlled, 0 Abbreviations used in this paper: HbA1c, glycated haemoglobin. 0 Talley, Verlinden, Geenen, et al 0 dose ranging trial in North American patients with type 1 diabetes mellitus. Methods The trial was approved by the local institutional review boards, and all patients gave informed consent. 0 PATIENT SELECTION 0 Ambulatory patients at least 18 years of age with documented type 1 diabetes were eligible to be enrolled. All patients were by definition insulin dependent. A minimum three month history of chronic upper abdominal discomfort (that is, one or more of postprandial fullness, bloating, epigastric discomfort, early satiety, belching after meals, postprandial nausea, vomiting, or epigastric pain) was required. A total of 383 patients were screened (by 33 investigators in the USA and three in Canada between June 1997 and August 1998) (fig 1). Patients were required to have a normal upper endoscopy (that is, no ulcers or erosions in the oesophagus and gastroduodenum) in the three months before randomisation. Furthermore, during the baseline evaluation over 14 days, patients had to have experienced one or more symptoms of postprandial upper abdominal discomfort on three or more days per week and on average have suYciently severe symptoms (defined as an upper abdominal discomfort severity score of >149 mm and a postprandial fullness severity score of >29 mm on visual analogue scales, as described below). Patients were only enrolled if there were no serious comorbid illnesses and screening laboratory values were normal. Excluded were patients with gastrooesophageal reflux disease, based on a normal endoscopy (only erythema was permitted), and 0 n = 383 Patients screened n = 113 Screening failures n = 270 Patients randomised n=1 Patient did not receive study drug n = 269 Intent to treat patients n = 15 Prematurely discontinued n = 254 Completed trial 0 Each site was supplied with separate sets of study drug for the gastric emptying strata (normal and delayed); to ensure random assignment, patients in each strata were given a number in sequential order from a separate computer generated randomisation list. A total of 270 patients were randomised but one was lost to follow up after the drug was dispensed and this patient was excluded. Patients treated (n=269) were randomly assigned to receive ABT-229 1.25 mg (n=55), 2.5 mg (n=58), 5 mg (n=53), 10 mg (n=55), or placebo (n=48) twice daily before breakfast and dinner for four weeks. These four doses were chosen based on the gastrokinetic eVects of ABT-229 administered in healthy subjects.12 The 2.5 mg twice daily dose was only marginally significantly superior to placebo as it accelerated gastric emptying of the evening meal only. The maximally eVective dose in healthy subjects was 5 mg twice daily. As the gastrokinetic eVects of ABT-229 were largest in those with slower gastric emptying, a 1.25 mg dose was included in the trial. To account for the possibility that patients with diabetic gastroparesis might be more resistant to therapy and require a higher dose, 10 mg was also included. Overall, 15 patients prematurely discontinued; the reasons were adverse events (n=10), treatment failure (n=2), lost to follow up (n=1), or other reasons (n=2), and the distribution was similar in each arm (fig 1). In total, 254 patients completed the trial. 0 Adverse events n = 10 Lost to follow up n=1 Treatment failures n=2 0 The placebo was identical in appearance to active therapy. All medication was supplied in double blinded multidose bottles. An administrative blind break occurred for one patient. 0 Other reasons n=2 0 Compliance, measured by a tablet count at week 4, was excellent. A minimum of 97% of patients in each treatment arm were at least 75% compli 0 Quality Indicators Increase the Reliability of Microarray Data 1 Wolfgang Raffelsberger,1 Doulaye Dembele,1 Mike G. Neubauer,2 Marco M. Gottardis,3 and Hinrich Gronemeyer1,* 0 Institut de Genetique et de Biologie Moleculaire et Cellulaire, CNRS/INSERM/ULP, B.P. 10142, F-67404 Illkirch Cedex, C. U. de Strasbourg, France Departments of 2Applied Genomics and 3Oncology Drug Discovery, Bristol-Myers Squibb Pharmaceutical Research Institute, Princeton, New Jersey 08543-4000, USA 0 Large-scale gene expression profiling with DNA microarrays opens new dimensions to molecular biology but still lacks the overall precision of traditional low-scale techniques. We developed a novel strategy of data processing linking search stringency to quality indicators for efficient detection of low-level, regulated genes. Using retinoid-induced differentiation of NB-4 promyelocytic cells, the variation of expression profiles between biological duplicates was studied and compared with the changes induced by all-trans retinoic acid (atRA) treatment. An analysis of 4320 genes showed that retinoic acid has mainly geneactivating function in NB-4 cells. Treatment with atRA for 18 hours induced metabolic genes that may be associated with cell differentiation and signaling factors triggering later events leading to apoptosis; cytokine genes were among the highest stimulated by atRA. Notably, we identified a regulatory loop inhibiting MYC action: as MYC was downregulated, a cognate repressor of MYC was upregulated. Key Words: retinoic acid, cell differentiation, gene expression profiling, biostatistics 0 Until recently only a limited number of genes were accessible to gene expression profiling, as northern blot, RT-PCR, and ribonuclease protection assays are designed for single genes or small groups of genes at a time. During the course of the human genome project, comprehensive cDNA libraries became available allowing the development of techniques for massive parallel expression profiling. Two types of microarrays emerged either using oligonucleotides directly synthesized on a chip surface (Affymetrix) [reviewed in 1,2] or depositing cDNA PCR products on glass slides [reviewed in 1,3]. In parallel, clustering algorithms for data analysis have been developed [4-7]. High-density microarrays allowed genome-wide screening programs for identification of target genes or expression profiles in disease and cancer [reviewed in 8-10]. Large amounts of data have been generated quickly, but several types of problems encourage the development of novel concepts for data evaluation. Large data sets with intrinsic variation ("noisy data") have to be interpreted by recognizing and excluding outlier data from subsequent analysis in an automated and highly reliable way. 0 Edge Effect and Normalization The microarrays used had a considerable edge effect: spots located close to the edge of a slide displayed lower fluorescence signals than duplicate spots in the center of the slide. For each column a correction factor was introduced minimizing the normalized differences of spot-duplicate (left/right). As low spot intensity values have 3- to 10-fold elevated deviation (Fig. 1A), only the 60% most intense spot pairs were used. Spots at saturation were excluded. All normalizations between replicate slides or subsequently between different samples were based on the assumption that there are no major changes in expression levels for the bulk part of the genes tested. This was a valid assumption--it is supported by near-identical shapes of cumulative frequency histograms of fluorescence intensities for different slides after median normalization (Fig. 1B). Comparison with Quantitative RT-PCR and Previous Results Obtained with Affymetrix GeneChips From preliminary experiments 18 genes were selected and their atRA-induced expression was assessed by real-time PCR. In general, most results were in agreement with the 0 arrays revealed upregul 0 Assessing the Drosophila melanogaster and Anopheles gambiae Genome Annotations Using Genome-Wide Sequence Comparisons 1 Olivier Jaillon,1 Carole Dossat,1 Ralph Eckenberg,1 Karin Eiglmeier,2 Beatrice Segurens,1 Jean-Marc Aury,1 Charles W. Roth,2 Claude Scarpelli,1 ´ Paul T. Brey,2 Jean Weissenbach,1 and Patrick Wincker1,3 0 Genoscope/Centre National de Sequencage and CNRS UMR 8030, 91057 Evry Cedex, France; 2Unite de Biochimie ´ ¸ ´ et Biologie Moleculaire des Insectes, Institut Pasteur, Paris 75724 Cedex 15, France ´ We performed genome-wide sequence comparisons at the protein coding level between the genome sequences of Drosophila melanogaster and Anopheles gambiae. Such comparisons detect evolutionarily conserved regions (ecores) that can be used for a qualitative and quantitative evaluation of the available annotations of both genomes. They also provide novel candidate features for annotation. The percentage of ecores mapping outside annotations in the A. gambiae genome is about fourfold higher than in D. melanogaster. The A. gambiae genome assembly also contains a high proportion of duplicated ecores, possibly resulting from artefactual sequence duplications in the genome assembly. The occurrence of 4063 ecores in the D. melanogaster genome outside annotations suggests that some genes are not yet or only partially annotated. The present work illustrates the power of comparative genomics approaches towards an exhaustive and accurate establishment of gene models and gene catalogues in insect genomes. 0 nome annotations. We therefore carried out this type of global comparison between these two insect genomes. 0 RESULTS AND DISCUSSION 0 The Drosophila Annotation 0 Genome Research 0 Jaillon et al. 0 Ecores 47,134 n.d. 46,742 n.d. 0 Genes 13,468 n.d. 13,666 n.d. 0 Exons 54,771 n.d. 61,085 n.d. 0 Ecores/ gene 3.17 n.d. 3.2 n.d. 0 Genes and exons stand for annotated genes and exons in the corresponding versions. 0 Genome Research 0 Drosophila/Anopheles Genomes Comparison 0 eral explanations that are not mutually exclusive may account for this observation. The high number of ecores could be the consequence of (1) an increased coding capacity in the genome of Anopheles, or (2) a larger number of pseudogenes or unmasked tranposable elements in Anopheles, or (3) problems in the sequence assembly. Explanations (1) and (2) were not supported by a previous comparative analysis (Zdobnov et al. 2002). The presence of at least two different haplotypes in the A. gambiae strain sequenced is known to have int 0 How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach 1 Wei Pan*, Jizhen Lin and Chap T Le* 0 comment reviews 0 deposited research refereed research interactions 0 Microarrays are used to measure the (relative) expression levels of thousands of genes (or expressed sequence tags). A comparison of gene expression in cells or tissues from two conditions may provide useful information on important biological processes or functions [1,2]. The challenge now is how to detect those genuine changes from noisy data. It is now known that simply using fold changes, as in the earlier days, is unreliable and inefficient [3,4]. More sophisticated statistical methods are called for. Many proposals have appeared in the literature [3-10]. In particular, it has been noticed that it may be necessary to design an experiment that uses multiple arrays (or multiple spots on each array) containing multiple measurements for each gene under each 0 condition. One reason is that because of a high noise-tosignal ratio, a single array may not provide enough information that can be reliably extracted [11]. More important, multiple measurements from each gene make it possible to assess the potentially different variability of genes. The problem then seems to fall within the traditional two-sample comparison in statistics. Two of the best known two-sample statistical tests are the two-sample t-test and the Wilcoxon test (or equivalently, Mann-Whitney test). The t-test is parametric and is based on the assumption that the gene-expression levels have normal distributions. In contrast, the Wilcoxon test is nonparametric and is based on the ranks of observed gene-expression levels. Although the t-test is robust to departures from normality and the Wilcoxon test 0 Genome Biology 0 Results and discussion 0 A statistical model 0 We consider a generic situation that, for each gene i, I = 1,2,...,N, we have (relative) expression levels X1i,..., Xmi from m microarrays under condition 1, and Y1i,..., Ymi from m arrays under condition 2. We need to assume that m is an even integer. A general statistical model is assumed for gene expression data: Xji = 0 where P(1),i and P(2),i are the mean expression levels for gene i under the two conditions respectively, and Hji and eli are independent random errors with means and variances E( ji) = E(eli) = 0, Var( ji) = 0 depend on the mean expression P(c),i. Also, we do not even need to assume that V2(1),i = V2(2),i unless P(1),i = P(2),i. A goal is to detect all genes with P(1),i z P(2),i. This can be accomplished through statistical hypothesis testing. 0 nonparametrically. T 0 Copyright 2004 by the Genetics Society of America DOI: 10.1534/genetics.104.026658 0 The DrosDel Collection: A Set of P-Element Insertions for Generating Custom Chromosomal Aberrations in Drosophila melanogaster 1 Edward Ryder,* Fiona Blows,* Michael Ashburner,* Rosa Bautista-Llacer,* Darin Coulson,* Jenny Drummond,* Jane Webster,* David Gubb,* Nicola Gunton,* Glynnis Johnson,* Cahir J. O'Kane,* David Huen,* Punita Sharma,* Zoltan Asztalos,* Heiko Baisch, Janet Schulze, Maria Kube, Kathrin Kittlaus, Gunter Reuter, Peter Maroy, ° Janos Szidonya, Asa Rasmuson-Lestander,§ Karin Ekstrom,§ Barry Dickson,** ¨ Christoph Hugentobler, Hugo Stocker, Ernst Hafen, Jean Antoine Lepesant, Gert Pflugfelder,§§ Martin Heisenberg,*** Bernard Mechler, Florenci Serras, Montserrat Corominas, Stephan Schneuwly,§§§ Thomas Preat,**** John Roote* and Steven Russell*,1 0 ENETICALLY tractable model organisms are valuable research tools for uncovering basic biological principles that are conserved through evolution. Many molecular pathways, such as signaling cascades, gene regulatory pathways, and cell cycle control circuits, were first characterized genetically in model systems. The subsequent molecular cloning of the genes involved in such pathways has shown how evolution has utilized basic molecular building blocks to control a wide variety of biological processes. Key to the success of such approaches has been the ability to carry out genetic screens 0 for components that function in particular pathways and characterize how individual genes participate in such pathways. The fruit fly, Drosophila melanogaster, is one such tractable model that has been used extensively to elucidate many conserved genetic hierarchies. One particularly powerful approach with Drosophila is the ability to rapidly carry out focused genome-wide screens for pathway components by identifying loci that modify specific phenotypes (see St. Johnston 2002 for review). In this approach, a sensitized genetic background, most commonly exhibiting an easily scored adult phenotype such as rough eyes or a wing defect, is used to search for mutations in genes that make the phenotype more severe (enhancer) or more like wild type (suppressor). Mutation-bearing chromosomes are introduced into the 0 E. Ryder et al. 0 specific recombinase (FRT site) placed within intron one. In the case of RS3, a second FRT site is placed upstream of the first of the mini-white exons; in the case of RS5 the second FRT site is located downstream of the mini-white exons. Golic and Golic demonstrated how a pair of RS3 and RS5 elements can be used to generate chromosome rearrangements by design. These chromosome rearrangements include both deficiencies and duplications (Figure 6). Since the insertion site of any P element can be precisely mapped to the genomic sequence, the end points of any chromosome aberration derived from a pair of these RS elements can be determined with single-base-pair resolution. The problem of genetic background heterogeneity is less easily overcome. Powerful genetic methods are available with D. melanogaster to construct "isogenic" lines and we have used these methods in our current screen (Ashburner 1989). However, in the absence of practical methods to preserve these lines cryogenically, there is no way to prevent the slow, but inevitable, divergence of these lines in subsequent years. While this may be a drawback in the long term, there can be no doubt that, in the medium term, a deficiency kit in a homogeneous genetic background will be of considerable utility in genome-scale analysis of Drosophila. We describe here the construction of a set of isogenic lines that form the basis for a mobilization screen with RS elements. We describe the isolation and mapping of 3000 new P-element-insertion lines on this background and demonstrate their utility for generating deletions precisely mapped onto the genome sequence. This work is a prelude to an ongoing effort to generate a precisely mapped deletion kit that will cover as much of the genome of D. melanogaster as is possible. In addition, we have constructed a genetic and computational toolkit that allows individual researchers to design and synthesize deletions in regions of particular interest. The materials we have generated are all publicly available. 0 MATERIALS AND METHODS Genetic nomenclature is according to FlyBase (2003). The FM7 balancer stocks were ob 0 Steroid signaling in plants and insects--common themes, different pathways 1 Carl S. Thummel1 and Joanne Chory2,3 0 Outside of mammals, two model systems have been the focus of intensive genetic studies aimed at defining the molecular mechanisms of steroid hormone action--the flowering plant, Arabidopsis thaliana, and the fruit fly, Drosophila melanogaster. Studies in Arabidopsis have benefited from a detailed description of the brassinosteroid (BR) biosynthetic pathway, allowing the effects of mutations to be linked to specific enzymatic steps. More recently, the signaling cascade that functions downstream from BR production has been defined, revealing for the first time how the hormone can exert its effects on gene expression through a cell surface receptor and phosphorylation cascade. In contrast, studies of steroid hormone action in Drosophila began in the nucleus, with a detailed description of the transcription puffs activated by the steroid hormone 20-hydroxyecdysone (20E) in the giant polytene chromosomes. Subsequent genetic studies have revealed that these effects are exerted through nuclear receptors, much like mammalian hormone signaling. Most recently, genetic studies have begun to elucidate the ecdysteroid biosynthetic pathway which, until recently, remained largely undefined. Our current understanding of steroid hormone signaling in Arabidopsis and Drosophila provides a number of intriguing parallels as well as distinct differences. At least some of these differences, however, appear to be due to deficiencies in our understanding of these pathways. Below we discuss recent breakthroughs in defining the molecular mechanisms of BR biosynthesis and signaling in plants, and we compare and contrast this pathway with what is known about the mechanisms of ecdysteroid action in Drosophila. We raise some current questions in these fields, the answers to which may reveal other similarities in steroid signaling in plants and animals. Brassinosteroid biosynthesis and homeostasis Although plants and animals diverged more than 1 billion years ago, it is remarkable that polyhydroxylated 0 steroidal molecules are used as hormones in both of these kingdoms, as well as in algae and fungi. Brassinosteroids (BRs), a class of plant-specific steroid hormones, control many of the same developmental and physiological processes as their animal and fly counterparts, including regulation of gene expression, cell division and expansion, differentiation, programmed cell death, and homeostasis. The regulation of these processes by BRs, acting together with other plant hormones, leads to the promotion of stem elongation and pollen tube growth, leaf bending and epinasty, root growth inhibition, proton-pump activation, and xylem differentiation (Mandava 1988; Clouse and Sasse 1998). In addition, useful agricultural applications have been found such as increasing yield and improving stress resistance of several major crop plants (Ikebawa and Zhao 1981; Cutler et al. 1991). Although the existence and biological activity of these plant steroids had been described in a large body of literature, they only found their way into the mainstream of plant hormone biology a few years ago, when the available biochemical and physiological data were complemented by the identification of BR-deficient mutants of Arabidopsis (Clouse et al. 1996; Kauschmann et al. 1996; Li et al. 1996; Szekeres et al. 1996), pea (Nomura et al. 1999), and tomato (Bishop et al. 1999; Koka et al. 2000). Mutations in 8 loci of Arabidopsis and several additional loci in tomato and pea result in plants with reduced levels of BR biosynthetic intermediates and lead to distinct phenotypes (Bishop et al. 1996; Li et al. 1996; Szekeres et al. 1996; Choe et al. 1998a,b, 1999a,b, 2000; Klahre et al. 1998; Nomura et al. 1999; Kang et al. 2001). In Arabidopsis, loss-of-function mutations in these genes have pleiotropic effects on development. In the dark, the mutants are short, have thick hypocotyls and open, expanded cotyledons, develop primary leaf buds, and inappropriately express light-regulated genes. In the light, these mutants are dark green dwarfs, have reduced apical dominance and male fertility, display altered photoperiodic responses, show delayed chloroplast and leaf senescence, have reduced xylem content, and respond improperly to fluctuations in their light environment 0 Thummel and Chory 0 (Chory et al. 1991, 1994; Millar et al. 1995; Szekeres et al. 1996; Fig. 1). Such phenotypic differences between BRdeficient mutants and wild-type Arabidopsis plants indicate that these genes (and by inference, BRs) play an important role throughout Arabidopsis development. Exogenous application of brassinolide (BL, the most active BR, and generally thought to be the endpoint of the biosynthetic pathway) leads to the normalization of their phenotypes. A biosynthetic pathway derived solely from biochemical studies provided an excellent framework for the characterization of these mutants, and was in turn confirmed and refined by their analysis (for review, see Clouse and Sasse 1998; Noguchi et al. 2000; Friedrichsen and Chory 2001; Fig. 1). Because of their striking mutant phenotypes, which led to the identification of most BR biosynthetic genes, considerable progress has been made in understanding the mechanisms of BR homeostasis. Multiple control mechanisms for regulating the levels of BRs in plants have been identified, including regulation of biosynthesis, inactivation, and feedback regulation from the signaling pathway. BR-deficient mutants have helped to determine that BL is not synthesized via a simple linear biosynthetic pathway. Recently, two pathways, the early C-6 oxidation and late C-6 oxidation pathways, were proposed for the biosynthesis of BL (Choi et al. 1996, 1997). In the early C-6 oxidation pathway, hydroxylation of the side chain occurs after C6 oxidation, whereas in the late C-6 oxidation pathway the hydroxylation of the side chain occurs before position 6 of the B-ring is oxidized. Feeding experiments with intermediates of both path- 0 ways provided strong genetic evidence that both pathways operate in Arabidopsis (Fujioka et al. 1997; Choe et al. 1998a). A study with dwf4 mutants suggests that 6-deoxo-cathasterone is a starting point for a new subpathway as this compound is able to rescue dwf4 mutations (Choe et al. 1998a). Of note, DWF4, a C-22 hydroxylase, appears to be the major rate-limiting step in the BR biosynthetic pathway based on feeding studies and overexpression of DWF4 in transgenic plants (Choe et al. 2001). Similarly, 6-6 -hydroxycampestanol could also be a starting point for a different subpathway whose intermediates act as "bridging molecules" between the early and late C-6 oxidation pathways. One simple explanation for plants having multiple pathways of BL biosynthesis is that these subpathways might be differentially regulated by various environmental or developmental signals. A possible point for light-regulation of BR biosynthesis has very recently been identified and is indicated in red in Figure 1 (Kang et al. 2001). In addition, feeding experiments using det2 and dwf4 mutants have shown that BRs in the late C-6 oxidation pathway are more effective in rescuing light phenotypes, whereas the BRs in the early C-6 oxidation pathways show stronger activity in promoting hypocotyl elongation of darkgrown seedlings (Fujioka et al. 1997; Choe et al. 1998a). Endogenous levels of BRs are increased in BR-signaling mutants, such as Arabidopsis bri1 and its orthologous mutants in tomato, pea, and rice (discussed below; Noguchi et al. 1999; Yamamuro et al. 2000; Bishop and Yokota 2001). These BR-insensitive mutants show the largest increases in the early C-6 oxidation BRs. In Ara- 0 GENES & DEVELOPMENT 0 Steroid hormone signaling 1 Fredj Tekaia a,*, Edouard Yeramian b, Bernard Dujon a 0 Keywords: Hyperthermophiles; Mesophiles; Thermostability; Amino acid composition; Evolution; Multivariate analyses 0 Introduction One major aim of large-scale genomic projects is to reach a global understanding of the physiological functioning of living organisms. Such understanding must encompass the 0 puzzling discovery that certain organisms live in extreme conditions of temperature, pressure, and salinity, which were originally thought to be incompatible with life (for a recent revue see Rothschild and Mancinelli, 2001, and references therein). With the genomic sequences of these organisms becoming available, it is rather surprising that no striking genomic counterparts seem to be associated with such extreme lifestyles. For example, at the DNA level, an 0 GENERAL AND COMPARATIVE 0 Yolk steroid hormones and sex determination in reptiles with TSD 0 Abstract In reptiles with temperature-dependent sex determination (TSD), the temperature at which the eggs are incubated determines the sex of the offspring. The molecular switch responsible for determining sex in these species has not yet been elucidated. We have examined the dynamics of yolk steroid hormones during embryonic development in the snapping turtle, Chelydra serpentina, and the alligator, Alligator mississippiensis, and have found that yolk estradiol (E2 ) responds differentially to incubation temperature in both of these reptiles. Based upon recently reported roles for E2 in modulation of steroidogenic factor 1, a transcription factor known to be significant in the sex differentiation process, we hypothesize that yolk E2 is a link between temperature and the gene expression pathway responsible for sex determination and differentiation in at least some of these species. Here we review the evidence that supports our hypothesis. O 2003 Elsevier Science (USA). All rights reserved. 0 Temperature-dependent sex determination Sex determination is thought to occur in two basically different modes. There is genetic sex determination (GSD), in which sex chromosomes determine the sex of the individual and environmental sex determination (ESD), where environmental factors determine sex. In one form of ESD, temperature-dependent sex determination (TSD), the temperature at which the eggs are incubated determines the sex of the hatchlings. There are three different patterns or temperature profiles that have been described for TSD species, male-female (MF), female-male (FM), and female-male-female (FMF). In the MF pattern, low temperatures produce a majority of males, high temperatures produce mostly females, and intermediate temperatures produce a ratio of males to females. The intermediate temperature that produces a 1:1 ratio of males to females is referred to as the pivotal temperature for the species. Several turtle species have been reported to show this profile, including the painted turtle, Chrysemys picta and the red-eared slider turtle, Trachemys scripta (Ewert et al., 1994). In the FM pattern, the temperature regimen is reversed, with high 0 temperatures producing mainly males, low temperatures producing primarily females, and again, intermediate temperatures producing ratios of males to females. This pattern has been reported for some lizards (Viets et al., 1994), including the skink, Eulamprus tympanum, the only viviparous TSD lizard reported to date (Robert and Thompson, 2001). In the third TSD pattern, FMF, females are produced at low temperatures, a majority of males are produced at an intermediate temperature, and predominantly females are produced again at high temperatures. In this system there are two pivotal temperatures at which ratios of males to females are produced. This pattern is displayed in all the crocodilians studied to date, including the American alligator, Alligator mississippiensis (Lang and Andrews, 1994). In the snapping turtle, Chelydra serpentina, the usual TSD pattern is FMF (Ewert et al., 1994), however, the TSD pattern in some populations of snapping turtles varies slightly from that described, being MF, with males predominating at lower temperatures, females at higher temperatures, and a single pivotal temperature range. The period of development during which sex is determined, the thermosensitive period (TSP), falls within the middle one-third to one half of the total incubation time (Wibbels et al., 1991a), and temperature influences the rate of development as well as the sex of the hatchling. 0 Temperature is apparently not the only factor influencing sex determination, at least in some of these species. There are reports of large variations in the ratios of males to females produced among clutches of eggs laid by different females at the pivotal temperature where one would expect to see a 1:1 ratio (Rhen and Lang, 1998, Fig. 1). This would indicate that other factors, perhaps some maternal contribution could influence the outcome of the sex determining process. Clutch identity or ``clutch effects'' have also been reported to influence other aspects of offspring fitness, including residual yolk mass, fat body mass and total mass of hatchling snapping turtles (Rhen and Lang, 1999). Moreover, studies of post-hatch growth of snapping turtles showed significant clutch effects in growth rates that were independent of egg mass (Rhen and Lang, 1995). These differences could also be due to differential hormone deposition in yolk, as has been reported in some avian species (Frank et al., 1991; Schwabl, 1996; Schwabl et al., 1997). 0 Gene expression patterns during sex differentiation of TSD reptiles What is known about the sex differentiation process in reptiles with TSD? The gene expression pattern that leads to sex determination and subsequent testis or ovary differentiation, has been defined best in mammalian species, which utilize GSD. SRY (Sex-determining region of the Y chromosome) is thought to be the primary determinant of testis differentiation in mouse and human systems (reviewed by Koopman et al., 2001), but there is no known homologue of SRY in TSD reptiles. There are a number of candidate genes that are present 0 but since the embryonic adrenal gland is extremely active, these results do not accurately reflect activity of the gonad alone (T. Wibbels, personal communication). Since in mammalian species SF-1 works in conjunction with SOX9 to up-regulate AMH for male differentiation, SF-1 must participate in completely different interactions in chickens and alligators, where it is upregulated in females. Recent reports indicate that DAX1, an orphan nuclear receptor, inhibits the expression of genes in the male differentiation pathway possibly by modulating the activity of SF-1 (reviewed by Parker and Schimmer, 2002). DAX1 also has reported interactions with estrogen receptors and is thought to act as a corepressor, so could play a role in estrogen signaling pathways (Zhang et al., 2000). Cytochrome P450 aromatase expression, a 0 FEBS 23893 0 Gene expression data analysis 1 Alvis Brazma*, Jaak Vilo 0 what are the functional roles of di¡erent genes and in what cellular processes do they participate; how are genes regulated, how do genes and gene products interact, what are these interaction networks ; how does gene expression level di¡er in various cell types and states, how is gene expression changed by various diseases or compound treatments. 0 Knowing the gene transcript abundance in various tissues, developmental stages and under various conditions is important for attacking these questions. Although mRNA is not the 0 ultimate product of a gene, transcription is the ¢rst step in gene regulation, and information about the transcript levels is needed for understanding gene regulatory networks. Moreover, the measurement of mRNA levels currently is considerably cheaper and can be done in a more high-throughput way than direct measurements of the protein levels. The correlation between the mRNA and protein abundance in the cell may not be straightforward, nevertheless the absence of mRNA in a cell is likely to imply a not very high level of the respective protein and thus at least qualitative estimates about the proteome can be based on the transcriptome information. The mRNA and protein level correlation studies are under way (see [1]). The ability to monitor gene expression at the transcript level has become possible due to the advent of DNA microarray technologies (see [2]). A microarray is a glass slide, onto which single-stranded DNA molecules are attached at ¢xed locations (spots). There may be tens of thousands of spots on an array, each related to a single gene. Microarrays exploit the preferential binding of complementary single-stranded nucleic acid sequences. There are several variations of microarray technologies each used in a speci¢c way. One of the most popular experimental platforms is used for comparing mRNA abundance in two di¡erent samples (or a sample and a control). RNA from the sample and control cells are extracted and labeled with two di¡erent £uorescent labels, e.g. a red dye for the RNA from the sample population and a green dye for that from the control population. Both extracts are washed over the microarray. Gene sequences from the extracts hybridize to their complementary sequences in the spots. To measure the relative abundance of the hybridized RNA the array is excited by a laser. If the RNA from the sample population is in abundance, the spot will be red, if the RNA from the control population is in abundance, it will be green. If sample and control bind equally, the spot will be yellow, while if neither binds, it will not £uoresce and appear black. Thus, from the £uorescence intensities and colors for each spot, the relative expression levels of the genes in the sample and control populations can be estimated. By measuring transcription levels of genes in an organism under various conditions, at di¡erent developmental stages and in di¡erent tissues, we can build up `gene expression pro¢les' which characterize the dynamic functioning of each gene in the genome. We can imagine the expression data represented in a matrix with rows representing genes, columns representing samples (e.g. various tissues, developmental stages and treatments), and each cell containing a number characterizing the expression level of the particular gene in the particular sample. We will call such a table a gene expres- 0 sion matrix. Building up a database of such matrices will help us to understand gene regulation, metabolic and signaling pathways, the genetic mechanisms of disease, and the response to drug treatments. For instance, if overexpression of certain genes is correlated with a certain cancer, we can explore which other conditions a¡ect the expression of these genes and which other genes have similar expression pro¢les. We can also investigate which compounds (potential drugs) lower the expression level of these genes. 2. From raw data to gene expression matrix Like many experimental technologies, microarrays measure the target quantity (i.e. relative or absolute mRNA abundance) indirectly by measuring another physical quantity ^ the intensity of the £uorescence of the spots on the array for each £uorescent dye, i.e. for each optical wavelength 0 (so-called channel). Therefore the raw data produced by microarrays are in fact monochrome images (Fig. 1). Transforming these images into the gene expression matrix is a nontrivial process: the spots corresponding to genes on the microarray should be identi¢ed, their boundaries determined, the £uorescence intensity from each spot measured and compared to the background intensity and to these intensities for other channels. The software for this initial image processing is often provided with the image scanner, since it will depend on particular properties of the hardware. Often laborious manual adjustment of the grid for spots is used. We will not discuss the raw data processing in detail in this paper, some survey of image analysis software can be found on http:// cmpteam4.unil.ch/biocomputing/array/software/MicroArray_ Software.html. In any physical experiment it is important to know not only the value of the measurement, but also the standard error or 0 Nutrient control of gene expression in Drosophila: microarray analysis of starvation and sugar-dependent response 1 Ingo Zinke, Christina S.Schutz, E Jorg D.Katzenberger, Matthias Bauer and E Michael J.Pankratz1 0 E Institut fur Genetik, Forschungszentrum Karlsruhe, Postfach 3640, D-76021 Karlsruhe, Germany 0 We have identified genes regulated by starvation and sugar signals in Drosophila larvae using whole-genome microarrays. Based on expression profiles in the two nutrient conditions, they were organized into different categories that reflect distinct physiological pathways mediating sugar and fat metabolism, and cell growth. In the category of genes regulated in sugar-fed, but not in starved, animals, there is an upregulation of genes encoding key enzymes of the fat biosynthesis pathway and a downregulation of genes encoding lipases. The highest and earliest activated gene upon sugar ingestion is sugarbabe, a zinc finger protein that is induced in the gut and the fat body. Identification of potential targets using microarrays suggests that sugarbabe functions to repress genes involved in dietary fat breakdown and absorption. The current analysis provides a basis for studying the genetic mechanisms underlying nutrient signalling. Keywords: fat/feeding/microarrays/starvation/sugar 0 Halaas, 1998). Malfunctioning of physiological pathways underlying nutrient signalling and energy homeostasis can have major consequences for human health, and the modern society is facing ever increasing cases of physiological disturbances such as eating disorders, diabetes and obesity. As the dietary requirement for sugars, fats and amino acids is essentially universal, many aspects of the basic logic of nutrient signalling should be conserved. The finding that both Drosophila and Caenorhabditis elegans possess components of insulin signalling supports this view (Lehner, 1999; Brogiolo et al., 2001; Gems and Partridge, 2001). As part of our analysis of Drosophila larval feeding behaviour, we previously identified lipase 3 (lip3) and phosphoenolpyruvate carboxykinase (pepck) as being upregulated upon starvation (Zinke et al., 1999). Upon addition of sugar, this upregulation was completely suppressed for lip3, but not for pepck. These results demonstrated that different nutrient conditions can have very specific effects on gene expression patterns in Drosophila larvae. We have now used Affymetrix microarrays to identify genes regulated by starvation and by sugar in order to study the mechanisms underlying nutrient signalling. Based on the pattern of response to different nutrient conditions and on existing knowledge of metabolic pathways, we could categorize the identified genes into groups that reflect distinct physiological functions. We have further characterized a zinc finger transcription factor that is one of the earliest and highest upregulated genes upon sugar ingestion. Identification of potential target genes indicates that this transcription factor functions to repress genes involved in dietary fat breakdown and absorption. 0 Drosophila larvae are continuous feeders and show large growth in a relatively short time period. About 5 days after egg laying (AEL), they stop feeding, leave the food to enter the wandering stage and pupariate shortly thereafter (Figure 1A). Within this normal developmental progression, there are several notable variations that become apparent under different environmental conditions. One intriguing observation was made by Beadle et al. (1938). When larvae are starved before 70 h AEL, they die within several days, whereas if they are starved after this time point, they do not grow, but still survive and differentiate to give rise to small adult flies. The authors concluded that some `organizational change occurs in larvae at about 70 h' and termed this the `70 h change' (Beadle et al., 1938). This survival after the 70 h change period is independent of whether the larvae are starved or placed on sugar; however, before the 70 h, larvae placed in sugar live for much longer than those under starvation conditions (over a 0 a European Molecular Biology Organization 0 Nutrient control of gene expression 0 week as compared with ~2 days; see also Britton and Edgar, 1998; Zinke et al., 1999). Clearly, there is a difference in the metabolic programme that becomes activated across this point upon change in nutrient status. As the period before 70 h is critical for survival, we decided to perform the experiments prior to this point. For each time and nutrient condition, two chips were used with each chip being hybridized to the samples collected independently (Figure 1B). 0 Categorization of nutrient-dependent genes 0 Mechanisms for differences in monozygous twins 1 Paul Gringrasa,*, Wai Chenb,c 0 Keywords: Twin; Monozygous; Genetic mechanisms 0 Introduction Over 200 pairs of twins are assessed each year at the Multiple Births Foundation, London. Despite often appearing indistinguishable to strangers, no `identical' twins assessed are so alike that their mothers fail to distinguish them accurately. Physical differences may be as subtle as one small mole, or a differently positioned hair crown; 0 but still, they exist and are unmistakable once identified. Many parents can also differentiate their `identical' twins by their personalities, some even claim from a very early age. Physical similarities between MZ twins are well recognised; and these similarities have long formed the basis of many instruments and clinical methods designed to classify zygosity, such as questionnaires and physical examinations. Even the most experienced practitioners can, however, `misclassify' zygosity in about 6% of cases [1], and molecular genetic methods are now the preferred method for establishing zygosity [2]. The term `identical'--although frequently used--is not synonymous with `monozygous' (MZ). Most MZ twins are phenotypically very similar, yet there are significant numbers of MZ pairs who are neither phenotypically nor genotypically identical. Even if one assumes a completely equal `apportioning' of genetic endowment when twinning occurs, the twin pair will only remain identical if post-zygotic genetic, post-zygotic epi-genetic and post-zygotic environmental factors affect each twin equally. Given the extent of these influences and many potential opportunities for disruption during the long and complex intrauterine development, it is perhaps surprising that so many MZ twins do turn out to be so alike. Nevertheless, it is these anomalous cases of discordant twins that have taught us much about human genetics, development and twinning in the past. It is likely that they will continue to do so when new technologies are applied to future research in this area. This review summarises some past findings of well established studies, and also some from more recent exploratory studies using more experimental techniques and designs. We will first consider the ante-natal environmental factors and their effects, and then the genetic factors that contribute to discordance in MZ twins. Some examples of discordancy do not necessarily fit into the above neat categories. For convenience, they have been grouped together and discussed in the final section on `discordancies of unknown origin'. 0 Timing of monozygous twinning Monozygous (MZ) twinning occurs when one single fertilised egg gives rise to two separate embryos. The timing of this division can be an important contributory factor in determining the post-zygotic discordance in MZ twins. This timing can be characterised by the differences in amniotic sac, chorionic and placental anatomical formation [3]. In principle, the earlier twinning occurs, the less the twins will share common supportive structures; and the later, the more. The extreme example of late twinning are conjoint twins who even share some somatic organs. If twinning takes place prior to the first 4 days after conception, two separate placentas and sets of membranes are formed: that is, one set for each embryo. Such twins are called dichorionic (DC) MZ twins, and they account for about one third of all MZ twins. After the `fourth' day, the progenitor cells of the placenta become separated from the inner cell mass of the embryo. As a result, for twinning occurring after this, only one single placenta will develop. This single monochorionic (MC) placenta serves both 0 Amnionicity Diamniotic Diamniotic Monoamniotic 0 Chorionicity Dichorionic Monochorionic Monochorionic twins 0 Frequency One-third of monozygous twins Approximately two-thirds monozygous twins Five percent of monozygous twins Conjoined twins 0 Timing for conjoint twins is theoretical and only suggested by animal models. 0 embryos, and in the majority of cases, contains anastomoses of blood vessels that connect the embryos. After about the eighth day, the MC MZ pair will share a common amniotic sac, in addition to the common MC placenta [4]. About 5% of MZ twins are monochorionic (MC) and monoamniotic (MA). Twinning after the second week results in the very rare phenomenon of conjoined twins (see Table 1). All MC twins are MZ by definition, and this is still the `gold standard' when defining monozygosity. Although often seen in animals, vascular communications in dichorionic placentae in man are extremely rare [5]. The combination of monochorionicity and arterioarterial anastomoses is a better proof of monozygosity than any genetic test currently available. If placentation has not already been established by ultrasound in the first trimester, it relies on placental examination by pathologists; unfortunately, this still has not become routine clinical practice in most hospitals, despite numerous pleas in the literature [6,7]. 0 Ante-natal environmental factors 3.1. Chorionicity, twin -twin transfusion syndrome and discordant birth weight Anastomotic connections between foetal circulations are present in around 90% of MC placentas. These anastomoses can result in the `twin to twin transfusion syndrome' (TTTS) [8]. This can result either in a chronic ante-partum transfusion or acute intrapartum transfusion. In the former event, growth discordance occurs and there are risks for both the donor and recipient. These include the possibility of the donor becoming malnourished and growth retarded, while the recipient is at risk of cardiac hypertrophy, polycythaemia and hydramnios. In general, the mortality and morbidity rate for both twins in this situation is high without intervention [9]. The acute transfusion syndrome occurs intrapartum and causes increased mortality and morbidity, through both hypovolaemia and hypotension in one twin, and polycythaemia in the other. Even without TTTS, discordant birth weight in MZ twins remains common as a result of: (1) unequal in-utero blood supply, and hence growth; and perhaps (2) in theory, unequal division of inner cell mass at twinning. Although such differences may diminish 0 with age, there is a growing body of evidence that significant discrepancy in birth weight may lead to long-lasting physiological changes in both twins. The concept of `foetal programming' proposes that intrauterine growth affects long-term growth and metabolism in later life. Epidemiological studies linking low birth weight with hypertension and coronary artery disease in adult life suggest that undernutrition before birth `programmes' later cardiovascular outcome [10]. Associations between `small for dates' babies with later insulin resistance and cardiovascular disease are consistent with the hypothesis that late gestation may be a window of sensitivity to nutrition in terms of its influence on later cardiovascular disease. In twins discordant for the development of non-insulin dependant diabetes (NIDDM), birth weight has been found to be lower in the affected twin [11]. Investigators continue to use twins with discordant birth weight as a means to test the `foetal programming' hypothesis, while assuming the twin pair would share common confounding variables such as social class, genetic endowment and post-natal environments. Two teams have recently reported the importance of birth weight in twins, independent of genetic differences, in influencing their blood pressure as adults [12]. Evidence for `foetal programming' has even been found in early infancy: in a small cohort of MZ twins, where a twin - twin transfusion had occurred, differences in arterial distensibility were found in the donor twin when compared to the recipient [13]. Appealing though the findings from twin studies may be, the extent to which they are generalisable to singleton population is un 0 Genome-wide identification of in vivo Drosophila Engrailed-binding DNA fragments and related target genes 1 Pascal Jean Solano1,*, Bruno Mugat1,*, David Martin2, Franck Girard1, Jean-Marc Huibant1, Conchita Ferraz1, Bernard Jacq2, Jacques Demaille1 and Florence Maschat1, 0 1Institut de Genetique Humaine (UPR 1142). 141 rue de la Cardonille, 34396 Montpellier, France 2Laboratoire de Genetique et Physiologie du Developpement (UMR 6545), IBDM, Parc Scientifique 0 de Luminy, 13288 Marseille, 0 Cedex 9, France 0 SUMMARY Chromatin immunoprecipitation after UV crosslinking of DNA/protein interactions was used to construct a library enriched in genomic sequences that bind to the Engrailed transcription factor in Drosophila embryos. Sequencing of the clones led to the identification of 203 Engrailed-binding fragments localized in intergenic or intronic regions. Genes lying near these fragments, which are considered as potential Engrailed target genes, are involved in different developmental pathways, such as anteroposterior patterning, muscle development, tracheal pathfinding or axon guidance. We validated this approach by in vitro and in vivo tests performed on a subset of Engrailed potential targets involved in these various pathways. Finally, we present strong evidence showing that an immunoprecipitated genomic DNA fragment corresponds to a promoter region involved in the direct regulation of frizzled2 expression by engrailed in vivo. 0 Key words: Engrailed, Chromatin immunoprecipitation, In vivo targets, Drosophila 0 INTRODUCTION Identification of target genes that are directly regulated by transcription factors is a key issue in developmental biology, and has been the purpose of several recent studies. Indeed, the genome-wide location of DNA-binding proteins using genomic microarrays has been performed in yeast (Iyer et al., 2001; Lieb et al., 2001; Ren et al., 2000). In mammalian cells, CpG island microarrays have allowed the identification of promoter regions capable of binding to the E2F transcription factor (Weinmann et al., 2002). Recently, whole-genome microarray assays associated with bioinformatic methods have also been successfully performed to identify direct target genes of the Dorsal transcription factor in Drosophila (Markstein et al., 2002; Stathopoulos et al., 2002). Identifying the genes that are directly regulated by transcription factors, rather than merely in the downstream pathways, remains essential for understanding gene function (Liang and Biggin, 1998; Mannervik, 1999; Furlong et al., 2001; Egger et al., 2002). Homeodomain transcription factors play key roles during development by coordinating the behavior of most cells within their domains of expression (Garcia-Bellido, 1975; Lawrence and Morata, 1992), and identifying their target genes is challenging (Biggin and McGinnis, 1997). Interestingly, whereas homeodomain proteins recognize closely related binding sites, they are involved in specific genetic pathways and their absence produces very specific phenotypic effects 0 P. J. Solano and others Weinmann et al., 2001; Weinmann et al., 2002). However, UV light is believed to be more efficient in fixing proteins that are directly bound to DNA (Toth and Biggin, 2000). In the present report, we constructed a library enriched in genomic sequences that bind Engrailed protein in Drosophila embryos, by using UV crosslinking and chromatin immunoprecipitation (UV-X-ChIP). Systematic sequencing of the recovered clones led to the identification of 203 potential direct targets of engrailed and evidence is presented to show that some of them represent bona fide engrailed targets. MATERIALS AND METHODS 0 Tissue-Specific Gene Expression and Ecdysone-Regulated Genomic Networks in Drosophila 0 Developmental Cell 60 0 midgut, larval epidermal cells and adult epidermal progenitor cells (midgut imaginal islands), respond in opposite ways to ecdysone. The larval epidermal cells initiate the process of programmed cell death, while the imaginal cells proliferate and form the adult midgut. These diverse responses to a single hormone offer an opportunity to study tissue-specific genomic activity during a developmental process that is coordinately regulated throughout the animal. We define the complements of genes expressed during the process of metamorphosis in specific tissues. We show that computational analysis of genome-wide gene expression patterns can facilitate the identification of cis-regulatory elements and a cognate transcription factor. We also show that the network that controls metamorphosis can be extended beyond the ecdysone-regulatory cascade to include components of other well-studied signaling pathways. 0 Results Identification of Transcripts Enriched in Different Tissues and Organs Delineating networks on a genome-wide scale requires a catalog of gene expression patterns in each tissue or organ. Of particular interest are those genes that have high levels of expression in only certain tissues or times during development. We isolated five different organs and tissues from the Drosophila melanogaster Canton-S strain (Figure 1A). Samples were collected in triplicate approximately 18 hr before puparium formation (BPF), when larvae are at the end of their feeding and growing phase but have not yet begun metamorphosis (Riddiford, 1993). We compared RNA isolated from each organ or tissue to a common reference RNA sample taken from identically staged whole animals. The use of a linear amplification protocol enabled small amounts of sample 0 Tissue-Specific Genomic Networks in Drosophila 61 0 BMC Bioinformatics 0 BioMed Central 0 Open Access 0 Array-A-Lizer: A serial DNA microarray quality analyzer 1 Andreas Petri*, Jan Fleckner and Mads Wichmann Matthiessen 0 Petri et al; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL. 0 Background: The proliferate nature of DNA microarray results have made it necessary to implement a uniform and quick quality control of experimental results to ensure the consistency of data across multiple experiments prior to actual data analysis. Results: Array-A-Lizer is a small and convenient stand-alone tool providing the necessary initial analysis of hybridization quality of an unlimited number of microarray experiments. The experiments are analyzed for even hybridization across the slide and between fluorescent dyes in two-color experiments in spotted DNA microarrays. Conclusions: Array-A-Lizer allows the expedient determination of the quality of multiple DNA microarray experiments allowing for a rapid initial screening of results before progressing to further data analysis. Array-A-Lizer is directed towards speed and ease-of-use allowing both the expert and non-expert microarray researcher to rapidly assess the quality of multiple microarray hybridizations. Array-A-Lizer is available from the Internet as both source code and as a binary installation package. 0 The ongoing development of DNA microarray analysis equipment have diminished both the price and workload associated with microarray experiments leading to development of data at a tremendous rate. It is not unusual for a group of researchers to be able to produce and scan 50- 100 microarray slides per week. The processing of such large amounts of experimental data, first requires verification of the overall quality of the experiments. Array-ALizer employs two tests to monitor the quality of the hybridization with respect to uniformity across the slide as well as relative intensity of the fluorescent dyes in two color experiments: 1) spectrum analysis of the signal across the microarray slide and 2) comparison of the two dyes that are used in two-color experiments (for instance Cy3 and Cy5). 0 The Array-A-Lizer graphical user interface (GUI) is created in Borland Delphi and the statistical calculations are carried out in the R-project statistical scripting language [1]. Array-A-Lizer includes a microdistribution of the Rproject and contains options for specifying the graphical output type as either bitmaps or postscript. Array-A-Lizer supports experiment files from GenePixPro and Spotfinder through an open architecture, which can be extended to include other file formats. Array-A-Lizer runs on the Microsoft Windows platform. 0 Results and discussion 0 Array-A-Lizer is an application for rapid quality control of large DNA microarray experiments. The program consists of a collection of scripts, that are contained and accessed 0 Page 1 of 6 0 (page number not for citation purposes) 0 BMC Bioinformatics 2004, 5 0 through a GUI to ease their use (figure 1). The main advantage of the program is the rapid processing of an unlimited number of experiments. Array-A-Lizer generates reports with a graphical analysis of each experiment, providing the researcher with a rapid survey of the quality of experiments (figures 2 and 3). Additionally, the program returns an overview of the results in the system browser with hyperlinks to each analysis report (figure 4). Array-A-Lizer facilitates the generation of several plots that detail the quality of the experiments. Two different analysis modes can be chosen, resulting in either a set of diagnostic plots or a spatial representation of the data. In comparison to existing analysis packages, Array-A-Lizer is both quick and easy to use. It is a stand-alone application that can be installed on any desktop computer running MS Windows. It is intended for easy visualization of microarray data allowing both the expert and non-expert microarray researcher to assess the quality of multiple microarray hybridizations. 0 Diagnostic report In this mode, the experimental data are used to generate several diagnostic plots (figure 2) as well as statistics on 0 the identified spots. The Array-A-Lizer diagnostic report includes both MvA plots (figure 2A left)[2] and red/greenscatter plots (figure 2A right), both of which show spot intensities after local background subtraction. MvA plots display the log intensity ratio M = log2(R/G) versus the mean log intensity A = log 2 RG . This plot type is widely use to visualize array data because it directly displays the red to green ratios, which are often the quantities of interest in most experiments. Furthermore, MvA plots make it easy to identify intensity dependent biases in the data (i.e. curvature or 'banana shape'). In scatter plots, the intensities from the green channel are plotted against the red channel after log2 transformation. Genes displaying difference in signal intensities in the two channels are plotted off the diagonal and genes showing similar intensities are plotted close to the the diagonal. A common source of variation in microarray data acquisition is attributed by incorrectly balanced photomultiplier tube (PMT) settings during scanning. This results in overall differences in signal intensities obtained from either channel and a shift of the data from the x-axis (M = 0) or 0 Page 2 of 6 0 (page number not for citation purposes) 0 BMC Bioinformatics 2004, 5 0 Page 3 of 6 0 (page number not for citation purposes) 0 BMC Bioinformatics 2004, 5 0 Page 4 of 6 0 (page number not for citation purposes) 0 BMC Bioinformatics 2004, 5 0 the diagonal (red = green) of the ideal MvA and scatterplot respectively (figure 2B). Finally, the diagnostic analysis generates histograms of the log2 transformed data for comparison of the distribution of intensities between the two channels. The histograms display the signal intensities across the slide (figure 2C). Overamplified channels (PMT levels are set too high) will result in many saturated spots, which is revealed as an over representation of high intensity values (figure 2D). The diagnostic report includes information on which files were used for the analysis, the number of saturated spots, and the number of negative values, i.e. the number of spots where the background intensity was higher than the foreground intensity. 0 Spatial report The spatial analysis results in a graphical representation of microarray data according to the location on the slide (figure 3). From each channel, three different plots are generated showing the log2 transformed foreground intensities, the background intensities, and a plot showing the location of negative values (background higher than fore- 0 ground). This analysis method can be used to identify spatial effects on the hybridized arrays such as fading or illumination at the edges due to cover-slip effects (figure 3A and 3B) or scratches and artifacts resulting from inadequate washing of slides (figure 3C and 3D). The cut-off values on the background plot can be set from the GUI prior to starting the analysis. Keeping these limits fixed will allow easy detection of pronounced fluctuations in background intensities both between and within slides. 0 With the reduced cost and labor of DNA m 0 TECHNICAL REPORTS 0 CA). Touchdown PCR amplifications were performed as recommended18. Cycle sequencing protocols were used with ABI sequencers at the Hutchinson Center Biotechnology Facility. DHPLC. Mutation detection was performed using the Transgenomic WAVE system. Following PCR amplification, the Pfu polymerase was inactivated, and the DNA samples were heated and cooled to form heteroduplexes18. For most fragments, the predicted WAVE (v.3.5) melting temperatures and separation gradients were used19. 0 We thank Bruce Draper for helpful discussions. This work was supported by grant RO1 GM29009 (to S.H.) from the National Institutes of Health. S.H. is an investigator of the Howard Hughes Medical Foundation, which also provided support for Karen Wolfe of the James Roberts lab, whom we thank for helping us with the screen. 0 High-fidelity mRNA amplification for gene profiling 1 Ena Wang1,3, Lance D. Miller2,3, Galen A. Ohnmacht1, Edison T. Liu2, and Francesco M. Marincola1* 0 TECHNICAL REPORTS 0 QUANTITATIVE TRAIT LOCI IN DROSOPHILA 1 Trudy F. C. Mackay 0 Phenotypic variation for quantitative traits results from the simultaneous segregation of alleles at multiple quantitative trait loci. Understanding the genetic architecture of quantitative traits begins with mapping quantitative trait loci to broad genomic regions and ends with the molecular definition of quantitative trait loci alleles. This has been accomplished for some quantitative trait loci in Drosophila. Drosophila quantitative trait loci have sex-, environmentand genotype-specific effects, and are often associated with molecular polymorphisms in non-coding regions of candidate genes. These observations offer valuable lessons to those seeking to understand quantitative traits in other organisms, including humans. 0 Transfer of genetic material from one strain to another by repeated backcrosses. With marker-assisted introgression, markers that distinguish the parental strains are used to track the desired interval and select against the undesired genotype. 0 The ease with which Mendelian and quantitative traits give up their genetic secrets is inversely proportional to the relative importance of the two classes of trait for human health, agriculture, evolution and even functional genomics. Although devastating to the possessor, highly deleterious alleles that cause inborn errors of metabolism and other single gene disorders are rare in the general population. By contrast, susceptibility to common diseases such as atherosclerosis, arthritis, diabetes, hypertension and schizophrenia is affected by multiple genetic factors and by the environment. These diseases are therefore quantitative traits (FIG. 1), and affect a large proportion of the human population. Similarly, individuals vary quantitatively in their response to drug therapy. There is great excitement in the human genetics community and the pharmaceutical industry that susceptibility loci for common diseases and individual variation in drug response can be identified and the molecular basis for this variation determined. This knowledge will herald a new era of personalized medicine in which environment-specific risk factors for common diseases are assessed for individual genotypes (and hopefully avoided by the patient) and pharmaceutical treatment is genotype dependent. Similar arguments apply to the agriculture industry, in which most characters of economic importance in domestic animal and crop species are quantitative. There is a long history of success in improving productivity traits 0 by selective breeding for favourable phenotypes. Knowledge of the allelic status at each locus affecting these traits will greatly facilitate this process, and will enable INTROGRESSION of favourable alleles from other strains, while simultaneously eliminating deleterious alleles. Variation for quantitative traits is the raw material on which the forces of evolution act to produce phenotypic diversity and adaptation. Major research efforts in evolutionary quantitative genetics are aiming to determine how genetic variation for adaptive quantitative traits is maintained in natural populations; whether the loci at which variation occurs within a population are the same as those that cause divergence between populations and species; and how the answers to these questions depend on the relationship of the trait to the ultimate quantitative trait -- reproductive fitness. So a comprehensive understanding of the evolutionary process is contingent on a detailed description of the molecular genetic basis of variation for quantitative traits in natural populations. The complete genome sequences of the yeast Saccharomyces cerevisiae1, the nematode Caenorhabditis elegans2 and the fruitfly Drosophila melanogaster3 reveal that a large fraction of these genomes is uncharted phenotypic territory. In Drosophila, for example, only 2,500 of the 13,600 genes and predicted genes (18%) have been characterized by classic genetic and molecular methods3. An important challenge for the future is to devise ways of determining the phenotypic effects of 0 NATURE REVIEWS | GENETICS 0 Macmillan Magazines Ltd 0 A1A1 Phenotype Phenotype A1A2 A2A2 A1A1 A1A2 A2A2 Phenotype Frequency A1A1 A1A2 A2A2 0 Phenotypic value 0 No GEI Parallel reaction norms 0 GEI Reaction norms cross 0 GEI Change of variance 0 ANTAGONISTIC PLEIOTROPY 0 Alternative homozygous genotypes (A1A1, A2A2) have opposite phenotypic effects under different conditions. 0 CONDITIONAL NEUTRALITY 0 The difference between quantitative trait loci genotypes is only expressed under some conditions. 0 A statistic to quantify dispersion about the mean. In quantitative genetics, the phenotypic variance, VP , is the observed variation of the trait in a population. VP is partitioned into components due to variation in the additive (VA) dominance (VD ) and epistatic (VI ) genetic variance, the variance attributable to the environment (VE ), and gene-environment correlations and interactions. 0 uncharacterized and predicted genes. Conventional screens for mutations with large phenotypic effects can lead to the identification of function for a biased sample of genes -- mutating one gene in a pathway in which there is functional redundancy might not cause a major effect on the phenotype. Furthermore, homozygous lethal mutations define loci that are essential for viability, but less severe mutations at these loci may have unknown and unexpected pleiotropic effects on morphology, physiology and behaviour. So, genetic screens for mutations with subtle, quantitative effects and genetic analysis of naturally occurring variation for quantitative traits will be important components of the functional genomics tool kit. Until very recently, the genetic basis of variation for quantitative traits was inferred solely from statistical estimates of correlations between relatives, response to artificial selection and changes of mean and VARIANCE of the trait on inbreeding and crossing4,5. To reap the benefits of a thorough understanding of quantitative traits, we must lift this statistical fog6 and describe quantitative genetic variation in terms of complex genetics (FIG. 1). Specifically, a full understanding of the genetic architecture of a quantitative trait will require answers to the following questions. What are the loci at which mutational variation affecting the trait occurs? What are the spontaneous mutation rates at these loci? What loci affect naturally occurring variation within and between populations of a single species, and between species? What are the homozygous and heterozygous effects of alleles at these loci? Are the effects of the individual loci on the final phenotype independent (additive), or is the effect of multiple loci on the phenotype nonlinear (epistasis)? What is the effect of quantitative trait locus (QTL) alleles on multiple quantitative traits, including 0 reproductive fitness (pleiotropy)? How do the homozygous, heterozygous, epistatic and pleiotropic QTL effects vary between the sexes and in a range of ecologically relevant environments? What defines a QTL allele at the molecular level? What are QTL allele frequencies within and between populations? At present, detailed genetic dissection of quantitative traits is most feasible in genetically tractable and wellcharacterized model systems. Drosophila melanogaster is one of the model organisms that provides us with all the tools necessary for identifying QTL and characterizing them at the molecular level7 (FIG. 2). Over eight decades of research on this organism have provided us with a library of stocks that bear mutations at single loci and deficiency chromosomes that cover around 70% of the genome. The P transposable element has been harnessed as a transformation vector and modified for efficient insertional mutagenesis, analysis of tissue-specific expression patterns, general and targeted overexpression, and, most recently, homologous rec 0 review review 0 In control: systematic assessment of microarray performance 1 Harm van Bakel & Frank C.P. Holstege+ 0 Expression profiling using DNA microarrays is a powerful technique that is widely used in the life sciences. How reliable are microarrayderived measurements? The assessment of performance is challenging because of the complicated nature of microarray experiments and the many different technology platforms. There is a mounting call for standards to be introduced, and this review addresses some of the issues that are involved. Two important characteristics of performance are accuracy and precision. The assessment of these factors can be either for the purpose of technology optimization or for the evaluation of individual microarray hybridizations. Microarray performance has been evaluated by at least four approaches in the past. Here, we argue that external RNA controls offer the most versatile system for determining performance and describe how such standards could be implemented. Other uses of external controls are discussed, along with the importance of probe sequence availability and the quantification of labelled material. Keywords: expression profiling; external controls; microarray; performance; quality; spikes 0 DNA microarrays are universal tools that can be applied throughout the life sciences (Brown & Botstein, 1999; Lockhart & Winzeler, 2000; Young, 2000). mRNA-expression profiling is the most frequent application. Such microarray hybridizations determine changes in mRNA levels between two samples or result in an absolute quantification that is correlated to mRNA levels. How reliable are these measurements? Given the widespread interest, it is surprising that there have been relatively few systematic analyses of microarray performance. One reason for this lack of assessment is the complicated nature of microarray technology; there is no single `microarray technology', but rather a collection of different technology platforms. Established platforms include Affymetrix GeneChips (Santa Clara, CA, USA), PCR-product-based cDNA arrays and long oligomer arrays that are manufactured in-house or by Agilent (Palo Alto, CA, USA). New platforms are still being introduced, such as the Illumina Beadarray 0 (San Diego, CA, USA; Fan et al, 2004) or the Universal Hexamer Array from Agilix (New Haven, CT, USA; Roth et al, 2004). To complicate matters further, many technical alternatives are possible within each platform for each of the numerous steps between sample preparation and data analysis. These include diverse methods of generating labelled material, various hybridization conditions, different microarray scanners and settings, a range of imagequantification techniques, and several approaches for determining statistically and biologically significant differential gene expression. Microarray technology is therefore an amalgamation of many different techniques, even within individual technology platforms. This complexity makes the need for comparing performance even stronger, whilst confounding such comparisons. Determining reliability is a complicated undertaking if all aspects are to be assessed in a non-arbitrary way across the different platforms and their variants. In addition, reliability is a sensitive issue for those groups that provide the technology. Finally, not every application requires reliable estimates of mRNA level changes. This should be interpreted as an indication of the power of microarray technology, as even lower quality data can yield important results. Improved performance would nevertheless benefit all applications. A high degree of reliability is a requirement if certain fields, such as systems biology (Ideker et al, 2001) or diagnostic mRNAexpression profiling (van de Vijver et al, 2002) are to mature. A strong argument can be made for investigating how the technology can be systematically assessed, given its increased usage, the costs that are involved and the fact that the aim is to determine the mRNA levels of all genes, including those that are expressed at nearly zero levels. Here, we describe approaches for determining microarray performance and propose that the use of external control RNAs is a versatile and robust method for achieving this goal. 0 Accuracy and precision 0 Which performance parameters should be assessed? The two main characteristics of data quality are accuracy and precision. Whereas accuracy refers to how close a measurement is to the real value, precision indicates how often a measurement yields the same result (Fig 1). When microarray data are discussed, the focus is often on precision; that is, reproducibility rather than accuracy. Reproducibility is easier to assess, by taking repeated measurements. Previous reviews have discussed the pitfalls that are involved in determining reproducibility, such as the confusion between 0 EUROPEAN MOLECULAR BIOLOGY ORGANIZATION 0 Controlling microarray performance H. van Bakel & F.C.P. Holstege 0 Measured mean 0 Measured mean 0 mized. Confounding artefacts are still being uncovered (Diehl et al, 2001; Ramdas et al, 2001; Chuaqui et al, 2002; Fare et al, 2003; Martinez et al, 2003; Raghavachari et al, 2003; t Hoen et al, 2003; Lyng et al, 2004). Therefore, monitoring quality would benefit individual hybridizations and projects. This could also aid in analyses of the data that are now being collected in public databases (Edgar et al, 2002; Brazma et al, 2003). In these cases, internal quality control would allow the refinement of decisions about which data to use, depending on the requirement for different quality parameters. 0 Real value 0 Real value 0 Measured mean 0 Measured mean 0 Approaches to determining performance 0 One method that can be used to optimize protocols is to measure and increase the signal intensity (Rickman et al, 2003; Wrobel et al, 2003). The underlying assumption is that increased signal-to-noise ratios will yield better quality hybridizations. However, an increase in signal might be aspecific; for example, owing to increased crosshybridization or the nonspecific binding of fluorophores to nucleicacid probes (Chuaqui et al, 2002). It is therefore risky to optimize signal-to-noise ratios without knowing whether specificity is being maintained. A second approach is to determine the correlation between new methods and an approach that is already in use. Different amplification and labelling techniques are usually assessed by comparison to a standard cDNA-synthesis protocol (Mahadevappa & Warrington, 1999; Manduchi et al, 2002; Gupta et al, 2003; t Hoen et al, 2003; Kenzelmann et al, 2004). A correlation coefficient only shows how similarly two protocols behave; it does not give information on their individual accuracy. A high correlation (Barczak et al, 2003) might therefore mean that the technologies that are being compared both suffer from the same error. Moreover, a low correlation (Tan et al, 2003) still begs the question of which technique is better. Another use of correlation is to monitor reproducibility; for example, between the two dye channels of cDNA arrays. The drawback is that the technology is being optimized for yielding identical intensities, rather than for accurately reporting what most users are interested in: differences in mRNA levels. Perfectly tight same-versus-same scatter plots, which are often touted in publications or advertisements as proof of superior performance, should be treated with caution. Optimization that is based on achieving tight scatter plots can lead to a decreased ability to report changes in mRNA levels. Ideally, optimization should focus on reporting relative or absolute mRNA levels and should take into account the entire range of expression levels. A third method for performance evaluation is to use an established cell-culture experiment in which changes in mRNA levels are verified by other means, such as northern blotting analysis or quantitative reverse transcription (RT)-PCR (Taniguchi et al, 2001; Yuen et al, 2002; Polacek et al, 2003; Loguinov et al, 2004; Roth et al, 2004). Using such established differentials is a good method because it optimizes the reporting of differences in expression, which is the goal of most microarray hybridizations. One disadvantage is that verification and optimization are driven by the differences that are reported by the microarrays, rather than by all of the mRNA-level differences that are present in the experimental system. There is no test for false-negative differentials unless RT-PCR, for example, is carried out on many hundreds of genes that are not reported as being differentially expressed in the microarray experiment. A further drawback is that this method, similar to those described above, does not lend itself to the routine assessment of each individual microarray hybridization before optimization. 0 Real value 0 Real value 0 Genome-Wide Location and Function of DNA Binding Proteins 1 Bing Ren,1* Francois Robert,1* John J. Wyrick,1,2* ¸ Oscar Aparicio,2,4 Ezra G. Jennings,1,2 Itamar Simon,1 Julia Zeitlinger,1 Jorg Schreiber,1 Nancy Hannett,1 ¨ Elenita Kanin,1 Thomas L. Volkert,1 Christopher J. Wilson,5 Stephen P. Bell,2,3 Richard A. Young1,2 0 Understanding how DNA binding proteins control global gene expression and chromosomal maintenance requires knowledge of the chromosomal locations at which these proteins function in vivo. We developed a microarray method that reveals the genome-wide location of DNA-bound proteins and used this method to monitor binding of gene-specific transcription activators in yeast. A combination of location and expression profiles was used to identify genes whose expression is directly controlled by Gal4 and Ste12 as cells respond to changes in carbon source and mating pheromone, respectively. The results identify pathways that are coordinately regulated by each of the two activators and reveal previously unknown functions for Gal4 and Ste12. Genome-wide location analysis will facilitate investigation of gene regulatory networks, gene function, and genome maintenance. Many proteins bind to specific sites in the genome to regulate genome expression and maintenance. Transcriptional activators, for example, bind to specific promoter sequences and recruit chromatin modifying complexes and the transcription apparatus to initiate RNA synthesis (1-3). The reprogramming of gene expression that occurs as cells move through the cell cycle, or when cells sense changes in their environment, is effected in part by changes in the DNA binding status of transcriptional activators. Distinct DNA binding proteins are also associated with origins of DNA replication, centromeres, telomeres, and other sites, where they regulate chromosome replication, condensation, cohesion, and other aspects of genome maintenance (4, 5). Our understanding of these proteins and their functions is limited by our knowledge of their binding sites in the genome. The genome-wide location analysis method we have developed allows protein-DNA interactions to be monitored across the entire yeast genome (6). The method combines a modified chromatin immunoprecipitation (ChIP) procedure, which has been previously used to study protein-DNA interactions at a small number of 0 in galactose using our analysis criteria (Fig. 2A). These included seven genes previously reported to be regulated by Gal4 (GAL1, GAL2, GAL3, GAL7, GAL10, GAL80, and GCY1). The MTH1, PCL10, and FUR4 genes were also bound by Gal4 and activated in galactose. Each of these results was confirmed by conventional ChIP analysis (Fig. 2B) (6), and MTH1, PCL10, and FUR4 activation in galactose was found to be dependent on Gal4 (Fig. 2C). Both microarray and conventional ChIP showed that Gal4 binds to GAL1, GAL2, GAL3, and GAL10 promoters under glucose and galactose conditions, but the binding was generally weaker in 0 specific DNA sites (7), with DNA microarray analysis. Briefly, cells were fixed with formaldehyde, harvested, and disrupted by sonication. The DNA fragments cross-linked to a protein of interest were enriched by immunoprecipitation with a specific antibody. After reversal of the cross-links, the enriched DNA was amplified and labeled with a fluorescent dye (Cy5) with the use of ligation-mediated-polymerase chain reaction (LM-PCR). A sample of DNA that was not enriched by immunoprecipitation was subjected to LM-PCR in the presence of a different fluorophore (Cy3), and both immunoprecipitation (IP)-enriched and -unenriched pools of labeled DNA were hybridized to a single DNA microarray containing all yeast intergenic sequences (Fig. 1). A single-array error model (8) was adopted to handle noise associated with low-intensity spots and to permit a confidence estimate for binding (P value). When independent samples of 1 ng of genomic DNA were amplified with the LM-PCR method, signals for greater than 99.8% of genes were essentially identical within the error range (P value 10 3). The IP-enriched/unenriched ratio of fluorescence intensity obtained from three independent experiments was used with a weighted average analysis method to calculate the relative binding of the protein of interest to each sequence represented on the array. To investigate the accuracy of the genomewide location analysis method, we used it to identify sites bound by the transcriptional activator Gal4 in the yeast genome. Gal4 activates genes necessary for galactose metabolism and is among the best characterized transcriptional activators (1, 9). We found 10 genes to be bound by Gal4 (P value 0.001) and induced 0 glucose (6). The consensus Gal4 binding sequence that occurs in the promoters of these genes (CGGN11CCG) can also be found at many sites through the yeast genome where Gal4 binding is not detected; therefore, sequence alone is not sufficient to account for the specificity of Gal4 binding in vivo. Previous studies of Gal4-DNA binding have suggested that additional factors such as chromatin structure contribute to specificity in vivo (10, 11). The identification of MTH1, PCL10, and FUR4 as Gal4-regulated genes reveals previously unknown functions for Gal4 and explains how regulation of several different metabolic pathways can be coordinated (Fig. 2D). MTH1 encodes a transcriptional repressor of certain HXT genes involved in hexose transport (12). Our results suggest that the cell responds to galactose by increasing the concentration of its galactose transporter at the expense of other transporters. In other words, while Gal4 activates expression of the galactose transporter gene GAL2, Gal4 induction of the MTH1 repressor gene leads to reduced levels of glucose transporter expression. The Pcl10 cyclin associates with Pho85p and appears to repress the formation of glycogen (13). Thus, the observation that PCL10 is Gal4-activated suggests that reduced glycogenesis occurs to maximize the energy obtained from galactose metabolism. FUR4 encodes a uracil permease (14), and its induction by Gal4 may reflect a need to increase intracellular pools of pyrimadines to permit efficient uridine 5 -diphosphate (UDP) addition to galactose catalyzed by Gal7. We next investigated the genome-wide binding profile of the transcription activator Ste12, which functions in the response of haploid yeast to mating pheromones (15). Activation of the pheromone-response pathway by mating pheromones causes cell cycle arrest and transcriptional activation of more than 200 genes in a Ste12-dependent fashion (8, 15). However, it is not clear which of these genes is directly regulated by Ste12 and which are regulated by other ancillary factors. The genomewide binding profile of epitope-tagged Ste12, determined before and after pheromone treatment in three independent experiments, indicates that 29 pheromone-induced genes are regulated directly by Ste12. Figure 3A lists the yeast genes whose promoter regions are bound by Ste12 at the 99.5% confidence level (i.e., P value 0.005) and whose expression is induced by factor. These 29 genes are likely to be directly regulated by Ste12 because (i) all have promoter regions bound by Ste12, (ii) exposure to pheromone causes an increase in their transcription, and (iii) pheromone induction of transcription is dependent on Ste12. Of the genes that are directly regulated by Ste12, 11 are already known to participate in various steps of the mating process (Fig. 3B). FUS3 and STE12 encode components of the signal transduction pathway involved in the response to pheromone (16); AFR1 and GIC2 are required for the formation of mating projections (17-19); FIG2, AGA1, FIG1, and FUS1 are involved in cell fusion (20-23); and CIK1 0 The End of the Microarray Tower of Babel: Will Universal Standards Lead the Way? 1 Ernest S. Kawasaki 0 NCI Advanced Technolog y Center, Bethesda, MD 0 A PRolIfERAtIon of MIcRoARRAy PlAtfoRMs And AssocIAtEd tEchnologIEs 0 Table 1 gives a list of sources for obtaining whole genome arrays, which are defined as arrays that have approximately the entire gene complement of the genome represented on one slide or chip. You will note that there are large differences in the size of the probes, the number of probe sets, and the total number of probes per array. This and many other technological differences found in these platforms will be enumerated, with pointers as to how or why 0 The enD oF The micRoARRAy ToweR oF BABel 0 The probe size, number of probes sets and the total number of probes per array are indicated. 0 these differences can cause discordant results between platforms. The nomenclature convention followed here is that the "probe" is the gene sequence arrayed on the chip, and the "target" is the RNA sequence to be labeled and hybridized to the probes. Probe manufacture. The probes for the arrays may be made in situ by photolithographic or ink-jet methods, or by standard oligonucleotide synthesis protocols followed by attachment to various substrates.3 Because the methods are so varied, it is difficult to estimate the purity of the probes or their true sizes, and large differences in these parameters can have a great influence on signal intensi- 0 A decade of microarray publications. The number of publications per year derived from Pubmed using the terms "microarray" or `microarrays" is shown. 0 e.s. KAwAsAKi 0 for detecting mRNAs of low abundance than the long probe arrays. Thus, probe size can be a confounding factor when comparing the same genes across many platforms (Table 1). Probe element size and concentration. The element or spot size diameters range from 11 microns to ~200 microns in the different platforms. The size of the array elements (spots), their size in µ2 , and concentration in the number of molecules per spot are given in Table 2. There is also a large difference in the number of probe molecules per spot, with estimates from several million to hundreds of millions of molecules. This can heavily influence the kinetics of hybridization, signal quantification, and signal intensities of the probes, and these important factors will vary from platform to platform. Probe number per array. The number of probe sets may vary from 30,000 to 54,000, but the total number of probes per array actually ranges from about 30,000 to greater than 500,000 (Table 2). Microarrays may contain one probe per gene or up to twenty probes per gene. This fact alone can make it difficult to directly compare the data from platforms with such a wide range of the number of probes per mRNA sequence. Proper probe annotation. This is an intense area of investigation.6-8 The sequence databases for expressed genes are still in a state of flux, such that probe sequences derived from older databases may be dramatically different from the latest version. It has been found that some probe sequences no longer exist in the database, or were not annotated properly and now have different IDs or names. Thus, platforms may have probe sequences that do not exist in the genome or have the incorrect designation, and this has been an important source of confusion in the analysis of array data. Target preparation. There is no standard way of isolating RNA for target labeling, although almost all microarray experimentalists follow the rule of analyzing the integrity of their RNA samples before beginning labeling steps. Many expression profiling experiments in the past were uninterpretable simply because of poor RNA quality. A common method to test RNA integrity is through the use of an Agilent 2100 Bioanalyzer, which provides an electrophoretic tracing and a RNA integrity number (RIN) for judging RNA quality.9 Target synthesis. Targets are commonly synthesized via cDNA reactions on total RNA or by in vitro synthesis of linearly amplified RNA using T7 RNA polymerase technologies.10 The cDNA targets are thought to faithfully represent the original concentrations of the mRNA in the sample, but linearly amplif 0 BIOINFORMATICS APPLICATIONS NOTE 0 arrayMagic: two-colour cDNA microarray quality control and preprocessing 1 Andreas Buness, Wolfgang Huber, Klaus Steiner, Holger Sueltmann and Annemarie Poustka 0 that can at any time be re-run or extended. The compendium technology (Gentleman, 2004) can be used to produce distributable objects containing the data as well as revivable documents reporting the processing. We aimed to integrate normalization methods, quality scores and visualizations that had been reported previously. In addition, we provide tools for dealing with different microarray layouts within one experiment and for merging data from replicate probes or hybridizations. The researcher obtains an instant overview on the quality of the experiment. 0 Normalization strategies for two-colour microarrays can be divided into two groups: adjustment of the colour channels or of the log-ratios. Moreover, depending on the experimental design and the objectives either a single channel intensity or a log-ratio-based analysis might be more appropriate. The tool offers log-ratio-based normalization by means of the loess method (Yang et al., 2002) and direct intensitybased normalization by means of vsn (Huber et al., 2002) and quantile normalization (Bolstad et al., 2003) methods. We will also use the terms `log-ratios' and `log-transformed intensities' for the data resulting from the vsn method. Groups of hybridizations, subsets of spots, e.g. by grid, print-tip or PCR plate, as well as colour channels can be normalized separately. Plots characterizing the distributions of the log-ratios and colour channels before and after normalization were generated (Fig. 1b). 0 Two-colour cDNA microarray technology has evolved into a routine laboratory procedure. Our motivation in implementing arrayMagic was to deal with the large amount of data generated by microarray projects in an efficient, reliable and reproducible manner. We focused on preprocessing and quality assurance, leaving out high-level analysis which has to be adressed specifically. The main design goal was to allow for the rapid construction of customized quality assessment and control (QA/QC) and preprocessing pipelines for such projects from a small set of building blocks. arrayMagic bridges the gap between the image quantification software and subsequent statistical and explorative analyses like testing for differential expression or classification. It simplifies the task of building processing pipelines that are reproducible, which means that even for idiosyncratic experimental designs and non-trivial combinations and selections of the data the whole procedure from raw data to normalized, quality-controlled, annotated and summarized data is documented in a not too verbose script 0 QUALITY CONTROL AND ASSESSMENT 0 Quality assured data are prerequisite for any reliable highlevel analysis. In addition, quality control allows to monitor and improve the laboratory procedures. The quality of hybridizations is best assessed in the context of normalization. In a model-based approach like vsn, the model is a summary of past experience and our expectations on the data. Thus, it can be used to identify hybridizations or groups of measurements that do not fit. Other methods 0 arrayMagic: two-colour microarray quality control 0 like loess or quantile normalization place more emphasis on making the data conform in any situation. In these cases, statistics of the data distribution can be calculated (e.g. location and scale of the distribution of normalized log-ratios) and compared against expectations. Moreover, as long as the majority of the data are assumed to be acceptable, outlier detection methods can be used for quality control. Visual inspection of the data is supported by spatial falsecolour representations of foreground and background intensities and the log-ratios. This allows to detect scratches and artefacts (Fig. 1a). Most notably, the spatial plots of the normalized data are useful for assessing the necessity of background correction and for assuring spatial homogeneity of the data. Several quality scores are calculated, stored in a report file and are visualized in part. These scores include spot replicate concordance, the correlation of the two colour channels and a robust measure of noise W for each hybridization. W is defined as the median absolute deviation of the normalized log-ratios qi , i.e. W = madi (qi ) = mediani (|qi - medianj (qj )|). A minority of differentially expressed genes should not disturb W . We do not find it practical to define universally applicable thresholds on quality scores. They should be evaluated not on the level of a single hybridization, but in the context of all data in the experiment. In our experience this has been very useful in detecting outliers in large-scale experiments. In particular, a global view on all pairwise similarities between all hybridizations shown in Figure 1c has proved to be useful. For two arrays a and b, we define a similarity score Sab = madi (xia - xib ), where xia can be the log-ratio of the i-th probe on the a-th array, or the log-transformed normalized intensity of an individual colour channel. Especially in the 0 case of biologically related samples, this is an informative measure of similarity. 0 The open source software tool arrayMagic facilitates the analysis of two colour cDNA microarray data. It aims to provide quality assured and normalized data. The scriptbased pipeline supports reproducible batch-like processing. The workflow starts with quantified image scan result files. Several quality scores and diagnostics are calculated and visualized, which offer a broad view. The processed data can be exported as HTML-file or as tab-delimited file with spot and sample annotation and may serve as input for follow-up analysis in commonly used tools of choice. Naturally, high-level follow-up analysis in the framework of R and Bioconductor is supported by adequate representation of the data. Documentation of all functionality and a step-by-step example following a typical workflow is part of the package. 0 A.Buness et al. 0 Gentleman,R. (2004) Reproducible research: a bioinformatics case study. Stat. Appl. Genet. Mol. Biol., 3. Gentleman,R., Carey,V.J., Bates,D.J., Bolstad,B.M., Dettling,M., Dudoit,S., Ellis,B., Gautier,L., Ge,Y., Gentry,J. et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Bioconductor Project Working Papers. Working Paper 1. Huber,W., von Heydebreck,A., Sueltmann,H., Poustka,A. and Vingron,M. (2002) Variance stabilization applied to microarray 0 Normalization of microarray data using a spatial mixed model analysis which includes splines 1 David Baird1,, Peter Johnstone2 and Theresa Wilson3 0 AgResearch, 0 techniques for normalization have been suggested, including linear regression (Hedenfalk et al., 2001), ratio statistics (Chen et al., 1997), local smoothing (Yang et al., 2002) and analysis of variance (Kerr et al., 2000; Chu et al., 2002). Yang et al. (2002) compare these approaches and suggested a method which allows for differences induced by different print tips. We extend this idea to model the rows and columns over the whole slide and within the print tips and also autocorrelation in the printing order. This differs from other methodology in that we are able to correct unwanted variation arising from unevenness of the slide surface and scanning efficiency. The usual statistical modelling approach is taken where all possible sources of noise are jointly fitted in one model, with the need for each term being assessed using statistical significance of the reduction in remaining unexplained variation. Model terms can be added or removed as required. The fitted model then indicates where useful modification of our protocols and equipment would help minimize variation in future experiments. 0 METHODS Amplification of ESTs 0 Microarray technology has been used extensively to survey patterns of gene expression in a range of biological models. Using our own collection of bovine expressed sequence tags (ESTs) we have constructed large cDNA arrays (up to 22 000 ESTs) for use in several of our research projects. For such large arrays it is essential to identify sources of variation and correct for them to allow for robust use of this technology. Through normalization procedures, such variations can be identified and removed to obtain data for follow on research. The analysis of the microarrays, is a two-step analysis; a within slide analysis aimed at normalization and if required standardization, and then a between slide analysis to estimate the differences between targets and their consistency. Various 0 Mixed models using splines for microarray data 0 C, washed for 5 min each in (1) 2 x SSC, 0.1% SDS, (2) 1 x SSC and (3) 0.1 x SSC, centrifuged at 500 g for 5 min, dried and scanned. 0 Allocation of probes to slides 0 Randomization is a well-known device used to ensure the valid application of significance tests and confidence intervals (Fisher, 1951). Randomization also disarms critics who suggest an allocation of experimental units has been chosen which is favourable to an author's hypothesis (Cox, 1992). Because of these properties, it is routine in traditional experiments to randomly allocated treatments to the experimental units. In microarray experiments the physical constraints imposed by the storage of probes in 96-well plates and by the microarray printing robots, ensure that a fully randomized layout is not possible. However, printing the 96-well plates in random order is possible and is justified in that some randomization is better than the alternative of no randomization. 0 ANALYSIS Measure of differential expression in probes 0 that the value M will be randomly distributed around a mean value of 0. Other approaches to handling values close or below background can be used. One option is to make no background correction, which will shrink all values of M towards zero, with large reductions for spots of low intensity and minimal reductions on spots with high intensity. This has the advantage of reducing the variation of low-intensity spots, but the disadvantage of reducing sensitivity of identifying differentially expressed ESTs with low expression levels. Any spatial trends not eliminated in the log ratios due to trends in the background can be estimated and removed as part of the spatial model, as explained later in this paper. Another alternative is suggested by Durbin and Rocke (2003), in the context of transforming the single channel's expression, add a constant to all values in each channel as part of a more complex transformation. The constant to be added in the Durbin and Rocke approach is estimated as that giving the best stabilized error variance. For large expression values, these approaches have virtually no effect on the log ratio, but for values just below and above the minimum cut off, the relative differences between the approaches may be substantial. The advantage of using logs over more complicated transforms is that the resulting values are more naturally interpreted by the experimenter. Which approach is best, in terms of giving unbiased results can only be ascertained by a uniform study, that is not available in our current datasets. 0 Within slide dye bias 0 It is typically found that the mean of M at a certain level of log-intensity depends on the level of intensity of the probe. If we define A, the average log-intensity of the probe as A= 0 We have used a value of 0.5 for k, but have tried values between 0.1 and 1.0. The value of k controls how much the information on the probe is down weighted, with larger values reducing the value of M towards 0. If both dyes have negative corrected intensities, then there is no information in the probe, and M is set to be a missing value. It is expected that the majority of probes in the sample will show no differential expression, and 0 then a plot of M versus A [an MA plot (Dudoit et al., 2002)], often shows a departure from the zero reference line. It is expected that the level of differential expression is independent of the brightness of the probes. Figure 1 shows the MA plot for one of our microarrays. It can be seen that the mean location of the M values is below zero for A between 8 and 11 and above zero for A > 11, falling back to zero as A approaches 16. The points falling on the two lines at the left of the plot are due to the truncation of the intensity of the dyes to the minimum values. Figure 3 shows an MA plot o 0 Developmental roles and molecular characterization of a Drosophila homologue of Arabidopsis Argonaute1, the founder of a novel gene superfamily 1 Youhei Kataoka1, Masatoshi Takeichi2 and Tadashi Uemura2,a,* 0 Background: Arabidopsis Argonaute1 (AGO1) is the founder of a novel gene superfamily that is conserved from fission yeasts to humans. AGO1, and several other members of this superfamily are necessary for stem cell renewal or RNA interference. However, little has been reported about their roles in animal development or about the molecular activities of any of the members. Results: We have isolated a Drosophila homologue of AGO1, dAGO1, in our attempt to search genetically for regulators of Wingless (Wg) signal transduction. dAGO1 is broadly expressed in the embryo and the imaginal disc. dAGO1 over-expression at wing margins suggested that it behaves as a positive regulator in the genetic background employed. Loss-of-function mutations of dAGO1, unexpectedly, did not give typical segment polarity phenotypes of the wg class; instead, dAGO1 maternal and zygotic mutant embryos showed developmental defects, with malformation of the nervous system being the most prominent. The mutant decreased in the numbers of several types of neurones and glia examined. The dAGO1 protein was distributed in the cytoplasm and co-sedimented with poly(U)- or poly(A)-conjugated beads. Conclusion: Our results suggest that the dAGO1 protein exerts its developmental functions by binding to RNA either directly or indirectly. 0 Cells are endowed with a variety of mechanisms to repress the translocation of signalling components to nuclei in the absence of extracellular stimuli. One straightforward strategy to inhibit translocation is destroying the key components such as transcription factors before they enter into the nucleus. Pioneers of such targets of proteolysis are b-catenin and its Drosophila homologue, Armadillo (Arm; McCrea et al. 1991; Peifer et al. 1992). Unless a cell receives secreted proteins of the Wnt family, b-catenin/Arm are degraded in the cytoplasm (Orsulic & Peifer 1996; Pai et al. 1997). Following the binding of Wnt to its receptor in the Frizzled family, the proteolytic mechanism is inactivated, and b-catenin enters nuclei, leading to the transcription of target genes (Cadigan & Nusse 1997; Wodarz & Nusse 1998; Peifer & Polakis 0 q Blackwell Science Limited 0 Most Wnt proteins evoke the b-cateninmediated signalling cascade, which plays important roles in cell proliferation and fate determination in animal development. Besides a role as a transcriptional activator in the Wnt signal transduction, Arm/b-catenin binds cadherin, and this complex is essential for cell adhesion at cell±cell junctions (Oda et al. 1994; Cox et al. 1996; Muller & Wieschaus 1996; Iwai et al. 1997). Curiously, E these two functions of Arm/b-catenin are separable (Orsulic & Peifer 1996). Because of the dual functions of b-catenin, overproduction of cadherin sequesters bcatenin and blocks the Wnt signalling in Xenopus (for example, see Heasman et al. 1994). Similarly, cadherin overproduction in Drosophila wings mimics one of the loss-of-function phenotypes of wingless (wg), one of the most characterized Wnt genes in terms of developmental roles (Sanson et al. 1996; this study). During the third instar larval stage, Wg is produced in a stripe of cells in the developing wing blade, and these cells are responsible for patterning the margin of the adult wing (Phillips & Whittle 1993; Couso et al. 1994). Without this Wg function in late stages of disc development, the 0 Y Kataoka et al. 0 wings lose their marginal structures, which can be reproduced with high penetrance by DE-cadherin overproduction (compare Fig. 3A with 3B). Transgenic flies that overproduce DE-cadherin along their wing margin are healthy and fertile. Thus, the strain provides an appropriate tool for conducting genetic searches for new regulators of Wg signalling, as has been previously attempted (Greaves et al. 1999). 0 Our search allowed us to identify a Drosophila homologue of Arabidopsis Argonaute1 (AGO1), which is required for the dorsoventral identity of the leaf, development of the axial meristem (a group of undifferentiated, dividing cells), and post-transcriptional gene silencing (Bohmert et al. 1998; Lynn et al. 1999; Fagard et al. 2000). AGO1 is the founder of a novel gene superfamily that is incredibly well conserved among 0 q Blackwell Science Limited 0 Roles of a Drosophila homologue of AGO1 0 fission yeast, plants and animals, which is designated the AGO1 gene superfamily in this article. To clarify the developmental roles of dAGO1, we examined both lossof-function and over-expression phenotypes in the embryo and in the imaginal disc. Although the amino acid sequences of any proteins of this superfamily do not predict their molecular activities, our result was suggestive of binding of the dAGO1 protein to RNA in either a direct or an indirect fashion. 0 level of mRNA (compare Fig. 4A with 4B) and was used in subsequent studies. 0 Subdivision of the AGO1 superfamily 0 At least three alternatively spliced transcripts are made from dAGO1, and we focused on one of them, CT42236, which is equivalent to the EST clone LD09501 (Fig. 1A; Adams et al. 2000). The predicted dAGO1 protein consisted of 950 amino acids, and its molecular weight was estimated as 106 kDa. As in the case of all members of the AGO1 superfamily, amino acid sequences of dAGO1 provided no definite information about its molecular activity. Phylogenetic trees and multiple alignments of amino acid sequences suggest that this superfamily consists of two distinct subfamilies and several orphans (Fig. 1B,C). We named one of these subfamilies the AGO1 subfamily, which includes AGO1, dAGO1 and an S. pombe protein, SPCC736.11. The other subfamily was designated as the PIWI subfamily, because the founder is a Drosophila protein of the piwi gene, which controls the division of germ-line stem cells (Cox et al. 1998). The orphans whose mutants were isolated, are C. elegans rde-1, which is required for RNA interference (Tabara et al. 1999), and Neurospora QDE-2, which is required for quelling, a phenomenon similar to co-suppression (Cogoni & Macino 1997; Fagard et al. 2000). Every member of the superfamily shares a conserved box of 43 residues near the carboxy terminal (Cox et al. 1998), and proteins of the AGO1 subfamily share a longer stretch of 86 residues on average (the AGO1 box; Fig. 1C, D). What distinguishes the two subfamilies most is the presence or absence of a region that is 0 Identification of a Drosophila AGO1 homologue essential for viability 0 To identify new components of the Wg signal transduction pathway, we performed a genetic screen for dominant modifiers of the wing-margin phenotype caused by the over-expression of DE-cadherin (see details in Experimental procedures). We focused on a P-element insertion line, l(2)k08121 (Spradling et al. 1995, 1999), in which we found that the transposon was inserted into gene CG6671 (Fig. 1A; Adams et al. 2000). This gene is homologous to Arabidopsis AGO1 (Bohmert et al. 1998) as described below; therefore we designated this Drosophila gene dAGO1. The lethality of l(2)k08121 was due to a loss of dAGO1 function, as shown by the fact that remobilization of the P-element recovered the lethality and that expression of a cDNA clone (LD09501; Rubin et al. 2000) under a heat-shock promoter made l(2)k08121 homozygotes and l(2)k08121/Df develop to adulthood. l(2)k08121 is a strong allele, as was shown by a great reduction in the 0 Open Access 0 Computational identification of Drosophila microRNA genes 1 Eric C Lai¤, Pavel Tomancak¤, Robert W Williams and Gerald M Rubin 0 These authors contributed equally to this work. 0 Background: MicroRNAs (miRNAs) are a large family of 21-22 nucleotide non-coding RNAs with presumed post-transcriptional regulatory activity. Most miRNAs were identified by direct cloning of small RNAs, an approach that favors detection of abundant miRNAs. Three observations suggested that miRNA genes might be identified using a computational approach. First, miRNAs generally derive from precursor transcripts of 70-100 nucleotides with extended stem-loop structure. Second, miRNAs are usually highly conserved between the genomes of related species. Third, miRNAs display a characteristic pattern of evolutionary divergence. Results: We developed an informatic procedure called 'miRseeker', which analyzed the completed euchromatic sequences of Drosophila melanogaster and D. pseudoobscura for conserved sequences that adopt an extended stem-loop structure and display a pattern of nucleotide divergence characteristic of known miRNAs. The sensitivity of this computational procedure was demonstrated by the presence of 75% (18/24) of previously identified Drosophila miRNAs within the top 124 candidates. In total, we identified 48 novel miRNA candidates that were strongly conserved in more distant insect, nematode, or vertebrate genomes. We verified expression for a total of 24 novel miRNA genes, including 20 of 27 candidates conserved in a third species and 4 of 11 high-scoring, Drosophila-specific candidates. Our analyses lead us to estimate that drosophilid genomes contain around 110 miRNA genes. Conclusions: Our computational strategy succeeded in identifying bona fide miRNA genes and suggests that miRNAs constitute nearly 1% of predicted protein-coding genes in Drosophila, a percentage similar to the percentage of miRNAs recently attributed to other metazoan genomes. 0 deposited research refereed research interactions 0 Although the analysis of sequenced genomes to date has focused most heavily on the protein-coding set of genes, all genomes also contain a constellation of non-coding RNA genes. With the exception of certain classes of RNAs with strongly conserved sequences and/or structures, such as ribosomal and transfer RNAs, identification of most non- 0 coding RNAs has historically been a relatively serendipitous affair. Only very recently have there been concerted efforts to identify such genes systematically, using both experimental and computational approaches [1]. Our collective ignorance of the totality of non-coding RNA genes was laid bare by recent work on microRNAs (miRNAs), 0 Genome Biology 2003, 4:R42 0 R42.2 Genome Biology 2003, 0 an abundant family of 21-22 nucleotide non-coding RNAs [2,3]. The founding members of this family, lin-4 and let-7, were identified through forward analysis of extant Caenorhabditis elegans mutants [4,5]. Both of these RNAs are post-transcriptional regulators of developmental timing that function by binding to the 3' untranslated regions (3' UTRs) of target genes [5-8]. Although they were long regarded as genetic curiosities possibly specific to nematodes, let-7 was subsequently found to be broadly conserved across bilaterian evolution [9] and miRNA genes are now recognized as a pervasive and widespread feature of animal and plant genomes [10-16]. In general, it is thought that miRNA biogenesis proceeds via intermediate precursor transcripts of more than 70 nucleotides that have the capacity to form an extended stem-loop structure (pre-miRNA), although at least some pre-miRNAs are further derived from even longer transcripts (primary miRNA transcripts, or pri-miRNAs). These can exist as long individual pre-miRNA precursor transcripts, as operon-like multiple pre-miRNA precursors, or even as part of primary mRNA transcripts. Processing of pri-miRNA into the premiRNA stem-loop occurs in the nucleus, while subsequent processing of pre-miRNA into 21-22 mers is a cytoplasmic event mediated by the RNAse III enzyme Dicer [17-20]; Dicer is also responsible for cleavage of long perfectly doublestranded RNA into 21-22 nucleotide fragments during RNA interference (RNAi) [2,21]. These latter molecules, known as silencing RNA (siRNA), bind to and trigger the degradation of perfectly homologous mRNA molecules via RISC, a doublestrand RNA-induced silencing complex containing nuclease activity [22,23]. Although the in vivo function of only a few miRNAs is known so far, it is believed that the vast majority are likely to participate in post-transcriptional gene regulation of complementary mRNA targets. Interestingly, perfect or near-perfect target complementarity is associated with mRNA degradation [24-26], similar to the effects of siRNA, whereas imperfect base-pairing is associated with regulation by translational inhibition [6,27]. Recently, siRNAs with imperfect match to target mRNA were observed to function as translational inhibitors [28], suggesting that the type of 21-22 nucleotide RNA-mediated regulation may be largely determined by the quality of target complementarity. The vast majority of the approximately 300 miRNAs currently known were identified through direct cloning of short RNA molecules. Although this method has been quite successful thus far, its practicality is limited by the necessity for a considerable amount of RNA as raw material for cloning, and cloned products are often dominated by a few highly expressed miRNAs. For example, 41% of miRNAs cloned from HeLa cells are variants of let-7, 28% of human brain miRNAs are variants of miR-124, and 45% of miRNAs cloned from human heart and 32% of those cloned from early 0 Drosophila embryos are miR-1 [10,29]. In fact, it has been opined that few additional mammalian miRNAs will be easily identified by the direct cloning method [30]. As a complementary approach to miRNA identification, we developed an informatic strategy ('miRseeker') and applied it to the completed genomes of Drosophila melanogaster and D. pseudoobscura, which are some 30 million years diverged. miRseeker subjects conserved intronic and intergenic sequences to an RNA folding and evaluation procedure to identify evolutionarily constrained hairpin structures with features characteristic of known miRNAs. The specificity of this computational procedure was shown by the presence of 18 out of 24 reference miRNAs within the top 124 candidates. We identified a total of 48 novel miRNA candidates whose existence was strongly supported by conservation in other insect, nematode or vertebrate genomes. Expression of 24 novel miRNA genes was verified by northern analysis (including 20 out of 27 candidates that were supported by third-species conservation and 4 out of 11 high-scoring predictions specific to Drosophila), demonstrating that the bioinformatic screen was successful. As might be expected, the newly verified miRNA genes vary tremendously with respect to abundance and developmental expression profile, suggesting diverse roles for these genes. Inference of our false-positive prediction and false-negative verification rates (based on our ability to identify known miRNAs and detect the expression of highly conserved, and thus presumed genuine, novel miRNAs) leads us to estimate that drosophilid genomes contain around 110 miRNA genes, or nearly 1% of the number of predicted protein-coding genes. In combination with other concurrent genomic analyses [31-34], it is likely that most miRNAs in completed animal genomes have now been identified. Collectively, this sets the stage for both genome-wide and targeted studies of this functionally elusive family of regulators. 0 Evolutionarily conserved characteristics of miRNA genes 0 Genome Biology 2003, 4:R42 0 Genome Biology 2003, 0 comment reviews 0 Unstructured sequence 0 Conserved stem-loop 0 Evaluation of cadmium-induced transcriptome alterations by three color cDNA labeling microarray analysis on a T-cell line 1 George Th. Tsangaris *, Athanassios Botsonis, Ioannis Politis, Fotini Tzortzatou-Stathopoulou 0 Keywords: Cadmium; Heavy metals; cDNA microarray; Gene regulation; Toxicogenomics; Apoptosis 0 Introduction The massive and rapid increase in human genome-scale DNA sequencing and the concomitant development of methods and technologies for the exploitation of this information, have recently indicated that reliable predictions should not be based on any single gene, but on multi-gene ex- 0 has been shown that Cd compounds induced tumors in lungs, testes, prostate as well as hematopoietic system malignancies (Degraeve, 1981; IARC, 1993; Waalkes and Rehm, 1994), while in cultured mammalian cells they induced morphological transformations, chromosomal aberrations and gene mutations (DiPaolo and Castro, 1979; Ochi and Ohsawa, 1983; Ochi et al., 1984; Yang et al., 1996; Hwua and Yang, 1998). A previous work on a human T-cell line (CEM-C12) has shown that Cd exerts its toxic effect via apoptosis (el Azzouzi et al., 1994), while a comparative study of Cd apoptotic effect in immune system's cell lines, has shown a differential Cd-induced apoptosis, which may disturb the immune system's normal growth and development (Tsangaris and Tzortzatou-Stathopoulou, 1998a). On the cellular level, Cd is highly reactive with sulfphydryl groups of proteins and can substitutes zinc in certain enzymes (Vallee and Ulmer, 1972; Figueiredo-Pereira et al., 1998) and so acts through an orphan zinc receptor can provoke the production of inositol triphosphate and subsequent release of calcium from internal stores, thereafter stimulating protein kinase C (Block et al., 1992; Smith et al., 1994). Cd has been also reported to activate p38 and extracellular regulated kinase (ERK) in rat brain tumor cells (Hung et al., 1998) and c-Jun N-terminal kinase (JNK) in porcine renal epithelial cells (Matsuoka and Igisu, 1998). On the molecular level, Cd has been shown to induce mRNA levels of several genes such as c-jun, c-myc (Jin and Ringertz, 1990), c-fos (Wang and Templeton, 1998), metallothionein (MT) (Karin et al., 1987) and heme oxygenase 1 (HMOX1) (Alam et al., 1989; Takeda et al., 1994). We and others have shown that in nucleated blood cells, and particularly lymphocytes, Cd time- and dose-dependently activates transcription of both metallothionein-IIA (MT-IIA) and heat shock protein 70 (HSP 70) genes (Pellegrini et al., 1994a,b). Thus, after exposure to low Cd concentrations, MT-IIA is induced, in contrast to higher concentrations in which HSP70 is induced. In the present study, we investigated by cDNA microarrays the cadmium-induced transcriptome alterations on the immature T-cell line CCRFCEM, analyzing 1455 genes, after incubation of 0 the cells for 6 and 24 h with two different Cd concentrations (10 and 20 mM), applying for the first time three fluorescent dye cDNA labeling, followed by three laser simultaneous analysis, on the same microarray slide. 0 ml per well of acid isopropanol (0.04 N HCl) and the plates were read on an Elisa reader (Stat-Fax 2100, Awareness Technology, Palm City, FL). The data were expressed as the percentage of the number of viable cells in cadmium-treated cells as compared to untreated cells (control). 0 Materials and methods 0 Quantification of apoptotic cells 0 The detection and quantification of apoptosis was performed as previously described (Tsangaris and Tzortzatou-Stathopoulou, 1996). Briefly, after the exposure of the cells (2x 106 cells/ml) for 6 or 24 h to various Cd2 + concentrations, 8 ml of the cell suspension were mixed with 2 ml of a fluorescent EtBr-containing dye (0.1 mg/ml EtBr, 1.5% NP40, in PBS). This suspension was placed on a microscope slide and covered with a coverslip. Fluorescent-stained cells were examined with an Epi-Fluorescence Microscope (Optiphot-2, Nikon, Japan). The cells were scored and categorized as normal, apoptotic or necrotic and the results were expressed as percentage of each cell kind to the total counted cells. For each Cd2 + concentration at each time point, more than five slides were prepared and more than 500 cells/slide were examined. 0 Media and reagents 0 The medium for cell cultures was RPMI 1640, supplemented with 10% heat-inactivated fetal bovine serum (FBS, Invitrogen/Life Technologies International, Paisley, England), 100 U/ml penicillin, 100 mg/ml streptomycin, 2 mM L-glutamine and 20 mM HEPES buffer (serum medium) (all derived from Biochrom, Berlin, Germany). Cadmium chloride (Cd2 + ) (Sigma Chem. Co., St. Louis, MO) was dissolved in water at 10 mM, stored at 4 °C (stock solution) and was diluted to appropriate concentrations immediately before use in culture medium without FBS. 0 Cell cultures 0 The CCRF-CEM human immature T-cell line was obtained from the European Collection of Cell Cultures (ECACC, Salisbury, UK). Cells (3 x 105 cells/ml) were cultured in serum medium at 37 °C in a humidified atmosphere containing 5% CO2 in air and changed every 3 days. For each experiment, cells (1x106 cells/ml) were harvested at the exponential growth phase and resuspended in 10% serum medium in the presence of Cd2 + for 6 or 24 h in Falcon 75 cm2 flasks (Becton Dickinson, Oxnard, CA). 0 RNA isolation and cDNA production 0 After the incubation of the cells for 6 or 24 h, with or without Cd2 + , 10x 106 cells were centrifuged (270x g, 10 min, 4 °C) and the pellets were washed twice with ice-cold normal saline. The cell pellets 0 Research article 0 Identification of Pax2-regulated genes by expression profiling of the mid-hindbrain organizer region 1 Maxime Bouchard1,2,*,, David Grote1,2,*, Sarah E. Craven3, Qiong Sun1, Peter Steinlein1 and Meinrad Busslinger1 0 The paired domain transcription factor Pax2 is required for the formation of the isthmic organizer (IsO) at the midbrain-hindbrain boundary, where it initiates expression of the IsO signal Fgf8. To gain further insight into the role of Pax2 in mid-hindbrain patterning, we searched for novel Pax2-regulated genes by cDNA microarray analysis of FACS-sorted GFP+ mid-hindbrain cells from wild-type and Pax2-/- embryos carrying a Pax2GFP BAC transgene. Here, we report the identification of five genes that depend on Pax2 function for their expression in the mid-hindbrain boundary region. These genes code for the transcription factors En2 and Brn1 (Pou3f3), the intracellular signaling modifiers Sef and Tapp1, and the non-coding RNA Ncrms. The Brn1 gene was further identified as a direct target of Pax2, as two functional Pax2-binding sites in the promoter and in an upstream regulatory element of Brn1 were essential for lacZ transgene expression at the mid-hindbrain boundary. Moreover, ectopic expression of a dominant-negative Brn1 protein in chick embryos implicated Brn1 in Fgf8 gene regulation. Together, these data defined novel functions of Pax2 in the establishment of distinct transcriptional programs and in the control of intracellular signaling during mid-hindbrain development. 0 Key words: Mid-hindbrain development, Pax2-regulated genes, Sef, Tapp1, Ncrms, En2, Brn1, Fgf8 regulation, Mouse 0 The midbrain and cerebellum develop from an organizing center that is formed at the junction between the embryonic midbrain and hindbrain, known as the isthmus. This isthmic organizer (IsO) was discovered because of its property of inducing an ectopic midbrain or cerebellum, when transplanted into the chick diencephalon or hindbrain, respectively (reviewed by Liu and Joyner, 2001a; Wurst and Bally-Cuif, 2001). The IsO activity recruits the surrounding tissue into either a midbrain or cerebellum fate by controlling cell survival, proliferation and differentiation along the anteroposterior axis of the mid-hindbrain region. The formation of the IsO is the result of complex cross-regulatory interactions between transcription factors (Otx, Gbx, Pax and En) and secreted proteins (Wnts and Fgfs), culminating in the expression of the signaling molecule Fgf8 at the mid-hindbrain boundary (Liu and Joyner, 2001a; Wurst and Bally-Cuif, 2001; Ye et al., 2001). Fgf8 is the central mediator of IsO activity, as it is both necessary and sufficient for inducing midbrain and cerebellum development (Crossley et al., 1996; Chi et al., 2003). Once formed, the IsO is maintained by a positive feedback loop involving multiple mid-hindbrain-specific regulators. Consequently, the IsO is lost upon individual 0 mutation of these regulators, whereas ectopic expression of a single factor activates most other components of the regulatory cascade (Nakamura, 2001). Owing to this interdependence, the hierarchical relationship among the different regulators remains largely elusive during the maintenance phase of IsO activity (Liu and Joyner, 2001a; Wurst and Bally-Cuif, 2001). The initiation of IsO development crucially depends on the transcription factor Pax2 (Favor et al., 1996; Brand et al., 1996), which shares similar DNA-binding and transactivation functions with Pax5 and Pax8 of the same paired domain protein subfamily (Kozmik et al., 1993; Doerfler and Busslinger, 1996). Pax2 is the earliest known gene to be expressed throughout the prospective mid-hindbrain region in late gastrula embryos (Rowitch and McMahon, 1995). The initially broad expression pattern of Pax2 is progressively refined to a narrow ring centered at the mid-hindbrain boundary by embryonic day 9.5, while the related Pax5 and Pax8 genes are activated in the same region at 3-4 and 6-7 somites, respectively (Urbanek et al., 1994; Rowitch and McMahon, 1995; Pfeffer et al., 1998). Consistent with this sequential gene induction, mutation of the Pax2 gene leads to the loss of the midbrain and cerebellum in mouse and zebrafish embryos (Favor et al., 1996; Brand et al., 1996; Bouchard et al., 2000), whereas the inactivation of Pax5 or Pax8 results in a mild 0 Development 132 (11) cerebellar midline defect or no brain phenotype at all (Urbanek et al., 1994; Mansouri et al., 1998). The severe mid-hindbrain deletion is, however, only observed in Pax2-/- mouse embryos on the C3H/He genetic background (Bouchard et al., 2000), where the compensating Pax5 and Pax8 genes fail to be activated at the mid-hindbrain boundary (Pfeffer et al., 2000; Ye et al., 2001) similar to the Pax2.1 (noi) mutant embryos of the zebrafish (Pfeffer et al., 1998). In the absence of Pax2, Otx2, Gbx2 and Wnt1 are normally transcribed at early somite stages, while the expression of En1 is reduced in the developing mid-hindbrain region (Ye et al., 2001). Importantly, Fgf8 expression is never initiated at the mid-hindbrain boundary of Pax2-/- C3H/He embryos (Ye et al., 2001), resulting in the complete absence of IsO activity and subsequent apoptotic loss of the mid-hindbrain tissue starting at the 12-somite stage (Pfeffer et al., 2000; Chi et al., 2003). To further investigate the role of Pax2 at the onset of midhindbrain development, we searched for novel Pax2-regulated genes by gene expression profiling of mid-hindbrain cells isolated by FACS sorting from wild-type and Pax2-/- E8.5 embryos. This unbiased approach identified the En2, Brn1 (Pou3f3 - Mouse Genome Informatics), Sef (Il17rd - Mouse Genome Informatics), Tapp1 (Plekha1 - Mouse Genome Informatics) and non-coding Ncrms genes as genetic Pax2 targets that are totally dependent on Pax2 function for their expression in the mid-hindbrain region. The transcription factors En2 and Brn1, as well as the signaling modifiers Sef and Tapp1, implicate Pax2 in the establishment of distinct transcriptional programs and the control of intracellular signaling during mid-hindbrain development. Biochemical and transgenic analyses demonstrated that Pax2 directly activates the mid-hindbrain-specific expression of Brn1 by interacting with two functional Pax2/5/8-binding sites in the promoter and an upstream regulatory element of the Brn1 gene. Moreover, ectopic expression of a dominant-negative Brn1 protein in chick embryos implicated Brn1 as a novel regulator of Fgf8 expression. The identification of new Pax2-regulated genes has thus provided important insight into the role of Pax2 in midhindbrain development. 0 Research article 0 Review articles 0 Genetic modules and networks for behavior: lessons from Drosophila 1 Robert R.H. Anholt 0 The aim of this review is not to provide an exhaustive review of the literature, as this would be a near impossible task, but rather to highlight fundamental principles using Drosophila as a model organism with examples from recent studies. It should be noted that, while the focus of this article is on the genetic architecture of behavior, similar principles apply to other complex traits as well. Behaviors as complex traits Behaviors show all the hallmarks of quantitative traits. They arise from the coordinated actions of multiple genes and their phenotypes are significantly affected by genome-environment interactions.(1,2) Consequently, neurogenetic studies of behaviors face the typical challenges characteristic of quantitative traits, often hard to control environmental variation and a vast number of independently segregating genes with both additive and epistatic interactions that render it difficult to predict phenotypic values from one generation to the next. To dissect the genetic architecture of such traits, it is desirable to minimize environmental variation and essential to precisely control the genetic background. This is difficult to achieve in human populations and, although inbred strains of mice have been used successfully in gene mapping studies, such studies are laborious and often are limited by their ability to define only large chromosomal regions that harbor possible candidate genes (quantitative trait loci, QTL).(3) Furthermore, different QTL are often identified in different environmental, physiological or developmental conditions, which further complicates efforts to understand the genetic architecture of the behavior under study. Whereas considerable advances have been made in the study of neurogenetics of behavior using mouse model systems, obtaining a comprehensive description of the genetic architecture of even a single behavioral trait appears to be a gargantuan task for every behavioral trait examined to date. Most behavioral genes in mice have been identified as a consequence of spontaneous mutations or as a result of homologous recombination studies, which, however, do not always yield unambiguously interpretable phenotypes.(4) Furthermore, genetic background variation and/or restricted sample sizes often limit resolution of such studies to identifying only genes with large effects. Nonetheless, knockout mice have confirmed yet again, one gene at a time, the polygenic 0 Introduction Behaviors are the quintessential unifying feature of all animal live forms and essential for survival and procreation. Behaviors are the ultimate expression of the nervous system and depend on the coordinated expression of ensembles of genes. This article seeks to describe how our views of the genetic architecture of behavior have evolved from attempts to connect individual mutations as isolated pieces of a complex puzzle into the current realization of dynamic multidimensional networks of interacting pleiotropic genes. An appreciation of the genetic architecture of any complex trait demands attention to genetic background and sex effects, and incorporates interactions between the genome and both the physical and-in the case of behavioral phenotypes-social environment.(1,2) 0 BioEssays 26.12 0 Review articles 0 Drosophila related information, is publicly available (http:// flybase.bio.indiana.edu/). Genetic networks A diverse spectrum of behaviors has been studied in Drosophila, including courtship and mating behavior,(18,19) circadian behavior(20-22) and sleep,(23) general locomotor activity,(24) geotaxis,(25) grooming behavior,(26) chemosensory responsiveness,(27-29) foraging behavior,(30,31) aggression,(32) and memory and learning.(33,34) Mutations affecting critical genes have been identified for many of these traits. Traditionally, mutant screens identify genes that affect the trait one at a time and subsequently attempt to place these loci into pathways that subserve the behavior under study. Recent applications of functional genomic approaches to behavior have transformed the traditional view of simple linear genetic pathways, in which a single mutation has a restricted effect on a specialized function, into a more complex concept of plastic genetic networks.(35) This was illustrated by transcriptional profiling studies of circadian genes, which identified a large and diverse group of oscillating genes that are co-regulated under the control of the Clock gene(36,37) Using high-density oligonucleotide microarrays, McDonald and Rosbash identified in wild-type flies 134 cycling genes, which included not only known members of the circadian clock, but also a large number of genes not previously known to cycle, encoding detoxification enzymes, ligand carrier proteins, neuropeptide modulators, proteins involved in cuticle formation, genes involved in immune defense, a diverse array of miscellaneous enzymes as well as predicted proteins of unknown function. A larger group of 267 genes with altered transcriptional regulation was identified when Clk mutants were analyzed. Such Clkregulated genes included unexpected co-regulated genes with 17 genes encoding antimicrobial peptides and 9 encoding pheromone or odorant-binding proteins, indicating that the Clk mutation has widespread direct and indirect effects throughout the transcriptome.(36) Similar results were obtained simultaneously and independently by Clar 0 BMC Genomics 0 Research article 0 BioMed Central 0 Open Access 0 Performance evaluation of commercial short-oligonucleotide microarrays and the impact of noise in making cross-platform correlations 1 Richard Shippy1, Timothy J Sendera*1, Randall Lockner1, Chockalingam Palaniappan1, Tamma Kaysser-Kranich1, George Watts2 and John Alsobrook3 0 Page 1 of 15 0 (page number not for citation purposes) 0 There are several commercial microarray systems currently available on the market for genome-scale gene expression analysis. Different microarray manufacturers provide distinct underlying technologies, protocols and reagents specific to each system [1]. Despite the widespread use of microarrays, much ambiguity regarding data analysis, interpretation and correlation of the different technologies exists. There is a need for standardization that will facilitate comparison of microarray data from different platforms [2]. Comparison and cross-validation between microarray platforms would greatly increase the understanding and value of the wealth of data generated from each microarray experiment [3]. A number of cross platform comparisons have reported a failure to demonstrate an acceptable level of correlation between different microarray technologies [4-7]. Some of the difficulties in correlating data can be attributed to fundamental differences between cDNA and oligonucleotide based microarray technologies. For example, target preparation differences and single vs. dual labeling techniques complicate the comparisons. Furthermore, cDNA arrays have difficulty in distinguishing between splice variants and highly homologous genes, while oligonucleotide arrays can make these distinctions if designed appropriately. However, when considering oligonucleotide platforms, which have widespread popularity, direct comparisons between different platforms should be less complex and more direct. We assert that differences in platform sensitivity, reproducibility and annotation cross-referencing accuracy account for a majority of the irreconcilable differences previously reported between different platforms [4-7]. When considering these factors we demonstrate a strong correlation between expression ratio data from two different commercially available short oligonucleotide based microarray technologies. This paper provides a comprehensive guideline for microarray analysis, interpretation and cross-platform correlation. There are two commercially available high-density microarray platforms that use short oligonucleotides for expression profiling. CodeLink (GE Healthcare formerly Amersham Biosciences, Chandler, AZ) and GeneChip (Affymetrix, Santa Clara, CA) microarray platforms utilize oligonucleotide gene target probes of 30 and 25 bases, respectively. Some of the notable differences between the GeneChip and CodeLink systems are, respectively, multiple probes vs. one pre-validated probe per gene target, two-dimensional surface vs. three-dimensional array matrix, and in situ synthesized oligonucleotides vs. presynthesized, non-contact oligonucleotide deposition. We restricted our comparative analysis to these two platforms because these systems are most similar with respect to oligonucleotide length, target preparation, and single color indirect labeling methodology. Since these commercial 0 assays are similar, and systematic variables were isolated by using the same total RNA starting material for all target preparations, we expected disparity in performance to reflect differences inherent to the microarray platforms. To provide data for comparison of the platforms, five technical replicates of brain and pancreas were processed on each platform and the results were compared for reproducibility, sensitivity, and similarity of results. Standard manufacturer-recommended protocols and settings were employed to obtain the raw data from each platform. In the case of Affymetrix GeneChip, a recent cross-platform microarray evaluation [7] used two additional algorithms [8,9] for analysis of the GeneChip data and found the same level of discordance across the three analysis algorithms as was observed in the cross-platform microarray comparisons [7]. We therefore restricted our analysis of the GeneChip data to the Affymetrix recommended MAS 5.0 software [10]. This methodology was followed to simulate the results users would achieve by following current protocols supplied with each microarray system. 0 Two different tissue types were compared in this study to ensure a large number of differentially expressed genes, and provide expression ratios across a wide dynamic range for derivation of the correlation coefficient between the two platforms. The array-to-array precision of each microarray platform was calculated from the five replicates within each tissue sample. The pair-wise array-to-array precision of each microarray platform is illustrated in Figure 1 with respective noise levels for both CodeLink and GeneChip. In these graphs all 10,763 uniquely represented genes, common between both microarray platforms, are illustrated. The GeneChip comparisons display a wider distribution relative to CodeLink at the lower end of the fluorescence detection range. While this wider distribution could be interpreted as indicating a lower level of precision relative to CodeLink, precision should only be assessed for the population of genes with expression values above the noise calculation (i.e. 'present' on the arrays being considered). Due to the variation in noise and specificity level between expression detection systems, each system must individually define its own threshold level cutoff for resultant confidence in signals above technical noise. In addition, in a multi-oligonucleotide detection system, a defined algorithm must be set to determine the method for combining individual probe data to yield a final gene expression level. Therefore, we used each manufacturer's indications for gene signals that should be considered confidently above system noise. The wider distribution observed in the GeneChip platform is within the noise population and therefore should not penalize the overall precision measurements. Qualitatively, CodeLink and GeneChip showed similar 0 Page 2 of 15 0 (page number not for citation purposes) 0 Genotyping by apyrase-mediated allele-specific extension 1 Afshin Ahmadian, Baback Gharizadeh, Deirdre O'Meara, Jacob Odeberg and Joakim Lundeberg* 0 Center for Physics, Astronomy and Biotechnology, Department of Biotechnology, The Royal Institute of Technology (KTH), Roslagstullsbacken 21, SE-106 91 Stockholm, Sweden 0 ABSTRACT This report describes a single-step extension approach suitable for high-throughput singlenucleotide polymorphism typing applications. The method relies on extension of paired allele-specific primers and we demonstrate that the reaction kinetics were slower for mismatched configurations compared with matched configurations. In our approach we employ apyrase, a nucleotide degrading enzyme, to allow accurate discrimination between matched and mismatched primer-template configurations. This apyrase-mediated allelespecific extension (AMASE) protocol allows incorporation of nucleotides when the reaction kinetics are fast (matched 3-end primer) but degrades the nucleotides before extension when the reaction kinetics are slow (mismatched 3-end primer). Thus, AMASE circumvents the major limitation of previous allelespecific extension assays in which slow reaction kinetics will still give rise to extension products from mismatched 3-end primers, hindering proper discrimination. It thus represents a significant improvement of the allele-extension method. AMASE was evaluated by a bioluminometric assay in which successful incorporation of unmodified nucleotides is monitored in real-time using an enzymatic cascade. INTRODUCTION Genome analysis techniques have increasingly been adapted to identify and score single-nucleotide polymorphism (SNP) to elucidate the genetics of individual differences in drug response and disease susceptibility. A number of different techniques have been proposed to scan sequence variations in a high-throughput fashion. Many of these methods are based on hybridization techniques, which discriminate between allelic variants. High-throughput hybridization of allelespecific oligonucleotides can be performed on microarray chips (1), microarray gels (2) or by using allele-specific probes (molecular beacons) in the PCR (3). Other technologies suitable for SNP genotyping are mini-sequencing (4), mass 0 PAGE 2 OF 5 0 Development and Validation of a Diagnostic DNA Microarray To Detect Quinolone-Resistant Escherichia coli among Clinical Isolates 1 Xiaolei Yu,1 Milorad Susa,2 Cornelius Knabbe,2 Rolf D. Schmid,1 and Till T. Bachmann1* 0 J. CLIN. MICROBIOL. 0 detection of quinolone resistance. Although there are several platforms available for array-based single-nucleotide polymorphism, e.g., allele-specific hybridization (34), single-base primer extension (26), allele-specific amplification (1), or allele-specific oligonucleotide ligation (13), we chose allele-specific hybridization because its robust performance should be suitable for routine clinical application. In contrast to the above-mentioned genotyping methods, the use of allele-specific hybridization allowed not only the identification of the mutated amino acid but also the exact substitution, which could have different contributions to resistance and can be used as a marker in epidemiological studies. 0 MATERIALS AND METHODS Strains. In total, 30 E. coli clinical isolates from four different hospitals in Germany (Backnang, Stuttgart, Schorndorf, and Winnenden) (referred to here as E. coli 1 to 30) were used for this study. These strains were isolated from urine (n 20), swabs (n 7), secretions (n 2), and blood (n 1) of patients. The susceptibility against quinolone was determined according to NCCLS guidelines by using either ciprofloxacin alone (n 23) or both ciprofloxacin and levofloxacin (n 7). The genomic DNA was isolated from a bacterial pure culture by using a QIAamp DNA minikit (Qiagen, Hilden Germany) according to the manufacturer's protocol. DNA sequencing. For the DNA sequencing, a 418-bp fragment of E. coli, which included the QRDRs, was amplified by PCR with primers described previously (35). The 50- l PCR mixture included approximately 80 ng of template (genomic DNA of E. coli), a 0.4 pM concentration of each primer, 0.25 mM deoxynucleoside triphosphates, 1.5 mM Mg2 , and 2.5 U of Taq polymerase (Eppendorf, Hamburg, Germany). The PCRs were performed in a thermocycler (Mastercycler gradient) (Eppendorf) with the following parameters: 94°C for 5 min; 30 cycles at 94°C for 1 min, 52°C for 1 min, and 72°C for 1 min; and a final elongation at 72°C for 10 min. The amplified fragment, which was purified with a QIAquick PCR purification kit (Qiagen) according to the manual provided by the manufacturer, was used for direct sequencing. The sequencing was done with the same primer pairs, a Big-Dye terminator cycle sequencing kit (Applied Biosystems, Darmstadt, Germany), and a Prism 377 DNA sequencer (Applied 0 QUARTERLY 0 DNA microarrays, a novel approach in studies of chromatin structure. 1 Piotr Widlak½ 0 Department of Experimental and Clinical Radiobiology, Center of Oncology, Gliwice, Poland 0 Key words: DNA microarray, genomics, epigenomics, chromatin, nucleosomes The DNA microarray technology delivers an experimental tool that allows surveying expression of genetic information on a genome-wide scale at the level of single genes -- for the new field termed functional genomics. Gene expression profiling -- the primary application of DNA microarrays technology -- generates monumental amounts of information concerning the functioning of genes, cells and organisms. However, the expression of genetic information is regulated by a number of factors that cannot be directly targeted by standard gene expression profiling. The genetic material of eukaryotic cells is packed into chromatin which provides the compaction and organization of DNA for replication, repair and recombination processes, and is the major epigenetic factor determining the expression of genetic information. Genomic DNA can be methylated and this modification modulates interactions with proteins which change the functional status of genes. Both chromatin structure and transcriptional activity are affected by the processes of replication, recombination and repair. Modified DNA microarray technology could be applied to genome-wide study of epigenetic factors and processes that modulate the expression of genetic information. Attempts to use DNA microarrays in studies of chromatin packing state, chromatin/DNA-binding protein distribution and DNA methylation pattern on a genome-wide scale are briefly reviewed in this paper. 0 Completion of the Human Genome Project has opened a new era in studies of functions of cells and organisms. Identification of the 0 thousands of genes forming genomes brings us to the next frontier: elucidation of the functions of these genes and their interactions -- 0 P. Widlak 0 DNA microarrays 0 plate, is typical for regions where active (or potentially active) genes are located. On the other hand, non-active repressed genes are located primarily in regions of packed/condensed chromatin (heterochromatin) (reviewed in: Groudine & Felsenfeld, 2003; Fry & Peterson, 2001). Because of technical limitations, the knowledge about the actual state of chromatin packing/condensation and its relationship to transcriptional activity was until recently restricted to a small number of genes studied in a few model organisms. The DNA microarray technology delivered the unique opportunity to survey the chromatin structure on a genome-wide scale at the resolution of single genes. In fact, modified DNA microarray technology has already been applied to genome-wide study of epigenetic factors and processes that regulate the expression of genetic information (reviewed in: Pollack & Iyer, 2002). This new field could be termed "epigenomics" (Novik et al., 2002). This paper briefly describes attempts to use DNA microarrays in studies of chromatin structure on a genome-wide scale. 0 ized to a DNA microarray, either "standard" or "specialized" (e.g. microarrays of promoter sequences or CpG islands). DNA could be fluorescence labeled either during PCR amplification or without amplification. The most essential step in such "structural" array protocols is initial isolation/fractionation of genomic DNA in a way that would reflect the problem to be analyzed. Several principles that lie behind such fractionation procedures are listed below. 0 Differential physicochemical characteristics of nucleoprotein complexes 0 The initial implementation of DNA microarray technology into genome structural research was comparative genomic hybridization (CGH) array, which allowed high resolution analysis of gene copy number (Solinas-Toldo et al., 1997; Pinkel et al., 1998). The primary difference between gene expression microarrays and the CGH array is replacement of RNA samples with DNA ones as a starting material. Two DNA samples are labeled with different fluorophores and co-hybridized to a DNA microarray, and their fluorescence ratio represents the relative DNA copy number. Similar strategies could be applied to study other aspects of genome structure: "test" and "reference" DNA samples that are differentially labeled and co-hybrid- 0 One of such strategies, originally described by Garrard and coworkers (reviewed in: Huang & Garrard, 1988), has been used to fractionate chromatin based on differential solubility of histone H1-containing and histone H1-free nucleosomes. Isolated nuclei were briefly incubated at "physiological" ionic strength with micrococcal nuclease, which specifically cleaves internucleosomal linker DNA. That treatment solubilized 10-20% of the chromatin, which was collected as the first supernatant fraction termed S1. After removal of salt an additional 50-60% of the chromatin was solubilized, which was collected as the second supernatant fraction termed S2. The S1 fraction contained primarily mononucleosomes lacking histone H1 while S2 consisted of histone H1-containing oligonucleosomal particles. Another strategy to fractionate genomic DNA based on specific nucleoprotein complexes that seems to be potentially applicable to DNA microarray analysis would be isolation of nuclear matrix-attached DNA (Sumer et al., 2003). The nuclear matrix is a putative skeletal structure isolated from nuclei after removal 0 TECHNICAL REPORTS 0 Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays 0 We developed a new DNA microarray-based technology, called protein binding microarrays (PBMs), that allows rapid, high-throughput characterization of the in vitro DNA binding- site sequence specificities of transcription factors in a single day. Using PBMs, we identified the DNA binding-site sequence specificities of the yeast transcription factors Abf1, Rap1 and Mig1. Comparison of these proteins' in vitro binding sites with their in vivo binding sites indicates that PBM-derived sequence specificities can accurately reflect in vivo DNA sequence specificities. In addition to previously identified targets, Abf1, Rap1 and Mig1 bound to 107, 90 and 75 putative new target intergenic regions, respectively, many of which were upstream of previously uncharacterized open reading frames. Comparative sequence analysis indicated that many of these newly identified sites are highly conserved across five sequenced sensu stricto yeast species and, therefore, are probably functional in vivo binding sites that may be used in a condition-specific manner. Similar PBM experiments should be useful in identifying new cis regulatory elements and transcriptional regulatory networks in various genomes. The interactions between transcription factors and their DNA binding sites are an integral part of transcriptional regulatory networks. They control the coordinated expression of thousands of genes during normal growth and in response to external stimuli. Much progress has been made recently in the identification and analysis of mRNA transcript profiles1,2, locations of in vivo binding sites of transcription factors3-6 and protein-protein interactions7-10. But many transcription factors still have unknown DNA binding specificities and regulatory roles. Earlier technologies aimed at characterizing DNA-protein interactions are time-consuming and not scalable. Microarray-based readout of chromatin immunoprecipitation (ChIP-chip), or genome-wide location analysis, is currently the most widely used high-throughput method for identifying in vivo genomic binding sites for transcription factors3-6. But some ChIP-chip experiments do not result in significant enrichment of bound fragments in the immunoprecipitated sample. In addition, there may be transcription factors of interest for which a specific antibody is not available or for which the culture conditions or time points that allow its expression and activity are not known. We previously developed a spotted microarray technology that used primer-extended, double-stranded synthetic DNAs to quantify the differences in binding affinities for various DNA binding-sequence variants. This technology allowed us to distinguish proteins with similar binding-site preferences and to determine the binding specificities of proteins with degenerate sequence preference11. Another group recently extended this technology to use surface plasmon resonance12. Although surface plasmon resonance can provide kinetic data, it is not currently scalable to a large number of samples. Here we developed a new in vitro DNA microarray technology for genome-scale characterization of the sequence specificities of DNA-protein interactions. This protein-binding microarray (PBM) technology allows the determination of in vitro binding specificities of individual transcription factors in a single day, by assaying the sequence-specific binding of those individual transcription factors directly to double-stranded DNA microarrays spotted with a large number of potential DNA-binding sites. A DNA-binding protein of interest is expressed with an epitope tag, purified and then bound directly to a double-stranded DNA microarray. The PBM is then washed to remove any nonspecifically bound protein and labeled with a fluorophore-conjugated antibody specific for the epitope tag (Fig. 1a). We focused our efforts on the genome of the yeast Saccharomyces cerevisiae because of its usefulness as a model organism for both experimental and computational studies. Binding-site data from PBMs on yeast transcription factors corresponded well with bindingsite specificities determined from ChIP-chip. Moreover, comparative 0 NUMBER 12 0 DECEMBER 2004 0 TECHNICAL REPORTS 0 dsDNA microarrays Bind epitope-tagged TF to dsDNA microarrays GST SybrGreen I 0 Label with fluorophore-tagged antibody to epitope 0 Scan triplicate microarrays 0 Calculate normalized PBM data 0 sequence analysis of the PBM-derived binding sites indicated that many of the sites bound in PBMs, including some not identified by ChIP-chip, are highly conserved in other sensu stricto yeast genomes and therefore are probably functional in vivo binding sites that potentially are used in a condition-specific manner. Our PBM technology should aid in the annotation of many regulatory proteins whose DNA-binding specificities have not been characterized and in the construction of gene regulatory networks. RESULTS PBM experiments As a validation of this approach, we bound CBP-FLAG-Rpn4 fusion protein to microarrays spotted with positive and negative control spots for binding by Rpn4. We labeled the protein-bound array with Cy3-conjugated M2 primary antibody to FLAG (Sigma) and scanned it with a microarray scanner (GSI Lumonics ScanArray). Only the spots that contain good matches to the binding-site motif for Rpn4 have high signal intensity (Supplementary Fig. 1 online). As we previously found that higher signal intensity is generally indicative of higher DNA-protein binding affinity11, this CBP-FLAG-Rpn4 PBM indicates that our PBM technology is successful in identifying sequence-specific transcription factor binding. Next, we applied the PBM technology on a genome-wide scale by using whole-genome yeast intergenic arrays in PBM experiments to identify the sequence specificities and target genes of three yeast transcription factors: Abf1, Rap1 and Mig1. Abf1 has a zinc-finger DNA-binding domain, binds origins of replication and regulates ribosome synthesis. Rap1 binds DNA through a Myb-like helixturn-helix DNA-binding domain and, in addition to regulating ribosome synthesis13, regulates telomere length and expression at the silent mating-type loci HML and HMR14. Mig1 has a zinc-finger DNA-binding domain and is involved in the repression of glucoserepressed genes15. We used Abf1, Rap1 and Mig1, dually tagged at the N terminus with glutathione S-transferase (GST) and His6, in PBM experiments 0 using microarrays spotted with essentially all the intergenic regions in the yeast genome3. The washed, protein-bound microarrays were labeled with Alexa 488-conjugated antibody to GST (Molecular Probes) and scanned with a microarray scanner. The microarray TIF images were quantified using GenePix Pro version 3.0 software. A whole-genome yeast intergenic microarray that was used in a PBM experiment with Rap1 is shown in Figure 1b,c. Negative control PBMs did not show sequence-specific DNA binding (Supplementary Fig. 2 online). For each transcription factor, experiments were done in triplicate. We found that the PBM data were highly reproducible, with most spots having a coefficient of variation (i.e., s.d. divided by the mean) o0.3 (Supplementary Fig. 3 online). To normalize the PBM data by relative DNA concentration, we stained separate microarrays from the same print run with SybrGreen I (Molecular Probes), which is specific for double-stranded DNA. The distribution of the log ratios of mean PBM to mean SybrGreen I signal intensities for the set of triplicate Rap1 PBM experiments is shown in Figure 2a. The spots on the left, whose distribution is fit well by a Gaussian function, are bound nonspecifically by the transcription factor. Conversely, the heavy upper tail of the distribution corresponds to spots that are bound specifically by the transcription factor. For each spot, we calculated a P value for specific binding based on the magnitude of its log ratio relative to the standard deviation of the Gaussian distribution. The numbers of unique spots that pass a P-value threshold of 0.05, 0.01 or 0.001 for t 0 Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects 1 George C. Tseng1, Min-Kyu Oh2, Lars Rohlin2, James C. Liao2 and Wing Hung Wong1,3,* 0 1Department 2Department 0 ABSTRACT We consider the problem of comparing the gene expression levels of cells grown under two different conditions using cDNA microarray data. We use a quality index, computed from duplicate spots on the same slide, to filter out outlying spots, poor quality genes and problematical slides. We also perform calibration experiments to show that normalization between fluorescent labels is needed and that the normalization is slide dependent and non-linear. A rank invariant method is suggested to select nondifferentially expressed genes and to construct normalization curves in comparative experiments. After normalization the residuals from the calibration data are used to provide prior information on variance components in the analysis of comparative experiments. Based on a hierarchical model that incorporates several levels of variations, a method for assessing the significance of gene effects in comparative experiments is presented. The analysis is demonstrated via two groups of experiments with 125 and 4129 genes, respectively, in Escherichia coli grown in glucose and acetate. INTRODUCTION Although cDNA microarrays have been used for global monitoring of gene expression in many areas of biomedical research (1), methods for analysis of the resulting data are only beginning to be addressed systematically (2-7). We have performed a series of calibration and comparative experiments to address several important issues in data analysis and study design of microarray experiments. In each calibration experiment we purified total RNA from Escherichia coli cells and divided the sample into two aliquots for labeling by Cy3 and Cy5. The two separately labeled samples were then pooled and subdivided into hybridization solutions for hybridization to multiple 0 slides. In the first group of experiments each slide had 125 E.coli genes multiply spotted (4 spots/gene) on it, while in the second each slide had 4129 genes singly spotted. The first and second groups of experiments will be called the 125 and 4129 gene projects, respectively, hereafter. Several levels of replication are embedded in the design of these calibration experiments and the resulting data provide information on the relative importance of variations due to spots, labels and slides. Based on this information, we formulate an approach to the analysis of comparative experiments where the samples to be compared are differentially labeled. The main components are as follows. (i) Detect and filter out poor quality genes on a slide using measurements from multiple spots. This procedure is not applicable in singly spotted designs. (ii) Perform slidedependent non-linear normalization of the log ratios of the two channels. (iii) Apply hierarchical model-based analysis to the normalized log ratio scale, where assessment of the significance of gene effects are aided by statistical information obtained from calibration experiments, if they are available. Details of the experiments are given below and the analysis methodology is developed, justified and illustrated. A discussion of other important issues, such as why a two label design is useful and whether gene-label interaction is an important consideration, is also provided. MATERIALS AND METHODS Preparation of the DNA array In the 125 gene project, to ensure uniform quality and quantity of the DNA probes, we constructed a gene library consisting of 125 genes each cloned into pBluescript II KS+ (Stratagene, La Jolla, CA) as previously reported (8,9). These genes are involved in various aspects of E.coli physiology, including glycolysis, the TCA cycle, the pentose phosphate pathway, fermentation pathways, the heat shock response, major biosynthetic pathways and the respiratory system. The gene probes used in microarray construction were obtained by PCR amplifying the inserted genes using pBluescript II KS+specific primers (Genosys, The Woodlands, TX), 0 5-GGCCGCTCTAGAACTAGTGGAT-3 and 5-CTCGAGGTCGACGGTATCGATA-3. PCR products were precipitated with ethanol and redissolved in 15 µl of 350 mM sodium bicarbonate/carbonate buffer, pH 9.0. Each gene was spotted four times on a slide to analyze the reliability and variability. In the 4129 gene project we performed the PCR reactions using Genosys E.coli ORFmers (the entire genome of E.coli) and an Eppendorf MasterTaq kit (Westbury, NY). Among 4290 primers, 161 failed to make products or proper sized products. The 4129 PCR products, representing 96% of the predicted open reading frames (10), were precipitated with propanol twice and then dissolved in 10 µl of 350 mM sodium bicarbonate/ carbonate buffer, pH 9.0. They were arrayed with single spotting on each slide. All resulting slides with DNA probes underwent post-processing according to the protocol suggested by Eisen and Brown (11). RNA purification and labeling Escherichia coli strain MC4100 [F- araD139 (argF-lac) U169 rpsL150 relA1 flb5301 deoC1 ptsF25 rbsR] was cultured in shake flasks using M9 minimal medium (12) containing either 0.5% glucose or acetate as carbon source supplemented with 125 mg/l (w/v) arginine. When the optical density of the cell reached 0.4-0.6 at 550 nm total RNA was purified from 1 x 109 cells using the RNeasy Midi kit from Qiagen (Valencia, CA). The resulting RNA solution was incubated at 37°C with 100 U DNase (Gibco BRL, Rockville, MD) and 40 U RNasin RNase inhibitor (Promega, Madison, WI) for 30 min, extracted with phenol/chloroform and then precipitated with ethanol. After dissolution in 10-20 µl of RNase-free water, 30 µg total RNA was labeled with either Cy3 or Cy5 during reverse transcription. The reverse transcription cocktail included 200 U Superscript RNase H- reverse transcriptase (Gibco BRL), E.coli gene-specific C-terminal primers (Genosys), 0.5 mM dATP, dTTP and dGTP, 0.2 mM dCTP and 0.1 mM Cy3- or Cy5labeled dCTP (Amersharm Pharmacia, Piscataway, NJ). After reverse transcription the RNA was degraded by adding 5 µl of 1 N NaOH and incubating at 65°C for 40 min. The resulting cDNA, labeled with either Cy3 or Cy5, was diluted with 60 µl of TE buffer, pH 8.0, and then mixed together. The labeled cDNA mixture was then concentrated to 1-2 µl using Micron50 (Millipore, Bedford, MA). Hybridization and scanning The concentrated Cy3- and Cy5-labeled cDNA was resuspended in 10 µl of hybridization solution, consist of 50% formamide, 3x SSC, 1% SDS, 5x Denhardt's solution, 0.1 mg/ml salmon sperm DNA and 0.05 mg/ml yeast total RNA. Hybridization solution without 5x Denhardt's solution was also used for comparison. The labeled cDNA was denaturated at 95°C for 3 min then quickly chilled on ice. The cDNA was then placed on the slide and covered by a coverslip. The slide was assembled with a hybridization chamber (Corning, Charlotte, NC) and hybridized for 14-20 h at 42°C. The hybridized slide was washed in 2x SSC, 0.1% SDS for 5 min at room temperature and then 0.2x SSC for 5 min prior to scanning. After drying the hybridized slides were scanned with an Affymetrix 418 scanner (Santa Clara, CA) and the scanned images analyzed with the software program Imagene (Biodiscovery, Santa Monica, CA). The median intensities of 0 spot areas were calculated and imported into the program S-Plus (MathSoft, Cambridge, MA). Description of experiments We performed four calibration experiments and two comparative experiments in the 125 gene project, two calibration and two comparative ones in the 4129 gene project. Calibration experiments used the same mRNA pool divided into two aliquots and labeled separately with two different dyes in order to investigate variations in this technology. Some calibration experiments used genes from E.coli grown in acetate, while the others used E.coli grown in glucose. The comparative experiments labeled mRNA from E.coli grown in acetate with Cy3 and mRNA from E.coli grown in glucose with Cy5. Different slides in the same experiment were hybridized with the same pool of labeled cDNA and different experiments in the same project redid the whole experiment with the same pool of mRNA. We will use C, R and S to denote the calibration experiment, comparative (real) experiment and slide, respectively, and suffix numbers to indicate the sequence in the two projects. For example, C3S2 indicates slide 2 in the third calibration experiment and R1S2 slide 2 in the first comparative experiment. Some slides did not use Denhardt's solution during hybridization while others did. Detailed information concerning experimental design is listed in Table 1. RESULTS AND DISCUSSION Outline of analysis procedure The steps of the proposed analysis are herein briefly described. The motivation and justification of each step will be given in subsequent sub-sections. To analyze a calibration experiment: (i) compute a quality measure for ea 0 TRENDS in Biochemical Sciences 0 Standardization of protocols in cDNA microarray analysis 1 Vladimir Benes and Martina Muckenthaler ´ 0 European Molecular Biology Laboratory, Meyerhofstrasse 1 D-69117 Heidelberg, Germany 0 TRENDS in Biochemical Sciences 0 Here, we list the points to consider during a cDNA microarray experiment starting from gene, to spot, to insight: Genome-wide expression profiling vs specialized microarrays Selection and sequence verification of cDNA samples 0 Background cut-off 1 0 Establishment of the technological microarray platform Synthesis and purification of gene fragments Surface chemistry Spotting conditions Array design Preparation of the experimental and the reference sample High quality RNA extraction from cultured cells, tissues, patient biopsies, laser capture microdissection taking into consideration that the experimental and the reference samples must be treated identically Choice of methodology for the synthesis of fluorescent-labelled cDNA Yield of purified total RNA Accuracy, sensitivity, background noise Labour intensity and working time Financial aspects Implementation of controls (non-specific background, normalization and ratio) 0 Background cut-off 0 Number and type of replicates (technical, biological) Data acquisition and evaluation Data normalization (global, intensity-dependent) 0 Interpretation of the microarray data (comparison, clustering, selforganizing maps) Independent validation of the data (quantitative reverse transcriptase real-time PCR, Northern blot, in situ hybridization) 0 numerous variations that can occur at each step (Box 1). Generally, experimental and systematic variations can be distinguished: experimental variability can be controlled by careful experimental design [8] and through a sufficient number of experimental repeats; systematic variations have to be addressed by controls on the array. A possible source for systematic variations can be the irregular deposition of PCR amplified cDNAs on the glass surface by different printing pins (including `carry-over' of the samples between adjacent sample wells caused by inferior washing of the pins) or biases associated with different fluorescent dyes. It has been recognized that fluorescent 0 dyes such as Cy3 and Cy5 exhibit different quantum yields and are differentially sensitive to photobleaching [9,10]. Depending upon the type of the activated surface, these dyes also show varying background levels (E. Furlong, pers. commun.). Although this phenomenon has not been thoroughly studied, it has been indicated that the direct incorporation of Cy3 and Cy5 modified-nucleotide analogues into the cDNA might introduce sequence-specific artefacts [11,12]. This is likely to be caused by the variable and differing rates by which these bulky nucleotide analogues are in 0 Significance analysis of microarrays applied to the ionizing radiation response 1 Virginia Goss Tusher*, Robert Tibshirani, and Gilbert Chu* 0 Microarrays can measure the expression of thousands of genes to identify changes in expression between different biological states. Methods are needed to determine the significance of these changes while accounting for the enormous number of genes. We describe a method, Significance Analysis of Microarrays (SAM), that assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements. For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of genes identified by chance, the false discovery rate (FDR). When the transcriptional response of human cells to ionizing radiation was measured by microarrays, SAM identified 34 genes that changed at least 1.5-fold with an estimated FDR of 12%, compared with FDRs of 60 and 84% by using conventional methods of analysis. Of the 34 genes, 19 were involved in cell cycle regulation and 3 in apoptosis. Surprisingly, four nucleotide excision repair genes were induced, suggesting that this repair pathway for UV-damaged DNA might play a previously unrecognized role in repairing DNA damaged by ionizing radiation. 0 sented by 20 oligonucleotide pairs, each pair consisting of an oligonucleotide perfectly matched to the cDNA sequence, and a second oligonucleotide containing a single base mismatch. Because gene expression was computed from differences in hybridization to the matched and mismatched probes, expression levels were sometimes reported by the GENECHIP ANALYSIS SUITE software as negative numbers. 0 Northern Blot Hybridization. Total RNA (15 g) was resolved by agarose gel electrophoresis, transferred to a nylon membrane, and hybridized to specific radiolabeled DNA probes, which were prepared by PCR amplification. 0 Microarray Hybridization. Each gene in the microarray was repre- 0 NA microarrays contain oligonucleotide or cDNA probes for measuring the expression of thousands of genes in a single hybridization experiment. Although massive amounts of data are generated, methods are needed to determine whether changes in gene expression are experimentally significant. Cluster analysis of microarray data can find coherent patterns of gene expression (1) but provides little information about statistical significance. Methods based on conventional t tests provide the probability (P) that a difference in gene expression occurred by chance (2, 3). Although P 0.01 is significant in the context of experiments designed to evaluate small numbers of genes, a microarray experiment for 10,000 genes would identify 100 genes by chance. This problem led us to develop a statistical method adapted specifically for microarrays, Significance Analysis of Microarrays (SAM). SAM identifies genes with statistically significant changes in expression by assimilating a set of gene-specific t tests. Each gene is assigned a score on the basis of its change in gene expression relative to the standard deviation of repeated measurements for that gene. Genes with scores greater than a threshold are deemed potentially significant. The percentage of such genes identified by chance is the false discovery rate (FDR). To estimate the FDR, nonsense genes are identified by analyzing permutations of the measurements. The threshold can be adjusted to identify smaller or larger sets of genes, and FDRs are calculated for each set. To demonstrate its utility, SAM was used to analyze a biologically important problem: the transcriptional response of lymphoblastoid cells to ionizing radiation (IR). Materials and Methods 0 Results RNA was harvested from wild-type human lymphoblastoid cell lines, designated 1 and 2, growing in an unirradiated state (U) or in an irradiated state (I) 4 h after exposure to a modest dose of 5 Gy of IR. RNA samples were labeled and divided into two identical aliquots for independent hybridizations, A and B. Thus, data for 6,800 genes on the microarray were generated from eight hybridizations (U1A, U1B, U2A, U2B, I1A, I1B, I2A, and I2B). We scaled the data from different hybridizations as follows. A reference data set was generated by averaging the expression of each gene over all eight hybridizations. The data for each hybridization were compared with the reference data set in a cube root scatter plot. We chose the cube root scatter plot because it resolved the vast majority of genes that are expressed at low levels and permitted the inclusion of negative levels of expression that are sometimes generated by the GENECHIP software. A linear leastsquares fit to the cube root scatter plot was then used to calibrate each hybridization. After scaling, a linear scatter plot was generated for average gene expression in the four A aliquots (U1A, I1A, U2A, and U2A) vs. the average in the four B aliquots (U1B, I1B, U2B, and U2B), a partitioning of the data that eliminates biological changes in gene expression (Fig. 1A). The linear scatter plot confirmed that the data were generally reproducible but failed to resolve genes expressed at low levels. Better resolution of these genes was achieved by the cube root scatter plot (Fig. 1B), which revealed three salient features: the large percentage of genes (24%) assigned negative levels of expression, the large percentage of genes with low levels of expression, and the low signal-to-noise ratio at low levels of expression. To assess the biological effect of IR, a scatter plot was generated for average gene expression in the four irradiated states vs. the four unirradiated states (compare Fig. 1 B and C). A few of the potentially significant changes in gene expression are indicated by arrows in Fig. 1C, but the effect was not easily quantified, and a method was needed to identify changes with statistical confidence. 0 Abbreviations: SAM, significance analysis of microarrays; FDR, false discovery rate; IR, ionizing radiation; FWER, family-wise error rate. 0 GM08925 (Coriell Cell Repositories, Camden, NJ) were seeded at 2.5 105 cells ml and exposed to IR 24 h later. RNA was isolated, labeled, and hybridized to the HUGENEFL GENECHIP microarray according to manufacturer's protocols (Affymetrix, Santa Clara, CA). 0 Preparation of RNA. Human lymphoblastoid cell lines GM14660 and 0 The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact. 0 where m and n are summations of the expression measurements in states I and U, respectively, a (1 n 1 1 n 2) (n 1 n 2 2), and n1 and n2 are the numbers of measurements in states I and U (four in this experiment). To compare values of d(i) across all genes, the distribution of d(i) should be independent of the level of gene expression. At low expression levels, variance in d(i) can be high because of small values of s(i). To ensure that the variance of d(i) is independent of gene expression, we added a small positive constant s0 to the denominator of Eq. 1. The coefficient of variation of d(i) was computed as a function of s(i) in moving windows across the data. The value for s0 was chosen to minimize the coefficient of variation. For the data in this paper, this computation yielded s0 3.3. Scatter plots of d(i) vs. s(i) are shown in Fig. 2. The scatter plot for relative difference between states I and U is shown in Fig. 2 A. By contrast, the scatter plot for relative difference between cell lines 1 and 2 shows more marked changes in Fig. 2B. These relative differences exceeded random fluctuations in the data, as measured by the relative difference between hybridizations A and B in Fig. 2C. Although the relative difference computed from hybridizations A and B provided a control for random fluctuations, additional controls were needed to assign statistical significance to the biological effect of IR. Instead of performing more experiments, which 0 Tusher et al. 0 April 24, 2001 0 where xI(i) and xU(i) are defined as the average levels of expression for gene (i) in states I and U, respectively. The ``gene-specific scatter'' s(i) is the standard deviation of repeated expression measurements: 0 Our approach was based on analysis of random fluctuations in the data. In general, the signal-to-noise ratio decreased with decreasing gene expression (Fig. 1). However, even for a given level of expression, we found that fluctua 0 Nonparametric methods for identifying differentially expressed genes in microarray data 1 Olga G. Troyanskaya 1, Mitchell E. Garber 1, Patrick O. Brown 2, 3, David Botstein 1, and Russ B. Altman 1, 0 Department 0 BACKGROUND DNA microarray technology allows for the monitoring of expression levels of thousands of genes under a variety 0 of conditions. A major question in microarray studies is how to select genes associated with specific physiological states or clinical parameters-genes whose expression in a tumor sample is related to a specific tumor subtype or patient survival. In a clinical context, such differentially expressed genes are often referred to as clinical markers. Clinical markers can form the basis for diagnostic tests, particularly if they can be assayed in reliable and inexpensive ways. Identification of clinical markers may lead to improved diagnosis and treatment guidance, early disease detection, and clinical outcomes prediction. While routine clinical use of microarrays is still not feasible, they may provide methods for fast, accurate, and systematic identification of biomedical markers from the data generated by gene expression experiments. Clinicians can then assay the expression of one or a few such markers by immunohistochemistry or quantitative PCR (Kim, 2001). Moreover, relating specific groups of genes with specific biological correlates is a critical step toward understanding the underlying molecular mechanisms and identifying novel therapeutic targets. The most commonly used tools for identification of differentially expressed genes include qualitative observation (usually following some form of clustering of expression patterns), heuristic rules, and model-based probabilistic analysis. The simplest heuristic is setting cutoffs for gene expression changes over a background expression level. In an early gene expression study, Iyer et al. (1999) sought genes whose expression changed by a factor of 2.20 or more in at least two of the experiments. DeRisi et al. (1997) looked for 2-fold induction of gene expression compared to baseline. Xiong et al. (2001) identified indicator genes based on classification errors by feature wrappers (including linear discriminant analysis, logistic regression, and support vector machines). Although this approach is not based on specific data modeling assumptions, the results are affected by assumptions behind the specific classification methods used for scoring. 0 Nonparametric identification methods for differentially expressed genes 0 sum test) with heuristic-based inference. We evaluate the performance of these methods on generated expression data as well as on real biological data sets. 0 METHODS Experimental methods We implemented and evaluated three methods for modelfree identification of differentially expressed genes in microarray analysis: a nonparametric t-test, a Wilcoxon rank sum test, and a heuristic idealized discriminator method. The evaluation included applications to both simulated data and real biological data. By using simulated data, we could first evaluate the methods on data sets with known differentiator genes in the context of different noise levels. The simulated data were generated to create plausible distributions of microarray expression values while not perfectly matching any particular data set. From qualitative comparisons of distribution histograms and Quantile-Quantile plots of several biological data sets (Alizadeh et al., 2000; Garber et al., 2001; Gasch et al., 2000), we found that normally generated data with uniform noise generated from uniform distribution in the range of U(-0.01, 0.01) to U(-0.1, 0.1) approximated the true distributions reasonably well. Such an approximate fit to biological data is similar to the differences in data distributions between real microarray experiments. To test the methods, we generated ten simulated data sets (5000 genes by 40 experiments each) at each of the six noise levels (U(-0.01, 0.01), U(-0.05, 0.05), U(-0.1,0.1), U(-0.5,0.5), U(-0.75,0.75), U(-1.0,1.0)). Increasing noise levels in the data sets allowed us to test robustness of the methods on very noisy data. Each data set included twenty predictor genes (markers), whose values were generated from two different normal distributions: group 1 (20 experiments) and group 2 (20 experiments). The rest of the genes, for which all values were generated from one normal distribution per gene, were considered nonpredictors. The means of each normal distribution were generated from a random normal distribution with a mean of 0 and standard deviation of 0.25 for nonpredictors and standard deviation of 0.5 for predictors. Each of the methods was then applied to each simulated data set, and true positive rate (TPR) and false positive rate (FPR) were calculated according to the following formulae. 0 TPR = number of predictors identified 0 A spline function approach for detecting differentially expressed genes in microarray data analysis 1 Wenqing He 0 Prossermen Center for Health Research, Samuel Lunenfeld Research Institute of Mount Sinai Hospital, Toronto, Ontario, Canada M5G 1X5 0 Microarray technology has been increasingly used in medical studies such as cancer research. This technology makes it possible to measure the expressions of thousands of genes simultaneously under a variety of conditions. The objectives of microarray studies often include finding genes which have different expressions between conditions and making predictions on outcomes such as tumor types in cancer research. In most cases, the predictions are based on those genes that are differentially expressed, and therefore, detection of differentially expressed genes plays an important role. Commonly used methods for identification of differentially expressed genes include qualitative observations, heuristic rules such as cutoff settings, and model-based probability 0 analyses. Iyer et al. (1999) discussed an approach based on choosing genes with expression changes from at least two arrays being more than 2.20 times of their baseline expressions. DeRisi et al. (1997) considered to select genes that have at least 2-fold changes over their baseline expressions. These heuristic rules just focused on the absolute expression changes of genes. Variation of gene expressions was not accounted for. Moreover, the decisive values for identifying differentially expressed genes are arbitrary. Thus, these methods have not been used widely. Several probability approaches have been proposed to detect differentially expressed genes. One intuitive method is the two sample t-test. Two sample t-tests select genes that have significantly different means between conditions. One problem for using two sample t-tests is that some genes with small differences between conditions may be selected because of their very small within group variation. To correct the effect of the small variance, Tusher et al. (2001) proposed a modified t-statistic for which a constant is added to the denominator of the traditional t-statistic. As microarray data commonly contain various types of variation, the normality assumption of expression measurements is often not adequate (Hunter et al., 2001), and therefore the normal-distribution-based inference may not be valid. In this context, non-parametric methods are more attractive because no specific distributions of data are required. Dudoit et al. (2002) used a non-parametric t-test with a corrected family-based error rate to detect differentially expressed genes. Tusher et al. (2001) discussed significant analysis of microarrays (SAM) in which repeated measurements are permuted to estimate the false discovery rate of differentially genes. Efron and Tibshirani (2002) considered a Wilcoxon statistic and estimated the associated distributions using an empirical Bayes approach. Pan et al. (2002) applied a mixture normal approach to a t-type statistic when the sample size under each condition is even. Zhao and Pan (2003) further proposed a modified statistic which overcomes the disagreement of the null statistic and test statistic under the null hypothesis (no differential expressions here), and 0 M-spline for detecting differentially expressed genes 0 their method can be used for data without even numbers of samples. The basic idea of those non-parametric approaches is to construct a null and a test statistics which have the same distribution under the null hypothesis, and deviation of the distribution of the test statistic under the alternative hypothesis is used to identify differentially expressed genes. The distributions of the null and test statistics under the null and alternative hypotheses are estimated non-parametrically. Although non-parametric methods have the advantage of not requiring a specific distribution form, there are some drawbacks. The inferential procedures based on non-parametric methods are generally complex (Efron et al., 2000; http:// www-stat.stanford.edu/tibs/research.html). Non-parametric estimates may not be as efficient as the parametric estimates, and therefore the tests for differentially expressed genes may not have adequate power. Non-parametric Wilcoxon test, for example, is rank based and does not make use of all available information for genes, thus it may have low power to identify differently expressed genes (Thomas et al., 2001). Furthermore, as pointed out in Pan (2002), the Wilcoxon test is not applicable when the expression levels of a gene may have unequal variances under the two experimental conditions. In this paper, we propose to use a weakly parametric approach to characterize the density functions for both differentially and non-differentially expressed genes. Specifically we consider a spline function approach. This approach is widely used in survival analysis to model the hazard functions (e.g. He and Lawless, 2003). It has appeal that no strong assumptions about the underlying distributions are needed, and the inferences are likelihood based and therefore straightforward. We use maximum likelihood methods to estimate the parameters involved in the density functions and the prior probability of differentially expressed genes. As a result, the posterior probability is applied to identify differentially expressed genes. The proposed method is applied to a real data set, and the results are compared with those obtained by some existing methods. A simulation study is also conducted to assess the performance of the proposed method. We end with concluding remarks. 0 The primary interest here is to detect genes which are differentially expressed under the two conditions. In many applications, it is the focus to identify genes based on different mean expressions. For gene i, i = 1, . . . , N , assume that gene expressions follow the model Yij = µi1 + and Yik = µi2 + 0 METHODS Microarray data 0 Let the matrix [Yij ] denote a microarray data set of gene expressions, i = 1, . . . , N, j = 1, . . . , n, with rows being genes and columns being arrays (samples). Without loss of generality, consider two different experimental conditions, and let expression measurements for microarrays under conditions 1 and 2 be indexed by j = 1, . . . , n1 , and j = n1 + 1, . . . , n1 + n2 , respectively, where n1 + n2 = n. The entries of the matrix may be the log ratios in cDNA microarrays, or summary differences of the perfect match (PM) and mismatch (MM) scores from oligonucleotide arrays. 0 where µi1 and µi2 are the mean expressions of gene i under conditions 1 and 2, respectively, ij , j = 1, . . . , n1 , and ik , k = n1 + 1, . . . , n1 + n2 , are independent 2 2 random errors with mean 0 and variances 1 and 2 , 2 and 2 are not necessarily equal. It respectively. 1 2 is a common assumption that random errors are symmetric. Note that the normality assumption is not made here. It is of interest to test the null hypothesis Ho : µi1 = µi2 , i.e. whether or not gene i is differentially expressed under the two conditions. This may appear to be a problem of the two-sample comparison. However, the characteristics of microarray data limit the direct application of traditional statistical tests. The total number N of genes is large, usually larger than several thousands, whereas the numbers of arrays (n1 and n2 here) are usually small (<100; in some cases, the array numbers are <20). These features make traditional t-tests or non-parametric rank-based tests infeasible (Pan, 2003). Furthermore, when multiple comparisons are needed, it is difficult to specify various significance levels. To utilize the large size of N and information between genes, a plausible way is to select differentially expressed genes based on the distributions of some statistics related to all gene expression levels {Yi1 , . . . , Yin1 } and {Yi,n1 +1 , . . . , Yin } for i = 1, . . . , N . For gene i let Zi and Zi be statistics that have the same distribution under the null hypothesis H0 : µi1 = µi2 . Under the alternative hypothesis Ha : µi1 = µi2 , however, the distribution of Zi deviates from its distribution under the null hypothesis, whereas the distribution of Zi does not change. Zi and Zi are often called the null and test statistics. Several authors discussed the formulation of such summary statistics. The Wilcoxon statistic was discussed in Efron and Tibshirani (2002). Pan et al. (2002) considered 0 Microarrays permit the analysis of gene expression, DNA sequence variation, protein levels, tissues, cells and other biological and chemical molecules in a massively parallel format. Robust microarray manufacture, hybridization, detection and data analysis technologies permit novice users to adapt this exciting technology readily, and more experienced users to push the boundaries of discovery. 0 Trends in microarray analysis 0 Purify mRNA Label cDNA Hybridize, wash and scan Label cDNA Hybridize, wash and scan Purify mRNA 0 Purify mRNA Label cDNA Mix Label cDNA Purify mRNA 0 Hybridize and wash Superimpose Scan and superimpose 0 to allow their import into software programs for data mining and modeling24. Composite image Composite image Transformed and normalized data are represented and modeled using a variety electricity, organic vapors and biological contaminants can im- of software tools, including scatter plots, principal component prove the quality of microarray manufacture in all settings, analysis (PCA), cluster diagrams, self-organizing maps (SOMs), ranging from the smallest research laboratories to the largest neural networks and other algorithms25-29. Although the mathcommercial facilities (see Supplementary Note online). ematical and statistical basis of the computational tools is comFluorescent probes for expression profiling are typically pre- plex, each endeavors to provide functionally relevant pared from total RNA or messenger RNA (mRNA) by reverse relationships between genes and gene products, assign putative transcription, although many different labeling strategies are function to unknown sequences, identify potential disease available. Methods that use T7 RNA polymerase produce large markers, elucidate the biochemical basis of drug and hormone amounts of amplified RNA and are widely used to generate action, and so forth (see Supplementary Table F online). The probes from small amounts of sample. Because amplified RNA experimental aspects of microarray analysis are linked to data is produced by linear amplification with T7 polymerase, popu- extraction, analysis and modeling in the microarray workflow lation skewing and the loss of quantitation are minimal. process (Fig. 2). Intranets and the Internet, together with relaControl and experimental samples can be labeled separately tional database warehouses, figure centrally in generating, with fluors that have non-overlapping emission spectra, in- mining, storing and retrieving microarray data (Fig. 2). cluding cyanine, Alexa, and other fluorescent derivatives. Two Downloadable software (`shareware') packages are available samples labeled with different fluors can be hybridized to a sin- free of charge to microarray researchers worldwide (see gle chip to derive absolute and comparative expression infor- Supplementary Note online). Forums on microarray data mation in the two samples. analysis, such as the Critical Assessm 0 MIAME, we have a problem 1 Robert Shields 0 Trends in Genetics, Elsevier, 84 Theobald's Road, London, UK, WC1X 8RR 0 consistency is improved because the same cross-hybridizing sequences are then detected by all platforms [3]? As if the problems associated with different platforms were not enough, a recent trio of articles [4-7] showed not only inconsistencies across platforms but also inconsistencies among laboratories that were using the same platform, and even using the same RNA samples. Matters were improved by the use of common protocols for RNA work-up and also, and the importance of this is not widely appreciated, common methods of data handling and analysis. If scientists are to create gene expression databases that incorporate results from multiple laboratories, it is simply not good enough to adhere to the minimal information about microarray experiment (MIAME) guidelines, which only focus on the documentation of experimental details, while failing to address real problems with the technology and how it is used. Equally depressing is the rush to apply microarrays to obtain `gene signatures' to aid disease diagnosis and prognosis. Again results from different groups studying ostensibly the same disease are frequently non-concordant [7,8]. The use of different microarray platforms is partly to blame for this. But perhaps most of the problem comes from lack of `inferential literacy' meeting lack of epidemiological savvy. The Toxicogenomics Research Consortium suggested that more-consistent results would be achieved not with signatures from individual genes but by examining the gene ontology (GO) categories of the differentially expressed genes [6]. Perhaps, but it is a sobering comment that when two RNA samples were compared in different laboratories, on different platforms and analysed in the same way, gene-by-gene list comparisons varied. All that could be agreed on were the changes in different GO categories - representative of the tissue of origin of the samples [6]. If scientists in different laboratories cannot agree on an ordered list of gene-expression differences when presented with the same two RNA samples, we really do have a problem. So what is the solution? Obviously, putting the right probes on the array would be a start - interrogating the same transcript or splice form is important. Consistent standards between laboratories would help improve the consistency of results - but consistency is not enough - after all the results within a laboratory were all consistent but the results can be consistently wrong. What we need is a proper evaluation of microarrays (including sample extraction and work-up, data handling and analysis) and an understanding of what is important to achieve consistent, accurate and reproducible results across laboratories. But perhaps 0 most important is that scientists understand the nature of the technology they are using - including experimental design, execution and analysis. We need to go beyond MIAME. 0 Miron, M. and Nadon, R. (2006) Inferential literacy for experimental high-throughput biology. Trends Genet. 22, (this issue, February 2006) doi: 10.1016/j.tig.2005.12.001 2 Draghici, S. et al. (2006) Reliability and reproducibility issues in DNA microarray measurements. Trends Genet. 22, (this issue, February 2006) doi: 10.1016/j.tig.2005.12.005 0 Project Creates Repository for Microarray Datasets 0 NEWS 0 NCBI GEO: mining millions of expression profiles--database and tools 1 Tanya Barrett, Tugba O. Suzek, Dennis B. Troup, Stephen E. Wilhite, Wing-Chi Ngau, Pierre Ledoux, Dmitry Rudnev, Alex E. Lash, Wataru Fujibuchi and Ron Edgar* 0 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Center Drive, Bethesda, MD, USA 0 INTRODUCTION Since 2000, the Gene Expression Omnibus (GEO) has served as a public repository for high-throughput molecular abundance experimental data, providing free distribution and shared access to comprehensive datasets (1). These data include single and multiple channel microarray-based experiments 0 The principle architecture of the GEO database remains as described previously (1). Briefly, data submitted to GEO are stored in a relational database partitioned into three upper-level entity types: Platform, Sample and Series. A Platform describes the list of elements (e.g. oligonucleotide probesets, cDNAs, SAGE tags, antibodies) being assayed or 0 A Drosophila full-length cDNA resource 1 Mark Stapleton*, Joe Carlson*, Peter Brokstein*, Charles Yu*, Mark Champe*§ Reed George*, Hannibal Guarin*, Brent Kronmiller*¶, Joanne Pacleb*, Soo Park*, Ken Wan*, Gerald M Rubin*¥# and Susan E Celniker* 0 comment reviews 0 reports deposited research 0 Background: A collection of sequenced full-length cDNAs is an important resource both for functional genomics studies and for the determination of the intron-exon structure of genes. Providing this resource to the Drosophila melanogaster research community has been a long-term goal of the Berkeley Drosophila Genome Project. We have previously described the Drosophila Gene Collection (DGC), a set of putative full-length cDNAs that was produced by generating and analyzing over 250,000 expressed sequence tags (ESTs) derived from a variety of tissues and developmental stages. Results: We have generated high-quality full-insert sequence for 8,921 clones in the DGC. We compared the sequence of these clones to the annotated Release 3 genomic sequence, and identified more than 5,300 cDNAs that contain a complete and accurate protein-coding sequence. This corresponds to at least one splice form for 40% of the predicted D. melanogaster genes. We also identified potential new cases of RNA editing. Conclusions: We show that comparison of cDNA sequences to a high-quality annotated genomic sequence is an effective approach to identifying and eliminating defective clones from a cDNA collection and ensure its utility for experimentation. Clones were eliminated either because they carry single nucleotide discrepancies, which most probably result from reverse transcriptase errors, or because they are truncated and contain only part of the protein-coding sequence. 0 refereed research interactions information 0 One of the goals of the Berkeley Drosophila Genome Project is to define experimentally the transcribed portions of the genome by producing a collection of fully sequenced cDNAs. We have previously reported the construction of cDNA 0 libraries from a variety of tissues and developmental stages; these libraries were used to generate over 250,000 expressed sequence tags (ESTs), corresponding to approximately 70% of the predicted protein-coding genes in the Drosophila melanogaster genome [1,2]. We used computational analysis 0 Genome Biology 0 Stapleton et al. 0 of these ESTs to establish a collection of putative full-length cDNA clones, the Drosophila Gene Collection (DGC) [1,2]. Here, we describe the process by which we sequenced the full inserts of 8,921 cDNA clones from the DGC, describe the methods by which we assess each clone's likelihood of containing a complete and accurate protein-coding region, and illustrate how these data can be used to uncover additional cases of RNA editing. We have confirmed the identification of 5,375 cDNA clones that can be used with confidence for protein expression or genetic complementation. 0 Results and discussion 0 Sequencing strategy 0 Current approaches to full-insert sequencing of cDNA clones include concatenated cDNA sequencing [3], primer walking [4], and strategies using transposon insertion to create priming sites [5-9]. We adopted a cDNA sequencing strategy that relies on an in vitro transposon insertion system based on the MuA transposase, combined with primer walking (see Materials and methods for details). The production of full-insert sequences from DGC cDNAs is summarized in Tables 1 and 2. For DGCr1, clones were sized before sequencing. Small clones (< 1.4 kilobases (kb)) were sequenced with custom primers and larger clones were sequenced using either mapped or unmapped transposon insertions. For DGCr2, clones were not sized and a set of unmapped transposon insertions was sequenced to generate an average of 5x sequence coverage. For both DGCr1 and r2, custom oligonucleotide primers designed using Autofinish [10] were used to bring the sequences to high quality. To date, we have completed sequencing 93% of the DGCr1 clone set and 80% of the DGCr2 clone set. The strategy used for sequencing DGCr1 clones appears to be more efficient, because on average they required fewer sequencing reads than DGCr2 clones. However, we were able to reduce cycle time and increase throughput using the shotgun strategy adopted for sequencing the DGCr2 clones. The average insert size of the 8,770 high-quality cDNA sequences that have been submitted to GenBank is 2 kb and they total 17.5 megabases (Mb) of sequence. The largest clone (SD01389) is 8.7 kb and is derived from a gene (CG10011) that encodes a 2,119-amino-acid ankyrin repeat-containing protein. 0 Candidate clones to be sequenced Submitted to GenBank Clones in progress 0 Evaluating the coding potential of each cDNA on the basis of its full-insert sequence 0 For many potential uses in proteomics and functional genomics [11-13], it is important to establish cDNA collections comprised only of cDNAs with complete and uncorrupted open reading frames (ORFs). To determine which of our sequenced clones meet this standard, we compared them to the annotated Release 3 genome sequence [14,15] using a combination of BLAST [16] and Sim4 [17] alignments (see Materials and methods for details). 0 We grouped the cDNAs into four categories (Table 3). The first category contains a total of 5,916 cDNA clones, or 68% of the sequenced clones. We are confident that 5,375 of these clones contain a complete and accurate ORF, as they precisely match the Release 3 predicted protein for the corresponding gene. An additional 541 clones are from the SD, GM and AT libraries, which were generated from fly strains that are not isogenic with the strain used to produce the genome sequence. The predicted ORFs from clones from these libraries were required to be identical in length to the Release 3 predicted protein with less than 2% amino-acid difference to be placed in this category. We cannot at present distinguish whether these differences result from strain polymorphisms or reverse transcriptase (RT) errors. However, our own internal estimates of RT errors (see below), based on the observed nucleotide substitution rate in cDNAs derived from the same strain as the genomic 0 Table 3 cDNA analysis comment DGCr1 Clones that encode complete ORFs ORFs identical to the Release 3 predicte 0 Donor/Acceptor Interactions in Systematically Modified RuII-OsII Oligonucleotides 1 Dennis J. Hurley and Yitzhak Tor* 0 Abstract: Donor/acceptor (D/A) interactions are studied in a series of doubly modified 19-mer DNA duplexes. An ethynyl-linked RuII donor nucleoside is maintained at the 5 terminus of each duplex, while an ethynyllinked OsII nucleoside, placed on the complementary strands, is systematically moved toward the other terminus in three base pair increments. The steady-state RuII-based luminescence quenching decreases from 90% at the shortest separation of 16 A (3 base pairs) to 11% at the largest separation of 61 A (18 base pairs). Time-resolved experiments show a similar trend for the RuII excited-state lifetime, and the decrease in the averaged excited-state lifetime for each duplex is linearly correlated with the fraction quenched obtained by steady-state measurements. Analysis according to the Forster dipole-dipole energy ¨ transfer mechanism shows a reasonable agreement. Deviation from idealized behavior is primarily attributed to uncertainty in the orientation factor, 2. Analyzing D/A interactions in an analogous series of doubly modified oligonucleotides, where the ethynyl-linked RuII center is replaced with a saturated two-carbon linked complex, yields an excellent correlation with the Forster mechanism. As this simple change partially ¨ relaxes the rigid geometry of the donor chromophore, these results suggest that the deviation from idealized Forster behavior observed for the duplexes containing the rigidly held RuII center originates, at least partially, ¨ from ambiguities in the orientation factor. Surprisingly, analyzing both quenching data sets according to the Dexter mechanism also shows an excellent correlation. Although this can be interpreted as strong evidence for a Dexter triplet energy transfer mechanism, it does not imply that this electron exchange mechanism is operative in these D/A duplexes. Rather, it suggests that systems that transfer energy via the Forster mechanism can under certain circumstances exhibit Dexter-like "behavior", thus illustrating the ¨ danger of imposing a single physical model to describe D/A interactions in such complex systems. While we conclude that the Forster dipole-dipole energy transfer mechanism is the dominant pathway for D/A ¨ interactions in these modified oligonucleotides, a minor contribution from the Dexter electron exchange mechanism at short distances is likely. This complex behavior distinguishes DNA-bridged RuII/OsII dyads from their corresponding low molecular-weight and covalently attached counterparts. 0 The DNA double helix has been shown to be an intriguing medium for exploring charge transfer phenomena.1 The intricacies of these processes have widely been probed using photoactive and redox-active transition metal coordination compounds.2 Much less attention has been given, however, to energy transfer processes in similarly metal-modified DNA oligonucleotides. The relatively complex excited-state manifold of polypyridine RuII and OsII compounds can be engaged in multiple relaxation mechanisms, including dipole-dipole (Forster) and ¨ electron exchange (Dexter) energy transfer processes (Figure 1).3,4 In simple heteronuclear RuII-OsII dyads, the mode of the 0 10.1021/ja020172r CCC: $22.00 © 2002 American Chemical Society 0 Hurley and Tor 0 pend on Hec1 may signal checkpoint activation through diffusible Mad2 complexes. In Hec1-depleted cells, this activity could be generated through CENP-E or BubR1. Because kinetochores were not stretched in Hec1-depleted cells (30), it is plausible that persistent checkpoint activity was caused by lack of tension. Injection of antibodies to Hec1 into bladder carcinoma cells was reported to cause aberrant mitotic progression and cell death but no checkpoint arrest (23). This result could be explained if these tumor cells were checkpoint-deficient or if the injected antibodies interfered with checkpoint signaling. In Saccharomyces cerevisiae, mutations in the Hec1 homolog Ndc80 caused chromosome segregation defects without activating the checkpoint (24, 26 ). This may relate to the fact that kinetochores in budding yeast bind only a single MT, whereas those in vertebrate cells capture multiple MTs (8, 9). Furthermore, kinetochore-MT interactions and checkpoint signaling in vertebrates may involve two distinct pathways: one centered on Hec1 interacting with Mad1/Mad2 and the other on CENP-E interacting with CENP-F and BubR1, both pathways converging onto APC/C (35, 36 ). Yeast has a clear counterpart of Hec1 but lacks an obvious homolog of CENP-E. The human kinetochore protein Hec1 was required, together with Mps1, for recruiting the Mad1/Mad2 complex to kinetochores. Moreover, Hec1-depleted cells displayed persistent spindle checkpoint activity although they lacked significant amounts of Mad1 or Mad2 at kinetochores. This latter observation contrasts with models emphasizing the importance of high steady-state levels of kinetochore-associated Mad1/Mad2 complexes in checkpoint signaling and instead suggests that some protein that does not depend on Hec1 for kinetochore localization is able to communicate with diffusible Mad2 complexes. Many tumor cells are thought to be defective in the spindle checkpoint (37 ). Any interference with Hec1 function in checkpoint-deficient cells, be it through siRNA or other specific inhibitors, is predicted to result in mitotic catastrophe, thereby causing the demise of most progeny. In contrast, normal checkpoint-proficient cells may arrest transiently in response to reversible Hec1 inhibition. Thus, Hec1 may be an attractive target for therapeutic intervention in cancer and other hyperproliferative diseases. 0 Gene Expression During the Life Cycle of Drosophila melanogaster 0 Molecular genetic studies of Drosophila melanogaster have led to profound advances in understanding the regulation of development. Here we report gene expression patterns for nearly one-third of all Drosophila genes during a complete time course of development. Mutations that eliminate eye or germline tissue were used to further analyze tissue-specific gene expression programs. These studies define major characteristics of the transcriptional programs that underlie the life cycle, compare development in males and females, and show that large-scale gene expression data collected from whole animals can be used to identify genes expressed in particular tissues and organs or genes involved in specific biological and biochemical processes. Molecular studies of development in multicellular organisms have gone through two major phases during the past three decades. Initially, solution hybridization studies quantitated transcript abundance and showed that large-scale changes in gene expression accompany development (1). In Drosophila, such studies suggested that 5000 to 7000 different polyadenylated RNA species are produced at each stage of the life cycle and that the composition of this set of RNAs shifted during development (1). These analyses gave an overview of genome activity during development, but they could not follow the expression of individual genes or reveal their identities. Later, when it became possible to clone individual genes (2, 3), RNA blots and in situ hybridization revealed when and where individual genes were active. This second phase of analysis allowed 0 an initial determination of the links between molecules and developmental functions. This gene-by-gene approach has dominated developmental biology for the past two decades. DNA microarrays extend the single-gene approach to the genome level by measuring the transcript levels of thousands of genes simultaneously (4 - 6). Here we present the transcriptional profiles for about one-third of all predicted Drosophila genes (7) throughout the life cycle, from fertilization to aging adults. cDNA microarrays were used to analyze the RNA expression levels of 4028 genes in wild-type flies examined during 66 sequential time periods beginning at fertilization and spanning the embryonic, larval, and pupal periods and the first 30 days of adulthood, when males and females were sampled separately (Fig. 1A). Early embryos change rapidly, so overlapping 1-hour periods were sampled; adults were sampled at multiday intervals (Fig. 1A) (8). We compared each experimental sample to a common reference sample made from pooled mRNA representing all stages of the life cycle, allowing us to measure each transcript's relative abundance (8). We refer to this relative abundance at each time as a gene's transcript or expression level, and to each gene's overall pattern of expression during development as its transcript or expression profile. Expression of most genes assayed (3483 out of 4028, 86%) changed significantly [P 0.001, analysis of variance (ANOVA)] during the 40-day period surveyed (8). Of these, 3219 genes exhibited at least a fourfold difference between their highest and lowest levels of expression (Fig. 1B and table S1). The vast majority of these developmentally modulated genes ( 88%) are expressed during the first 20 hours of development, before the end of embryogenesis (Fig. 1, B and C). To identify patterns of gene reexpression during development, we applied a peak-finding algorithm (8) to each gene's expression profile. We found that 36.3% of the genes (1169 genes) showed a single major peak of expression (Fig. 1D, left panels), whereas 40.3% (1298) showed two peaks (Fig. 1D, right panels) and 23.4% (752) showed three or more peaks (fig. S1 and tables S2 to S6). Many genes are expressed in two waves 0 BMC Genomics 0 Methodology article 0 BioMed Central 0 Open Access 0 Utilization of a labeled tracking oligonucleotide for visualization and quality control of spotted 70-mer arrays 1 Martin J Hessner*1,2, Vineet K Singh3, Xujing Wang1,2, Shehnaz Khan2, Michael R Tschannen2 and Thomas C Zahrt3 0 Hessner et al; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL. 0 Spotted oligonucleotide arrays70-mersgene expression analysis 0 Background: Spotted 70-mer oligonucleotide arrays offer potentially greater specificity and an alternative to expensive cDNA library maintenance and amplification. Since microarray fabrication is a considerable source of data variance, we previously directly tagged cDNA probes with a third fluorophore for prehybridization quality control. Fluorescently modifying oligonucleotide sets is cost prohibitive, therefore, a co-spotted Staphylococcus aureus-specific fluorescein-labeled "tracking" oligonucleotide is described to monitor fabrication variables of a Mycobacterium tuberculosis oligonucleotide microarray. Results: Significantly (p < 0.01) improved DNA retention was achieved printing in 15% DMSO/1.5 M betaine compared to the vendor recommended buffers. Introduction of tracking oligonucleotide did not effect hybridization efficiency or introduce ratio measurement bias in hybridizations between M. tuberculosis H37Rv and M. tuberculosis mprA. Linearity between the mean log Cy3/Cy5 ratios of genes differentially expressed from arrays either possessing or lacking the tracking oligonucleotide was observed (R2 = 0.90, p < 0.05) and there were no significant differences in Pearson's correlation coefficients of ratio data between replicates possessing (0.72 ± 0.07), replicates lacking (0.74 ± 0.10), or replicates with and without (0.70 ± 0.04) the tracking oligonucleotide. ANOVA analysis confirmed the tracking oligonucleotide introduced no bias. Titrating target-specific oligonucleotide (40 µM to 0.78 µM) in the presence of 0.5 µM tracking oligonucleotide, revealed a fluorescein fluorescence inversely related to target-specific oligonucleotide molarity, making tracking oligonucleotide signal useful for quality control measurements and differentiating false negatives (synthesis failures and mechanical misses) from true negatives (no gene expression). Conclusions: This novel approach enables prehybridization array visualization for spotted oligonucleotide arrays and sets the stage for more sophisticated slide qualification and data filtering applications. 0 Page 1 of 11 0 (page number not for citation purposes) 0 BMC Genomics 2004, 5 0 variable DNA probe deposition and retention on the solid support surfaces. To minimize variations using this fabrication platform, a number of approaches have been described that allow direct visualization of array integrity following printing and blocking procedures. Commonly used methods include the staining of microarrays with DNA-binding fluorescent dyes, or the hybridization of "universal" targets (i.e. random 9-mers) to the spotted DNA elements [15,16]. While these techniques provide useful information regarding the physical characteristics of the array, its integrity may be compromised during subsequent de-staining or stripping procedures required prior to hybridization of labeled targets [16]. Consequently, investigators typically only examine one or a few representative slides to access the quality of a printed batch. Previously, we have reported the development and use of a novel three-color cDNA array platform that allows immobilized probes to be directly visualized [17-19]. Utilizing this format, oligonucleotide primers used to amplify cDNA targets are labeled at their 5' end with fluorescein, a dye compatible with commonly used cyanine labeling dyes and confocal laser scanners possessing narrow bandwidths [18,20]. Element/array morphology, surface DNA deposition/retention, and surface background can be monitored on each slide. Thus, in our laboratory, all cDNA arrays are imaged for quality control prior to hybridization, maximizing the use of quality arrays for subsequent experimental procedures. It is likely that many or all of the benefits to using a directly-coupled fluorophore are also applicable to oligonucleotide-based microarrays; however, synthesis costs make this approach unfeasible. In this report, we describe the use and evaluation of a Staphylococcus aureus-specific fluorescein-labeled 70-mer "tracking" oligonucleotide as a third-color quality control measure of a Mycobacterium tuberculosis-specific oligonucleotide-based microarray. 0 Results and Discussion 0 Page 2 of 11 0 (page number not for citation purposes) 0 BMC Genomics 2004, 5 0 Variation in gene expression within and among natural populations 1 Marjorie F. Oleksiak1, Gary A. Churchill2 & Douglas L. Crawford1 0 Evolution may depend more strongly on variation in gene expression than on differences between variant forms of proteins1. Regions of DNA that affect gene expression are highly variable, containing 0.6% polymorphic sites2. These naturally occurring polymorphic nucleotides can alter in vivo transcription rates3-7. Thus, one might expect substantial variation in gene expression between individuals. But the natural variation in mRNA expression for a large number of genes has not been measured. Here we report microarray studies addressing the variation in gene expression within and between natural populations of teleost fish of the genus Fundulus. We observed statistically significant differences in expression between individuals within the same population for approximately 18% of 907 genes. Expression typically differed by a factor of 1.5, and often more than 2.0. Differences between populations increased the variation. Much of the variation between populations was a positive function of the variation within populations and thus is most parsimoniously described as random. Some genes showed unexpected patterns of expression-- changes unrelated to evolutionary distance. These data suggest that substantial natural variation exists in gene expression and that this quantitative variation is important in evolution. 0 each of 907 genes. The loop design is substantially different from the most commonly used `reference microarray' design, in which each RNA sample of interest is used to probe the same reference sample and all values are expressed as ratios of the sample signal to the reference signal. We proposed to answer two questions. First, what proportion of genes are differentially expressed between individuals within the same population? Second, how many genes are differentially expressed between populations? To address these questions, we applied ANOVA methods to the loge normalized data18. Unlike most microarray strategies (but similar to one previous study19), ours did not depend on assessing ratios of fluorescent signals, whereby only large differences can be detected. Instead, we investigated which genes showed statistically significant variations in expression. The expression levels of 161 genes (18%) were significantly different between individuals within the same population at the nominal P value of 0.01 (Fig. 2), as determined using standard statistical tables or permutation analyses within each gene. This number of significant genes is 18 times larger than the nine false positives expected under the null hypothesis when P = 0.01. To provide tighter control of type I errors (falsely rejecting the null hypothesis), we considered applying a multiple-testing adjustment to these tests20. Experiment-wide control of type I error at the 5% level corresponds to an individual test P value of 6 x 10-5. Only 37 of the 161 genes showed significant differences in expression between individuals at this level of stringency, which may be overly conservative. We chose to use the significance level of P = 0.01 and accept a greater type I error in our analyses. 0 The proportion (18%) of loci differing significantly in expression between individuals within the same population is similar to the percentage of loci that differ significantly in expression between different strains of yeast21 (24%) and the percentage of loci that show non-zero variance in Drosophila melanogaster19 (25%), as determined by previous studies. These studies by necessity used pooled samples, and thus could not measure variation in expression between individuals in natural populations. In humans there is a large variation in gene expression between individuals; in a global comparison of mRNA levels of chimpanzees and humans, there was greater variation within the human population than between human and chimpanzee populations22. These results support our finding of large variation in gene expression between individuals and emphasize the importance of examining individual variation. An ANOVA analysis calculates significance using an F statistic, and significant F values require that the variation between samples is significantly larger than the residual variation within samples20. Thus, finding significant differences between individuals requires that the variation between individuals be larger than experimental variation (for example, variation due to printing, hybridization, array differences and other factors). One measure of the experimental variation is the coefficient of variation (c.v.) of gene expression for each individual among the eight replicates, which equals the standard deviation divided by the mean, expressed as a percentage. Nearly all (99%) of the genes for each individual had a c.v.error of less than 15% (Fig. 2). The statistical significance of the differences in expression of 161 genes depended on this small experimental error. We minimized experimental error by using eight replicate measures per individual for each gene and using normalized data rather than the ratio typically used in a reference design. Ratios of two values, each having its own variation, have larger experimental variation20. Not surprisingly, genes for which there was little experimental variation (low c.v.error values) showed the greatest significant differences in expression between individuals within the same population, and genes with large experimental variation values did not differ significantly (Fig. 2). 0 Regulation of noise in the expression of a single gene 1 Ertugrul M. Ozbudak1, Mukund Thattai1, Iren Kurtser2, Alan D. Grossman2 & Alexander van Oudenaarden1 0 Nature Publishing Group http://genetics.nature.com 0 Stochastic mechanisms are ubiquitous in biological systems. Biochemical reactions that involve small numbers of molecules are intrinsically noisy, being dominated by large concentration fluctuations1-3. This intrinsic noise has been implicated in the random lysis/lysogeny decision of bacteriophage-4, in the loss of synchrony of circadian clocks5,6 and in the decrease of precision of cell signals7. We sought to quantitatively investigate the extent to which the occurrence of molecular fluctuations within single cells (biochemical noise) could explain the variation of gene expression levels between cells in a genetically identical population (phenotypic noise). We have isolated the biochemical contribution to phenotypic noise from that of other noise sources by carrying out a series of differential measurements. We varied independently the rates of transcription and translation of a single fluorescent reporter gene in the chromosome of Bacillus subtilis, and we quantitatively measured the resulting changes in the phenotypic noise characteristics. We report that of these two parameters, increased translational efficiency is the predominant source of increased phenotypic noise. This effect is consistent with a stochastic model of gene expression in which proteins are produced in random and sharp bursts. Our results thus provide the first direct experimental evidence of the biochemical origin of phenotypic noise, demonstrating that the level of phenotypic variation in an isogenic population can be regulated by genetic parameters. 0 We selected as our reporter system a single-copy chromosomal gene with an inducible promoter. As an estimated 50-80% of bacterial genes are transcriptionally regulated8, this system typifies the majority of naturally occurring genes, allowing our results to be extended to natural systems. We incorporated a single copy of our reporter, the green fluorescent protein gene (gfp), into the chromosome of B. subtilis. We chose to integrate gfp into the chromosome itself, rather than in the form of plasmids, as variation in plasmid copy number9,10 can act as an additional and unwanted source of noise. Transcriptional efficiency was regulated by using an isopropyl--D-thiogalactopyranoside (IPTG)-inducible promoter, Pspac, upstream of gfp, and varying the concentration of IPTG in the growth medium. Translational 0 Table 1 · Translational mutants: point mutations in the RBS and initiation codon of gfp Strain ERT25 ERT27 ERT3 ERT29 Ribosome binding site GGG GGG GGG GGG AAA AAA AAA AAA AGG AGG AGG AGG AGG AGG TGG AGG TGA TGA TGA TGA ACT ACT ACT ACT Initiation Translational codon efficiency ACT ACT ACT ACT ATG TTG ATG GTG 1.00 0.87 0.84 0.66 0 efficiency was regulated by constructing a series of B. subtilis strains (Table 1) that contained point mutations in the ribosome binding site (RBS) and initiation codon of gfp11. The use of two different strategies to regulate transcriptional and translational processes introduces a potential bias in the relative contributions of these processes to biochemical noise. As a control, we constructed four additional strains (Table 2) whose transcription rates were altered by mutations in the promoter region of the reporter gene. As described below, both strategies of transcriptional regulation produced similar results. We measured expression of green fluorescent protein (GFP) for single cells in a bacterial population using flow cytometry. Variation in GFP expression from cell to cell (phenotypic noise) is seen in a histogram (Fig. 1a) showing the protein expression levels (p) measured during a typical experiment. The histogram is characterized by a mean value p and a standard deviation p. The phenotypic noise strength, defined as the quantity p2/p (variance/mean), is sensitive to the biochemical sources of stochasticity that we wished to study and is therefore the unit in which we report our results. We measured phenotypic noise strength for the four different translational strains as we varied IPTG concentration between 30 µM (near-basal transcription) and 1 mM (full operon induction). For example, Fig. 1b shows flow cytometer results for the four strains at full induction, whereas Fig. 1c shows the results from a series of flow cytometer experiments conducted on a single strain (ERT3) as IPTG concentration was varied. A summary of all of our experimental results (Fig. 2a) shows the measured noise strength as a simultaneous function of both transcriptional efficiency (varying [IPTG] in the growth medium) and translational efficiency (using different strains with mutations in the RBS and initiation codon). As the addition of IPTG and mutations in the gfp RBS are not expected to affect normal cellular processes, all contributions to phenotypic noise remained unchanged throughout our experiment, except fluctuations in rates of transcription and translation. The response of phenotypic noise strength to a change in either translational efficiency (Fig. 2b) or transcriptional efficiency (Fig. 2c) indicates the isolated contribution of that parameter to the phenotypic noise. 0 Table 2 · Transcriptional mutants: point mutations in the Pspac promoter Strain ERT57 ERT25 ERT53 ERT51 ERT55 -10 regulatory region -10 +1 CAT CAT CAT CAT CAT AAT AAT AAT AAT AAT GTG GTG GTG GTG GTG TGT TGG TGC TGA TAA AAT AAT AAT AAT AAT Transcriptional efficiency 6.63 1.00 0.79 0.76 0.76 0 number of cells 0 p /

(fluorescence units) 0 p /

(fluorescence units) 0 [IPTG]=75 µM 0 [IPTG]=30 µM 0 [IPTG]=1 mM 0 Nature Publishing Group http://genetics.nature.com 0 p (fluorescence units) 0

(fluorescence units) 0

(fluorescence units) 0 We find that the phenotypic noise strength shows a strong positive correlation with translational efficiency (Fig. 2b, slope=21.8), in contrast to the weak positive correlation observed for transcriptional efficiency (Fig. 2c, slope=6.5). Switching from the ERT27 strain to the ERT25 strain (an increase in translational efficiency of about 15%; Table 1) increases the noise strength from 32 to 35 units; the same effect is achieved only upon doubling transcriptional efficiency (a 100% increase) from the half-induction to the full-induction level. Experiments involving the control strains, in which transcription rates were altered by mutation rather than by operon induction, supported the weak correlation between noise strength and transcriptional efficiency (Fig. 2c inset, slope=7.3). The differential nature of our measurements (investigating changes rather than absolute values) makes our results independent of the specific properties of the reporter protein, such as gene locus or folding characteristics. This suggests that 0 increased translational efficiency will strongly increase the variation in the expression of any naturally occurring gene. A stochastic model for the expression of a single gene (Fig. 3a) predicts that the noise strength (p2/p) is greater than Poissonian (p2/p=1) and is simply an increasing function of translational efficiency12: 0 Here, b=kP/R is the average number of proteins synthesized per mRNA transcript; these proteins are injected into the cytoplasm in sharp bursts (Fig. 3b). The measured asymmetry between the noise contributions of transcriptional and translational parameters is consistent with this prediction and provides evidence of 0 ngth noise stre 0 p /

(fluorescence units) 0 p /

(fluorescence units) 0 scrip tion 0 translational efficiency 0 translational efficiency 0 transcriptional efficiency 0 Fundamentals of experimental design for cDNA microarrays 1 Gary A. Churchill 0 Sources of variation in microarray experiments The design of a two-color microarray experiment can be considered as having three layers. Figure 1 shows an example of an experiment that compares the effects of two treatments--A and B--on gene-expression profiles in a mouse tissue. At the top layer of the experiment are the experimental units, the two mice to whom each treatment is applied. The term `treatment' pertains to any attribute, such as the sex or strain of the organism, of primary interest in the experiment. The mice were selected to be representative of a population of mice and, if possible, the treatment should be assigned using a randomizing device such as a coin toss. Assigning at least two mice to each treatment group ensures that there is biological replication in the experiment. In the middle layer, two RNA samples are obtained from each mouse. These technical replicates may be two independent RNA extractions or two aliquots of the same extraction. The RNA samples are assigned to two different dye labels, indicated by the red and green test tubes. They are then paired (one red and one green) and mixed for co-hybridization on microarray slides. The bottom layer of the experiment involves the arrangement of array elements on the slides. In this example, duplicate spots of each cDNA clone have been printed side by side. The many sources of variation in a microarray experiment can be partitioned along these three layers. Biological variation (top layer) is intrinsic to all organisms; it may be influenced by genetic or environmental factors, as well as by whether the samples are pooled or individual. Technical variation (middle layer) is introduced during the extraction, labeling and hybridization of samples. Measurement error (bottom layer) is associated with reading the fluorescent signals, which may be affected by factors such as dust on the array. Valid statistical tests for differential expression of a gene across the samples can be constructed on the basis of any of these variance components, but there are important distinctions in how the different types of tests should be interpreted. If we are interested in determining how the treatments affect different biological populations represented in our samples, statistical tests should be based on the biological variance. If our interest is to detect variations within treatment groups, the tests should be based on technical variation. For example, Olesiak et al.1 employed both types of tests to look at variation between and within natural populations. Tests 0 based on measurement error variance can also be constructed but are of limited utility2. For most questions of interest, the higher two levels of variation are appropriate for constructing tests, and hence good designs should incorporate replication at the higher layers. 0 Experimental units and treatments The correlation observed between ratios of fluorescent intensity from duplicate spots on a single microarray slide will typically exceed 95%. This is often interpreted as a demonstration that microarray assays are reproducible. However, if the same target sample is divided and hybridized to two different microarray slides, the correlation across hybridizations is likely to fall to the 60 to 80% range, somewhat lower if the dye labeling is reversed. Correlations between samples obtained from individual inbred mice may be as low as 30%. If the experiments are carried out in different laboratories, the correlations may be lower still. These decreasing correlations reflect the cumulative contributions of multiple sources of variation. It is tempting to avoid biological replication in an experiment because results will appear to be more reproducible. The apparent increase in statistical power is illusory, however, and significant findings may simply reflect chance fluctuations in the particular animals chosen for the experiment. In general, it is appropriate to take steps to vary the conditions of the experiment--for example, by assaying multiple animals--to ensure that the effects that do achieve statistical significance are real and will be reproducible in different settings3. Identifying the independent units in an experiment is a prerequisite for a proper statistical analysis, as any hidden correlations in the data can lead to bias and inflated levels of statistical significance. Statistical independence is a relative concept. For example, hybridizations of the same target sample to multiple slides may be viewed as independent replicates if the intent is to characterize that sample accurately. However, in an experiment where the question of interest concerns a biological comparison at the whole-organism level (for example, a comparison of geneexpression profiles between genetically altered and control animals), the technical replicates from any one sample may no longer be regarded as independent. Details of how individual animals and samples were handled throughout the course of an experiment can be important to 0 Allocating resources in a microarray experiment The precision of estimated quantities depends on the variability of the experimental material, the number of experimental units, the number of repeated observations per unit and the accuracy of the primary measurements4. The basis for drawing inferential conclusion is the residual error (or mean squared error, MSE), which quantifies the precision of estimates and thus allows one to determine whether estimated quantities are significantly different in the statistical sense. In a microarray experiment, the residual error can be decomposed into three components of variance corresponding to the three layers of the design (Fig. 1). The first component is the intrinsic variation of the biological units within a treatment group, which we will denote by 2 . 0 are multiple treatment factors). If there are no degrees of freedom left, there may be no information available to estimate the biological variance, the statistical tests will rely on technical variance alone, and the scope of the conclusions will be limited to the samples in hand. If there are 5 df or more, you are in good shape (see Box). In some circumstances, a large number of experimental units may be available, perhaps more than can be measured individually, in which case we have the option to form pools of individual samples. In other cases, pooling may be a necessity owing to the limited availability of RNA. Pooling the original experimental units creates new units, the pools. Pooling can reduce the biological component of variation, but it cannot reduce the variability due to sample handling or measurement error. In a two-sample comparison, we could consider making two large pools of all available units and measuring each pool multiple times. This is a poor design, as it does not allow estimation of the between-pool variance. By pooling all the available samples together we have minimized the biological variance, but we have also eliminated all independent replication. It is better to use several pools and fewer technical replicates. 0 Pairing samples for hybridizations The 0 The effect of replication on gene expression microarray experiments 1 Paul Pavlidis1,, Qinghong Li2 and William Stafford Noble3, 0 Columbia 0 Replication is a straightforward method for improving the quality of inferences made from experimental studies. However, replication increases the cost of experiments and, typically, the amount of material needed. In general, it makes sense to do as much replication as is necessary to achieve a desired level of sensitivity and specificity, but not much more. This trade-off between cost and statistical power arises frequently in gene expression microarray experiments. Replication is clearly necessary in this domain (Lee et al., 2000; Novak et al., 2002), but microarray experiments are costly and involve RNA samples that are often difficult to obtain. We therefore need techniques for estimating in advance how many replicates should be performed in a given study. 0 A standard approach to the problem of estimating the statistical properties of a planned set of data is `power analysis'. Power analysis estimates the probability of correctly rejecting the null hypothesis in favor of a specific alternative while maintaining a particular Type I error rate. For the situations we consider here, the alternative hypothesis is usually expressed in terms of `effect size', the actual difference in the group means (relative to the variance) that is desired to be detected. A mathematical model of the data is then used to estimate how many replicates are needed to achieve the desired Type I and Type II error rates. Certain parameters for the modeled data (most critically, the expected variability) are often estimated from real data, perhaps from a pilot study. Although clearly a useful tool, power analysis comes with some caveats. First, the estimated variability is critically dependent on the assumptions of the model and the quality of the input parameter estimates. A second set of assumptions enters into the statistical test that is used to evalute the null hypothesis. In addition, for gene expression studies, power analysis is potentially extremely complex, with a separate set of parameters for each gene, not to mention the need to account for complex interactions among genes. To our knowledge such a complete power calculation has not been attempted, though some papers have used simpler power analyses to study microarray expression data (Zien et al., 2002; Hwang et al., 2002; Pan et al., 2002). In this paper, we study the effect of increasing (or decreasing) replication on the detection of differentially expressed genes in real data sets, avoiding the assumptions required to simulate data. However, because in real data sets we do not know which genes truly show differential expression, we cannot directly assess power. Instead, we examine aspects of the results which are of interest to biologists and which complement traditional power analyses. We make our findings as general as possible by analyzing many data sets. We consider a simple general type of experiment, the goal of which is to identify genes that are differentially expressed between two experimental groups (for example, tumor and normal tissue). The two groups each contain a number of 0 Effect of replication on microarray experiments 0 replicate samples. These replicates are derived from different biological sources, as opposed to so-called `technical replicates', in which the same biological sample is tested multiple times. Differentially expressed genes are identified by a statistical test for group comparison (such as a t-test), where the null hypothesis is equality of the group means. A p-value threshold is applied following the test to establish a desired Type I error rate. The final result obtained from this hypothetical experiment is a list of genes that are differentially expressed at a particular level of statistical confidence. To study various levels of replication, we use a random sampling approach. Given a real data set, we simulate smaller data sets of various sizes by randomly selecting samples from it. For example, if we start with a data set containing at least 12 replicates in each group, then we can make data sets of any level of replication (up to 12) by randomly selecting from the real samples (Fig. 1). We then examine properties of each of these sampled data sets with methods described below. We repeat this procedure on many data sets, for every possible level of replication, for many random samples, to generate a large set of statistics on the properties of data sets of various sizes. We consider two qualities of each sampled data set. The first and most important is the ability to obtain any results at all, that is, to find genes that meet our statistical criteria. We refer to this property as `apparent power' to distinguish it from power in the strict sense. Because increasing sample size will essentially always increase power, it might be reasonable for an experimenter to choose a level of replication that is sufficient to yield `enough' high-confidence candidates, where `enough' must be defined by the needs of the experiment. The second quality that we consider is the stability of the results. Note that stability is only meaningful if some genes have met our statistical criteria. We define stability as the tendency for the results to remain the same as the replication level is changed. We define two metrics of stability, which differ in their stringency. First, we consider the stability of the identities of the genes that meet the statistical criteria. Second, we consider the rank order of those genes. Details of our metrics are provided in the methods section, below. Our goal is to identify, for each data set, a level of replication that yields good performance according to our metrics, but without requiring an unreasonably large number of replicates. We wish to ask, `Can we find useful results with only a few replicates?' and at the other extreme, `Do we need 30 replicates?' Although the experimental design used here is simple--identifying differentially expressed genes across two conditions--the techniques that we describe could be applied to a wide range of situations. Our results suggest that while statistical power is a critical consideration in experimental design, researchers should also consider the stability of the results they obtain. While the specific findings are data dependent, we found that good apparent power and stability can usually be obtained with fewer than 0 Tissue class X Tissue class Y 0 For r = (3...n) r=6 Up to 100 random trials Up to 10 random trials 0 Full data set 0 n = 8 per group 0 Sample (S) 0 per group 0 Sample-test (Stest) 0 per group 0 T-test ranking Threshold 0 Sample-selected (Ssel) 0 Sample (S; 'gold standard') 0 Comparisons for stability determination 0 T-test ranking 0 Sample-test-selected (Stestsel) 0 Sample-test (Stest) 0 replicates, and often with fewer than 10. On the other hand, using fewer than five replicates almost always results in poor apparent power and low stability. The methods we present can be used i 0 Error-correcting microarray design 1 Arshad H. Khan,a Alex Ossadtchi,b Richard M. Leahy,b and Desmond J. Smitha,* 0 Abstract We describe a microarray design based on the concept of error-correcting codes from digital communication theory. Currently, microarrays are unable to efficiently deal with "drop-outs," when one or more spots on the array are corrupted. The resulting information loss may lead to decoding errors in which no quantitation of expression can be extracted for the corresponding genes. This issue is expected to become increasingly problematic as the number of spots on microarrays expands to accommodate the entire genome. The error-correcting approach employs multiplexing (encoding) of more than one gene onto each spot to efficiently provide robustness to drop-outs in the array. Decoding then allows fault-tolerant recovery of the expression information from individual genes. The error-correcting method is general and may have important implications for future array designs in research and diagnostics. © 2003 Elsevier Science (USA). All rights reserved. 0 Keywords: Efficiency; Error-correcting codes; Fault-tolerance; Microarrays; Overhead 0 Relative expression levels for two different biological samples can be measured simultaneously for several thousand genes using cDNA microarrays [1]. The arrays are created robotically using pins to spot different cDNAs as a 2D grid on a treated glass slide. The RNA from the two samples is labeled using fluorescent dyes with distinct spectra and cohybridized to the array. A photomultiplier tube (PMT) is then used to collect an image of the stimulated fluorescence for each of the two fluorophores at every spot. Relative transcript abundances for each gene are quantitated as the log-ratio of the fluorescence intensities. Many factors can affect the accuracy of microarrays: spot size, pin effects, hybridization efficiency, the response of the PMT, and the quality of the labeled RNA [2]. Taking the ratio for the two fluorophores at each spot to compute relative expression can help mitigate effects common to both samples, such as spot size and hybridization efficiency. However, poor spot formation and neighborhood background fluorescence may be so severe that little or no useful information can be extracted from affected spots. As the 0 number of spots on microarrays expands to accommodate the entire genome, the occurrence of such "drop-outs" will tend to increase. Current microarray designs are not robust to these errors and are susceptible to loss of experimental information from genes that may be essential for a particular study. Error-correcting codes play a fundamental role in reducing inaccuracies during data transmission in digital communication systems [3]. An important concept in these codes is overhead, the percentage of transmitted bits employed for error correction. The converse quantity is efficiency. The simplest approach to error correction employs replication of all bits; however, this carries considerable overhead (low efficiency), and much more economical and elegant schemes have been devised. In this report, we describe a new approach to microarray design that employs the concepts of error-correcting codes. The approach is thus fault tolerant, and expression levels for each gene can be estimated in the presence of corrupted spots. The design is based on the use of a binary encoding scheme in which two or more genes are multiplexed onto each spot. Using a decoding procedure, the expression level for each gene can then be recovered. In the case in which 0 one or more spots are corrupted, the decoder discards these data and computes the expression level for each gene using the remaining spots. The coding scheme has greater efficiency (less overhead) than simple approaches such as duplication of all spots, an important consideration since it is necessary to keep array sizes within bounds. We first describe the error-correcting approach and then studies of error-correcting performance, linearity, and sensitivity. In the first set of investigations, four genes are encoded using six spots, providing robustness to loss of two spots. However, higher degrees of multiplexing can be used to reduce the total number of spots (greater efficiency, less overhead) while still providing error-correcting capabilities. In additional implementations, we demonstrate the utility of this principle. 0 Results Error-correcting codes Error-correcting codes are formulated for finite alphabets and are based on the introduction of redundancy into data transmitted over a channel, in the case of block codes by using k codeword bits to encode n source bits, where k n. Redundancy in the code allows detection and correction of errors. The microarray problem differs in a fundamental way, since gene expression levels are continuously variable. Consequently, it is not possible to work in the finite field framework. Nevertheless, because of the impracticality of combining fractional amounts of cDNA for different genes, use of a binary encoding matrix is appropriate. We denote by x the vector of RNA levels corresponding to a set of n genes. We will assume that hybridization rates are unaffected by the multiplexing process. Then the total concentration of RNA y at k multiplexed spots can be written as y TGSx, (1) 0 where G is the k n binary encoding matrix, S is a diagonal matrix with elements s(j,j) denoting the affinity of RNA from the jth gene to cDNA on the array, and T is a diagonal matrix with elements t(i,i) denoting spot-specific effects, such as size, that are not included in S. The ith row of the encoding matrix G has n entries of value 1 and 0, indicating which of the n genes are encoded in the ith spot through inclusion of their cDNA. The encoding matrix is chosen to maximize error-correcting capabilities while minimizing propagation of noise effects. Let us assume for now that the entire process is linear, concentration levels are measured directly, T and S are identity matrices, and measurement noise is identical and independent at each spot. The expression levels can be computed to minimize error variance by multiplying the measurements y by the pseudoinverse G of G [4]. It is then possible to design G to minimize the noise variance 0 Slide reading and decoding The multiplex spot signal Differences in affinity and spot sizes from gene to gene make absolute quantitation extremely difficult using cDNA microarrays. Consequently, ratios of intensity between two fluorescence images are typically used to determine relative expression [1]. Let the vector xCy5 denote the concentration of RNA corresponding to the n genes labeled with Cy5. Let yCy5 TGSxCy5 denote the vector representing the concentrations of labeled RNA hybridized to the k multiplex spots. Similarly, define vectors xCy3 and yCy3 for concen- 0 trations of Cy3-labeled RNA. The quantity to be extracted from the microarray data is thus the ratio ri log x Cy5/x Cy3 i i i 1, . . . , n. (5) 0 We assume the response of the scanner PMT used to measure fluorescence is linear, so that the measured image intensity can be written as I Cy5 I Cy3 aTGSx Cy5, aTGSx Cy3, (6) 0 where a is the calibration factor. From these measurements we compute the vector z of log ratios of the multiplexed gene expression levels, i.e., zi log I Cy5/I Cy3 i i i 1,. . .,k, (7) 0 and from these we estimate the expression ratios rj, j 1, . . . , n, as defined in Eq. (5). Nonlinear decoding algorithm We use a nonlinear decoding algorithm to estimate the relative expression levels for each gene. We first identify and discard any corrupted spots to leave the index set {1, . . . , k}. The remaining spots are then processed by numerically minimizing the function, J(x Cy3, x Cy5) ^ ^ 0 assess the quantitative performance and sensitivity of the microarrays over a large dynamic range, 10 different amounts of kidney RNA were cohybridized to each microarray in the 0 Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale 1 Joseph L. DeRisi, Vishwanath R. Iyer, Patrick O. Brown* 0 DNA microarrays containing virtually every gene of Saccharomyces cerevisiae were used to carry out a comprehensive investigation of the temporal program of gene expression accompanying the metabolic shift from fermentation to respiration. The expression profiles observed for genes with known metabolic functions pointed to features of the metabolic reprogramming that occur during the diauxic shift, and the expression patterns of many previously uncharacterized genes provided clues to their possible functions. The same DNA microarrays were also used to identify genes whose expression was affected by deletion of the transcriptional co-repressor TUP1 or overexpression of the transcriptional activator YAP1. These results demonstrate the feasibility and utility of this approach to genomewide exploration of gene expression patterns. 0 The complete sequences of nearly a dozen 0 microbial genomes are known, and in the next several years we expect to know the complete genome sequences of several metazoans, including the human genome. Defining the role of each gene in these genomes will be a formidable task, and understanding how the genome functions as a whole in the complex natural history of a living organism presents an even greater challenge. Knowing when and where a gene is expressed often provides a strong clue as to its biological role. Conversely, the pattern of genes expressed in a cell can provide detailed information about its state. Although regulation of protein abundance in a cell is by no means accomplished solely by regulation of mRNA, virtually all differences in cell type or state are correlated with changes in the mRNA levels of many genes. This is fortuitous because the only specific reagent required to measure the abundance of the mRNA for a specific gene is a cDNA sequence. DNA microarrays, consisting of thousands of individual gene sequences printed in a high-density array on a glass microscope slide (1, 2), provide a practical and economical tool for studying gene expression on a very large scale (3-6). Saccharomyces cerevisiae is an especially 0 favorable organism in which to conduct a systematic investigation of gene expression. The genes are easy to recognize in the genome sequence, cis regulatory elements are generally compact and close to the transcription units, much is already known about its genetic regulatory mechanisms, and a powerful set of tools is available for its analysis. A recurring cycle in the natural history of yeast involves a shift from anaerobic (fermentation) to aerobic (respiration) metabolism. Inoculation of yeast into a medium rich in sugar is followed by rapid growth fueled by fermentation, with the production of ethanol. When the fermentable sugar is exhausted, the yeast cells turn to ethanol as a carbon source for aerobic growth. This switch from anaerobic growth to aerobic respiration upon depletion of glucose, referred to as the diauxic shift, is correlated with widespread changes in the expression of genes involved in fundamental cellular processes such as carbon metabolism, protein synthesis, and carbohydrate storage (7). We used DNA microarrays to characterize the changes in gene expression that take place during this process for nearly the entire genome, and to investigate the genetic circuitry that regulates and executes this program. Yeast open reading frames (ORFs) were amplified by the polymerase chain reaction (PCR), with a commercially available set of primer pairs (8). DNA microarrays, containing approximately 6400 distinct DNA sequences, were printed onto glass slides by 0 using a simple robotic printing device (9). Cells from an exponentially growing culture of yeast were inoculated into fresh medium and grown at 30°C for 21 hours. After an initial 9 hours of growth, samples were harvested at seven successive 2-hour intervals, and mRNA was isolated (10). Fluorescently labeled cDNA was prepared by reverse transcription in the presence of Cy3(green)or Cy5(red)-labeled deoxyuridine triphosphate (dUTP) (11) and then hybridized to the microarrays (12). To maximize the reliability with which changes in expression levels could be discerned, we labeled cDNA prepared from cells at each successive time point with Cy5, then mixed it with a Cy3labeled "reference" cDNA sample prepared from cells harvested at the first interval after inoculation. In this experimental design, the relative fluorescence intensity measured for the Cy3 and Cy5 fluors at each array element provides a reliable measure of the relative abundance of the corresponding mRNA in the two cell populations (Fig. 1). Data from the series of seven samples (Fig. 2), consisting of more than 43,000 expression-ratio measurements, were organized into a database to facilitate efficient exploration and analysis of the results. This database is publicly available on the Internet (13). During exponential growth in glucoserich medium, the global pattern of gene expression was remarkably stable. Indeed, when gene expression patterns between the first two cell samples (harvested at a 2-hour interval) were compared, mRNA levels differed by a factor of 2 or more for only 19 genes (0.3%), and the largest of these differences was only 2.7-fold (14). However, as glucose was progressively depleted from the growth media during the course of the experiment, a marked change was seen in the global pattern of gene expression. mRNA levels for approximately 710 genes were induced by a factor of at least 2, and the mRNA levels for approximately 1030 genes declined by a factor of at least 2. Messenger RNA levels for 183 genes increased by a factor of at least 4, and mRNA levels for 203 genes diminished by a factor of at least 4. About half of these differentially expressed genes have no currently recognized function and are not yet named. Indeed, more than 400 of the differentially expressed genes have no apparent homology 0 to any gene whose function is known (15). The responses of these previously uncharacterized genes to the diauxic shift therefore provides the first small clue to their possible roles. The global view of changes in expression of genes with known functions provides a vivid picture of the way in which the cell adapts to a changing environment. Figure 3 shows a portion of the yeast metabolic pathways involved in carbon and energy metabolism. Mapping the changes we observed in the mRNAs encoding each enzyme onto this framework allowed us to infer the redirection in the flow of metabolites through this system. We observed large inductions of the genes coding for the enzymes aldehyde dehydrogenase (ALD2) and acetyl-coenzyme A(CoA) synthase (ACS1), which function together to convert the products of alcohol dehydrogenase into acetyl-CoA, which in turn is used to fuel the tricarboxylic acid (TCA) cycle and the glyoxylate cycle. The concomitant shutdown of transcription of the genes encoding pyruvate decarboxylase and induction of pyruvate carboxylase rechannels pyruvate away from acetaldehyde, and instead to oxalacetate, where it can serve to supply the TCA cycle and gluconeogenesis. Induction of the pivotal genes PCK1, encoding phosphoenolpyruvate carboxykinase, and FBP1, encoding fructose 1,6-biphosphatase, switches the directions of two key irreversible steps in glycolysis, reversing the flow of metabolites along the reversible steps of the glycolytic pathway toward the essential biosynthetic precursor, glucose-6-phosphate. Induction of the genes coding for the trehalose synthase and glycogen synthase complexes promotes channeling of glucose-6-phosphate into these carbohydrate storage pathways. Just as the changes in expression of genes encoding pivotal enzymes can provide insight into metabolic reprogramming, the behavior of large groups of functionally related genes can provide a broad view of the systematic way in which the yeast cell adapts to a changing environment (Fig. 4). Several classes of genes, such as cytochrome c-related genes and those involved in the TCA/glyoxylate cycle and carbohydrate storage, were coordinately induced by glucose exhaustion. In contrast, genes devoted to protein synthesis, including ribosomal proteins, tRNA synthetases, and translation, elongation, and initiation factors, exhibited a coordinated decrease in expression. M 0 adulthood, specific combinations of tumor suppressor genes may cooperate to control proliferation, differentiation, and survival in different cell lineages. 0 Microarray Analysis of Drosophila Development During Metamorphosis 1 Kevin P. White,* Scott A. Rifkin, Patrick Hurban, David S. Hogness 0 Metamorphosis is an integrated set of developmental processes controlled by a transcriptional hierarchy that coordinates the action of hundreds of genes. In order to identify and analyze the expression of these genes, high-density DNA microarrays containing several thousand Drosophila melanogaster gene sequences were constructed. Many differentially expressed genes can be assigned to developmental pathways known to be active during metamorphosis, whereas others can be assigned to pathways not previously associated with metamorphosis. Additionally, many genes of unknown function were identified that may be involved in the control and execution of metamorphosis. The utility of this genome-based approach is demonstrated for studying a set of complex biological processes in a multicellular organism. The generation of vast amounts of DNA sequence information, coupled with advances in technologies developed for the e 0 A common reference for cDNA microarray hybridizations 1 Ellen Sterrenburg, Rolf Turk, Judith M. Boer, Gertjan B. van Ommen and Johan T. den Dunnen* 0 ABSTRACT Comparisons of expression levels across different cDNA microarray experiments are easier when a common reference is co-hybridized to every microarray. Often this reference consists of one experimental control sample, a pool of cell lines or a mix of all samples to be analyzed. We have developed an alternative common reference consisting of a mix of the products that are spotted on the array. Pooling part of the cDNA PCR products before they are printed and their subsequent amplification towards either sense or antisense cRNA provides an excellent common reference. Our results show that this reference yields a reproducible hybridization signal in 99.5% of the cDNA probes spotted on the array. Accordingly, a ratio can be calculated for every spot, and expression levels across different hybridizations can be compared. In dye-swap experiments this reference shows no significant ratio differences, with 95% of the spots within an interval of T0.2-fold change. The described method can be used in hybridizations with both amplified and non-amplified targets, is time saving and provides a constant batch of common reference that lasts for thousands of hybridizations. INTRODUCTION cDNA microarraying is currently widely used to assess differential gene expression (1). Simultaneous hybridization of two samples labeled with different fluorescent dyes provides an intensity ratio that reflects the relative mRNA levels (2). Though adequate for comparison of two samples, assessment of expression levels across multiple samples, for example in a time series, becomes complicated. For multiarray comparisons, hybridization of a common reference sample simultaneously with each experimental sample is recommended (3,4). Initially one sample, e.g. mRNA originating from one cell line or time point zero, was used as a common reference (5±7). A disadvantage of this approach is that the control sample does not provide a signal in all spots and, since for these no ratio can be calculated, they are usually 0 disregarded in the analysis. Sometimes these gaps are filled in by applying a program that is designed to estimate missing values (8). However, to avoid using an estimation program or other alternatives, the ideal reference should ensure consistent and non-zero values for all probes on the array, guaranteeing that no information is lost when the ratios are calculated (4). A reference consisting of a labeled PCR product from a part of the vector that all the spotted probes have in common, as has been described for filter hybridizations, meets this criterion (9). However, it will not compete with the target cDNA for hybridization to the specific sequence of the probe. Consequently, the ratios obtained from such a hybridization may not always reflect the amount of RNA present in the experimental sample (e.g. saturated spots). Another described common reference consists of a pool of RNA originating from different cell lines (3,10±12). This approaches the ideal situation, but cell culturing is very time and space consuming. In addition, gene expression in the pooled cell lines may not represent all genes present on the array and it may change over time under even slightly different growth conditions and other variables like passage number. Furthermore, it is difficult to repeatedly quantify and pool large amounts of RNAs from multiple sources in a reliable and reproducible way. Bergstrom et al. used such a common reference and reported a coverage of 90% of the array by the reference (13). An alternative to this method, which does provide signal in all spots that need to be analyzed, is pooling part of the RNA of all the experimental samples (e.g. cell lines or biopsies) which will be used in that particular experiment (4,14). The disadvantage here is that this approach is experiment specific and each time a new experiment is performed, a new reference pool has to be made. Furthermore, if the amount of experimental samples is limiting, it is not possible to use part of it for the common reference and if one wants to study individual samples (e.g. new incoming patients), there is no reference sample present. The experiments presented here demonstrate the use of a common reference for cDNA microarrays consisting of a mix of all probes spotted on the array. The PCR reference is made by pooling a fraction of all amplified probes before they are printed. Single-stranded products are synthesized in a subsequent in vitro transcription reaction and the product is labeled in parallel with the experimental target. The method can be used in hybridizations with both amplified and non-amplified 0 PAGE 2 OF 6 0 dichloromethane was used. After extraction, the aqueous layer was transferred to a fresh tube and purified and concentrated by ethanol precipitation. Antisense cRNA transcripts were generated using the Ampliscribe Sp6 High Yield Transcription kit (Epicentre), starting with 1 mg of pooled PCR product (Fig. 1). In addition to the protocol, 1 ml of RNasin (Fermentas) was added and the reaction was incubated at 42°C for 3 h. The generated cRNA was washed three times with 450 ml of diethylpyrocarbonate-treated water using a Microcon-100 column (Millipore). cRNA (750 ng) was reverse transcribed with random hexamers, and labeled through incorporation of Renaissance cyanine 5-dUTP (Cy5) or Renaissance cyanine 3-dUTP (Cy3) (NEN) according to the protocols of Ross et al. (12) with the following modifications: 8 mg of random hexamer primers were used in the reaction and before incubation at 42°C the mixture was incubated at room temperature for 10 min. Target preparation Human fibroblast cultures were grown in DMEM without phenol red (Gibco BRL) supplemented with 1% glucose, 2% glutamax, 100 U/ml penicillin, 100 mg/ml streptomycin and 10% heat-inactivated fetal bovine serum (Gibco BRL). Cells were coll 0 BETWEEN GENOTYPE AND PHENOTYPE: PROTEIN CHAPERONES AND EVOLVABILITY 1 Suzanne L. Rutherford 0 Protein chaperones direct the folding of polypeptides into functional proteins, facilitate developmental signalling and, as heat-shock proteins (HSPs), can be indispensable for survival in unpredictable environments. Recent work shows that the main HSP chaperone families also buffer phenotypic variation. Chaperones can do this either directly through masking the phenotypic effects of mutant polypeptides by allowing their correct folding, or indirectly through buffering the expression of morphogenic variation in threshold traits by regulating signal transduction. Environmentally sensitive chaperone functions in protein folding and signal transduction have different potential consequences for the evolution of populations and lineages under selection in changing environments. 0 The heat-shock proteins (HSPs) are highly conserved families of enzymes and CHAPERONES that are involved in the folding and degradation of damaged proteins. They are rapidly and concertedly mobilized in large numbers by cells that are under stress. The mobilization of HSPs is an important component of a universal and tightly orchestrated stress response that has probably allowed organisms to survive otherwise lethal temperatures throughout evolution1,2. Even at normal temperatures, several HSP chaperones are essential for viability, and promote the successful folding and activity of many cellular proteins2-4. Recent reports document further roles of some of the constitutively important chaperone families that are expressed at the population level5-8. Genetic or pharmacological manipulation of these chaperones alters the expression of genetic variation in several systems. Therefore, as well as having a vital role in stress physiology, chaperones also provide a plausible molecular mechanism for regulating the capacity of populations and lineages for evolutionary adaptation to changing environments -- EVOLVABILITY. It is thought that during periods of environmental stress, competition for chaperones by stress-damaged proteins compromises the ability of the chaperones to protect or fold their usual targets, thereby reducing the activities of most target proteins9,10. According to recent studies, the modulation of chaperone and target functions in response to stress would alternately mask and expose phenotypic variation, depending on the degree of stress and the availability of free chaperones11-14. This indicates that chaperones control a reserve of neutral genetic variation, which builds up in populations under normal conditions and could be expressed as heritable phenotypic variation during periods of environmental change. As the rate of evolution is limited by heritable variation in fitness, this chaperone-mediated mechanism might allow populations and lineages to better adapt to severe environmental change. The expression of random genetic variation is expected to be largely deleterious to individual fitness. However, both individual organisms and interbreeding groups of organisms produce the differential `births' (new individuals or groups) and `deaths' (loss of reproductive fitness or extinction) that are required for evolution. Under certain circumstances, population-level traits can increase group fitness more than they decrease individual fitness, even though the evolutionary forces that operate at each 0 A class of proteins that, by preventing improper associations, assist in the correct folding or assembly of other proteins in vivo, but that are not a part of the mature structure. 0 NATURE REVIEWS | GENETICS 0 Nature Publishing Group 0 The ability of random genetic variation to produce phenotypic changes that can increase fitness (intrinsic evolvability) or the ability of a population to respond to selection (extrinsic evolvability). Extrinsic evolvability depends on intrinsic evolvability, as well as on external variables such as the history, size and structure of the population. 0 GROUP SELECTION 0 Selection on traits that increase the relative fitness of populations or lineages of organisms at some fitness cost to individuals. All of the feasible mechanisms require selection on lineages or small interbreeding groups of related individuals in subdivided populations. 0 MUTATION LOAD 0 The accumulated deleterious alleles that are carried by a population at any given time. 0 EXPRESSED MUTATION RATE 0 The rate of phenotypic change that results from the continuing accumulation of new mutations (expressed mutation rate = total mutation rate - neutral mutation rate). 0 THRESHOLD TRAITS 0 Quantitative traits that are discretely expressed in a limited number of phenotypes (usually two), but which are based on an assumed continuous distribution of factors that contribute to the trait (underlying liability). 0 evolutionary time6. This work attracted the attention of biologists ranging from protein biochemists to ecologists and evolutionary biologists17-20. Recent experiments indicate that p 0 USING DROSOPHILA AS A MODEL INSECT 1 David Schneider 0 The fruitfly Drosophila melanogaster has become such a popular model organism for studying human disease that it is often described as a little person with wings. This view has been strengthened with the sequencing of the Drosophila genome and the discovery that 60% of human disease genes have homologues in the fruitfly. In this review, I discuss the approach of using Drosophila not only as a model for metazoans in general but as a model insect in particular. Specifically, I discuss recent work on the use of Drosophila to study the transmission of disease by insect vectors and to investigate insecticide function and development. 0 Insects transmit pathogens that sicken and kill millions of people annually. Between 300 and 500 million people are infected each year, and more than a million die from malaria alone (WHO 2000 report on health). To put these numbers into perspective, the number of people killed by malaria in 1998 was comparable to the number of people killed by breast and prostate cancer, melanoma and leukaemia combined. Although malaria is, by far, the most serious insect-borne disease, there are still other arthropod-transmitted illnesses, such as Chagas disease, leishmaniasis, sleeping sickness and river blindness, which infect hundreds of thousands of people each year. Insects are vectors for many animal, as well as human, diseases. Furthermore, insects affect human health by damaging our food supply and, by eating and damaging crops, insects also function as vectors for various plant diseases. Because of the development of insecticide resistance in vector insects and insect pests, and because of the development of antibiotic resistance in disease-causing organisms, it is essential to continuously develop new methods to fight these scourges. In addition, we must devise effective approaches to fight the spread of disease where none has existed before. This will involve developing new pesticides and antibiotics to keep ahead of resistance, as well as new, unique approaches. For example, modern molecular biology has led to the development of transgenic crops that are resistant to insects. This approach can narrow the target range of control techniques and limit our dependence on chemical insecticides. Similarly, creative applications on the basis of knowledge of insect biology should yield even more results. The massive amount of information known about Drosophila should be put to use in this endeavour. This review is divided into three parts. First, I discuss briefly how the fruitfly has been used to solve general problems in insect biology. Second, I discuss how insects act as vectors for human diseases and how our understanding of Drosophila biology has contributed to this field1 (see link to human homologues in the fruit fly). Last, I discuss how Drosophila can help us to understand and control agricultural pests. This is not intended to be a global review of modern approaches to vector biology or pesticide research. This review focuses on how Drosophila has informed or could inform work in the fields of vector biology and pesticide research. 0 The fruitfly as a model insect 0 The fruitfly has been a general testing ground for genetic concepts and techniques that have applications for both vector biology and pest control. For example, a promising twist on the `sterile-male' technique, used to reduce insect population size, has been modelled in Drosophila2. Typically, sterile-male projects involve isolating large numbers of male insects and then sterilizing them using radiation. The males are then released into the wild, where they overwhelm local males, and prevent productive matings from occurring. Although this 0 Macmillan Magazines Ltd 0 approach has been used successfully to reduce populations of screw worms and the tsetse fly3, its effectiveness is limited by the ability to isolate large numbers of homogeneous male populations and by the reduced viability of irradiated males. The new technique, developed in the fruitfly, takes advantage of a tetracyclinerepressible transcription transactivator (TRTT). The first step is to create fruitflies that express the TRTTencoding gene under the control of a yolk promoter, so that expression is limited to females. The fly is also made transgenic for a dominant-lethal gene that is expressed under the control of TRTT. This permits easy sorting of males because, in the absence of tetracycline, all female offspring die, whereas males are unaffected by tetracycline treatment because they never express the transactivator. The technique also results in non-productive matings, as all female offspring die and all male offspring carry, and will transmit, the lethal constructs. The net result is a simple method of producing male fruitflies and a simple method of sterilizing a population. Now that the value of this approach has been shown in the fruitfly, the procedure should be applied to other insects as genetic transformation becomes more readily available. A second example of how advances in our understanding of Drosophila biology have improved our ability to manipulate insects is the development of methods for transforming genes into other insects4,5. The fruitfly has functioned both as a source of transposable elements and as a system for developing transformation techniques. Both avenues have led to the genetic transformation of mosquitoes6,7. Transgenic tools will facilitate the dissection of mosquito-parasite interactions and could lead to the development of parasiteresistant vectors. These two examples show the usefulness of the fruitfly in pioneering technologies that should be central in understanding and controlling the spread of insect-borne diseases and insect pests in general. In both examples, the fruitfly is used not because we are interested in studying it but because Drosophila is the simplest insect to manipulate. 0 Insects as vectors of human disease 0 A mosquito of the subfamily that includes the genus Anopheles, and which may transmit malaria. 0 A mosquito of the subfamily that includes the genera Mansonia, Aedes and Culex, and which may transmit several diseases. 0 There is a large variety of vector-borne diseases (TABLE 1). From bacteria to viruses, and protozoans to worms, almost every type of pathogen has adapted to use insects as vectors (FIG. 1). Vectors provide a means of getting in and out of the vertebrate host by hitching a ride in a blood meal. In practice, however, insects are not usually passive carriers when transmitting disease from animal to animal; instead, parasites must overcome many barriers to colonize the insect host8,9. There are situations where passive transmission occurs but, for the diseases listed in TABLE 1, biological transmission is the rule10. This review focuses on malaria because this is, by far, the most life-threatening of all insect-borne diseases. In humans, malaria is caused by four species of the protozoan genus Plasmodium11 (FIG. 2), of which a single species, P. falciparum, is responsible for most malarial deaths. There is stringent host-parasite specificity for most species of plasmodia when interacting with both their vertebrate and insect hosts. For example, ANOPHELINE mosquitoes are the insect vectors for all human-specific plasmodia whereas Plasmodium gallinaceum, which infects ground fowl, uses CULICINE mosquitoes as vectors12. Multifaceted approaches, such as the coordinated use of vaccines, antibiotics and public health measures have been important in limiting disease. Unfortunately, few of these tools are available to fight malaria. There is, at present, no vaccine against any of the plasmodia strains that infect humans. Furthermore, parasites have developed resistance to many of the drugs available to fight the disease13-15 and probably will develop resistance to new drugs as they are introduced. Because 0 Table 1 | Arthropod-borne diseases 0 Viruses Dengue fever West Nile fever Yellow fever Bacteria Plague Typhus Lyme disease Protozoa Malaria Leshmaniasis Sleeping sickness Chagas disease Worms River blindness Filariasis Black fly Mosquito Mosquito Sand fly Tsetse Kissing bug Flea Louse Tick Vector Mosquito Mosquito Mosquito 0 Chagas disease, typhus Leishmania, plague, sleeping sickness Malaria, filariasis, arbovirus 0 NATURE REVIEWS | GENETICS 0 Macmillan Magazines Ltd 0 most people afflicted with malaria reside in developing cou 0 The DNA sequence and analysis of human chromosome 6 1 A. J. Mungall*, S. A. Palmer, S. K. Sims, C. A. Edwards, J. L. Ashurst, L. Wilming, M. C. Jones, R. Horton, S. E. Hunt, C. E. Scott, J. G. R. Gilbert, M. E. Clamp, G. Bethel, S. Milne, R. Ainscough, J. P. Almeida, K. D. Ambrose, T. D. Andrews, R. I. S. Ashwell, A. K. Babbage, C. L. Bagguley, J. Bailey, R. Banerjee, D. J. Barker, K. F. Barlow, K. Bates, D. M. Beare, H. Beasley, O. Beasley, C. P. Bird, S. Blakey, S. Bray-Allen, J. Brook, A. J. Brown, J. Y. Brown, D. C. Burford, W. Burrill, J. Burton, C. Carder, N. P. Carter, J. C. Chapman, S. Y. Clark, G. Clark, C. M. Clee, S. Clegg, V. Cobley, R. E. Collier, J. E. Collins, L. K. Colman, N. R. Corby, G. J. Coville, K. M. Culley, P. Dhami, J. Davies, M. Dunn, M. E. Earthrowl, A. E. Ellington, K. A. Evans, L. Faulkner, M. D. Francis, A. Frankish, J. Frankland, L. French, P. Garner, J. Garnett, M. J. R. Ghori, L. M. Gilby, C. J. Gillson, R. J. Glithero, D. V. Grafham, M. Grant, S. Gribble, C. Griffiths, M. Griffiths, R. Hall, K. S. Halls, S. Hammond, J. L. Harley, E. A. Hart, P. D. Heath, R. Heathcott, S. J. Holmes, P. J. Howden, K. L. Howe, G. R. Howell, E. Huckle, S. J. Humphray, M. D. Humphries, A. R. Hunt, C. M. Johnson, A. A. Joy, M. Kay, S. J. Keenan, A. M. Kimberley, A. King, G. K. Laird, C. Langford, S. Lawlor, D. A. Leongamornlert, M. Leversha, C. R. Lloyd, D. M. Lloyd, J. E. Loveland, J. Lovell, S. Martin, M. Mashreghi-Mohammadi, G. L. Maslen, L. Matthews, O. T. McCann, S. J. McLaren, K. McLay, A. McMurray, M. J. F. Moore, J. C. Mullikin, D. Niblett, T. Nickerson, K. L. Novik, K. Oliver, E. K. Overton-Larty, A. Parker, R. Patel, A. V. Pearce, A. I. Peck, B. Phillimore, S. Phillips, R. W. Plumb, K. M. Porter, Y. Ramsey, S. A. Ranby, C. M. Rice, M. T. Ross, S. M. Searle, H. K. Sehra, E. Sheridan, C. D. Skuce, S. Smith, M. Smith, L. Spraggon, S. L. Squares, C. A. Steward, N. Sycamore, G. Tamlyn-Hall, J. Tester, A. J. Theaker, D. W. Thomas, A. Thorpe, A. Tracey, A. Tromans, B. Tubby, M. Wall, J. M. Wallis, A. P. West, S. S. White, S. L. Whitehead, H. Whittaker, A. Wild, D. J. Willey, T. E. Wilmer, J. M. Wood, P. W. Wray, J. C. Wyatt, L. Young, R. M. Younger, D. R. Bentley, A. Coulson, R. Durbin, T. Hubbard, J. E. Sulston, I. Dunham, J. Rogers & S. Beck* 0 The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK 0 Following the announcement of the completion of the human genome project on 14 April 2003, we present here our findings on the mapping, sequencing and analysis of chromosome 6. Chromosome 6 was best known for the major histocompatibility complex (MHC), a region of 3.6 megabases (Mb) on band 6p21.3 of the short arm. The MHC has an essential role in the innate and adaptive immune system, and is characterized by high gene density, high polymorphism and high linkage disequilibrium. Much of what we know today about genetic variation and the organization of haplotypes was first discovered from studies of this region. At a time when genetic variation was assessed by serology rather than sequence, the term `haplotype' was first introduced to describe "the combination of individual antigenic [MHC] determinants that are positively controlled by an allele"1. Because of its crucial role in immunity and its association with many common diseases, the MHC was sequenced well ahead of the rest of chromosome 6 (ref. 2). Particular care was taken to ensure that the highest quality was achieved for the sequence, analysis and annotation of chromosome 6. The annotation of all gene structures was manually checked and, in some cases, led to the correction of known reference genes. In addition to the genome sequences of Mus musculus and Tetraodon nigroviridis, the comparative analysis was enhanced by the inclusion (for the first time in the analysis of human chromosomes) of the recently assembled genomes of Rattus norvegicus, Fugu rubripes and Danio rerio. Our analysis is available through the new vertebrate genome annotation (VEGA) database (http://vega.sanger.ac.uk/), 0 making the chromosome 6 annotation a high-quality and instantly available resource. 0 Clone map and sequence map 0 Bacterial clone contigs were assembled using restriction enzyme fingerprinting and sequence-tagged site (STS) content analysis of the clones, anchored to a radiation hybrid (RH) map with a marker density of 16 per Mb. A tiling path of 1,797 clones and polymerase chain reaction (PCR) fragments (see Supplementary Table S1) were selected for sequencing spanning the chromosome in nine contigs separated by gaps of 50-200 kilobases (kb), as estimated by DNA fibre fluorescence in situ hybridization (FISH) (see Supplementary Table S2). All but two gaps (gaps 2 and 6) reside in the pericentromeric or sub-telomeric chromosomal regions. We assessed the chromosome coverage in several ways. First, 38% of the clones selected for sequencing were hybridized to metaphase chromosomes using FISH. This provided independent support of the map construction and also highlighted the presence of intra- and interchromosomal repeats. Next we identified known chromosome 6 markers in both genetic (deCODE3 and Marshfield comprehensive genetic maps4) and RH maps (n ¼ 3,036). D6S1694 was the only genetic marker found to be absent from the sequence. The position of D6S1694 on these maps indicates that it is likely to reside within gap 6, between the sequences AL135906 and AL731777. We also accounted for all RefSeq genes mapping to chromosome 6. In the final sequence, no RefSeq gene was entirely missing. Three RefSeq 0 Nature Publishing Group 0 MICROARRAY TECHNOLOGIES Creation of a minimal tiling path of genomic clones for Drosophila: provision of a common resource 1 Volker Hollich1, Eric Johnson2, Eileen E. Furlong3, Boris Beckmann1, Joseph Carlson4, Susan E. Celniker4, and Joerg D. Hoheisel1 0 INTRODUCTION Representing the entire genome of an organism on DNA microarrays rather than the coding regions only is prerequisite to various functional analyses, such as chromatin immunoprecipitation experiments (1). But even for transcriptional profiling analyses, it could be advantageous, since a comprehensive coverage would by definition represent a complete and normalized gene repertoire irrespective of the status of sequence annotation. In order to produce a genomic tiling path, typically, a large set of PCR primers is designed on the basis of the genome sequence. A recent publication (2) reports on experiments performed on a relatively small set of such fragments that represent in total about 3 Mb of the Drosophila chromosomes 2 and 3. However, this approach is rather time-consuming and expensive. For coverage of the entire 115-Mb Drosophila sequence with 3kb non-overlapping fragments, more than 76,000 primer molecules would be needed. Alternatively, the very DNA fragments on which the sequencing process was performed could be utilized to such an end. Since usually shotgun clones form the basis of large-scale se282 BioTechniques 0 quencing projects, all fragments could be readily amplified with a single primer pair, thus creating enormous savings in time and expense. Slightly disadvantageous is the fact that the fragments cannot be placed end-to-end, but would overlap in part. Thus, slightly more fragments would be needed to cover a genome. However, a certain degree of redundancy in coverage may prove to be beneficial for analytical purposes. Adopting the latter strategy, we set out to cover the genome of Drosophila melanogaster by selecting a minimal tiling path across the entire genome from the bacterial artificial chromosome (BAC)-based subclone libraries used in the sequencing project (3). MATERIALS AND METHODS Clone Selection Based on the sequencing data, a minimal tiling path was calculated for each subclone contig. This was accomplished by construction of a directed acyclic graph for every contig. Within this graph, each clone is represented by a vertex, and the set of vertices within the contig is called V. An edge between two vertices is intro- 0 ing project--sublibraries covering regions 1-11 of chromosome X and all of the left arm of chromosome 3 as well as a global shotgun library--had been destroyed prior to the start of this initiative. To construct the tiling path, we initially determined the sequence positions of the subclones within the regions that are defined by 638 BAC clones (5). This included not only subclones, which had been produced from the respective BAC, but also subclones derived from P1 clones generated during an earlier phase of the sequencing project. The D. melanogaster chromosome arms of euchromatic sequence Release 3 (6) had been constructed by joining the individual sequences that represent the BAC clone inserts. As a result, the location of each BAC within an arm is known precisely. As a control, we compared the distance of the BAC end sequences within the genomic sequence and the actual length of each BAC insert used in our analysis. On the template of overlapping BAC sequences, the position of the shotgun clones was extracted from the Phrap sequence assembly, thus defining the start and end of each subclone insert. Subsequently, overlapping subclones were 0 combined into contigs. Because of both unfinished BACs and missing shotgun clones, however, 2641 gaps remained in addition to the absent X(1-11) and 3L areas (Figure 1). These gaps could not be filled with 2-kb clones from the whole genome shotgun approach, since these clones were not available either. Since the tiling path is based on randomly produced fragments, there is bound to be some overlap between them. However, as known from earlier analyses (e.g., References 2 and 7), this is rather an advantage (e.g., increasing resolution and providing some degree of redundancy). In the selection process of a minimal path, minimizing the degree of overlap between clones on 2L gave rise to 1.1% more clones, whereas aiming at a minimal total of clones resulted in 39.2% more overlap. This is due to the variation in clone lengths. Thus, the clone with the least overlap might be shorter than another clone, which spans further. Analyses on the other arms led to similar results. As the overlap-optimized path has only a small percentage of additional clones, we decided to base our minimal tiling path on this selection process, resulting in a set of 25,135 clones. In Figure 1, the coverage of the chromo- 0 MICROARRAY TECHNOLOGIES 0 Open Access 1 M Hild¤*, B Beckmann¤, SA Haas¤, B Koch*, V Solovyev§, C Busold, K Fellenberg, M Boutros¶, M Vingron, F Sauer*¥, JD Hoheisel and R Paro* 0 An integrated gene annotation and transcriptional profiling approach towards the full gene content of the Drosophila genome 0 reviews reports 0 Hild et al; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL. using the Fgenesh sequences forstringent ofbased our approaches haveannotation and same number of potential genes,genome anddebate. the integrated geneof such in silicovariety annotationmore only the combinationthedifferent computational more completematter of experiWhile the genome software. overlap. This organisms are now available, thewhole-transcriptomeinitio the Drosophilamethods stringency An Drosophila genomeBerkeley a data we indicates that D.approach resulted in the agene content of gene prediction of lower content on mental human melanogaster genome, will provide new complete genome annotations. In order to get a prediction a a careful comparisonevaluation the several and transcriptional profiling melanogaster precise number of the genes encoded is stillbutgene For combination of annotation Drosophila Genome Project (BDGP) towards of full novel ab microarray, the Heidelberg FlyArray, of the revealed only limited 0 deposited research 0 Background: While the genome sequences for a variety of organisms are now available, the precise number of the genes encoded is still a matter of debate. For the human genome several stringent annotation approaches have resulted in the same number of potential genes, but a careful comparison revealed only limited overlap. This indicates that only the combination of different computational prediction methods and experimental evaluation of such in silico data will provide more complete genome annotations. In order to get a more complete gene content of the Drosophila melanogaster genome, we based our new D. melanogaster whole-transcriptome microarray, the Heidelberg FlyArray, on the combination of the Berkeley Drosophila Genome Project (BDGP) annotation and a novel ab initio gene prediction of lower stringency using the Fgenesh software. Results: Here we provide evidence for the transcription of approximately 2,600 additional genes predicted by Fgenesh. Validation of the developmental profiling data by RT-PCR and in situ hybridization indicates a lower limit of 2,000 novel annotations, thus substantially raising the number of genes that make a fly. Conclusions: The successful design and application of this novel Drosophila microarray on the basis of our integrated in silico/wet biology approach confirms our expectation that in silico approaches alone will always tend to be incomplete. The identification of at least 2,000 novel genes highlights the importance of gathering experimental evidence to discover all genes within a genome. Moreover, as such an approach is independent of homology criteria, it will allow the discovery of novel genes unrelated to known protein families or those that have not been strictly conserved between species. 0 refereed research interactions information 0 Genome Biology 2003, 5:R3 0 R3.2 Genome Biology 2003, 0 Hild et al. 0 Results and discussion 0 Combined annotation 0 To overcome the known limitations in gene prediction, we constructed our Drosophila transcriptome microarray by first combining the BDGP Drosophila genome annotation Release 2 and the BDGP cDNA collection Release 1 [15] and then we also included an ab initio prediction based on the Fgenesh software [16]. We merged the combined BDGP set with the 20,622 Fgenesh predicted genes (Heidelberg Prediction, Heidelberg Collection (HDC)), based on the assumption that predictions showing an overlap of more than 30% of their exon sequences represent the same gene, resulting in a set of 21,396 potential genes (Figure 1). While the fact that nearly 97% of the BDGP genes were also predicted by Fgenesh validates our overlap criterion, we still found a further 7,464 predicted genes (36.2%; HDC unique) not represented in the BDGP annotation. 0 Computational analysis of the combined annotation 0 The simplest explanation for the high number of HDC unique predictions may be the relaxed stringency criterion applied. Consequently, a careful inspection of the two sets (BDGP/FlyBase versus HDC) showed a high degree of similarity for most common predictions; differences were largely confined to the 5' and 3' ends of the predictions as may be expected. This is not only because ab initio gene prediction algorithms have most difficulties in locating the precise ends of a gene, but also because the HDC predictions contain only coding regions - while the BDGP/FlyBase annotat 0 Shotgun DNA microarrays and stage-specific gene expression in Plasmodium falciparum malaria 0 Q 2000 Blackwell Science Ltd 0 unravelling additional important aspects of malaria biology and the general approach may be applied to any organism, regardless of how much of its genome is sequenced. 0 Introduction In the fight against malaria, there are only eight commonly used drugs and no reliable vaccines (White, 1996; Holder, 1999). Many strains of the malaria parasite Plasmodium falciparum are now resistant to our antimalarial compounds (Peters, 1998) and, in some parts of the world, resistance to new antimalarial agents may be occurring faster than before (Rathod et al., 1997). To help overcome these problems, global malaria initiatives have invested heavily in sequencing the Plasmodium falciparum genome and the next challenge is to correlate genome sequences to function (Wellems et al., 1999). Based on sequencing efforts to date, about half the malarial genome coding regions will have unknown function (Gardner et al., 1998; Bowman et al., 1999). Relating these genome sequences to malaria biology will be particularly challenging because the experimental tools to study malaria are limited (Wellems et al., 1999). First, most species of malarial parasites and most stages of P. falciparum cannot be routinely maintained in cell culture. Even the erythrocytic cycle of P. falciparum, which can be cultured, is very slow, labour intensive, and expensive to propagate. Second, the experimental power of transfection technology in P. falciparum and other malarial species is restricted at present. Although the erythrocytic stages can be transfected, gene disruptions are only possible for non-essential genes, as this part of the parasite life cycle is haploid (Wellems et al., 1999). Gene replacement is not possible because there is no negative selection system. Transfection efficiencies in P. falciparum are so poor that no gene function has been established purely on the basis of genetic complementation with a library of malarial genes or through a population of random knock outs. Finally, as the complete sexual life cycle of P. falciparum can only be studied in mosquitoes and as yet not in vitro, classical genetics can only be performed with great difficulty (Walliker et al., 1987). Not surprisingly, only two genetic crosses have been performed with malaria parasites and only a handful of traits have been mapped (Walliker et al., 1987; Vaidya et al., 1995; Wang et al., 1997; Wellems et al., 1999). 0 Shotgun DNA microarray for malaria Clearly, there is an urgent need for additional methods for assessing gene function in malaria. Recently, it has become possible to decipher transcriptional programmes of organisms by studying gene expression en masse (Brown and Botstein, 1999). DNA microarray technologies offer an opportunity to look at changes in gene expression in thousands of genes simultaneously under different physiological conditions (DeRisi et al., 1997; DeRisi and Iyer, 1999). Because the malarial genome is not completely sequenced, a variation on the standard array technology was used in this study. Inserts from a malarial genomic library were arrayed randomly to generate `shotgun' microarrays. To measure variation in expression of genes during the parasite life cycle, the arrays were probed with differentially labelled cDNAs prepared from total RNA isolated from cells at defined developmental stages. PCR products on the array that showed differential hybridization were sequenced. 0 et al., 1984; Vernick and McCutchan, 1998). Such digestion was expected to capture long stretches of unique coding regions and avoid over-representation of flanking sequences or introns on the array. Individual colonies from the unamplified library were immediately transferred to a 96-well plate. Amplified inserts from 8000 independent clones were analysed by agarose gel electrophoresis. Only PCR products greater than about 300 bp were applied on the DNA array. The average size of the insert applied to the array was 1±2 kb, but some clones had PCR products as large as 5 kb. In addition to clones from this library, several previously characterized genes encoding stage-specific malarial surface antigens (MSP1, Pfs25, Pfs28, Pfs48/45) were included in the prototype array (Holder, 1988; Kaslow et al., 1988; Duffy et al., 1993; Kocken et al., 1993). Transcriptional differences between trophozoites and gametocytes The usefulness of the shotgun microarray for analysing malarial transcription programmes was evaluated by comparing gene expression between two differentiated forms of Plasmodium. Trophozoite-specific RNA was used as a template to generate Cy3-labelled cDNA (green fluorescence) and late-stage gametocyte-specific RNA was used to generate Cy5-labelled cDNA (red fluorescence). Equal amounts of the two labelled cDNA populations were mixed and hybridized to the shotgun microarray. Fluorescence signals from Cy3 and Cy5 label were separately measured at each spot on the array using a 0 Results and discussion Array construction The malaria shotgun microarray was constructed by printing 3648 PCR-amplified inserts from a P. falciparum DNA library (Fig. 1). To provide as complete a representation of genes as possible, and to minimize bias towards specific sequences, a mung bean nuclease genomic library was used. Mung bean nuclease preferentially cuts malarial DNA in regions flanking coding regions (McCutchan 0 Q 2000 Blackwell Science Ltd, Molecular Microbiology, 35, 6±14 0 R. E. Hayward et al. genes). Third, the 50 arrayed genes showing the highest red/green fluorescence and the 35 genes with the highest green/red fluorescence were sequenced, they were found to include several previously known stage-specific genes (Table 2A and B). Among the trophozoite-selective gene transcripts identified in this way, MSP-1 was represented twice (Table 2A). Other transcripts such as HRP-1 (histidine-rich protein-1), RAP-1 (rhoptry-associated protein-1) and PfEMP-3 (P. falciparum erythrocyte membrane protein 3) were also found to be trophozoite-specific in comparison to stage IV±V gametocytes. The stage-specific expression of these proteins is consistent with association of knob proteins, rhoptry proteins, PfEMP 3 and merozoite function in asexual stage parasites (Holder et al., 1985; Ellis et al., 1987; Holder, 1988; Pasloske et al., 1993), but not in late stage (III±V) gametocytes (Day et al., 1998). Among the sexual stage-selective transcripts, we identified sequences coding for the known gametocyte-specific genes Pfg377 and Pfs2400 (11.1 gene) (Table 2B, Fig. 3A; 0 scanning confocal microscope (DeRisi et al., 1997). The red/green fluorescence ratio provided a measure of the relative abundance of transcripts, from each DNA segment represented on the array, in trophozoites compared with late-stage gametocytes (Fig. 2A; the raw data from this hybridization and all the figures in this publication may be accessed on the web at http://derisilab.ucsf.edu/malaria/). Reliability The faithfulness of the shotgun DNA microarray for reporting stage-specific gene expression was apparent in four ways. First, three separate hybridizations from three independent cDNA preparations showed virtually identical differential hybridization patterns (Fig. 2B). Second, genes such as Pfs25, Pfs28, Pfs48/45 and MSP1, which were known to be expressed in a stage-selective fashion and which were applied to the microarray as controls, exhibited 0 ANALYTICAL BIOCHEMISTRY 0 A combined oligonucleotide and protein microarray for the codetection of nucleic acids and antibodies associated with human immunodeficiency virus, hepatitis B virus, and hepatitis C virus infections 1 Agns Perrin,a,* David Duracher,b Magali Perret,c Philippe Cleuziat,b e and Bernard Mandranda 0 UMR 2142 CNRS-bioMrieux, 46 alle dOItalie, 69364 Lyon Cedex 07, France e e Apibio, Zone ASTEC, 15 rue des Martyrs, 38054 Grenoble Cedex 9, France UMR 2142 CERVI IFR INSERM 74, 24 avenue Tony Garnier 69365, Lyon Cedex 07, France 0 Keywords: Hybridization; DNA; Microtiterplate well; Densitometry; Enzyme substrate; Alkaline phosphatase; Immunoassays; ELISA; Complexity; Multidetection 0 Coinfections by hepatitis B (HBV)1 and C (HCV) viruses are frequent in seropositive patients infected with human immunodeficiency virus (HIV) since the same 0 routes of transmission are shared by these viruses (drug abusers, blood transfusion, etc.) [1]. Diagnosis and therapy follow-up of such associated diseases are possible by the combination of several individual assays for testing pertinent parameters. The immune response to HIV type 1 (HIV-1) is oriented mainly against gag and env glycoproteins, but a period of about 3 weeks is observed between contamination and appearance of anti-HIV antibodies. During this period, p24 protein is present in the serum of most patients. The recent emergence of combined assays for the codetection of p24 antigenemia and anti-HIV antibody titer--e.g., HIV Duo assay (bioMrieux)--allows e 0 reducing the delay between contamination and diagnosis [2]. Quantification of HIV-1 genome is achieved by molecular techniques, which take on more importance since they are extremely sensitive [3], viral load RNA being predictive of CD4 decline, acquired immune deficiency syndrome progression, and patient survival [4]. In the case of HBV infection, the presence of plasma hepatitis B surface antigen (HBs-Ag) indicates an active HBV infection [5]. Furthermore, testing HBV DNA levels during therapy may allow early recognition of patients who do not respond to therapy [3], as both the DNA and the protein are often associated for HBV follow-up [6]. On the other hand, appearance of antiHBs antibodies is an indicator of patient recovery. Detection of HCV infection by a HCV positivity has been facilitated by the development of antibody assays [7]. However, these methods are of restricted use due to the period of several weeks between infection and seroconversion [8]. Alternatively, amplification of viral nucleic acid is an effective means for direct HCV quantification [9]. Many commercial tests currently available permit the detection of each of these parameters in separate assays. Emerging protein microarray technology enabling one to set up more complex systems such as antigen microarrays for serodiagnosis of several infectious diseases [10] has been proposed. Other generic array formats designed for the detection of a wider range of infectious or toxic substances have been proposed, notably by Lee et al. [11] or Yang et al. [12]. These chips could be used indiscriminately for either immunoassays or DNA hybridization. Multiplexed assays based on tagged microspheres are also well adapted for versatile applications targeting proteomics or genomics [13,14]. But to our knowledge, no description of a technique allowing the simultaneous, real-time codetection of immunological and DNA hybridization reactions has been made in the literature. Our proposal in this work is a microarray based on a standard 96-well microplate format for which the potential as a protein microarray has already been demonstrated [15]. Each well is functionalized by 16 spots comprising nucleic acids and viral proteins, each of these probes allowing the detection of a parameter relevant for the diagnosis or follow-up of three frequently associated viral infections (HIV, HBV, HCV). Immunological models are chosen so that a systematic comparison is possible between CombOLISA and validated immunoassay platforms such as ELISA in microtiter plates or the VIDAS automat. 0 (SK431: TGCTATGTCAGTTCCCCTTGGTTCTCT and SK462: AGTTGGAGGACATCAAGCAGCCA TGCAAAT) [15] and 50 -aminated probe for amplified productsO capture (CHIV : GAGACCATCAATGAGGA AGCTGCAGAATGGGAT) [16] were synthetized by Eurogentec (Seraing, Belgium) as were all other oligonucleotides. HCV RNA targets from HCV were extracted from serum of chronically infected patients using Nucleospin RNA Virus Kit (Macherey-Nagel, Hoerdt, France) and amplified by RT-PCR with 50 -biotinylated primers (RC21: CTCCCGGGGCACTCGCAAGC and RC1: GTGTA GCCATGGCGTTAGTA) [17]. The 50 -aminated probe CHCV (CATAGTGGTCTGCGGAACCGGTGAGT) [18] was designed to capture biotinylated amplified products. HIV and HCV targets were amplified by RT-PCR under the following conditions using an Access kit from Promega (Madison, WI, USA): 1A AMV/Tfl reaction buffer, 1.8 mM MgSO4 , 0.2 mM dNTP, 1 lM primers, 1 U of AMV reverse transcriptase, and 5 U of Tfl DNA polymerase; RT cycle 48 °C for 45 min; 35 PCR cycles (94 °C for 30 s 60 °C for 1 min, 68 °C for 2 min); final extension at 68 °C for 7 min. PCR templates were analyzed on agarose gels stained with ethidium bromide and revealed under UV illumination. Concentrations of amplified products were evaluated by comparison to band density of a mass ladder (Eurogentec). HIV and HCV amplicons were 46 and 23 nM, respectively. HBV A synthetic single-stranded nucleic target (74 bp) (CCCAGTAAAGTTCCCCACCTTATGAGTCCAAG GAATTACTAACATTGAGATTCCCGAGATTGAG ATCTTCTGCGA) from the HBV genome [19], a 50 aminated capture probe for target hybridization (CHBV : ATCTCGGGAATCTCAATGTTAG), and a 50 -biotinylated detection probe that also hybridizes to the synthetic target (DHBV : TATTCCGACTCATAAGGTG) were synthetized. Immunoassay Recombinant HCV core protein, whose synthesis is described elsewhere [20], and HIV envelope glycoprotein GP160 were obtained from bioMerieux. HBs antigens were obtained from Hytest (Turku, Finland) for the Ay subtype and from Cliniqa (Fallbrook, CA, USA) for the Ad subtype. GP160 and HBs antigens were the same as those used for adsorption on receptacles of the VIDAS instrument in the HIV Duo kit and in the Anti-HBs Total kit, respectively. Two proteins (NSP1 , NSP2 ) having no affinity in the present study were also spotted to verify immunological reaction specificity. Infected human sera were kindly provided by the Croix-Rousse 0 Materials and methods Nucleic acid probe and DNA targets HIV HIV-1 RNA was bought from Ambion (Austin, TX, USA). Biotinylated primers for amplification 0 Hospital (Lyon, France). Alkaline phosphatase-labeled goat anti-human IgG (AP-GaH IgG) was from Jackson Immunoresearch (West Grove, PA, USA) and alkaline phosphatase-labeled streptavidin (AP-SA) was from Sigma (St. Quentin, France). Microarray setup Capture probes CHIV , CHBV , and CHCV were diluted at 10 lM in a coating buffer (150 mM Na2 HPO4 / NaH2 PO4 , 450 mM NaCl, 1 mM EDTA, pH 7.4). Nonspecific proteins (NSP1 , bovine serum albumin; NSP2 , human chorionic gonadotropin) were diluted at 50 lg/ml in 50 mM carbonate buffer, pH 9.3. GP160, HBs antigens, and HCV core proteins were diluted at 10 lg/ml in phosphate-buffered saline (PBS; 50 mM Na2 HPO4 /NaH2 PO4 , 150 mM NaCl, pH 7.4). Spotting was carried out with the Biochip Arrayer (Perkin- Elmer, Boston, MA, USA), which is based on a submicroliter noncontact, drop-on-demand piezoelectric dispensing technology providing a typical spot diameter of 250 lm. Each 0 BRIEF COMMUNICATIONS 0 Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays 1 Hajime Matsuzaki, Shoulian Dong, Halina Loi, Xiaojun Di, Guoying Liu, Earl Hubbell, Jane Law, Tam Berntsen, Monica Chadha, Henry Hui, Geoffrey Yang, Giulia C Kennedy, Teresa A Webster, Simon Cawley, P Sean Walsh, Keith W Jones, Stephen P A Fodor & Rui Mei 0 We present a genotyping method for simultaneously scoring 116,204 SNPs using oligonucleotide arrays. At call rates >99%, reproducibility is >99.97% and accuracy, as measured by inheritance in trios and concordance with the HapMap Project, is >99.7%. Average intermarker distance is 23.6 kb, and 92% of the genome is within 100 kb of a SNP marker. Average heterozygosity is 0.30, with 105,511 SNPs having minor allele frequencies >5%. 0 Single-nucleotide polymorphisms (SNPs) are emerging as the marker of choice for a broad spectrum of genetic analyses. Previously, we demonstrated a highly accurate approach for genotyping over 10,000 SNPs which combines reduction in genome complexity with the allele-discriminating specificity of oligonucleotide arrays1,2 . Recent advancements in array technology, assay and algorithm development, together with new SNP content from 0 BRIEF COMMUNICATIONS 0 RNA interference microarrays: High-throughput loss-of-function genetics in mammalian cells 1 Jose M. Silva, Hana Mizuno, Amy Brady, Robert Lucito, and Gregory J. Hannon* 0 RNA interference (RNAi) is a biological process in which a doublestranded RNA directs the silencing of target genes in a sequencespecific manner. Exogenously delivered or endogenously encoded double-stranded RNAs can enter the RNAi pathway and guide the suppression of transgenes and cellular genes. This technique has emerged as a powerful tool for reverse genetic studies aimed toward the elucidation of gene function in numerous biological models. Two approaches, the use of small interfering RNAs and short hairpin RNAs (shRNAs), have been developed to permit the application of RNAi technology in mammalian cells. Here we describe the use of a shRNA-based live-cell microarray that allows simple, low-cost, high-throughput screening of phenotypes caused by the silencing of specific endogenous genes. This approach is a variation of ``reverse transfection'' in which mammalian cells are cultured on a microarray slide spotted with different shRNAs in a transfection carrier. Individual cell clusters become transfected with a defined shRNA that directs the inhibition of a particular gene of interest, potentially producing a specific phenotype. We have validated this approach by targeting genes involved in cytokinesis and proteasome-mediated proteolysis. 0 similarly using cell microarrays for loss-of-function genetics. This is accomplished by creating a microarray of living cells that have been transfected in situ with either small interfering RNAs (siRNAs) or with DNA constructs that direct the expression of short hairpin RNAs (shRNAs). These are effective at initiating a silencing response and in creating defined areas (spots) of cells in which suppression of a targeted gene generates an expected phenotype. Such arrays will find broad application to highthroughput low-cost phenotype-based screens in mammalian cells. Materials and Methods 0 Microarray Printing and Reverse Transfection. Transfection mixes 0 NA interference (RNAi) has emerged as one of the standard techniques to study gene function in diverse experimental systems. Introduction of double-stranded RNA (dsRNA) into a cell decreases the level of the complementary mRNAs producing a knockdown of the corresponding protein. The current model of the RNAi mechanism proposes that the silencing ``trigger'' is processed by Dicer into small RNAs of 21-22 nucleotides in length. These become incorporated into an RNA-induced silencing complex with endonuclease activity (RISC), which, in turn, identifies and cleaves homologous mRNAs (1, 2). Based on this approach, genomewide RNAi approaches have been used successfully for phenotype-based screens in Caenorhabditis elegans (3-5) and Drosophila melanogaster (6, 7). In part, these successes derive from the availability of convenient and inexpensive methods for producing and introducing dsRNA. For example, it has previously been shown that RNAi can be triggered by soaking C. elegans in a solution of dsRNA (8), or by feeding worms with E. coli expressing gene-specific dsRNAs (9). In Drosophila cells a soaking protocol is also available allowing an easy method of introducing dsRNA (10). Unfortunately, similarly straightforward approaches for triggering silencing have not been described in mammals. Analysis of multiples genes requires a ``gene by gene'' method, in which individual transfections must be performed, making these studies expensive, tedious, and dependent on high-throughput robotic systems. Cell microarrays represent a novel alternative to classical approaches to phenotype-based assays in mammalian cells. Cell microarrays were first described by Ziauddin and Sabatini (11), who demonstrated that cells grown on a glass substrate could take up DNA-lipid complexes that had been deposited on the slide before cells were plated. Cells essentially became transfected in situ, with defined spots of transfected cells localized over the printed DNAs. These studies demonstrated the use of conventional DNA constructs for creating phenotypes based on ectopic expression. Here we investigate the possibility of 0 Reporter Assays. One hundred sixty dots containing a dual 0 reporter vector expressing GFP dsRed fluorescent proteins (gift of Alla Karpova, Cold Spring Harbor Laboratory) and individual shRNAs were printed. All shRNA were part of a library of U6 polymerase III promoter-driven hairpins (28). Four groups of experiments with 40 dots (each) were printed: the first group contained only dual reporter vector, the second group contained the reporter vector plus an shRNA or siRNA against firefly luciferase (Ff shRNA and Ff siRNA), the third group contained 0 Ninety-Six-Well Plate Analyses. All RNAi microarray results were 0 validated by using cells transfected in 96-well tissue culture plates. Cells were transfected with LT-1 (Mirus, Madison, WI) according to the manufacturer's instructions at 50-70% confluence. The plasmids containing appropriate constructs were cotransfected, keeping the same ratios used in the arrayed slides but with a total mass of 100 ng of DNA for each transfected well. Again, results were analyzed after 60 h of incubation. Results 0 Targeting Reporter Genes in Situ by Using siRNAs. Given previous 0 the reporter vector plus a shRNA or a siRNA against GFP that has no effect in the expression level of the protein (GFP shRNA-1 and GFP siRNA-1), and the last group contained the reporter vector plus a shRNA that reduces by 90% the GFP signal when tested in culture plates (GFP shRNA-2 and GFP siRNA-2). Several cell lines were tested for transfection, NIH 3T3, IMR90 E1A, HeLa, and HEK 293T. To test the stability of the printed array, we repeated the assay at different time points after printing, day 0, 1 week, 2 weeks, 4 weeks, and 2 months. For testing the stability of the transfection master mix, we stored the solution at 4°C and then printed new slides and assayed them at the time points described above. 0 Proteasome-Mediated Proteolysis Assays. Thirty shRNAs targeting different proteasome subunits were printed in triplicate. Every dot harbored an shRNA-expression vector, a plasmid expressing dsRed (dsRed N-1, Clontech), and a vector encoding a proteasome fluorescent reporter (ZsProSensor, Clontech). This reporter encodes a fusion protein that has been engineered to show varying levels of expression depending on the status of the proteasome pathway. Every transfection master mix contained 400 ng of dsRed vector, 100 ng of ZsProSensor, and 1 g of shRNA plasmid. Twenty micrograms of total protein lysates was used for Western blot analysis. Rabbit anti-PSMC-6 subunit of the proteasome (Affinity, Biomol, Plymouth Meeting, PA), rabbit anti-ubiquitin (StressGen Biotechnologies, Victoria, Canada), and mouse anti- -actin (United States Biological, Swampscott, MA) antibodies were also used in these studies. Cytokinesis Defect Assays. Eight shRNAs targeting the motor 0 successes in ectopically expressing genes by reverse transfection (11), we hoped that similar approaches could be coupled with the use of RNAi to produce knockdown phenotypes. Therefore, we began by testing the ability of siRNAs to be deposited on a microarray as lipid-RNA comple 0 An Arabidopsis promoter microarray and its initial usage in the identification of HY5 binding targets in vitro 1 Ying Gao1,2, Jinming Li3, Elizabeth Strickland2, Sujun Hua4, Hongyu Zhao5, Zhangliang Chen1, Lijia Qu1 and Xing Wang Deng1,2,* 0 Key words: Arabidopsis, HY5, promoter microarray, transcription factor-promoter interaction 0 Abstract To analyze transcription factor-promoter interactions in Arabidopsis, a general strategy for generating a promoter microarray has been established. This includes an integrated platform for promoter sequence extraction and the design of primers for the PCR amplification of the promoter regions of annotated genes in the Arabidopsis genome. A web-interfaced primer-retrieval program was used to obtain up to 10 primer pairs with a suitability ranking given to each gene. We selected primer pairs for the promoters of about 3800 genes, and greater than 95% of the promoter fragments from the total genomic DNA were successfully amplified by PCR. These PCR products were purified and used to print an Arabidopsis promoter microarray. This initial promoter microarray was used to study the in vitro binding of the transcription factor HY5 to its promoter targets. A set of promoter fragments exhibited consistent and strong interaction with the HY5 protein in vitro, and computational analysis revealed that they were enriched with the HY5 consensus binding G-box motif. Thus, a promoter microarray can be a useful tool for identifying transcription factor binding sites at the genomic scale in higher plants. 0 Introduction Transcription factor-promoter interactions are fundamentally important for understanding the regulation of genome expression, and, thus, eukaryotic cell growth and development. A series of recent papers revealed critical insights in the genome-wide transcription regulatory network using a global genome-wide analysis of transcription factor binding sites in several model organisms, including yeast (Ren et al., 2000; Iyer et al., 2001; Simon et al., 2001; Wyrick et al., 2001), Drosophila (Markstein et al., 2002; Stathopoulos 0 et al., 2002; Orian et al., 2003), and mammalian cells (Horak et al., 2002; Ren et al., 2002; Weinmann et al., 2002). Although a combination of gene expression analysis and computational prediction strategy has been employed previously to understand genome expression regulation in Arabidopsis (Hong et al., 2003; Ramirez-Parra et al., 2003), the analysis of transcription factor-promoter interactions has been largely limited to individual genes (Saha et al., 2001; Egelkrout et al., 2002; Lopez-Molina et al., 2002). The Arabidopsis thaliana genome encodes at least fifteen-hundred transcription factors, which 0 We retrieved the assembled Arabidopsis chromosome sequences and annotation information from MAtDB - the MIPS Arabidopsis thaliana database (ftp://ftpmips.gsf.de/cress/). The annotation information included gene contig names, entry codes, gene structures, and transcription directions. The promoter region of each gene was located according to the annotation information and then was extracted from the chromosome sequences. Representative promoter deletion analyses have shown that most Arabidopsis genes have functional promoters within 1400 bp of their translational start sites (Conley et al., 1994; Tjaden et al., 1995; Honma and Goto, 2000; Haralampidis et al., 2002; Brown et al., 2003). Therefore, we used 1400 bp as an upper limit for our promoter sequence selection of Arabidopsis genes. To select promoter fragments for microarray construction, we also considered the need for the uniformity of promoter size, so as to reduce the variation in PCR amplification yield, as well as hybridization efficiency. Therefore, the following principles were followed in selecting promoter fragments for PCR amplification. First, the longest fragment size of the PCR products was 1400 bps. Second, a minimum fragment size of the PCR products was set to 500 bps. Third, the promoter 3¢ end was always near and no more than 50 bps upstream of the ATG. To apply the above principles, transcription directions of the selected specific gene and the length of the intergenic region between this gene and its upstream neighbor gene were considered. These intergenic regions in the genome were grouped into 14 types, and in each case a distinct formula was used to define the promoter region for PCR amplification (Figure 2). Then the promoter sequences from these defined promoter regions were extracted from the chromosome sequences, stored in the database, and used for primer selection. A da 0 Microarray and Functional Gene Analyses of Sulfate-Reducing Prokaryotes in Low-Sulfate, Acidic Fens Reveal Cooccurrence of Recognized Genera and Novel Lineages 1 Alexander Loy,1 Kirsten Kusel,2 Angelika Lehner,3 Harold L. Drake,2 ¨ and Michael Wagner1* 0 MATERIALS AND METHODS 0 Site description. The two low-moor fens, designated Schloppnerbrunnen I ¨ (50°08 14 N, 11°53 07 E) and Schloppnerbrunnen II (50°08 38 N, 11°51 41 E), ¨ that were investigated are in the Lehstenbach catchment in the Fichtelgebirge mountains in northeastern Bavaria (Germany). The catchment has an area of 4.2 km2, and the highest elevation is 877 m above sea level. Ninety percent of the 0 SULFATE-REDUCING PROKARYOTES IN ACIDIC FENS TABLE 1. 16S rRNA gene-targeted primers 0 Short namea 0 Full nameb 0 Annealing temp (°C) 0 Sequence (5 -3 ) 0 616V 630R 1492R ARGLO36F DSBAC355F DSMON85F DSMON1419R SYBAC 282F SYBAC1427R DBACCA65F DBACCA1430R 0 S-D-Bact-0008-a-S-18 S-D-Bact-1529-a-A-17 S- -Proka-1492-a-A-19 S-G-Arglo-0036-a-S-17 S- -Dsb-0355-a-S-18 S-G-Dsmon-0085-a-S-20 S-G-Dsmon-1419-a-A-20 S- -Sybac-0282-a-S-18 S- -Sybac-1427-a-A-18 S-S-Dbacca-0065-a-S-18 S-S-Dbacca-1430-a-A-18 0 Most Bacteria Most Bacteria Most Bacteria and Archaea Archaeoglobus spp. Most "Desulfobacterales" and "Syntrophobacterales" Desulfomonile spp. Desulfomonile spp. "Syntrophobacteraceae" and some other Bacteria "Syntrophobacteraceae" Desulfobacca acetoxidans Desulfobacca acetoxidans 0 Short name used in the reference or in this study. Name of 16S rRNA gene-targeted oligonucleotide primer based on established nomenclature (6). The annealing temperature was 52°C when the primer was used with forward primer 616V or ARGLO36F, and the annealing temperature was 60°C when the primer was used with forward primer DSBAC355F. 0 area is covered with Norway spruce (Picea abies [L.] Karst.) of different ages. Upland soils in the catchment are not water saturated, have developed from weathered granitic bedrock, and are predominantly cambisols and cambic podsols (according to the Food and Agriculture Organization system). Considerable parts of the catchment (approximately 30%) are covered by minerotrophic fens or intermittent seeps. The annual precipitation in the catchment is 900 to 1,160 mm, and the average annual temperature is 5°C. Schloppnerbrunnen I is covered with patches of Sphagnum moss and spruce, ¨ and the soil is a fibric histosol and is usually water saturated; in years with extremely hot summer months, the upper soil can become dry. Schloppnerbrun¨ nen II is permanently water saturated and completely overgrown by the grass Molinia caerula. The soil of Schloppnerbrunnen II has a larger amount of bio¨ available Fe3 than the soil of Schloppnerbrunnen I has. The soil pHs of Sch¨ loppnerbrunnen I and II were approximately 3.9 and 4.2, respectively; the soil ¨ solution pH varied between 4 and 6. Dialysis chambers. A soil solution from the upper 40 cm of each site was sampled with dialysis chambers (27) every 2 months from July 2001 to November 2002. Each dialysis chamber consisted of 40 1-cm cells covered with a cellulose acetate membrane with a pore diameter of 0.2 m. Prior to installation, the chamber was filled with anoxic, deionized water. The dialysis chambers were placed in the water-saturated fens for 2 weeks prior to sampling. On the sampling date, each chamber was closed (i.e., made airtight), transported to the laboratory, and sampled with argon-flushed syringes. Collection of soil. For microcosms, soil samples from three different depths (approximately 0 to 10, 10 to 20, and 20 to 30 cm) were obtained in December 2001 in sterile airtight vessels, transported to the laboratory, and processed within 4 h. For isolation of DNA, soil cores (diameter, 3 cm) from four different depths (approximately 0 to 7.5, 7.5 to 15, 15 to 22.5, and 22.5 to 30 cm) were collected on 24 July 2001 and immediately cooled on ice. Soil samples were brought to the laboratory, where they were diluted 1:1 (vol/vol) in phosphatebuffered saline (130 mM NaCl, 10 mM NaH2PO4, 10 mM Na2HPO4; pH 7.3), homogenized by vortexing, and stored at 20°C. Anoxic microcosms. Thirty-gram (fresh weight) portions of soil were placed into 125-ml infusion flasks (Merck 0 The Use of Carbohydrate Microarrays to Study Carbohydrate-Cell Interactions and to Detect Pathogens 1 Matthew D. Disney and Peter H. Seeberger* Laboratory for Organic Chemistry Swiss Federal Institute of Technology Zuerich ETH Hoenggerberg HCI F315 Wolfgang-Pauli-Strasse 10 8093 Zuerich, Switzerland Summary The use of carbohydrate microarrays to investigate the carbohydrate binding specificities of bacteria, to detect pathogens, and to screen antiadhesion therapeutics is reported. This system is ideal for wholecell applications because microarrays present carbohydrate ligands in a manner that mimics interactions at cell-cell interfaces. Other advantages include assay miniaturization, since minimal amounts (wpicomoles) of a ligand are required to observe binding, and high throughput, since thousands of compounds can be placed on an array and analyzed in parallel. Pathogen detection experiments can be completed in complex mixtures of cells or protein using the known carbohydrate binding epitopes of the pathogens in question. The nondestructive nature of the arrays allows the pathogen to be harvested and tested for antibacterial susceptibility. These investigations allow microarraybased screening of biological samples for contaminants and combinatorial libraries for antiadhesion therapeutics. Introduction Carbohydrates displayed on the surface of cells play critical roles in cell-cell recognition, adhesion, signaling between cells, and as markers for disease progression. Neural cells use carbohydrates to facilitate development and regeneration [1]; cancer cell progression is often characterized by increased carbohydrate-dependent cell adhesion and the enhanced display of carbohydrates on the cell surface [2]; viruses recognize carbohydrates to gain entry into host cells [3]; and bacteria bind to carbohydrates for host cell adhesion [4]. Identification of the specific saccharides involved in these processes is important to better understand cell-cell recognition at the molecular level and to aid the design of therapeutics and diagnostic tools. Many interactions at cell-cell interfaces involve multiple binding events that occur simultaneously [5, 6]. This "multivalent" type of binding amplifies affinities relative to interactions that involve only a single ligand [6]. This effect has led to the development of multivalent antiadhesive therapeutics against bacteria [7, 8] and viruses by displaying carbohydrates on flexible polymers [9- 11]. Dendrimers and bovine serum albumin (BSA) have also been used as multivalent scaffolds [8]. Additionally, devices that are responsive to the presence of a 0 Results and Discussion Cell Adhesion to Carbohydrate Arrays Five different monosaccharides equipped with an ethanolamine linker on their reducing ends were used to construct the carbohydrate arrays (Figure 1). Functionalized sugars were spotted onto glass slides that had been coated with the amine-reactive homobifunctional disuccinimidyl carbonate linker. In initial tests, 10 µl of a 20 mM carbohydrate solution was placed onto different positions on the surface. Slides were hybridized with 109 E. coli (ORN178) cells that had been stained with a nucleic acid staining dye (Figure 2). After removing unbound bacteria by washing, slides were scanned using a fluorescent array scanner. Results show that a strongly fluorescent signal (signal to noise [S/N] >10) was observed at positions where mannose was immobilized; hybridization with unstained E. coli resulted in a weak signal (S/N w2). The remainder of the slide exhibited no signal above background (data not shown). Next, an arraying robot was used to construct highdensity arrays. The robot spatially delivered 1 nl of carbohydrate-containing solutions that ranged in concentration from 20 mM to 15 M, and the resulting spots had a diameter of w200 m. Several types of slides were tested to optimize array performance. Standard amine-coated glass slides were reacted with either disuccinimidyl carbonate or disuccinimidyl tetrapolyethylenglycol linkers, alternatively CodeLink polymer coated slides were used (data not shown). For each of these 0 Chemistry & Biology 1702 0 slides, ORN178 bound to mannose and not to the other carbohydrates. Furthermore, binding occurred with a signal to noise ratio of >100 despite the small size of the spots (Figure 3). CodeLink slides had the best performance since they gave the highest binding signal and the lowest background. These slides were used in all subsequent array experiments where monosaccharides were displayed. Most likely, the three-dimensional manner in which the carbohydrates were immobilized on these slides is responsible for the enhanced performance. Other arrays that displayed mono- to nonamannosides, which were constructed as described [15], were tested for binding to ORN178 (see Supplemental Data). Results from these experiments show that ORN178 has little preference for binding to these mannosides, despite varying lengths and linkage stereochemistry. This likely reflects that recognition of mannose residues by this strain occurs through only a single mannose residue, and that stereochemistry of the linkage plays little role in binding. The observation of cell adhesion to arrays constructed using an arraying robot with microarray-size spots is promising. A previous report studied adhesion of chicken hepatocytes and human T cells to carbohydrates arrays that were manually constructed. These spots were 1.7 mm in diameter and allowed for w200 spots to be placed on a single slide [16]. The arrays described here show that the interactions of bacteria to carbohydrates can be studied in a high-throughput manner with the arrays. Due to the smaller spot size used here, a much larger number of interactions can be screened in parallel. The minimal amount of carbohydrate sufficient to detect binding was determined. Analyte consumption is an important aspect for carbohydrate arrays, since materials isolated from natural sources are in short supply. Several 1 nl aliquots of serially diluted solutions of carbohydrate that ranged in concentration from 20 mM to 15 M were arrayed. A concentration-dependent decrease in signal was observed, and delivery of as little as 20 fmol to a slide was sufficient to obtain a signal 0 above background (Figure 4). Different concentrations of bacteria were next hybridized with the arrays to determine the bacterial detection limit. As expected, a concentration-dependent decrease in signal was observed. When 106 or greater ORN178 were incubated, signals were well above background (Supplemental Data); however, hybridization of 105 cells gave signal that approached background, thus defining the current detection limit. This sensitivity rivals or exceeds that used in methods requiring a bacterial enrichment step prior to detection [17]. Standard microscopic images were taken of ORN178 bound to three mannose-containing spots. Images show that ORN178 only adhered to these positions, they are densely covered with bacteria (Figure 4), and no bacteria are observed outside of this area. This illustrates that these slides are resistant to nonspecific adhesion of bacteria. Assessing the Carbohydrate Binding Specificities of Different Bacterial Strains The arrays were tested for their ability to probe differences in carbohydrate binding affinities between re- 0 Carbohydrate Microarrays to Detect Pathogens 1703 0 Intact cell adhesion to glycan microarrays 0 Department of Pharmacology and Molecular Sciences, The Johns Hopkins School of Medicine, 725 N. Wolfe Street, Baltimore, MD 21205; 4 Instituto de Microbiologia Prof. Paulo de Goes, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil; and 5Glycominds, Ltd., Lod 71291, Israel 0 A rapid and reproducible method was developed to detect and quantify carbohydrate-mediated cell adhesion to glycans arrayed on glass slides. Monosaccharides and oligosaccharides were covalently attached to glass slides in 1.7-mmdiameter spots (200 spots/slide) separated by a Teflon gasket. Primary chicken hepatocytes, which constitutively express a C-type lectin that binds to nonreducing terminal N-acetylglucosamine residues, were labeled with a fluorescent dye and incubated in 1.3-mL aliquots on the glycosylated spots. After incubating to allow cell adhesion, nonadherent cells were removed by immersing the slide in phosphate buffered saline, inverting, and centrifuging in a sealed custom acrylic chamber so that cells on the derivatized spots were subjected to a uniform and controlled centrifugal detachment force while avoiding an air±liquid interface. After centrifugation, adherent cells were fixed in place and detected by fluorescent imaging. Chicken hepatocytes bound to nonreducing terminal GlcNAc residues in different linkages and orientations but not to nonreducing terminal galactose or N-acetylgalactosamine residues. Addition of soluble GlcNAc (but not Gal) prior to incubation reduced cell adhesion to background levels. Extension of the method to CD4 human T-cells on a 45-glycan diversity array revealed specific adhesion to the sialyl Lewis x structure. The described method is a robust approach to quantify selective cell adhesion using a wide variety of glycans and may contribute to the repertoire of tools for the study of glycomics. Key words: CD4 /glycomics/hepatocyte/lectins/ oligosaccharides Introduction Carbohydrate-mediated cell±cell recognition is emerging as an important component in the repertoire of molecular recognition events that underlie the orderly development and functioning of multicellular organisms (Crocker and 0 L. Nimrichter et al. 0 lectins or by generating multivalency using chimeras or secondary binding proteins. In nature, multivalency is often generated by lectin expression on cell surfaces, where lectin molecules selfassociate or cluster in response to multivalent binding arrays on an apposing surface (Weis and Drickamer, 1996; Weisz and Schnaar, 1991). Here we report methods that 0 detect specific adhesion of intact cells to covalent carbohydrate microarrays engineered on glass slides. These methods take advantage of the natural multivalency of cell surface carbohydrate binding to extend the applicability of glycan microarrays. Results Glass-slide arrayed carbohydrates Defined glycosides were covalently arrayed on standardsize glass slides using previously described chemistry (Schwarz et al., 2003). The array consisted of 8 rows, 25 columns of 1.7-mm diameter spots separated by a Teflon gasket (Figure 1). Adhesion of intact chicken hepatocytes to GlcNAc-terminated glycans A method for quantifying intact cell adhesion to glass slide glycan arrays was developed and refined using primary chicken hepatocytes, which express the well-defined GlcNAc-specific chicken hepatic lectin on their surface (Drickamer, 1981). Initial experiments used slides with multiple spots derivatized with GlcNAc, Gal, linker arm (control), and no modification (Figure 2). Chicken hepatocytes adhered selectively to spots derivatized with GlcNAc glycosides. Cell adhesion to Gal-derivatized spots and control surfaces was very low. Varying the conditions for blocking nonspecific cell adhesion (5 mg/mL or 10 mg/mL bovine serum albumin [BSA]) did not alter the results. Microscopic examination of the wells (Figure 3) confirmed 0 Cell adhesion to glycan microarrays 0 scientific report scientificreport 0 Parasite-specific immune response in adult Drosophila melanogaster: a genomic study 1 ¨m-Lindquist*w, Olle Terenius* & Ingrid Faye+ Katarina Roxstro 0 Insects of the order Diptera are vectors for parasitic diseases such as malaria, sleeping sickness and leishmania. In the search for genes encoding proteins involved in the antiparasitic response, we have used the protozoan parasite Octosporea muscaedomesticae for oral infections of adult Drosophila melanogaster. To identify parasite-specific response molecules, other flies were exposed to virus, bacteria or fungi in parallel. Analysis of gene expression patterns after 24 h of microbial challenge, using Affymetrix oligonucleotide microarrays, revealed a high degree of microbe specificity. Many serine proteases, key intermediates in the induction of insect immune responses, were uniquely expressed following infection of the different organisms. Several lysozyme genes were induced in response to Octosporea infection, while in other treatments they were not induced or downregulated. This suggests that lysozymes are important in antiparasitic defence. 0 The majority of insect vectors for human parasites are found among dipterans. In an attempt to understand the immunological basis for Anopheles vector capacity, Schneider & Shahabuddin (2000) successfully used Drosophila melanogaster and the malaria parasite Plasmodium gallinaceum as a vector-parasite model system. Ookinetes injected into the fly haemocoel developed into sporozoites that were infective when injected into the chicken host. However, when feeding the flies with parasitized blood or ookinetes, parasite development was hampered, indicating that the important barrier for the parasite to develop resides in the gut of this insect. Either certain mosquito-specific invasion routes are not present in Drosophila, or the malaria parasites encountered 0 EUROPEAN MOLECULAR BIOLOGY ORGANIZATION 0 scientific report 0 Drosophila were fed with DCV, 30-50% of the flies died within 6 days after infection (Gomariz-Zilber et al, 1995). This is the first whole-genome study on antiparasitic response in D. melanogaster. We demonstrate that Drosophila responds by upregulating a new and specific set of genes on an oral infection with Octosporea. Many of the genes with unknown function have signal peptides and will be a subject for future analyses of antiparasitic activity. 0 Beauveria 49 0 Antiparasitic gene expression in Drosophila K. Roxstrom-Lindquist et al. ¨ 0 Octosporea 23 0 RESULTS AND DISCUSSION Genome data analysis 0 The Drosophila gene expression in response to different microbes was examined after 24 h of natural infection of adult males. The RNA was hybridized to Affymetrix Drosophila GeneChips, and Affymetrix MAS 5.0 software was used for the calculation of expression and statistical analyses of the chips (supplementary information table 1 online). Duplicates of each infection were compared to duplicates of normal flies in a 2 A 2 matrix (supplementary information text part A online). The genes that were significantly increased (Po0.0025, Wilcoxon's signed ranks test) in all four comparisons were defined as induced genes. In total, 427 genes were induced and selected for further analysis (supplementary information table 2 online). The fungal infection generated the strongest response, with 298 genes induced, and the parasitic infection induced 127 genes. In the viral and bacterial infections, a low number of genes were significantly induced: 11 and 10, respectively. The significantly induced genes are found in many different functional classes (Fig 1). A common feature in the four infections was that many of the genes encode enzymes, in particular serine proteases: Octosporea, 35% enzymes (13% serine proteases); Beauveria, 24% (8%); Serratia, 60% (50%); and DCV, 36% (27%) (supplementary information table 4 online). Unique or common induction of a gene was determined by comparing the expression of each induced gene selected in one treatment with its expression in other treatments (supplementary information text part A online). The numbers of uniquely induced genes were 214, 59 and 2 in response to Beauveria, Octosporea and DCV, respectively; this constitutes 65% of the 427 induced genes and thereby demonstrates specificity in the immune response (Fig 2). Many genes were induced in several infections; 16 genes are designated as common in response to all four infections. The genes in common encode the antimicrobial proteins Attacin A, Cecropin A1, Cecropin A2, Drosomycin and Metchnikowin, as well as acetylCoA homeostasis (CG8628), one serine protease (CG6483) and nine genes with unknown functions (supplementary information table 3 online). 0 Confirmation of genes responding to Beauveria infection 0 The antifungal peptide genes Drosomycin and Metchnikowin (Ekengren & Hultmark, 2001, and references therein) were heavily induced by Beauveria in our study: 14.3- and 19.9-fold, respectively (Table 1). In a similar experiment, where the D. melanogaster strain OregonR was naturally infected with the same strain of Beauveria, the response at 24 h was lower compared to our results: Drosomycin 6.4-fold and Metchnikowin 4.4-fold (De Gregorio et al, 2001). The Canton S flies used in our study died within 5 days (Fig 3), whereas 90% of the OregonR flies used by De Gregorio et al (2002) were still alive at that time point. This 0 may indicate that our flies were more heavily infected, or that there is a certain genetic difference between these two wild-type isolates of D. melanogaster. Turandot M (TotM) is a stress-induced humoral protein gene in Drosophila, earlier shown to be upregulated by the Gram-negative bacterium Enterobacter cloacae b12 when injected into adults (Ekengren & Hultmark, 2001). In our study, TotM is induced 13.7fold by fungal infection (Table 1) and 2.4-fold by bacterial feeding. The strong fungal induction could reflect the stress response inferred by cuticular penetration. Notably, in De Gregorio's study TotM (CG14027) is, after 24 h, upregulated 3.6-fold by the fungal infection and 13.6-fold by septic injury. This is a recurring pattern of contrasting results on fungal versus bacte 0 The Human MitoChip: A High-Throughput Sequencing Microarray for Mitochondrial Mutation Detection 1 Anirban Maitra,1,3 Yoram Cohen,2 Susannah E.D. Gillespie,3 Elizabeth Mambo,2 Noriyoshi Fukushima,1 Mohammad O. Hoque,2 Nila Shah,4 Michael Goggins,1 Joseph Califano,2 David Sidransky,1,2 and Aravinda Chakravarti3,5 0 et al. 1998; Fliss et al. 2000; Bianchi et al. 2001; Jones et al. 2001; Parrella et al. 2001; Sanchez-Cespedes et al. 2001; Chen et al. 2002; Copeland et al. 2002). The frequency of mitochondrial mutations in these studies is high, with half to two-thirds of cancers harboring at least one somatic mutation. The mitochondrial genome is an ideal target for mutation detection in cancers for several reasons. First, mitochondrial mutations in cancer are not only common, but unlike nuclear genes, do not appear to be restricted by cancer type (Polyak et al. 1998; Fliss et al. 2000; Jones et al. 2001; Sanchez-Cespedes et al. 2001). Second, detection of mitochondrial DNA mutations in clinical samples (such as exfoliated cells in urine, or lavage fluids) offers a distinct advantage over nuclear DNA because of the high copy number of mitochondrial genomes in cancer cells. Fliss et al. (2000) determined that mitochondrial DNA was 19 to 220 times as abundant as mutated p53 nuclear DNA in matched body fluids from cancer patients. Similarly, Jones et al. (2001) confirmed the facile detection of mitochondrial DNA mutations in primary tumors with a 30% or less neoplastic cellularity, whereas known nuclear DNA mutations could not be detected in the nonenriched samples. Finally, the presence of mitochondrial DNA mutations in a proportion of preneoplastic lesions suggests that mutations occur early in multistep tumor progression (Jeronimo et al. 2001; Parrella et al. 2001; Ha et al. 2002), and hence, may be used as a tool for early detection of cancer in clinical samples, including body fluids and serum (Hibi et al. 2001; Jeronimo et al. 2001; Nomoto et al. 2002; Okochi et al. 2002). Current strategies for using the mitochondrial genome as a screening tool in cancer are limited by the availability of a highthroughput platform for mutation detection. Even with the 0 Genome Research 0 Mitochondrial Sequencing Microarray 0 Reproducibility of Array-Based Sequencing 0 availability of sensitive and rapid mutation detection platforms such as automated capillary sequencers and denaturing highperformance liquid chromatography (HPLC; Medintz et al. 2001; Liu et al. 2002), the routine sequencing of 16.5 kb of mitochondrial DNA is an onerous task. Microarrays are inherently parallel devices that offer the promise of determining the genotypes at every site of interest with a limited level of effort (Hacia 1999). Chee et al. developed the first mitochondrial sequencing microarray in 1996, comprised of "tiled" oligonucleotide sequencing probes synthesized using standard photolithography and solidphase DNA synthesis (Chee et al. 1996). This microarray platform, however, had several limitations, including the requirement for generating RNA by in vitro transcription of genomic DNA for chip hybridization, tiling of only a single strand of the target mitochondrial sequence on the chip, and absence of robust genotype assignment software. We have developed a "second-generation" sequencing microarray for high-throughput analysis of mitochondrial coding 0 A custom microarray platform for analysis of microRNA gene expression 1 J Michael Thomson1, Joel Parker2,5, Charles M Perou2-4 & Scott M Hammond1,2 0 MicroRNAs are short, noncoding RNA transcripts that posttranscriptionally regulate gene expression. Several hundred microRNA genes have been identified in Caenorhabditis elegans, Drosophila, plants and mammals. MicroRNAs have been linked to developmental processes in C. elegans, plants and humans and to cell growth and apoptosis in Drosophila. A major impediment in the study of microRNA function is the lack of quantitative expression profiling methods. To close this technological gap, we have designed dual-channel microarrays that monitor expression levels of 124 mammalian microRNAs. Using these tools, we observed distinct patterns of expression among adult mouse tissues and embryonic stem cells. Expression profiles of staged embryos demonstrate temporal regulation of a large class of microRNAs, including members of the let-7 family. This microarray technology enables comprehensive investigation of microRNA expression, and furthers our understanding of this class of recently discovered noncoding RNAs. 0 MicroRNAs comprise a large family of noncoding RNAs found in organisms ranging from nematodes to plants to humans (see ref. 1 for a review). Over 200 microRNAs have been identified in mammals, either through computational searches or by RT-PCRmediated cloning. These RNAs function as natural triggers of the RNAi pathway, regulating gene expression at a post-transcriptional step. MicroRNA biogenesis begins with a primary transcript that contains a stem-loop structure1. This transcript is processed by the ribonuclease III enzyme Drosha, liberating the stem-loop, which is termed the precursor. This precursor is transported out of the nucleus in a process dependent on the Ran GTPase and the export receptor exportin-5. Further processing in the cytoplasm by the ribonuclease III enzyme Dicer leads to the production of mature RNAs of B22 nucleotides (nt) that are incorporated into the RNAi effector complex RISC (RNA-induced silencing complex). Complementarity with elements in mRNAs leads to suppression of gene expression. In cases where the microRNA is an imperfect match to the mRNA, as with C. elegans lin-4, recognition leads to reduction in protein levels without affecting mRNA levels. In plants, mRNA targets in the scarecrow-like family of 0 transcription factors contain sequences perfectly complementary to the microRNA miR-39. Similarly, in mammals, miR-196 has near-perfect identity with elements in the mRNA of the homeobox transcription factor gene HoxB8 (ref. 2). In this case recognition of the mRNA by microRNAs leads to cleavage, rather than translational repression, analogous to siRNA-mediated gene silencing3,4. Despite the large number of identified microRNAs, the scope of their roles in regulating cellular gene expression is not known. The founding members of this family of noncoding RNAs are the C. elegans lin-4 and let-7 (refs. 5,6). Expression of these microRNAs, originally termed short-temporal RNAs, is essential for proper timing of events during larval development. For example, levels of the let-7 RNA increase during the fourth larval stage and the adult stage, resulting in suppression of larval-specific genes, including lin-41 (ref. 6). Partially complementary elements in the lin-41 mRNA are binding sites for let-7 (ref. 7). The role of microRNAs in cell lineage and development has recently been found to extend to mammalian systems. miR-181 is highly expressed in hematopoietic progenitors, and its overexpression promotes differentiation into B-lineage cells8. The regulation of homeobox genes by microRNAs further links this gene family to mammalian developmental processes2. One approach to identifying the cellular roles of microRNAs is the identification of mRNA targets. Several groups have developed computational methods to search for target sequences of microRNAs (see ref. 1 for a discussion). These methods have yielded hundreds of candidate targets in plants, Drosophila and mammals that implicate microRNAs in a diverse range of cellular pathways. Essential for the interpretation of these data, however, is an `-vis understanding of microRNA expression patterns vis-a expression patterns of predicted targets. The temporally restricted expression of large sets of microRNAs in C. elegans and Drosophila has been reported9-11. More recently, tissue-specific expression patterns of mammalian microRNAs have been described12. All data were obtained by northern blot analysis of microRNA levels. As a refinement to this approach, the use of nylon macroarrays for analysis of 44 microRNAs during brain development has been reported13. All the aforementioned approaches, however, 0 prevents edge effects. We adapted MJ Research in situ PCR chambers as disposable hybridization chambers. A reference oligonucleotide set corresponding to all mature microRNAs, labeled with Cy5 (red channel), was included in all hybridizations. This reference set provides an internal hybridization control for every probe on the array. In principle, this could permit absol 0 MicroRNAs: SMALL RNAs WITH A BIG ROLE IN GENE REGULATION 1 Lin He and Gregory J. Hannon 0 MicroRNAs are a family of small, non-coding RNAs that regulate gene expression in a sequence-specific manner. The two founding members of the microRNA family were originally identified in Caenorhabditis elegans as genes that were required for the timed regulation of developmental events. Since then, hundreds of microRNAs have been identified in almost all metazoan genomes, including worms, flies, plants and mammals. MicroRNAs have diverse expression patterns and might regulate various developmental and physiological processes. Their discovery adds a new dimension to our understanding of complex gene regulatory networks. 0 RNA INTERFERENCE 0 (RNAi). A form of posttranscriptional gene silencing, in which dsRNA induces degradation of the homologous mRNA, mimicking the effect of the reduction, or loss, of gene activity. 0 The discovery of miRNAs 0 The founding member of the miRNA family, lin-4, was identified in C. elegans through a genetic screen for defects in the temporal control of post-embryonic development10,11. In C. elegans, cell lineages have distinct characteristics during 4 different larval stages (L1-L4). Mutations in lin-4 disrupt the temporal regulation of larval development, causing L1 (the first larval stage)specific cell-division patterns to reiterate at later developmental stages10. Opposite developmental phenotypes -- omission of the L1 cell fates and premature development into the L2 stage -- are observed in worms that are deficient for lin-14 (REF. 12). Even before the molecular identification of lin-4 and lin-14, these loci were placed in the same regulatory pathway on the basis of their opposing phenotypes and antagonistic genetic interactions11. Most genes identified from mutagenesis screens are protein-coding, but lin-4 encodes a 22-nucleotide non-coding RNA that is partially complementary to 7 conserved sites located in the 3-untranslated region (UTR) of the lin-14 gene (FIG. 1b)13,14. lin-14 encodes a nuclear protein, downregulation of which at the end of the first larval stage initiates the developmental progression into the second larval stage13,15. The negative regulation of LIN-14 protein expression requires an intact 3 UTR of its mRNA14, as well as a functional lin-4 gene13. These genetic interactions inspired a series of molecular and biochemical studies demonstrating that 0 the direct, but imprecise, base pairing between lin-4 and the lin-14 3 UTR was essential for the ability of lin-4 to control LIN-14 expression through the regulation of protein synthesis16-18. Through an analogous mechanism, lin-4 also negatively regulates the translation of lin-28, a cold-shock-domain protein that initiates the developmental transition between the L2 and L3 stages19. Compared with lin-14, lin-28 has fewer lin-4 binding sites, which might lead to its translational repression being delayed following lin-4 expression owing to less efficient lin-4 binding 6,19. The discovery of lin-4 and its target-specific translational inhibition hinted at a new mechanism of gene regulation during development. In 2000, almost 7 years after the initial identification of lin-4, the second miRNA, let-7, was discovered, also using forward genetics in worms. let-7 encodes a temporally regulated 21-nucleotide small RNA that controls the developmental transition from the L4 stage into the adult stage20-22. Similar to lin-4, let-7 performs its function by binding to the 3 UTR of lin-41 and hbl-1 (lin-57), and inhibiting their translation20-24. The identification of let-7 not only provided another vivid example of developmental regulation by small RNAs, but also raised the possibility that such RNAs might be present in species other than nematodes. Unlike lin-4, the orthologues of which in flies and mammals initially escaped bioinformatic searches, and were only recognized recently25,26, both let-7 and lin-41 are evolutionarily conserved throughout metazoans, with homologues that were readily detected in molluscs, sea urchins, flies, mice and humans27. This extensive conservation strongly indicated a more general role of small RNAs in developmental regulation, as supported by the recent characterization of miRNA functions in many metazoan organisms. 0 miRNAs and siRNAs -- what's the difference 0 lin-4 pre-miRNA 0 GU CU G UU U C A G CCUG CCC GAGA CUCA GUGUGA GUA A U C GGAC GGG CUCU GGGU CACACUUCGU U A CAU C C C AG 0 lin-4 miRNA 0 Ribosome ORF lin-14 0 A AA AU UCAUGCUCUCAGGA AGUGUGAGAGUCCU AA C CC UC AUUCAAAACUCAGGA UGAGU GAGUCCU GA C U C G C AU AC 0 UCAUUGAACUCAGGA AGUG GAGUCCU A C U C G A UC AC 0 UCACAACCAACUCAGGGA AGUGU G GAGUCCCU A AC AC CU A UUAUGUUAAAAUCAGGA A G UGUGA AGUCCU A G C CA UC C 0 22nt U UCGCAUUU CUCAGGGA AGUGUGAA GAGUCCCU C A UC C 0 UCUACCUCAGGGA AGGUGGAGUCCCU U AA AC CC U 0 Hundreds of miRNAs have now been identified in various organisms, and the RNA structure and regulatory mechanisms that have been characterized in lin-4 and let-7 still provide unique molecular signatures as to what defines miRNAs. miRNAs are generally 21-25nucleotide, non-coding RNAs that are derived from larger precursors that form imperfect stem-loop structures (FIG. 1a)4,5. The mature miRNA is most often derived from one arm of the precursor hairpin, and is released from the primary transcript through stepwise processing by two ribonuclease-III (RNase III) enzymes28,29. At least in animals, most miRNAs bind to the target-3 UTR with imperfect complementarity and function as translational repressors (see below for a discussion of plant miRNAs)4. Almost coincident with the discovery of the second miRNA, let-7, small RNAs were also characterized as components of a seemingly separate biological process, RNA interference (RNAi). RNAi is an evolutionarily conserved, sequence-specific gene-silencing mechanism that is induced by exposure to dsRNA30. In many systems, including worms, plants and flies, the stimulus that was used to initiate RNAi was the introduction of a dsRNA (the trigger) of ~500 bp. The trigger is ultimately processed in vivo into small dsRNAs of ~21-25 bp in length, designated as small interfering RNAs (siRNAs)31,32. It is now clear that one strand of the siRNA duplex is selectively incorporated into an effector complex (the RNA-induced silencing complex; RISC). The RISC directs the cleavage of complementary mRNA targets, a process that is also known as post-transcriptional gene silencing (PTGS) (FIG. 2)33. The evolutionarily conserved RNAi response to exogenous dsRNA might reflect an endogenous defense mechanism against virus infection or parasitic nucleic acids30. Indeed, mutations of the RNAi components greatly compromise virus resistance in plants, indicating that PTGS might normally mediate the destruction of the viral RNAs34. In addition, siRNAs can also regulate the expression of target transcripts at the transcriptional level, at least in some organisms. Not only can siRNAs induce sequence-specific promoter methylation in plants35,36, but they are also crucial for heterochromatin formation in fission yeast37,38, and transposon silencing in worms39,40. Fundamentally, siRNAs and miRNAs are similar in terms of their molecular characteristics, biogenesis and effector functions (see below for details). So, the current distinctions between these two species might be arbitrary, and might simply reflect the different paths through which they were originally discovered. miRNAs and siRNAs share a common RNase-III processing enzyme, Dicer, and closely related effector complexes, RISCs, for post-transcriptional repression (FIG. 2). In fact, much of our current knowledge of the biochemistry of miRNAs stems f 0 Understanding the molecular responses to hypoxia using Drosophila as a genetic model 1 Reza Farahani a, Gabriel G. Haddad a,b,* 0 Keywords: Anoxia, tolerance, genetic approaches; genes, anoxia tolerance, d ADAR; invertebrates, Drosophila melanogaster 0 `genetic' discoveries were being made, even without the understanding of the basis for heredity. Subsequent to this era, and more recently in the past couple of decades, the emphasis has shifted to a totally different paradigm. At present, a considerable amount of research is tied to the understanding of behavioral, biochemical or genetic processes at the molecular level because it may have direct implications on a disease process in mammals or humans. Examples in point are related, for instance, to the past effort, that went on to understand the development of the thorax (or bithorax) and the effort that is on-going at present to solve the molecular underpinnings of 0 aging, tumor formation, alcohol intoxication, neurodegeneration, and memory. We have been interested in a variety of questions that span from O2 sensing to the cellular and molecular responses to hypoxia and to injury from anoxia. Although most of our previous work has been done in mammals, we have recently discovered that Drosophila is very resistant to O2 deprivation (Krishnan et al., 1997). This opened major avenues for us since the Drosophila has been used so effectively in so many relevant research areas, as noted above. Indeed, in spite of many advances in monitoring oxygenation, there is still considerable morbidity and mortality arising from conditions with O2 deprivation leading to hypoxic/ ischemic damage, especially, brain injury. Part of this failure is related to the complexity of the cascade of events that ensue after hypoxia. Hence, Drosophila has been used in our laboratory to solve some of the questions related to tolerance or susceptibility to hypoxia. In this review, the role and importance of genetic models, such as Drosophila melanogaster , are discussed and an example illustrating how to harness the power of Drosophila genetics is detailed. In this review, we will detail approaches that have been used in flies or other genetic models and have been shown to be very useful. We demonstrate that these approaches have also been fruitful in trying to understand hypoxic responses and the basis for tolerance or susceptibility to hypoxic tissue injury. 0 Some of the more recent studies in our laboratory as well as in others, using molecular and genetic approaches, have provided evidence that there are genes that can protect against or predispose to cell injury and death when nerve cells are exposed to O2 deprivation (Ma and Haddad, 1997; Haddad et al., 1997; Ma et al., 1999; Ma and Haddad, 1999, 2000). In this review, we will review some of these novel approaches, focus on genetic models and delineate some of their experimental power. 2.1. Forward genetics 2.1.1. Tolerance to hypoxia, a Drosophila phenotype Drosophila can be placed in 100% N2 for several hours and yet survive the stress with no apparent injury: following return to a normoxic milieu, they can mate, fly, and see, among other complex behaviors that seem to be intact. Furthermore, electron microscopic studies of the central nervous system of the fly did not show any disruption or swelling of any cellular organelles or membranes. The time period during which flies can sustain such a stress (i.e. hours) is clearly very significant since the life span of these flies is just over 1 month. One of the interesting aspects of the Drosophila phenotype with respect to anoxia tolerance is that, unlike other animals (such as the turtle), the Drosophila is tolerant not because of a lack of sensitivity to the stress. Indeed, these animals are very sensitive to stress and `sense' hypoxia: when exposed to a partial pressure of O2 (PO2) of 0 (anoxia). Under these circumstances, flies lose coordination, stop moving first and then fall and remain motionless for the rest of the anoxic period (Krishnan et al., 1997; Haddad et al., 1997). When they are exposed to about 2A/3% O2 (which is extremely low by mammalian standards), they continue flying and moving for hours albeit at a slower pace than in normoxic conditions. Their O2 consumption during hypoxia (2 A/3% O2) drops to about 20% of control and this demonstrates that they `sense' the lack of O2 at cellular level. Therefore, we believe that the Drosophila tolerance to the lack of O2 is derived from their ability to 0 Approaches for the study of hypoxia Many approaches have been taken to study questions about the importance in nerve cell response and/or injury due to O2 deprivation. Some investigators have used acute settings and mostly electrophysiologic techniques, to examine ionic homeostasis (Haddad and Jiang, 1993). Others have relied on morphometric and anatomic approaches, and still others have focused almost exclusively on molecular approaches, especially in settings in which the stress is modest and cells and tissues withstood prolonged periods of hypoxia (Banasiak and Haddad, 1998; Banasiak et al., 0 number of mutant lines (deficiencies, inversions, duplications, etc.) and chromosomal markers available for mapping and mutagenesis. (iii) There are tools available for the study of cell or organ physiology in Drosophila such as the Giant Fiber System, which is very well studied in Drosophila (Haddad et al., 1997). Finally, (iv) P-elements, which are transposable DNA elements with known sequences, have been very useful in Drosophila for cloning, mutagenesis and over-expression of genes using Gal4 syst 0 Anomalies in the Expression Profile of Interspecific Hybrids of Drosophila melanogaster and Drosophila simulans 0 Genome Research 0 Ranz et al. 0 RESULTS AND DISCUSSION 0 TECHNICAL REPORTS 0 Comparing genomic expression patterns across species identifies shared transcriptional profile in aging 0 We developed a method for systematically comparing gene expression patterns across organisms using genome-wide comparative analysis of DNA microarray experiments. We identified analogous gene expression programs comprising shared patterns of regulation across orthologous genes. Biological features of these patterns could be identified as highly conserved subpatterns that correspond to Gene Ontology categories. Here, we demonstrate these methods by analyzing a specific biological process, aging, and show that similar analysis can be applied to a range of biological processes. We found that two highly diverged animals, the nematode Caenorhabditis elegans and the fruit fly Drosophila melanogaster, implement a shared adult-onset expression program of genes involved in mitochondrial metabolism, DNA repair, catabolism, peptidolysis and cellular transport. Most of these changes were implemented early in adulthood. Using this approach to search databases of gene expression data, we found conserved transcriptional signatures in larval development, embryogenesis, gametogenesis and mRNA degradation. Gene expression profiling measures the expression levels of thousands of genes at once1,2. Most expression profiling studies have focused on the specific genes that respond to specific conditions, but another important direction in functional genomics is to derive insight from global patterns of gene expression. Genome-scale expression patterns have been used as physiological `fingerprints' for classifying tumors3,4 and assigning uncharacterized mutations and drugs to known pathways5. Because they use information from many genes at once, patterns have great discriminating power, even when the transcriptional effects on individual genes are small5,6. The patterns of changes in gene expression observed in microarray experiments can be extensive and complex. To try to analyze these patterns, we exploited the principle that important biological processes are often conserved between organisms. We present an approach to comparative functional genomics based on shared patterns of regulation 0 across orthologous genes. We also present a method for identifying conserved biological components of those patterns that correspond to Gene Ontology categories. These methods can be used to search databases of microarray experiments to discover connections among biological processes in different organisms. RESULTS Comparing genomic expression patterns across species We used phylogenetic analysis to systematically identify orthologous groups of genes for all pairwise comparisons between C. elegans, D. melanogaster, Saccharomyces cerevisiae and Homo sapiens (Supplementary Tables 1-5 online). For C. elegans and D. melanogaster, we identified 3,851 most-conserved orthologous gene pairs (Fig. 1a). We used DNA microarrays in each organism to compare gene expression under different conditions (Fig. 1b). We then used gene phylogenetic relationships to match systematically the measurements of differential expression between orthologous genes from the two organisms (Fig. 1c). We used the correlation of the log-transformed relative change in expression of orthologous genes to assess the extent of shared regulation. Global similarity of transcriptional profiles of aging Using this approach, we asked whether gene expression patterns in adult aging were shared by two highly diverged animals: the nematode C. elegans and the fruit fly D. melanogaster, whose last common ancestor existed about one billion years ago7. We used spotted-PCR-product microarrays1 to compare gene expression in middle-aged adult (6 d adult) and young adult (0 d adult) sterile C. elegans hermaphrodites and used Affymetrix oligonucleotide microarrays2 to compare expression in middle-aged adult (23 d old) and young adult (3 d old) female flies8. The cross-species Pearson correlation of the log-transformed relative change in expression of orthologous genes during aging was 0.144, which is significant at the 10-11 level. Sixteen comparisons of independent experimental replicates all had high significance values, with a mean 0 TECHNICAL REPORTS 0 review article 0 The immune response of Drosophila 1 Jules A. Hoffmann 0 Institut de Biologie Moleculaire et Cellulaire du CNRS, 67084 Strasbourg Cedex, France 0 Drosophila mounts a potent host defence when challenged by various microorganisms. Analysis of this defence by molecular genetics has now provided a global picture of the mechanisms by which this insect senses infection, discriminates between various classes of microorganisms and induces the production of effector molecules, among which antimicrobial peptides are prominent. An unexpected result of these studies was the discovery that most of the genes involved in the Drosophila host defence are homologous or very similar to genes implicated in mammalian innate immune defences. Recent progress in research on Drosophila immune defence provides evidence for similarities and differences between Drosophila immune responses and mammalian innate immunity. 0 Toll(s) in the host defence of Drosophila 0 Toll activation during the immune response (Fig. 1) is strictly dependent on the product of the Spaetzle gene. The Spaetzle protein is a cystine-knot molecule with structural similarities to mamma33 0 Nature Publishing Group 0 review article 0 lian neurotrophins, and requires proteolytic cleavage for full biological activity23,24. This cleavage is induced by a proteolytic cascade activated as an early result of infection. The mature 12-kDa form of Spaetzle binds as a dimer to the Toll ectodomain with high affinity (K d < 0.4 nM) and with a stoichiometry of one Spaetzle dimer to two receptor proteins25. The intracytoplasmic TIR domain of Toll interacts with three partners, each of which has a death domain. Two of these are adaptor proteins: the Drosophila homologue of MyD8826-29, which in addition to the death domain has a TIR domain similar to that of Toll with which it associates, and Tube. Tube has no obvious mammalian homologue. The third deathdomain protein in this receptor-adaptor complex is Pelle, which has a serine-threonine kinase domain and is homologous to mammalian IRAKs (interleukin-1 receptor-associated kinases; reviewed in ref. 30). Depending on the developmental stage, Toll can activate two closely related NF-kB proteins in immune-responsive tissues: DIF31 (Dorsal-related immunity factor) in adults, and Dorsal and/or DIF in larvae32-34. The end effect of Toll signalling is the dissociation of NF-kB protein from the ankyrin-repeat inhibitory protein Cactus, a homologue of mammalian IkBs. This process involves signal-dependent phosphorylation of Cactus, followed by its degradation by the proteasome35,36. The activation of Dorsal requires phosphorylation, in addition to dissociation from Cactus (see also37,38). It is unclear how activation of the Toll receptor- adaptor complex leads to these various processes. Although Drosophila expresses genes encoding members of the TRAF (TNF-receptor-associated factor) family and homologues of mammalian IKK-b (IkB kinase-b) and IKK-g/NEMO, genetic studies have failed so far to demonstrate an involvement of any of these genes downstream of Toll. Furthermore, Pelle does not directly phosphorylate Cactus and the identity of the Cactus kinase remains elusive. The precise roles of the Toll pathway during the response to fungal and Gram-positive bacterial infection are not fully understood. One effect is obviously to direct the expression of various antimicrobial peptides. However, microarray data have indicated that hundreds of genes are markedly upregulated as a consequence of the challenge-dependent activation of Toll39,40, and their functions have not yet been adequately addressed. In addition to Toll, the Drosophila genome contains eight homologues (18-Wheeler/Toll-2 to Toll-9)41. Except for Toll, it has not been possible to unequivocally a 0 A genome-wide analysis of immune responses in Drosophila 1 Phil Irving*, Laurent Troxler*, Timothy S. Heuer, Marcia Belvin, Casey Kopczynski, Jean-Marc Reichhart*, Jules A. Hoffmann*, and Charles Hetru*§ 0 Oligonucleotide DNA microarrays were used for a genome-wide analysis of immune-challenged Drosophila infected with Grampositive or Gram-negative bacteria, or with fungi. Aside from the expression of an established set of immune defense genes, a significant number of previously unseen immune-induced genes were found. Genes of particular interest include corin- and Stubblelike genes, both of which have a type II transmembrane domain; easter- and snake-like genes, which may fulfil the roles of easter and snake in the Toll pathway; and a masquerade-like gene, potentially involved in enzyme regulation. The microarray data has also helped to greatly reduce the number of target genes in large gene groups, such as the proteases, helping to direct the choices for future mutant studies. Many of the up-regulated genes fit into the current conceptual framework of host defense, whereas others, including the substantial number of genes with unknown functions, offer new avenues for research. 0 at either 18 or 25°C. Adult male flies were removed from the colonies at 1-day-old and kept at 18°C until 3 days old. At this age, flies were either inoculated or designated as controls. Control and infected flies were snap-frozen in liquid nitrogen and stored at 80°C before extraction of total RNA. 0 Microbial Challenge of Flies. Inoculation with bacteria. The bacteria Escherichia coli and Micrococcus luteus were precultured in LB medium. Pellets taken when the cultures were in the log phase of growth were resuspended in a small amount of culture medium, and sharpened needles dipped into these suspensions were used to inoculate the flies. Flies were harvested at 6, 12, and 48 h after inoculation. Natural infection with fungi. Flies anaesthetized with CO2 were shaken for a few minutes in a Petri dish containing a sporulating culture of Beauveria bassiana. Flies covered with spores were placed in fresh tubes of Drosophila medium and kept at 25°C. Flies were collected 3 days after infection. Sample Preparation and Analysis. For each time point and infection 0 nnate immunity is the first-line defense of multicellular organisms that operates to limit infection after exposure to microbes. Invertebrates and vertebrates share a common ancestry for this defense system, illustrated by the striking conservation of the intracellular signaling pathways that regulate the rapid transcriptional response to infection in the fruit fly Drosophila and in mammals (1, 2). Because of its flexible genetics, Drosophila has emerged as a powerful model system for the study of innate immunity. Prominent among the innate immunity reactions is the phagocytosis or encapsulation of the invading organism by the hemocytes (3) and the massive synthesis of antimicrobial peptides by the fat body (4, 5), a functional equivalent of the liver. Transcriptional induction of antimicrobial peptide genes is known to be controlled by at least two distinct pathways, Toll and Imd (6). Although much has been learned about Drosophila immunity through genetic screens and biochemical analyses, many questions remain. For example, what gene products are responsible for recognition of invading pathogens and how do they activate the Toll or Imd pathways? What genes other than the antimicrobial peptide genes are induced after immune challenge and what roles do these genes play in the innate immune response? To complement the genetic approaches currently underway, transcriptional profiling experiments were carried out to survey the majority of Drosophila genes for their response to bacterial and fungal infection, using Affymetrix (Santa Clara, CA) GeneChips. The induction of the various Drosophila antimicrobial peptides correlated well with many earlier studies based on Northern blotting experiments (7, 8), confirming the accuracy of the microarray methodology used. In addition, a large number of genes previously unknown to be induced by infection were identified. The potential role of these genes in recognition, signaling, and effector mechanisms of the Drosophila immune response can now be assessed by using reverse genetic tools available in Drosophila. Materials and Methods Drosophila Stocks. Cinnabar brown flies (cn bw) were reared on standard cornmeal medium in vials held in humid culture rooms, 0 L.T., and T.S.H. contributed equally to this work. 0 The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact. 0 December 18, 2001 0 Table 1. Absolute and relative expression values for genes discussed in text 0 Caspase CG7486 (Dredd) Death related ced-3 Nedd2-like protein CG7788 (Ice) Interleukin-1 beta-converting enzyme CG14902 (Decay) Death executioner caspase related to Apopain CG18188 (Daydream) Death Associated Molecule related to Mch2 Defense or immunity protein CG11709 (PGRP-SA) Peptidoglycan recognition protein-SA CG9681 (PGRP-SB1) Peptidoglycan recognition protein-SB1 CG14745 (PGRP-SC2) Peptidoglycan recognition protein-SC2 CG7496 (PGRP-SD) Peptidoglycan recognition protein-SD CG14704 (PGRP-LB) Peptidoglycan recognition protein-LB CG10146 (AttA) Attacin-A CG18372 (AttB) Attacin-B CG4740 (AttC) Attacin-C CG7629 (AttD) Attacin-D CG1365 (CecA1) Cecropin A1 CG1367 (CecA2) Cecropin A2 CG1878 (CecB) Cecropin B CG1373 (CecC) Cecropin C CG12763 (Dpt) Diptericin A CG10794 (DptB) Diptericin B CG1385 (D 0 Open Access 1 Ian Birch-Machin¤*, Shan Gao¤, David Huen, Richard McGirr*, Robert AH White* and Steven Russell 0 Genomic analysis of heat-shock factor targets in Drosophila 0 Birch-Machin et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 0 deposited research 0 We have used a chromatin immunoprecipitation-microarray (ChIP-array) approach to investigate the in vivo targets of heat-shock factor (Hsf) in Drosophila embryos. We show that this method identifies Hsf target sites with high fidelity and resolution. Using cDNA arrays in a genomic search for Hsf targets, we identified 141 genes with highly significant ChIP enrichment. This study firmly establishes the potential of ChIP-array for whole-genome transcription factor target mapping in vivo using intact whole organisms. 0 refereed research 0 Chromatin immunoprecipitation or, more correctly, immunopurification (ChIP) has emerged as a valuable approach for identifying the in vivo binding sites of transcription factors [1-6]. Before the availability of complete genome sequence the use of this approach for identifying transcription targets on a genome-wide scale was, however, limited. Over the past few years, a number of laboratories have successfully used high-density DNA microarrays to identify sequences enriched by chromatin immunopurification (the ChIP-array approach). In the yeast Saccharomyces cerevisiae, microarrays containing virtually all of the intergenic sequences from the genome have been used to identify the binding sites of a large number of transcription factors [7,8]. In principle, the same techniques can be applied to higher eukaryotes, but the complexity of their genomes presents a challenge for the construction of full genomic microarrays. 0 Despite such difficulties, several studies have shown the feasibility of the ChIP-array approach with small regions of complex eukaryotic genomes using tissue culture systems. In cultured mammalian cells, for example, the binding sites for several transcription factors have been mapped using microarrays composed of specific promoter regions or enriched for promoter sequences with CpG arrays [9-11]. Although such studies are valuable in identifying some of the targets of particular transcription factors, they are limited because the microarray designs restrict the analysis to proximal promoter elements of a subset of genes. It would be preferable to examine binding sites in an unbiased fashion by constructing tiling arrays composed of all possible binding targets. Such tiling arrays have been constructed on a small scale with microarrays containing a series of 1-kb fragments from the -globin locus [12], or on a large scale with oligonucleotide arrays containing elements that detect all the unique sequences of human chromosomes 21 and 22 [13]. These studies indicate that the DNA-binding patterns of regulatory molecules in 0 interactions information 0 Genome Biology 2005, 6:R63 0 R63.2 Genome Biology 2005, 0 Birch-Machin et al. 0 large eukaryotic genomes are complex and highlight the need for a comprehensive approach to understand how transcription factors interact with DNA in vivo. Drosophila melanogaster, with a genome complexity intermediate between that of yeast and human, provides a powerful system for investigating transcription factor targets and regulatory networks in a complex multicellular eukaryote. Recently, the principle of using Drosophila genome tile arrays to identify transcription factor binding sites in tissue culture cells has been demonstrated. Using a technique employing fusions between DNA-binding proteins and the Escherichia coli DNA adenine methyltransferase (DamID; [14]) the binding locations for the GAGA transcription factor and the heterochromatin protein HP1 were mapped within a 3-Mb region of the Drosophila genome in a tissue culture system [15]. Other studies have used this method to map proximal binding sites with cDNA arrays [16]. While this elegant technique has the advantage that high-quality antibodies against particular transcription factors are not required, and a recent study indicates that it may be possible to transfer from a tissue culture system to the intact organism [17], it clearly has limitations, as in vivo the DAM-tagged transcription factor is not expressed in its normal developmental context. It is therefore desirable to develop methods that allow the mapping of native transcription factors in their correct in vivo context within the organism. Here we adapt chromatin immunopurification techniques using intact Drosophila embryos and demonstrate the reliable identification of in vivo binding sites for the heat-shock transcription factor Hsf on both genome tile and cDNA arrays. The response of most organisms to heat stress involves the rapid induction of a set of heat-shock proteins (Hsps), including several chaperone molecules that assist in protecting the cell from the deleterious effects of heat [18-21]. Several direct targets of the Hsf transcription factor are already well characterized. In higher eukaryotes, including Drosophila and mammals, heat stress results in the trimerization of Hsf monomers, which then bind with high affinity to regulatory elements (heat-shock elements, HSE) close to the transcriptional start sites of Hsp genes [22,23]. The Drosophila heat-shock system has been characterized at several levels, from the cytological mapping of Hsf-binding sites on polytene chromosomes [22] to the detailed molecular and biochemical analysis of transcriptional regulation at individual Hsp genes [24-26]. In this study we extend the analysis of the Drosophila heat-shock response by demonstrating that chromatin immunopurification from embryos can accurately map in vivo Hsf-binding sites on genome tile microarrays and identify new potential in vivo HSEs. In addition, using microarrays containing full-length cDNA clones for over 5,000 Drosophila genes we identify almost 200 genes that are reproducibly bound by Hsf upon heat shock in Drosophila embryos. The targets correspond well with previously identified cytological locations of Hsf binding on salivary gland pol- 0 ytene chromosomes, thus providing direct target genes associated with the low-resolution cytological analysis. A comparison with studies using S. cerevisiae Hsf [27,28] suggest that a set of conserved genes are regulated by Hsf in both organisms. Overall, this study presents the strong potential of this approach for in vivo genome-wide mapping of transcription factor binding sites in higher eukaryotes using the whole organism. 0 Results and discussion 0 Immunopurification of Hsf-bound chromatin 0 Genome Biology 2005, 6:R63 0 Nature Publishing Group http://genetics.nature.com 0 The contributions of sex, genotype and age to transcriptional variance in Drosophila melanogaster 0 Nature Publishing Group http://genetics.nature.com 1 Wei Jin1,4*, Rebecca M. Riley1*, Russell D. Wolfinger2, Kevin P. White3, Gisele Passador-Gurgel1 & Greg Gibson1 0 Here we present a statistically rigorous approach to quantifying microarray expression data that allows the relative effects of multiple classes of treatment to be compared and incorporates analytical methods that are common to quantitative genetics. From the magnitude of gene effects and contributions of variance components, we find that gene expression in adult flies is affected most strongly by sex, less so by genotype and only weakly by age (for 1- and 6-wk flies); in addition, sex x genotype interactions may be present for as much as 10% of the Drosophila transcriptome. This interpretation is compromised to some extent by statistical issues relating to power and experimental design. Nevertheless, we show that changes in expression as small as 1.2-fold can be highly significant. Genotypic contributions to transcriptional variance may be of a similar magnitude to those relating to some quantitative phenotypes and should be considered when assessing the significance of experi- 0 mental treatments. 0 Statistical genetic approaches to mapping genotype onto phenotype continue to place in a black box all the events occurring between the gene and the appearance of a trait. Despite the historical successes of partitioning environmental and interaction effects into variance components1, it can be argued that the failure to include a mechanistic component in this general approach presents a considerable obstacle to the integration of developmental/physiological genetics and quantitative genetics. In this context, the precise quantification of intracellular processes such as transcription and translation should be an important goal of genomic analysis. Comparing gene expression among lines and treatments using complementary DNA microarray technology presents one means of achieving this goal. Currently, microarray data are most often analyzed by comparing an experimental treatment to a common control and measuring the ratio of inferred transcript levels for each gene from the ratio of fluoresence2. This approach is inadequate for quantitative analysis for two main reasons: the choice of arbitrary ratio thresholds has no sound basis in statistical theory and the approach does not provide the flexibility to allow direct comparison of different sources of variance. It has been pointed out that standard methods of quantitative genetic analysis can be applied to microarray data3,4, and in fact such methods suggest experimental designs that dispense with reference samples but increase statistical power as compared with ratio-based methods5. Relying on moderate levels of replication, these methods allow investigators to identify significant 0 Nature Publishing Group http://genetics.nature.com 0 reported, but the fact that most fly traits show sex variance and sex x genotype interactions9,13 in addition to the obvious differences between the male and female reproductive systems implies that the transcriptomes of the two sexes are likely to be quite different. Here we use a long-standing and widely used statistical method from agricultural and quantitative genetics--the mixed model analysis of variance14--to rank the effects of sex, genotype and age on transcription and to draw comparisons between the contributions of sex and genotype to the variance of transcription and of phenotypic traits. 0 Nature Publishing Group http://genetics.nature.com 0 Our experimental design consisted of 24 cDNA microarrays, 6 for each combination of 2 genotypes (Oregon R and Samarkand) and the 2 sexes, involving 48 separate labeling reactions. We directly contrasted two time points, 1-wk and 6-wk adult flies, on each microarray. The dyes Cy3 and Cy5 were flipped for two of the six replicates of each genotype and sex combination. A common reference sample was not used. In total, we spotted 4,256 clones, representing a third of the genome--two-thirds of which were verified by resequencing before printing. We excluded 325 clones from the analysis because no consistent expression above background was detected. We analyzed fluorescence levels with the objective of establishing whether the level of expression of each gene relative to the sample mean of the labeling reaction varies according to sex, genotype and age. We used two sequential analyses of variance (ANOVAs). This procedure uses differences in normalized expression levels, rather than ratios, as the unit of analysis of expression differences, eliminating the need for a reference sample. The statistical model for each clone simultaneously fits the 0 effects of the treatments of interest across the entire experiment, allowing direct contrasts of the magnitude of the effects caused by each treatment and interactions among treatments. Differences in global levels of transcription among treatments can also be tested (Methods). In our experiment, the male samples tended to show higher fluorescence intensities than the female ones, although the magnitude of the effect was very small relative to significant individual gene differences. Cluster analysis of normalized expression levels tends to group genes according to the overall mean fluorescence intensity and, to some extent, to the greatest effect (in this case sex), but is inefficient at identifying groups of genes coregulated in more subtle ways. Nevertheless, after grouping genes according to the significance of fixed effects, we used TreeView15 to provide a visual representation analogous to the standard method of representing ratio effects (Fig. 1). Representative sex and sex x genotype interaction effects of various types are clearly seen in Fig. 1a,b, whereas more subtle genotype and age effects can be seen by close inspection of Fig. 1c,d. Plots of normalized expression levels for individual clones provide a visual means of assessing within- and among-treatment variance (Fig. 2). Lines link measures on a single array, with agecontrasted pairs of points corresponding to, from left to right, Oregon R females and males and then Samarkand females and males. The top two genes (fs(1)K10 and ebony) show significant effects of both sex and genotype, whereas the testes-enriched gene ocnus shows only the sex effect. CG9090, which encodes a putative mitochondrial phosphate transporter (Flybase; http://flybase.bio. indiana.edu), is unaffected by sex or genotype, but is consistently reduced in older flies (P<0.0001, ANOVA). Note that few of these effects exceed the commonly used arbitrary threshold of a twofold 0 Rapid evolution of male-biased gene expression in Drosophila 1 Colin D. Meiklejohn*, John Parsch, Jose M. Ranz*, and Daniel L. Hartl* ´ 0 A number of genes associated with sexual traits and reproduction evolve at the sequence level faster than the majority of genes coding for non-sex-related traits. Whole genome analyses allow this observation to be extended beyond the limited set of genes that have been studied thus far. We use cDNA microarrays to demonstrate that this pattern holds in Drosophila for the phenotype of gene expression as well, but in one sex only. Genes that are male-biased in their expression show more variation in relative expression levels between conspecific populations and two closely related species than do female-biased genes or genes with sexually monomorphic expression patterns. Additionally, elevated ratios of interspecific expression divergence to intraspecific expression variation among male-biased genes suggest that differences in rates of evolution may be due in part to natural selection. This finding has implications for our understanding of the importance of sexual dimorphism for speciation and rates of phenotypic evolution. 0 microarray intraspecific variation interspecific variation cDNA 0 nisogamous reproduction is common in many animal and plant species and can produce a number of conflicts with important evolutionary consequences. For example, differential selection coefficients between the two sexes can lead to stable genetic polymorphisms or a decline in population mean fitness (1). It can also drive accelerated rates of phenotypic evolution, as many morphologies associated with sex and reproduction diverge more rapidly than other phenotypes (2). Molecular techniques that provide rapid and quantitative measures of genotypic and phenotypic variation have extended this pattern to include accelerated rates of evolution among proteins with sexual or reproductive functions (3, 4). Since then, most data supporting this observation have come from homologous nucleotide sequences of genes that are associated with sex or reproduction. In ciliates, green algae, diatoms, angiosperms, fungi, and at least four animal phyla, unusually high ratios of nonsynonymous to synonymous substitutions (dN dS) between species have been documented in sex-related genes (reviewed in ref. 5). Some of these genes also show high levels of intraspecific differentiation (5). In Drosophila, much of this work has focused on genes that are expressed in testes or accessory glands (e.g., refs. 6 and 7), although a high dN dS has also been observed for genes expressed in females and components of the sex determination pathway (8). Protein coding sequences provide a natural context for studying rates of evolution, as the effect of a given nucleotide substitution on the polypeptide is predictable, and comparison between neighboring synonymous and nonsynonymous sites controls for mutation rate. Because of the lack of an analogous context for regulatory sequences, the rates and patterns of evolution in regions of the genome controlling gene expression are less well understood. Thus, it is not known whether the rapid rates of evolution among genes associated with sex and reproduction holds for gene expression as well. Because a large proportion of important phenotypic evolution may be the result of changes in gene expression (9, 10), understanding rates and patterns of regulatory change within and between species is 0 critical for a comprehensive picture of biological evolution. Given the pattern seen for amino acid sequences and morphologies, we would predict that genes associated with sex should be evolving faster at the level of gene regulation as well. Indeed, much of the divergence among proteins in the male reproductive tract of Drosophila may be attributable to large changes in protein levels, which is likely due in part to changes in gene expression (3). To test this prediction, we obtained gene expression data for 1 3 of the genome from adult males of eight strains of Drosophila melanogaster, and from adult males and females of one strain of D. melanogaster and one strain of Drosophila simulans. By analyzing intra- and interspecific expression differentiation within males and the sex-specificity of expression in both species, we show that gene expression in males evolves more rapidly than in females. Genes that are male-biased in their expression have on average more intra- and interspecific divergence in expression than genes with female-biased expression. Furthermore, comparison of intra- and interspecific differentiation suggests that at least some of the excess in divergence among male-biased genes (MBGs) is due to differential selective pressures acting on the expression of different sexbiased classes of genes. Materials and Methods 0 Gene Collection version 1.0 (12) were amplified by PCR with universal primers, and the products were confirmed by gel 0 This paper was submitted directly (Track II) to the PNAS office. Abbreviations: MBGs, male-biased genes; FBGs, female-biased genes; UBGs, unbiased genes; OBGs, ovary-biased genes. 0 Table 1. Overrepresentation of MBGs among genes with polymorphic expression within D. melanogaster 0 Subsets of genes include those that exhibit at least one pairwise difference between any two strains at the significance level indicated. G, G test of independence. 0 Influence of age, sex, and strength training on human muscle gene expression determined by microarray 0 THE LOSS OF SKELETAL MUSCLE 0 SKELETAL MUSCLE GENE EXPRESSION 0 Physiol Genomics · VOL 0 blood, and connective tissue, enclosed in cryovials, snapfrozen in liquid nitrogen, and stored at 80°C until analysis. Microarray molecular biology. Total RNA was extracted using the SV RNA Isolation Kit (Promega) according to manufacturer's instructions (which included DNase I treatment) and quantitated by determining absorbance at 260 nm in triplicate, with the values averaged. For each microarray experiment, a total of 1 g of total RNA was used for each hybridization, thus 200 ng of total RNA was taken from each sample and pooled for each group. Arrays were hybridized according to the manufacturer's instructions, once for each experimental condition (baseline, ST) within a single group. Thus four total microarrays, one for each of the four groups, were hy 0 Transcriptional Repressor Functions of Drosophila E2F1 and E2F2 Cooperate To Inhibit Genomic DNA Synthesis in Ovarian Follicle Cells 0 CAYIRLIOGLU ET AL. 0 MOL. CELL. BIOL. 0 Research article 0 A genomic analysis of Drosophila somatic sexual differentiation and its regulation 1 Michelle N. Arbeitman1,*,, Alice A. Fleming1,, Mark L. Siegal1, Brian H. Null2 and Bruce S. Baker1, 0 In virtually all animals, males and females are morphologically, physiologically and behaviorally distinct. Using cDNA microarrays representing one-third of Drosophila genes to identify genes expressed sexdifferentially in somatic tissues, we performed an expression analysis on adult males and females that: (1) were wild type; (2) lacked a germline; or (3) were mutant for sex-determination regulatory genes. Statistical analysis identified 63 genes sex-differentially expressed in the soma, 20 of which have been confirmed by RNA blots thus far. In situ hybridization experiments with 11 of these genes showed they were sex-differentially expressed only in internal genital organs. The nature of the products these genes encode provides insight into the molecular physiology of these reproductive tissues. Analysis of the regulation of these genes revealed that their adult expression patterns are specified by the sex hierarchy during development, and that doublesex probably functions in diverse ways to set their activities. 0 Key words: Drosophila, Sex determination, Microarray, Somatic, Reproduction 0 In essence, sexual reproduction is the process whereby two gametes, one contributed by each parent, fuse to form a new individual. Achieving this end is an elaborate process that in multicellular animals requires, along with germline development, the appropriate sex-specific development and physiology of the external genitalia, portions of the nervous system that control sex-specific reproductive behaviors, somatic tissues of the gonads (which play important roles in gametogenesis), and the internal genital organs (whose products are important both pre- and post-copulation for successful reproduction). Currently, we have limited knowledge, in any organism, of the sets of genes that are deployed sex-differentially in adult somatic tissues, and limited knowledge of their roles in sexual reproduction. Drosophila melanogaster is a powerful model system in which to acquire an understanding of the sex-specific physiology of adult somatic tissues, because we have a thorough understanding at the molecular-genetic level of the regulatory hierarchy that controls somatic sexual differentiation (Fig. 1) (reviewed by Cline and Meyer, 1996; Baker et al., 2001; Christiansen et al., 2002). There have been significant advances in understanding how the actions of DSXF and DSXM, terminal transcription factors in the hierarchy encoded by the doublesex (dsx) gene, are integrated with other key developmental hierarchies to achieve sex-specific patterns of growth, morphogenesis and differentiation (reviewed by 0 Christiansen et al., 2002). However, we have relatively little knowledge of the genes that are sex-differentially deployed in adults through the action of the two final genes in the hierarchy, dsx and fruitless (fru), which encodes (among several isoforms) a male-specific transcription factor hereafter referred to as FRUM. Several approaches have been used to identify genes expressed sex-differentially in D. melanogaster adults. The most thoroughly studied tissue is the male accessory gland, in which 75 genes have been identified using biochemical purification and differential cDNA hybridization (reviewed by Wolfner, 2002). Several of these genes encode proteins whose effects in the mated female have been characterized and include decreasing female receptivity to re-mating, increasing ovulation and egg laying, and facilitating sperm storage. Additional screens have focused on sex-differential gene expression in the head and foreleg. In head tissues, subtractive hybridization identified takeout (Dauwalder et al., 2002), and serial analysis of gene expression (SAGE) uncovered 46 sexdifferentially expressed genes (Fujii and Amrein, 2002). From the foreleg, two genes implicated in male-specific chemosensory function (CheA29a and CheB42a) were isolated by subtractive cloning (Xu et al., 2002). Sex-differential gene expression in adults has also been studied using microarray technology (Jin et al., 2001; Arbeitman et al., 2002; Parisi et al., 2003; Ranz et al., 2003). In two of these studies (Arbeitman et al., 2002; Parisi et al., 2003), both the somatic and germline 0 Development 131 (9) components of sex-differential expression were determined, but regulation by the sex-determination hierarchy was not explored. Here, we identify genes that are expressed sex-differentially in somatic tissues of adults and regulated by the sex hierarchy. Using arrays that assay approximately one-third (4040) of Drosophila genes, we analyzed adults mutant for the regulatory genes transformer (tra), dsx and fru (Fig. 1). To select a small number of such genes for further study, we chose a conservative approach. Stringent statistical analysis of these data, combined with data from wild-type adults and adults that lack germline tissue (Arbeitman et al., 2002), identified 63 genes that are sex-differentially expressed in the adult soma and regulated by the somatic sex hierarchy. Additional selection criteria, and validation by RNA blot analysis, defined a set of 11 genes for further characterization. In situ hybridization revealed that sex-differential expression of all 11 genes is confined to the internal genitalia. Analysis of the regulation of these genes revealed that the sex hierarchy functions during development to specify their adult expression patterns, and that dsx probably functions in diverse ways to set their activities. 0 Research article 0 fru males; if it is controlled by fru, its expression level is expected not to differ between tud females and dsxD pseudomales. First, the within-group mean square (MS) was calculated assuming the gene was under dsx control. Three means were calculated: 0 x tudF = x dsxD = xM = 0 Then the sum of squared deviations of each data point from its respective mean was calculated and divided by the degrees of freedom: 0 MSDSX = 0 x tudF ) + ( x4 j - x d sxD ) + ( x1j - x M ) + ( x 2 j - x M ) 2 0 The MS, assuming fru control, was calculated in the same way, except that genotypes were expected to have the same expression level: 0 Materials and methods 0 Drosophila stocks Flies were grown using standard conditions at 25°C, unless otherwise indicated. The wild-type stock was Canton S. XX tra, XX DsxD pseudomales, fru males and dsx intersexual mutant animals were wa/w; tra1/Df(3L)st-j7, w/+;DsxD/dsxm+r15 (XX), fru4-40/frup14 (XY), w/+; dsxm+r15/dsxd+r3 (XX), and w;dsxm+r15/dsxd+r3 (XY), respectively. tudor mutants are the progeny of virgin tud1 bw sp females crossed to Canton S males. tra2 temperature-shift experiments used the following genotypes: BsY;tra-2ts1/tra-2ts2(XY) and tra-2ts1/tra-2ts2 (XX). 0 x wtM = x fruM = xF = 0 MSFRU = 0 The MSs were then compared using an F test with the appropriate degrees of freedom. RNA blot analyses Total RNA was isolated with Trizol (Invitrogen), followed by RNeasy (Qiagen) or poly(A)+ isolation using Poly-ATtract (Promega). Blots were prepared from a Northern Max kit (Ambion). Radiolabeled RNA probes made with Strip-EZ kit (Ambion) were used at approximately 1-7x106 cpm/ml of hybridization solution. Blots were typic 0 Drosophila melanogaster MNK/Chk2 and p53 Regulate Multiple DNA Repair and Apoptotic Pathways following DNA Damage 1 Michael H. Brodsky,1,2* Brian T. Weinert,2 Garson Tsang,2,3 Yikang S. Rong,4 Nadine M. McGinnis,1 Kent G. Golic,5 Donald C. Rio,2 and Gerald M. Rubin2,3 0 BRODSKY ET AL. 0 MOL. CELL. BIOL. 0 RESEARCH ARTICLE 0 Patterns of Gene Expression During Drosophila Mesoderm Development 1 Eileen E. M. Furlong,1 Erik C. Andersen,1* Brian Null,1 Kevin P. White,2 Matthew P. Scott1 0 The transcription factor Twist initiates Drosophila mesoderm development, resulting in the formation of heart, somatic muscle, and other cell types. Using a Drosophila embryo sorter, we isolated enough homozygous twist mutant embryos to perform DNA microarray experiments. Transcription profiles of twist loss-of-function embryos, embryos with ubiquitous twist expression, and wild-type embryos were compared at different developmental stages. The results implicate hundreds of genes, many with vertebrate homologs, in stagespecific processes in mesoderm development. One such gene, gleeful, related to the vertebrate Gli genes, is essential for somatic muscle development and sufficient to cause neural cells to express a muscle marker. Formation of muscles during embryonic development is a complex process that requires coordinate actions of many genes. Somatic, visceral, and heart muscle are all derived from mesoderm progenitor cells. The Drosophila twist gene (1), which encodes a bHLH transcription factor, is essential for multiple steps of mesoderm development: invagination of mesoderm precursors during gastrulation (2), segmentation (3), and specification of muscle types (4). The role of twist in mesoderm development has been conserved during evolution (5), perhaps because it controls conserved regulatory mesoderm genes. For example, tinman and dMef 2 are regulated by Twist in flies (6, 7) (Fig. 1A) and are highly conserved in sequence and function in vertebrates (8-10). In Drosophila, somatic muscle forms from progenitor cells that divide to become muscle founder cells (11). Founder cells acquire unique identities controlled by transcription factors including Kruppel, S59, ves¨ tigial, and apterous. Each of the 30 body wall muscles in an abdominal hemisegment is initiated by a single founder cell and has unique attachments and innervations (12). To further clarify mechanisms underlying founder cell specification, myoblast fusion, and muscle patterning, we have used Drosophila mutants together with microarrays of cDNA clones. 0 dependent embryo collections, embryo sortings, and microarray hybridizations were conducted. The microarrays used for the analysis contained over 8500 cDNAs corresponding to 5081 unique genes plus a variety of controls [see Web fig. 3 for array details (13)]. Each embryonic RNA sample was compared with a reference sample, which contains RNA made from all stages of the Drosophila life cycle and allows direct comparisons among all the experiments. Sample and array variability was determined by calculating correlation coefficients and standard deviations for each gene for all pair-wise combinations of repeated samples. The median correlation coefficient is 0.92, and median standard deviation divided by mean is 0.246 [see Web text for validation information (13)]. To determine how transcription was affected by the twist mutation, SAM (significance analysis of microarrays) analysis was used (17). Genes that are normally highly expressed in mesoderm should have lower transcript levels in twist homozygotes. Genes in other tissues whose expression depends on signals from the mesoderm might also have reduced expression. Transcripts of 130 genes, the "Twist-low" group, were significantly lower in twist mutants than in wild type (Fig. 2A). Conversely, cells that would have formed mesoderm may take on other fates in the absence of twist, such as neuroectoderm; therefore, many transcript levels could increase in twist mutants. Genes whose transcription is repressed by signals from the mesoderm would also be enriched in twist mutants. One hundred fifty genes, called the "Twist-high" group, have increased levels of RNA in twist mutant embryos (Fig. 2A). In total, 280 of 5000 genes had significant changes in transcript levels, with 10 false positives (17) [see Web text for validation information (13)]. The genes on the array include 15 previously characterized mesoderm-specific genes, all of which were significantly reduced in twist mutant embryos (Fig. 3A). The arrays also contain genes known to be transcribed in both mesoderm and other cell types. Significant changes in expression were detected for many of these genes (Fig. 3B). The 130 Twist-low genes were divided into three groups (A, B, and C) with similar trends of expression by a self-organizing map (SOM) clustering program (Fig. 1B) (18). The 24 group A genes, which included tinman, dMef 2, and bagpipe, had reduced transcript levels in twist mutants at all developmental stages assayed. Most of the Twist-low genes fall into the B and C groups. The 62 group B "early genes" encode transcripts with reduced levels of expression in twist mutants only during stages 9-10, not later. One member of group B, stumps (dof/hbr) is 0 RESEARCH ARTICLE 0 essential for mesoderm cell migration. stumps RNA is abundant in the mesoderm at stages 9-10 and is strongly reduced by stage 11 (Fig. 1B) (19). At stage 11, stumps RNA accumulates in trachea, which are largely unaffected in twist mutants. The 44 group C genes have reduced transcript levels in twist mutant embryos only during late stage 11 and stage 12. These "late genes" include blown fuse, a gene essential for myoblast fusion (20); delilah, a gene required for somatic muscle attachment (21); and genes such as kettin, which is required to form contractile muscle (22). Given the predominantly early expression of twist, the early genes in groups A and B are the best candidates for direct transcription targets of Twist, though some indirectly activated genes may be present within these groups. Group C late genes are likely to be regulated by products of genes that are activated by Twist. In situ hybridizations were done using a previously uncharacterized representative of each Twist-low group (Fig. 1C). In each case, the hybridization pattern was consistent with the predicted time of transcription. A group A gene, CG15015 (GH16741), is transcribed in somatic muscle throughout stages 9-12. A group B gene, CG12177 (GH22706), is transcribed during early mesoderm development, but not later. CG14848 (GH21860), a group C gene, is expressed in the stomodeum but not the mesoderm during stages 9-10. Its mesoderm expression initiates during stage 11, the latest period of the twist experiment. Thus, combining loss-of-function mutant embryo analysis with staged embryo collections provides gene expression information for both tissue specificity and temporal expression. A complementary test: The transcription profile with twist overexpression. The misexpression of twist in the ectoderm is sufficient to convert both neuronal and epidermal tissues to a myogenic cell fate (4). RNA from embryos with ubiquitous twist expression was used to evaluate the ability of Twist to initiate mesoderm-like gene expression in cells that would normally form other tissue types. Genes whose transcript levels decrease in twist loss-of-function embryos and increase when twist is ubiquitous are excellent candidates for regulators of mesoderm development or differentiation. To ectopically express twist, a dominant gain-of-function mutation of the maternal gene Toll (Toll10B) was used (23). Activated Toll induces the expression of twist and snail in early embryos and of immune response genes in older embryos (Fig. 1A) (24, 25). Thus, the effects of Toll10B on gene expression reflect the activities 0 Dmp53 protects the Drosophila retina during a developmentally regulated DNA damage response 1 Omar W.Jassim, Jill L.Fink and Ross L.Cagan1 0 Ultraviolet (UV) light is absorbed by cellular proteins and DNA, promoting skin damage, aging and cancer. In this paper, we explore the UV response by cells of the Drosophila retina. We demonstrate that the retina enters a period of heightened UV sensitivity in the young developing pupa, a stage closely associated with its period of normal developmental programmed cell death. Injury to irradiated cells included morphology changes and apoptotic cell death; these defects could be completely accounted for by DNA damage. Cell death, but not morphological changes, was blocked by the caspase inhibitor P35. Utilizing genetic and microarray data, we provide evidence for the central role of Hid expression and for Diap1 protein stability in controlling the UV response. In contrast, we found that Reaper had no effect on UV sensitivity. Surprisingly, Dmp53 is required to protect cells from UV-mediated cell death, an effect attributed to its role in DNA repair. These in vivo results demonstrate that the cellular effects of DNA damage depend on the developmental status of the tissue. Keywords: apoptosis/Drosophila/Dmp53/retina/UV 0 UV-damaged DNA can be repaired by a number of mechanisms, including nucleotide excision repair (Friedberg, 2001) and photoreactivation (Carell et al., 2001). In the process of nucleotide excision repair, pyrimidine dimers are excised and replaced with undamaged nucleotides. The disorder xeroderma pigmentosum is linked to at least seven genetic loci that encode factors that participate in nucleotide excision repair (e.g. the nucleases XPF and XPG); patients exhibit hypersensitivity to UV light and a strong predisposition toward skin cancer. An alternate repair mechanism is photoreactivation. Many vertebrates and invertebrates use this system to repair pyrimidine dimers. It includes a lightdependent photolyase repair enzyme that binds to pyrimidine dimers; the dimer is then enzymatically restored to a monomeric form using 350±450 nm light as an energy source. Several lines of evidence suggest, however, that the damage provoked by UV irradiation is mediated by more than its ability to alter DNA. Activation of a number of signaling pathways, including JNK, EGFR and TNF, can occur in a manner independent of either prior nuclear signaling or effects on DNA (e.g. Kulms et al., 1999; Kulms and Schwarz, 2002a). This broad spectrum has led to the suggestion that most receptors that are activated by oligomerization can be affected by UV (Rosette and Karin, 1996). In some cell lines, effects on cellular proteins are thought to represent the principal UV-mediated insult. Once DNA is damaged, the tumor suppressor P53 mediates a cell's response by regulating expression of a number of targets including signal transduction factors, cell cycle regulators, cell repair genes and cell death regulators (Vousden and Lu, 2002). P53 also binds to specific DNA sites and damaged, single-strand DNA (Liu and Kulesz-Martin, 2001). UV irradiation leads to stabilization of the P53 protein, in part due to its phosphorylation by ERK and P38 kinases (She et al., 2000; Chouinard et al., 2002). The kinases ATR and ATM have also been implicated in signaling, and perhaps even sensing DNA damage, leading to their subsequent targeting of P53 (Lakin et al., 1999; Tibbetts et al., 1999). The result is a dual role for P53: it can direct cell cycle arrest to permit DNA repair or promote cell death when this repair fails. The Drosophila P53 ortholog Dmp53 also acts in the cellular response to DNA damage. Following ionizing radiation, Dmp53 targets expression of the pro-apoptotic effector Reaper (Brodsky et al., 2000; Jin et al., 2000; Ollmann et al., 2000; Sogame et al., 2003). Consistent with this connection, removing Reaper in the larval wing disk results in a reduction of DNA damage-induced programmed cell death (PCD; Peterson et al., 2002). Overexpression of Dmp53 in the retina can lead to extensive cell death (Jin et al., 2000; Ollmann et al., 0 a European Molecular Biology Organization 0 UV irradiation of the Drosophila retina 0 These observations have led to the suggestion that Dmp53 is promoting inappropriate Reaper expression, although genetic tests did not confirm this association (Peterson et al., 2002). Reaper belongs to the family of RHG proteins that includes Hid, Grim, and Sickle; these proteins are critical during embryonic PCD (Grether et al., 1995). The role of Grim and Hid during radiationmediated apoptosis has not been examined. Each of these family members is active in specific tissues and responds to specific death stimuli. For example, Reaper is active during embryonic segmentation and larval CNS development (Lohmann et al., 2002; Peterson et al., 2002), whereas Hid appears necessary for PCD within the pupal retina (Yu et al., 2002). RHG proteins direct apoptosis at least in part by targeting Diap1 (Drosophila inhibitor of apoptosis protein-1) for degradation. Diap1 normally inhibits caspase activity by direct binding, and removal of Diap1 leads to caspase activation and subsequent apoptosis. In Drosophila, regulation of Diap1 stability appears to be the primary step in the regulation of apoptosis (Martin, 2002). Its role in radiation-induced cell death, however, has yet to be explored. In this report, we exploit the developing Drosophila retina as a model system to explore the factors that provoke UV and DNA damage response within an emerging epithelium. We utilize several advantages offered by the pupal retina as an in vivo model for UV irradiation: it is a simply constructed neuroepithelium, constituent cells are post-mitotic, the tissue is superficial and is therefore accessible (and highly sensitive) to UV irradiation, and the molecular aspects of its development have been studied extensively. We present a number of interesting features and factors associated with the retina's response to UV, and find parallels between this response and the factors that direct normal PCD during its development. 0 UV irradiation leads to retinal defects 0 of 40 000 mJ/cm2 was chosen for the assay as it resulted in a moderate roughening and ablation of the retina; ~10 000 mJ/cm2 resulted in minimal defects and ~100 000 mJ/cm2 resulted in near complete retinal ablation and eventual pupal death. The effect of radiation waned after 25 h APF (Figure 1D; see Supplementary data). By 42 h APFDthe stage by which most developmental cell death is completeDthe retina no longer responded to moderate UV treatment. We were unable to assess the sensitivity of the retina prior to 18 h APF as at that point the retina has yet to emerge from deeper within the developing pupa. The period of significant UV sensitivity (<25 h APF) corresponds to the early stages of cell death in the pupal retina (Cagan and Ready, 1989a; Wolff and Ready, 1991), suggesting that the signals modulating the induction of developmental cell death may regulate UV-induced cell death as well. Some of the phenotypes observed with UV were due to induction of apoptotic cell death: we observed condensed nuclei and fragmentation of DNA as assessed by TUNEL (Figure 1F). In addition, irradiation led to activation of caspases as assessed by antibodies that target the cleaved downstream caspases ca 0 RESEARCH ARTICLE 0 A Gene Expression Map for the Euchromatic Genome of Drosophila melanogaster 1 Viktor Stolc,1,5* Zareen Gauhar,1,2* Christopher Mason,2* Gabor Halasz,7 Marinus F. van Batenburg,7,9 Scott A. Rifkin,2,3 Sujun Hua,2 Tine Herreman,2 Waraporn Tongprasit,6 Paolo Emilio Barbano,2,4 Harmen J. Bussemaker,7,8 Kevin P. White2,3. 0 We used a maskless photolithography method to produce DNA oligonucleotide microarrays with unique probe sequences tiled throughout the genome of Drosophila melanogaster and across predicted splice junctions. RNA expression of protein coding and nonprotein coding sequences was determined for each major stage of the life cycle, including adult males and females. We detected transcriptional activity for 93% of annotated genes and RNA expression for 41% of the probes in intronic and intergenic sequences. Comparison to genome-wide RNA interference data and to gene annotations revealed distinguishable levels of expression for different classes of genes and higher levels of expression for genes with essential cellular functions. Differential splicing was observed in about 40% of predicted genes, and 5440 previously unknown splice forms were detected. Genes within conserved regions of synteny with D. pseudoobscura had highly correlated expression; these regions ranged in length from 10 to 900 kilobase pairs. The expressed intergenic and intronic sequences are more likely to be evolutionarily conserved than nonexpressed ones, and about 15% of them appear to be developmentally regulated. Our results provide a draft expression map for the entire nonrepetitive genome, which reveals a much more extensive and diverse set of expressed sequences than was previously predicted. Characterization of the complete expressed set of RNA sequences is central to the functional interpretation of each genome. For almost 3 decades, the analysis of the Drosophila genome has served as an important model for studying the relationship between gene expression and development. In recent years, Drosophila provided the initial demonstration that DNA microarrays could be used to study gene expression during development (1), and subsequent large-scale studies of gene expression in this and other developmental model organisms have given new insights into how 0 of the human genome and for Arabidopsis (11-13). Microarrays have also recently been used to characterize the great diversity of RNA transcripts brought about by differential splicing in human tissues (14). We used both types of approaches to characterize the Drosophila genome. Experimental design. To determine the expressed portion of the Drosophila genome, we designed high-density oligonucleotide microarrays with probes for each predicted exon and probes tiled throughout the predicted intronic and intergenic regions of the genome. We used maskless array synthesizer (MAS) technology (15, 16) to synthesize custom microarrays containing 179,972 unique 36-nucleotide (nt) probes (17). Of these, 61,371 exon probes (EPs) assayed 52,888 exons from 13,197 predicted genes, 87,814 nonexon probes (NEPs) assayed expression from intronic and intergenic regions, and 30,787 splice junction probes (SJPs) assayed potential exon junctions for a test subset of 3955 genes. For the SJPs, we used 36-nt probes spanning each predicted splice junction, with 18 nt corresponding to each exon (14). RNA from six developmental stages during the Drosophila life cycle (early embryos, late embryos, larvae, pupae, and male and female adults) was isolated and reversetranscribed in the presence of oligodeothymidine and random hexamers, and the labeled cDNA was hybridized to these arrays. The stages were chosen to maximize the number of transcripts that would be differentially expressed between samples on the basis of previous results (3, 7). Each sample was hybridized four times, twice with Cy5 labeling and twice with Cy3 labeling (fig. S1). Genomic and chromosomal expression patterns. We determined which exon or nonexon probes correspond to genomic regions that are transcribed at any stage during development (18). We used a negative control probe (NCP) distribution (fig. S3) to score the statistical significance of the EP or NEP signal intensities for each of the 24 unique combinations of stage, dye, and array, correcting for probe sequence bias (17, 19). These results were combined into a single expression-level estimate (19), a threshold for which was determined by requiring a false discovery rate of 5% (20). This threshold shows 47,419 of 61,371 EPs (77%) and 35,985 out of 87,814 NEPs (41%) were significantly expressed at some point during the fly life cycle. Significantly expressed EPs correspond to 79% (41,559/52,888) of all exons probed and 93% (12,305/13,197) of all probed gene annotations. Our results confirmed 2426 annotated genes not yet validated through an EST sequence (Fig. 1A). Out of 10,280 genes represented by EST sequences, 0 OCTOBER 2004 0 RESEARCH ARTICLE 0 only 401 (3.0%) were not detected in these microarray experiments. Our finding that a large fraction of intergenic and intronic regions (NEPs) is expressed in D. melanogaster mirrors similar observations for chromosomes 21 and 22 in humans (16) and for Arabidopsis (14). These results support the conclusion that extensive expression of intergenic and intronic sequences occurs in the major evolutionary lineages of animals (deuterostomes and protostomes) and in plants. We noted that mRNA expression levels for protein-encoding genes varied with the protein function assigned in the Drosophila Gene Ontology (fig. S2) (21). For example, genes encoding G protein receptors were expressed at relatively low levels, whereas genes encoding ribosomal proteins were highly expressed. A gene's expression level was also associated with cellular compartmentalization and the biological process it mediates (fig. S2). For example, genes encoding cytosolic and cytoskeletal factors were more highly expressed than those predicted to function within organelles such as the endoplasmic reticulum, Golgi, and peroxisome. To determine whether a high level of gene expression was associated with essential genetic functions, we examined the expression levels of genes recently shown to be required for cell viability (Fig. 1B) in a genome-wide RNA interference (RNAi) screen in Drosophila (22). Compared to the rest of the genome, the genes identified as essential by RNAi showed a significant increase in expression during all stages of development (P 0 0.0009, t test), even when the highly expressed ribosomal protein genes were omitted (P 0 0.0005, t test). This result is also consistent with the observation that genes with mutant phenotypes from the 3-Mbase Adh genomic region are overrepresented in EST libraries (23). High levels of essential gene expression may in part reflect widespread expression in cells throughout the animal, and the relative RNA expression level may serve as a rough predictor of essential cellular function. We also examined changes in gene expression during the fly life cycle to determine what fraction of the entire genome is differentially expressed between developmental stages. Figure 2A shows the expression signal intensities of transcripts from a typical 50-kilobase pair (kbp) region of the Drosophila genome during each major developmental stage. Stage-specific variation in expression is observed not only for exon probes, as expected, but also for intergenic and intronic probes. We used analysis of variance (ANOVA) (24) to systematically identify probes as differentially expressed at a false discovery rate of 5% (16). As expected, the majority of probes detecting differentially expressed sequences are also expressed above background noise level (89% of EPs and 81% of NEPs) (17) (Table 1). We found 27,176 EPs to be differentially expressed, corresponding to 76% of annotated genes, and even more when we applied a less conservative background model (fig. S4). The fact that the 0 SHORT REPORT 0 High resolution microarray comparative genomic hybridisation analysis using spotted oligonucleotides 1 B Carvalho, E Ouwerkerk, G A Meijer, B Ylstra 0 Background: Currently, comparative genomic hybridisation array (array CGH) is the method of choice for studying genome wide DNA copy number changes. To date, either amplified representations of bacterial artificial chromosomes (BACs)/phage artificial chromosomes (PACs) or cDNAs have been spotted as probes. The production of BAC/PAC and cDNA arrays is time consuming and expensive. Aim: To evaluate the use of spotted 60 mer oligonucleotides (oligos) for array CGH. Methods: The hybridisation of tumour cell lines with known chromosomal aberrations on to either BAC or oligoarrrays that are mapped to the human genome. Results: Oligo CGH was able to detect amplifications with high accuracy and greater spatial resolution than other currently used array CGH platforms. In addition, single copy number changes could be detected with a resolution comparable to conventional CGH. Conclusions: Oligos are easy to handle and flexible, because they can be designed for any part of the genome without the need for laborious amplification procedures. The full genome array, containing around 30 000 oligos of all genes in the human genome, will represent a big step forward in the analysis of chromosomal copy number changes. Finally, oligoarray CGH can easily be used for any organism with a fully sequenced genome. 0 Abbreviations: BAC, bacterial artificial chromosome; CGH, comparative genomic hybridisation; CHORI, Children's Hospital Oakland Research Institute; oligo, oligonucleotide; PAC, phage artificial chromosome; PCR, polymerase chain reaction 0 rray comparative genomic hybridisation (array CGH) has been used successfully for the detection of genomic imbalances in human and mouse tumours.1-6 As chromosomal representations, approximately 2500 bacterial artificial chromosome (BAC) and phage artificial chromosome (PAC) clones have been amplified and spotted for genome wide CGH arrays, yielding a resolution of 1-1.5 Mb,7 in addition to cDNAs,8 which encompass a maximum of 13 824 genes and yield an average resolution of 267 kb.9 Although spatial resolution using cDNAs is currently higher, the number of cDNAs is finite and their sensitivity is lower. This reduced sensitivity of cDNAs is partly the result of cross hybridisation. Oligonucleotides (oligos) can theoretically circumvent the problems encountered with cDNAs. In addition, oligo-libraries are cheaper, easier to work with, and faster than cDNAs or BAC/PAC clones, because no DNA isolation or PCR amplification steps are necessary. The in silico design can control for the hybridisation temperature and specificity and there is no limit to the spatial resolution. Finally, oligos can be designed for any organism with a sequenced genome. 0 MATERIALS AND METHODS 0 Short report 0 for each clone. On the oligoarrays, each experiment was performed three times and data were taken from one representative experiment. Average and standard deviations of log2 ratios were calculated for each oligonucleotide across the three experiments. A moving average (window of eight by eight) was applied to plot genome wide graphs. 0 RESULTS AND DISCUSSION 0 We hybridised a 19 K human 60 mer oligoarray with breast tumour cell line (BT474) DNA, labelled with Cy3, and normal genomic kidney (female) DNA, labelled with Cy5. Ratios for the non-flagged oligos (35%) were ordered by their position on the chromosome (June 2002 freeze; http://genome.ucsc. edu/). We compared the oligo CGH profile with the BAC array CGH profile (fig 1). Both array profiles showed the same pattern--for example, on the short arm of chromosome 1 neither profile showed a change in DNA copy number, whereas on the q arm two amplified areas are present in both profiles. No aberrations can be seen on chromosome 2. The 0 standard deviation of the log2 ratio of the individual probes was 0.21 for the BAC array and 0.45 for the oligoarray. On chromosome 3, a loss on the short arm was evident on the oligoarray, and was also seen with the BAC array (fig 1). Figure 2 shows two regions of amplification on the q arm of chromosome 17: one narrow peak over the chromosomal region containing c-Erb-B2/neu (Her2)10 and a second amplicon distal to c-Erb-b2. The BAC array has three clones over c-Erb-B2, and the best possible judgment towards the start and end of the amplicon, according to the April 2003 freeze, is therefore 2.5 Mb. With the oligo approach, 38 non-flagged oligos represent amplified ratios in this region and the size of the amplicon is 2.4 Mb according to the April 2003 freeze. Thus, the actual resolution in the region is 63 kb on average. The log2 ratios for the three replicate BACs containing the c-Erb-b2 gene are 2.91, 3.06, and 2.53, a similar order of magnitude as that obtained in three independent experiments with the oligoarray for the single oligo corresponding to the c-Erb-b2 gene: 3.4, 3.4, and 3.9. 0 Short report 0 Take home message 0 We describe pilot experiments that serve as a proof of principle that oligonucleotides are a feasible platform for array comparative genomic hybridisation (CGH) Oligoarray CGH can be rapidly, cost effectively, and easily used to measure chromosomal copy number changes for any organism with a fully sequenced genome 0 like to thank Professor D G Albertson, Professor D Pinkel, and laboratory staff (UCSF Comprehensive Cancer Centre) for their support in performing the hybridisation procedures and for the GM0143 DNA sample. BC is holder of fellowship SFRH/BPD/5599/ 2001 and is working in the frame of the Grant Project POCTI/CBO/ 41179/2001. This work is furthermore supported by the Dutch Cancer Society (VU 2002-2618). We thank the mapping core and map finishing groups of the Wellcome Trust Sanger Institute for initial BAC clone supply and verification. ..................... 0 Comparative genomic hybridization using oligonucleotide microarrays and total genomic DNA 1 Michael T. Barrett*, Alicia Scheffer*, Amir Ben-Dor*, Nick Sampas*, Doron Lipson*§, Robert Kincaid*, Peter Tsang*, Bo Curry*, Kristin Baird¶, Paul S. Meltzer¶, Zohar Yakhini*, Laurakay Bruhn*, and Stephen Laderman* 0 Array-based comparative genomic hybridization (CGH) measures copy-number variations at multiple loci simultaneously, providing an important tool for studying cancer and developmental disorders and for developing diagnostic and therapeutic targets. Arrays for CGH based on PCR products representing assemblies of BAC or cDNA clones typically require maintenance, propagation, replication, and verification of large clone sets. Furthermore, it is difficult to control the specificity of the hybridization to the complex sequences that are present in each feature of such arrays. To develop a more robust and flexible platform, we created probedesign methods and assay protocols that make oligonucleotide microarrays synthesized in situ by inkjet technology compatible with array-based comparative genomic hybridization applications employing samples of total genomic DNA. Hybridization of a series of cell lines with variable numbers of X chromosomes to arrays designed for CGH measurements gave median ratios for X-chromosome probes within 6% of the theoretical values (0.5 for XY XX, 1.0 for XX XX, 1.4 for XXX XX, 2.1 for XXXX XX, and 2.6 for XXXXX XX). Furthermore, these arrays detected and mapped regions of single-copy losses, homozygous deletions, and amplicons of various sizes in different model systems, including diploid cells with a chromosomal breakpoint that has been mapped and sequenced to a precise nucleotide and tumor cell lines with highly variable regions of gains and losses. Our results demonstrate that oligonucleotide arrays designed for CGH provide a robust and precise platform for detecting chromosomal alterations throughout a genome with high sensitivity even when using full-complexity genomic samples. 0 cancer DNA microarrays genome 0 dated for expression profiling of 17,000 transcripts (expression array), was used to develop initial assay conditions for aCGH. The second design consisted of custom microarrays containing a higher density of probes that represent unique genomic sequences for selected chromosomes (CGH array). The content of the CGH array was biased toward gene regions, but it also included noncoding regions for chromosome-wide coverage. These arrays were used to explore performance improvements that could be made possible by developing oligonucleotide probe-selection methods specifically for CGH. Materials and Methods 0 Genomic DNA. We obtained genomic DNA from normal male 0 rray-based comparative genomic hybridization (aCGH) allows the identification of chromosomal regions of gains and losses in cancers and genetic diseases (1-5). Oligonucleotide-array probes can be designed in silico for any sequenced region of a genome, thus allowing genome-wide and higher-density region-specific coverage, in principle. Application-specific designs, assays, and analysis methods allow routine use of oligonucleotide arrays for gene-expression studies and characterization of DNA polymorphisms and mutations (6-11). Typically, these applications use labeled targets of markedly reduced complexity relative to a complete genome (for example, expressed sequences in transcriptional profiling and PCR amplicons for polymorphic allele analyses). The usefulness of oligonucleotide arrays for aCGH has also been examined by using targets of reduced complexity (12-16). However, the broadest use of aCGH, including both a simplified preparation of targets and hybridization of samples to any array design of interest, requires preserving the greatest possible complexity of targets derived from whole-genome samples. Therefore, we investigated and developed probe-design criteria, assay conditions, and analysis methods that enable 60-mer oligonucleotide arrays to be used for CGH measurements even when using total genomic DNA. We used two array designs for these studies. The first design, consisting of 60-mer oligonucleotide probes designed and vali- 0 46,XY and normal female 46,XX from Promega. The following cell lines are part of the National Institute of General Medical Sciences Human Genetic Cell Repository and were obtained from the Coriell Institute for Medical Research (Camden, NJ): 47,XXX (repository no. GM04626), 48,XXXX (repository no. GM01415D), 49,XXXXX (repository no. GM05009C), and the 18q deletionsyndrome cell line (repository no. GM50122). The colon (COLO 320DM, HT 29, and HCT116) and breast (MDA-MB-231 and MDA-MB-453) carcinoma cell lines were obtained from the American Type Culture Collection. Each cell line was grown under the conditions recommended by the supplier. Genomic DNA was prepared from each cell line by using the DNeasy tissue kit (Qiagen, Germantown, MD). Tumor biopsies were collected from 1980- 2003 and accessed by means of the National Cooperative Human Tissue Network (Charlottesville, VA). Total cellular DNA was isolated from fresh-frozen tumor specimens by using TRIzol reagent (Invitrogen) extraction techniques and further purified by phenol-chloroform extraction. 0 Freely available online through the PNAS open access option. Abbreviations: CGH, comparative genomic hybridization; aCGH, array-based CGH. 0 and A.S. contributed equally to this work. 0 Technologies, the employer of M.T.B., A.S., A.B.-D., N.S., D.L., R.K., P.T., B.C., Z.Y., L.B., and S.L., manufactures DNA microarrays. 0 §Present address: Technion Israel Institute of Technology, Technion City, Haifa 32000, Israel. 0 December 21, 2004 0 Image and Data Analysis. Microarray images were analyzed by using 0 FEATURE EXTRACTION 0 aCGH. For each CGH hybridization, we digested 20 0 in plots of raw data are obscured by even a small percentage of outlier probes. Therefore, we applied a 50-kb moving average, as calculated below, to plots presented in Figs. 4-6. The log2 ratio measured for all m probes of the chromosome was smoothed by using the following weighted moving average: 0 where yi is the measured log2 ratio at xi. The weights are given by the following triangular function: x wx 0 W W xW W 0 for for for for x W W x x W W x 0 0 [2] 0 software (version 6.1.1, Agilent Technologies). Default settings were used, except that probes from autosomal chromosomes were used for dye normalization by using the locally weighted linear-regression curve fit option. Also, we used signals from negative control featu 0 Requirement of Circadian Genes for Cocaine Sensitization in Drosophila 1 Rozi Andretic, Sarah Chaney, Jay Hirsh* 0 The circadian clock consists of a feedback loop in which clock genes are rhythmically expressed, giving rise to cycling levels of RNA and proteins. Four of the five circadian genes identified to date influence responsiveness to freebase cocaine in the fruit fly, Drosophila melanogaster. Sensitization to repeated cocaine exposures, a phenomenon also seen in humans and animal models and associated with enhanced drug craving, is eliminated in flies mutant for period, clock, cycle, and doubletime, but not in flies lacking the gene timeless. Flies that do not sensitize owing to lack of these genes do not show the induction of tyrosine decarboxylase normally seen after cocaine exposure. These findings indicate unexpected roles for these genes in regulating cocaine sensitization and indicate that they function as regulators of tyrosine decarboxylase. In response to exposure to volatilized freebase cocaine, Drosophila perform a set of reflexive behaviors similar to those observed in vertebrate animals, including grooming, proboscis extension, and unusual circling locomotor behaviors (1-3). Additionally, flies can show sensitization after even a single exposure to cocaine provided that the doses are separated by an interval of 6 to 24 hours (1). Sensitization, a process in which repeated exposure to low doses of a drug leads to increased severity of responses, has been linked to the addictive process in humans (4-6) and is potentially involved in the enhanced craving and psychoses that occur after repeated psychostimulant administration. We have shown circadian variation in the agonist responsiveness of Drosophila nerve cord dopamine receptors functionally coupled to locomotor output (7). This variation is dependent on the normal functioning of the Drosophila period ( per) gene, the founding member of the circadian gene family (8, 9). Because changes in postsynaptic dopamine receptor responsiveness are also seen during cocaine sensitization in vertebrates (10-12), we examined flies mutant in circadian functions for alterations in responsiveness to cocaine. Wild-type (WT) flies or flies containing a per null mutation, per o, were exposed to 75 g 0 of cocaine four times over 2 days, and the fraction of flies showing severe responses was quantified after each exposure (Fig. 1A). Whereas WT flies showed sensitization after 0 the initial cocaine exposure, per o flies showed no sensitization either to a normal or increased dose even after repeated exposures. As with WT flies, per o flies showed a dose-dependent increase in the severity of responses, and the normal cocaine-induced types of behaviors were observed (13). per alleles that either shorten or lengthen the circadian periods show distinct patterns of cocaine responsiveness. The short-period mutants per S and perT (14, 15) both showed increased responsiveness to the initial cocaine exposure and weak sensitization to a second 75- g exposure (Fig. 2A), with only the sensitization of per S showing statistical significance. Sensitization is not observed in these lines when tested with other cocaine doses (16). The long-period mutant per L1 (17) showed a normal initial cocaine response but no sensitization to a subsequent exposure. Similarly, other circadian genes showed effects on cocaine sensitization: Both clock and cycle mutants failed to sensitize when given two doses of cocaine (Fig. 2B). Because these mutants showed an increased sensitivity to the 0 first exposure (16), cocaine doses were decreased to 50 g. The inability of clock and cycle to sensitize is markedly similar to the behavior of per o mutants. The gene product of timeless (tim), TIM, is required for nuclear translocation of PER and its stability in the cytoplasm; in timo mutants, cytoplasmic PER is degraded and per mRNA levels are constant (18 -20). Cocaine responses in timo mutant flies were normal (Fig. 2B), both in initial responsiveness and in showing a robust sensitized response to the second exposure. Recently, a doubletime (dbt) protein with homology to human casein kinase I was identified and shown to be required for phosphorylation of PER (21). We tested cocaine responses in two viable dbt mutants, dbt S and dbt L, which shorten and lengthen the circadian locomotor period, respectively (22). dbt mutants required a substantially higher cocaine dose to show behaviors normally observed at 75 g (Fig. 2B), but even at these higher doses dbt flies did not show significant sensitization. If the role of dbt in cocaine responsiveness is analogous to its role in circadian behavior, then PER phosphorylation status may be important in regulating both initial cocaine responsiveness and sensitization. Modulation of dopamine receptor responsiveness is important in both the sensitization to cocaine in vertebrate animals and in the circadian modulation of locomotion in Drosophila (7, 23). We tested whether cocaine-sensitized flies would show an increase in the responsiveness of the nerve cord dopamine D2-like receptors by using a preparation of behaviorally acFig. 2. Circadian mutants show altered cocaine responses. (A) per mutants. Flies carrying per mutations, as indicated, were exposed twice to 75 g of volatilized cocaine 6 hours apart. The number of flies assayed, for first and second exposures, is as follows: WT CantonS, n 105, 95; per o, n 81, 60; perS, n 114, 112; perT, n 88, 52; per L1, n 86, 83. (B) Other circadian mutants. As in (A), except that cocaine doses were adjusted to compensate for differences in cocaine responsiveness to the initial dose: WT CantonS exposed to 75 g of cocaine, n 105, 95; timo, n 66, 63. Circadian mutants exposed to 50 g of cocaine: clock, n 187, 182; and cycle, n 79, 79. dbt mutants were exposed to 100 g of cocaine: dbt S, n 59, 55; dbt L, n 52, 51. In both (A) and B), significant differences in responses to the first versus second exposures are indicated (*P 0.05, **P 0.01; 2 test). 0 tive decapitated flies that allows direct addition of drugs to the nerve cord (24). After decapitation, cocaine-sensitized WT flies locomoted significantly more than sham-treated controls in response to the dopamine D2-like agonist quinpirole (Fig. 1B). However, there was no increase in quinpirole responsiveness of per o flies that did not sensitize to repeated cocaine exposures. Thus, similar to the inability of per o mutant to modulate receptor responsiveness as a function of the time of day (7), per o is unable to modulate dopamine receptor responsiveness after cocaine exposure. The observation that cocaine sensitization is associated with increased responsiveness of postsynaptic dopamine receptors shows additional similarities between this system and that in higher vertebrates, where a similar relation holds (12, 23). In Drosophila, sensitization requires the trace amine tyramine because the mutant inactive, which is defective in sensitization, shows both reduced tyramine and reduced levels of the enzyme involved in t 0 Genome-wide Transcriptional Orchestration of Circadian Rhythms S in Drosophila* 1 Hiroki R. Ueda§¶ , Akira Matsumoto¶**, Miho Kawamura§, Masamitsu Iino, Teiichi Tanimura**, and Seiichi Hashimoto§ 0 Circadian rhythms govern the behavior, physiology, and metabolism of living organisms. Recent studies have revealed the role of several genes in the clock mechanism both in Drosophila and in mammals. To study how gene expression is globally regulated by the clock mechanism, we used a high density oligonucleotide probe array (GeneChip) to profile gene expression patterns in Drosophila under light-dark and constant dark conditions. We found 712 genes showing a daily fluctuation in mRNA levels under light-dark conditions, and among these the expression of 115 genes was still cycling in constant darkness, i.e. under free-running conditions. Unexpectedly the expression of a large number of genes cycled exclusively under constant darkness. We found that cycling in most of these genes was lost in the arrhythmic Clock (Clk) mutant under lightdark conditions. Expression of periodically regulated genes is coordinated locally on chromosomes where small clusters of genes are regulated jointly. Our findings reveal that many genes involved in diverse functions are under circadian control and reveal the complexity of circadian gene expression in Drosophila. 0 cells (4, 5). Since information about all the possible transcription units is available in Drosophila (6, 7), we can extensively analyze the data for all the genes relating to their function. Functions of identified genes can be analyzed using various genetic tool and databases (9 -11) available in Drosophila. 0 EXPERIMENTAL PROCEDURES 0 The use of Drosophila has been at the forefront of studies of the molecular and genetic basis of circadian rhythms (1). A number of clock genes have been identified in Drosophila, and interlocked per-tim and Clk feedback loops are now thought to underlie the central molecular machinery of circadian rhythms (2, 3). However, we still do not know how expression of the whole genome is orchestrated by the circadian mechanism nor have we identified all the genes involved. One comprehensive way to find out all the rhythmically expressed genes is to utilize microarray. A number of genes regulated in a circadian manner have been identified in Arabidopsis and mammalian cultured 0 Genome-wide Orchestration of Circadian Rhythms 0 Microarray Analysis and Organization of Circadian Gene Expression in Drosophila 0 Summary We have used high-density oligonucleotide arrays to study global circadian gene expression in Drosophila melanogaster. Coupled with an analysis of clock mutant (Clk) flies, a cell line designed to identify direct targets of the CLOCK (CLK) transcription factor and differential display, we uncovered several striking features of circadian gene networks. These include the identification of 134 cycling genes, which contribute to a wide range of diverse processes. Many of these clock or clock-regulated genes are located in gene clusters, which appear subject to transcriptional coregulation. All oscillating gene expression is under clk control, indicating that Drosophila has no clk-independent circadian systems. An even larger number of genes is affected in Clk flies, suggesting that clk affects other genetic networks. As we identified a small number of direct target genes, the data suggest that most of the circadian gene network is indirectly regulated by clk. Introduction 0 Cycling Circadian Genes To isolate mRNA for analysis, we entrained wild-type Canton-S flies for 3 days in a standard 12:12 hr light dark (LD) cycle and then collected flies every 4 hr during the first full day in constant darkness (DD). This strategy was chosen to avoid light-regulated genes not under circadian control as well as the damping (e.g., a decreased cycling amplitude of circadian gene expression) that occurs during extended incubation in constant darkness (see Discussion). Fly head mRNA was harvested from the six time points, biotinylated cRNA prepared and Affymetrix Drosophila GeneChips used to probe the labeled cRNA. The final data set includes replicas of 4 chips for CT0, CT4, CT8 and CT12, 5 chips for CT16, and 3 chips for CT20. The GeneChip data were analyzed using a model-based expression approach with dCHIP software (Li and Hung Wong, 2001a, 2001b; for complete dataset, see Supplemental Table S3). To identify a set of circadian genes with confidence, we put the data through four sequential analyses. First, signals were averaged over the 6 time points, and those that did not have an average signal intensity greater than 20 were excluded. This step removed genes with very weak or dubious expression levels ( 40% of the transcripts). Second, we required the difference between the highest 0 Cell 568 0 Circadian Genes, Microarrays, and Drosophila 569 0 Table 1. Top 10 Highest Fold Cycling Genes Flybase ID ldlr CG11854 CG13856 per vri tim1 CG5798 CG2069 clk CG5156 Function scavenger receptor ligand binding or carrier unknown PAS domain clock protein par domain clock protein clock protein ubiquitin thiolesterase unknown bHLH PAS clock protein unknown Fold Cycling 40.8 5.7 5.6 5.3 4.8 4.6 0 Global Survey of Chromatin Accessibility Using DNA Microarrays 0 Program in Molecular Biophysics, Division of Cell and Molecular Biology, Southwestern Graduate School of Biomedical Science, Department of Molecular Biology, 3Hamon Center for Therapeutic Oncology Research, 4Center for Biomedical Inventions, 5 Department of Internal Medicine, 6Eugene McDermott Center for Human Growth and Development, and 7Department of Pharmacology, UT Southwestern Medical Center, Dallas, Texas 75390, USA; 8Department of Experimental and Clinical Radiobiology, Center of Oncology, Gliwice, 44-100, Poland 0 In recent years, the study of transcriptional regulation by epigenetic mechanisms has enjoyed a renaissance because of advances in DNA microarray technology. These developments include the creation of high-throughput CpG methylation resequencing microarrays (Hatada et al. 2002) and advances in using DNA microarrays to probe Chromatin Immuno-Precipitation (ChIP) assays (Ren et al. 2000) on a genomic scale. Even with all these advances, perhaps one of the most important epigenetic regulation systems, chromatin architecture, has been overlooked. By mediating the availability of specific DNA sequences to regulatory proteins, chromatin accessibility in the form of chromatin condensation or relaxation is thought to be a major regulator of transcription (Orphanides and Reinberg 2002). Current methods of studying chromatin architecture either measure the accessibility of the genome as a whole (Banerjee and Hulten 1994) or of a few sub-kilobase regions (Reid et al. 2000), but no technique is currently available to easily and simultaneously measure the chromatin accessibility of the whole genome at kilobase resolution (Urnov 2003; Crawford et al. 2004). In this paper, we describe a new method for using DNA microarrays to study the global chromatin accessibility state as a measure of nuclease accessibility in relation to expression at the resolution of single genes. The primary method we chose for isolating DNA by its chromatin accessibility state takes advantage of the solubility differences of histone H1-depleted mononucleosomes and histone H1-containing mono- and oligonucleosomes in the presence or absence of MgCl2 and KCl to recover different chromatin fractions based on their activity states. This method's 0 utility was demonstrated by Rose and Garrard (1984) to study the chromatin packing of immunoglobulin light chain genes in relation to their transcription during B-cell development. A second method was optimized to use the preferential sensitivity of transcriptionally active chromatin to DNase I cleavage (Weintraub and Groudine 1976) to recover the relatively resistant regions as the "condensed" fraction using fragment length selection. Both of these methods are currently used in high-resolution, lowthroughput chromatin accessibility studies. To make these techniques both high resolution and high throughput, we optimized microarray-based comparative genomic hybridization (CGH) methods using commercially available probe sets or microarrays to probe the chromatin accessibility state en masse (Pollack et al. 1999; Weil et al. 2002). This "Chromatin Array" allows us to overcome the limited resolution and throughput problems of previous methods (Banerjee and Hulten 1994; Reid et al. 2000) by using the multiplex nature of microarray experiments while retaining the high resolution of low-throughput chromatin accessibility measurement techniques. Because this new type of microarray experiment has a novel output, we developed methods to interpret the chromatin state from the relationship of the condensed fraction's hybridization intensity as compared with the intensity of total genomic DNA. These data can then be related to the absolute RNA expression level measured on an identical microarray. To demonstrate the utility of the Chromatin Array method, we chose the cell line MCF7 because it is has been extensively studied by other groups (Pollack et al. 1999; Ross et al. 2000). We show that the chromatin solubility assay recovered fractions based on the condensation state of the chromatin, and that the microarray-based measurements could accurately measure the accessibility. The reproducibility of the condensation state mea- 0 Genome Research 0 Global Survey of Chromatin Accessibility 0 surements was independently verified using two different methods to extract the condensed chromatin for microarray-based measurements. To support the data analysis and interpretation, we used the Stanford Microarray Database (SMD) to validate our expression findings (Sherlock et al. 2001). Although the condensation state and expression measurement of a single gene may be of great value in transcriptional discovery, the biological relevance of the data on a global scale is possibly even more valuable. By relating function as defined by the Gene Ontology (GO) database (Ashburner et al. 2000) to the condensation state of large groups of genes, specific accessibility signatures of functionally related genes can be identified. These signatures are based on the different functional gene groupings of a particular accessibility state, and the differences in functional group assignments observed across the different accessibility states (Jimenez-Sanchez et al. 2001). These signatures can then be used to uniquely define a cell line. By comparing the signatures of multiple cell lines, it should be possible to identify the disease- and tissue-specific components of the signatures. Analysis of the accessibility data in light of both the condensation state of single genes as well as its global relationship to gene function makes the development of the Chromatin Array method a novel and important addition to study chromatin structure-function relationships. 0 RESULTS AND DISCUSSION 0 The Chromatin Array Accurately Measures the Accessibility State of the DNA Recovered by the Chromatin Solubility Assay 0 The chromatin solubility assay first uses micrococcal nuclease to generate mono- and oligonucleosomes that are separated into three fractions designated S1, S2, and P. The transcriptionally active DNA is found in the S1 and P fractions, which in MCF7 comprise 68% of the total DNA. The S1 fraction is depleted in histone H1 and enriched in the high mobility group (HMG) proteins and heterogeneous ribonucleoproteins particles (HnRNPs), both of which are known to be associated with actively transcribed chromatin (Huang et al. 1986). Likewise, the P fraction is highly enriched in nonhistone proteins, and with further digestion, it can be partially converted to the S1 fraction (Rose and Garrard 1984; Huang et al. 1986). The S2 fraction represents 32% of the total DNA and contains nucleosomes stoichiometrically associated with histone H1 and highly deficient in nonhistone proteins (Rose and Garrard 1984). This S2 fraction operationally represents the most condensed chromatin fraction as indicated by previous studies that have demonstrated that his- 0 The number of genes (19,437 possible) in each group that pass all data possessing filters is shown. Reproducibility refers to the percentage of genes that yield similar results in an independent replicate experiment of chromatin solubility fractionation on a different array platform. Concordance refers to the percentage of genes between the merged fragment length selection data and the chromatin solubility da 0 Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment 1 Evangelia E Ntzani, John P A Ioannidis 0 DNA microarray analysis is a highly promising technique with broad applications. Simultaneous characterisation of the expression pattern of thousands of genes could allow better understanding of the molecular properties of healthy and diseased tissue.1,2 Such information might lead to more accurate diagnosis and individual prediction of clinical outcomes.3 Oncology has been one of the most promising specialties for this technique to date.4 By use of DNA microarrays, investigators have tried to predict the overall clinicopathological behaviour of diverse malignant disorders. Although this information could revolutionise cancer prognosis and therapy, there is a need for close scrutiny of the clinical performance of the new method. We undertook a systematic assessment of molecular profiling studies that used DNA microarray analysis to generate predictive models for clinical cancer outcomes. We also recorded studies that addressed the relation of molecular subtypes with other clinicopathological features of malignant diseases. We investigated the strength of the current evidence for the predictive performance of DNA microarray analyses in oncology, whether this predictive information is independent of known traditional predictors of cancer outcomes, and whether there are features that influence the chances that a DNA microarray study will find significant associations with clinical outcomes and correlates thereof. 0 Study eligibility and search strategy We selected original studies in which: cDNA or oligonucleotide microarray analyses were done for functional gene expression of at least 500 genes; samples from at least ten patients with cancer were included; and an attempt was made to classify cancers into subtypes for prospective prediction of a major clinical outcome or to assess correlations with any other clinicopathological variables. Major clinical outcomes were death, metastasis, recurrence, or clinical response to therapy. Studies were included whether or not they succeeded in subtyping. We excluded studies that focused on structural gene alterations and those that used only pooled samples, cancer cell lines, or xenografts. When various samples were used, we focused on individual patients' samples. We also excluded studies that contrasted normal (or premalignant) and malignant tissue samples without subtyping tumour samples; studies of differential gene expression among cancer tissue samples from different organs; studies aiming to separate known distinct entities (eg, myeloid vs lymphocytic leukaemia); and studies focusing a priori on a specific gene. We used the cut-off of 500 genes to exclude studies more focused on identifying the role of a limited number of preselected genes. Some early microarrays used slightly over 500 probes. We searched MEDLINE limited to human studies and using the terms "microarr*", "gene expression profiling", 0 For personal use. Only reproduce with permission from The Lancet publishing Group. 0 We plotted on receiver operating characteristic (ROC) spaces the sensitivity and specificity of molecular subtypes for major clinical outcomes. Sensitivity and specificity estimates were calculated in a standard way from information presented in the reports and supplementary files of eligible studies. The major outcome definitions followed the main definition of the primary investigators. Whenever there were more than two resulting subtypes, the subtype with worse prognosis was compared against all others combined. Continuous predictive scores were split into two groups, as done by the primary investigators. Separate plots were drawn for independent validations, cross-validations, and unsupervised classifications. When different crossvalidations (complete and incomplete) were reported, we captured the predictive accuracy of all of them and discussed any d 0 Original article 0 Monitoring gene expression profile changes in bladder transitional cell carcinoma using cDNA microarray 1 Sun Ying-Hao, M.D.a,*, Yang Qing, M.D.a, Wang Lin-Hui, M.D.a, Gao Li, M.D.b, Tang Rong, M.D.c, Ying Kang, M.D.c, Xu Chuan-Liang, M.D.a, Qian Song-Xi, M.D.a, Li Yao, M.D.c, Xie Yi, M.D.c, Mao Yu-Ming, M.D.c 0 Keywords: Bladder neoplasms; Carcinoma; cDNA microarray; Gene 0 Introduction Cancers have been defined as a group of cells exhibiting an unrestrained proliferation phenotype. The development and progression of cancer result from complex changes in patterns of gene expression in the cell, which are accompanied by different histological or clinical classification of the abnormal cells' growth. It's very important to screen out these special genes from the human genome. Conventional methods such as northern or southern blot fail to achieve its expedient effect, but the advanced technique of cDNA microarrays works. It allows monitoring simultaneously the expression level of thousands of both selected known genes and cDNAs representing uncharacterized genes in one hybridization experiment. By employing this technique, detec- 0 Chipping away at brain function: mining for insights with microarrays 1 Gilbert L Henry, Karen Zito and Josh DubnauA 0 The impact of microarray studies on neurobiology has been limited because, with the exception of a few outstanding papers, most reports provide little more than lists of genes, often leaving the reader at a loss to understand which and how many of the identified transcripts will be true positives with significant biological impact. However, some recent papers have offered considerable biological insight by providing independent in vivo confirmation of the roles of candidate genes, offering a glimpse of the potential power of microarrays in neurobiological research. 0 to `genes with metabolic function'; in all cases, `genes of unknown function' dominate the pack. Second, the unavoidably high level of false positives inherent in the massively parallel quantification of small-magnitude effects has necessitated the use of careful, low-throughput follow-up assays to validate high-throughput array experiments. Despite these caveats, it is evident from several recent studies that genome-wide expression approaches, when validated with in vivo follow-up experiments, can yield significant insights. Our objective with this review is not to dwell upon the technical aspects of gene expression profiling in the brain, as experimental design and analysis methods, and the pitfalls associated with these, have been reviewed extensively elsewhere (e.g. in [1,2]), but instead to concentrate on the insights this technology has offered us as neurobiologists. With this in mind, we have chosen to discuss a subset of the most recent papers that we feel increase our understanding of brain function. 0 Chips and brain development 0 One major effort of neurobiological research is the study of brain development. At the cellular level, there are questions concerning the genetic programs responsible for specification of neural cell fates and the differentiation of the myriad neuronal and glial types (the mammalian retina alone contains approximately 55 separate neuronal types [3]). At the circuit level, the processes of axon guidance, target selection, synapse formation and refinement of synaptic connections each rely on a combination of intrinsic gene-expression patterns and environmental influences. In addition, at the systems level, there is a drive to fully map the spatial and temporal expression patterns of each gene. The utilization of microarrays to probe gene expression patterns at each of these levels has resulted in the identification of a considerable number of candidate genes, of which a few have been confirmed with in vivo studies. 0 Cellular-level analyses 0 Abbreviations CREB cyclic AMP response-element binding protein EAE experimental autoimmune encephalomyelitis FACS fluorescence-activated cell sorting FGF18 fibroblast growth factor 18 FMR1 fragile X mental retardation gene FMRP fragile X mental retardation protein FraX fragile X syndrome GC granule cell G-CSF granulocyte colony stimulating factor GFP green fluorescent protein HD Huntington's disease htt huntingtin MS multiple sclerosis OPC olfactory progenitor cell PolyQ polyglutamine SCN suprachiasmatic nucleus 0 In the past few years, the use of genome-wide expression profiling in neurobiology has exploded. Although these studies have in a short period produced an impressive list of candidate genes, two issues have limited the scope of the ensuing biological insights. First, long lists of genes do not, on their own, further our understanding of the biology. In virtually all cases, most functional categories of genes are identified, ranging from from `transcription factors' to `translation factors', from `signaling molecules' to `cell-cycle control genes' and from `cytoskeletal proteins' 0 In the past few years, several groups have used microarrays to probe for gene expression patterns that confer upon neural stem cells their unique ability both to selfrenew and to differentiate into multiple cell types (e.g. [4-7]). In each case, >200 genes were identified, including many known markers of stem cells. It is worth noting, however, that a third-party comparison of the `stem-cell enriched' transcripts identified in two of these studies revealed an overlap of only 15 genes [8]. This small overlap is likely to be due to discrepancies in the manner 0 Chipping away at brain function: mining for insights with microarrays Henry, Zito and Dubnau 571 0 in which the stem cell populations were isolated and to their lack of purity. The neural stem cells for these experiments were obtained from neurospheres, colonies of cultured stem cells from regions of the mammalian ventricular and subventricular zones. Neurospheres are known to be heterogeneous, containing only 3-4% true stem cells that give rise to all three neural lineages [9]. This heterogeneity creates a signal-to-noise problem for the detection of gene expression in a given cell type. One way to alleviate problems of tissue heterogeneity is through the analysis of single cells. Technologies for single-cell mRNA analysis have been under development for just over a decade [10], and recently the first neurobiological reports have emerged on the use of single cells in combination with microarrays (e.g. [11-13]). Tietjen et al. compared single neuronal progenitor cells (OPCs) from the olfactory bulb to mature olfactory sensory neurons [12]. The authors identified 197 genes enriched in OPCs, some of which were confirmed by in situ hybridizations to be expressed in proliferative regions of the olfactory epithelium. Evaluation of the overall success of these and the earlier experiments awaits a detailed examination of the expression patterns of the identified genes to determine their utility as markers of stem cells. The discovery of stem-cell marker genes should facilitate the identification and selection of stem-cell populations for functional studies as well as for therapeutic purposes. A second example of the advantages afforded by cell purification comes from a study of neuronal differentiation in Caenorhabditis elegans. Zhang et al. examined downstream targets of mec-3, a transcription factor required for the development and function of the touch-receptor neuron [14]. The authors used fluorescence-activated cell sorting (FACS) to isolate populations of GFP-expressing touch-receptor neurons from wild-type and mec3-mutant animals. They identified 71 mec-3-dependent candidate genes, including seven of the nine known mec3-dependent genes, two genes known to be expressed in touch receptors, and mec-17, a gene previously identified in an independent screen and required for the maintenance of touch-receptor differentiation. Seventeen of the newly identified and eight of the nine known mec-3dependent genes contained in their promoter regions an over-represented heptanucleotide indirectly implicated in mec-3-dependent transcription, making them potential direct targets for mec-3 regulation. Thus, microarrays can facilitate identification of the genes responsible for differentiation of a particular neuronal subtype. Access to a homogeneous population of that neuronal type greatly facilitates this type of analysis. 0 Circuit-level analyses 0 attempted to address this issue in the developing pontocerebellar projection system. First, the authors used a powerful combination of approaches to characterize gene expression in cerebellar granule cells (GCs). By examining developmental gene expression in the cerebellum (of which GCs make up a major component), in cultured GCs, in acutely isolated GCs, and in two strains of mutant mice that lack GCs, the authors identified genes that could play a role in GC differentiation. They then looked at gene expression changes in the pontine nucleus, which contains the presynaptic cells that project to and synapse on GCs of the cerebellum. With the expectation that, in the absence of their GC targets, pontine cells would not undergo target selection and synapse formation, the authors again used mutant mice that lack GCs, this time to identify candidate genes responsible for axonal outgrowth and synapse formation. Although these experiments reveal some potentially promising candidates, only a small fraction of genes were validated by in situ hybridizations, and thus it remains to be seen whether the newly identified candidate genes in fact have their hypothesized roles in GC differentiation, axon outgrowth and synapse formation. Additional investigations of the type described by Diaz et al. are expected to greatly improve our understanding of the genes involved in the establishment, maintenance and modification of the connectivity patterns of neuronal circuits. 0 Systems-level analyses 0 An ultimate goal of microarray studies in the brain is the assembly of a comprehensive map of gene expression across all neuronal types, brain regions, and developmental stages [1,16]. The resulting expression map should enable neuroscientists to gain a broader understanding of brain function through a systems-level analysis of coordinate gene regulation patterns. Several groups have identified genes with subregion-specific expression patterns, for example in the hippocampal subregions [17] or in the amygdaloid subnuclei [18]. However, it is increasingly clear that most individual laboratories do not have the resources to carry out large-scale validation of all candidate genes, which would be required before they could be incorporated into a comprehensive molecular atlas of th 0 Identification of genes involved in Drosophila melanogaster geotaxis, a complex behavioral trait 0 Nature Publishing Group http://genetics.nature.com 1 Daniel P. Toma1, Kevin P. White2, Jerry Hirsch3 & Ralph J. Greenspan1 0 Pioneering experiments on Drosophila melanogaster and Drosophila pseudoobscura investigated the nature of the genetic basis for extreme, selected geotaxic behavior. These experiments constituted the first attempt at the genetic analysis of a behavior. Selection and chromosomal substitution experiments successfully showed that there is a genetic basis for extreme geotaxic response in flies1-5 and, by implication, for behavior in general. These experiments also added to our understanding of the role of variation in phenotypic evolution and selection6-8. Despite their seminal contributions in behavioral genetics, population genetics and the study of selection, by their nature these experiments could not identify specific genes9. These results highlight both the success and the limitation of behavioral selection experiments. Although selection results tend to be representative of the natural interactions of genes that produce behavior10 and can demonstrate that a trait has a genetic basis, they do not pinpoint specific genes that influence the trait. This is partly due to the involvement of many genes and the relatively minor role of each in complex polygenic phenotypes--a problem that is especially acute for the intrinsically more variable phenotypes that are associated with behavior. The advent of cDNA microarray technology offers an easily generalized strategy for detecting gene expression differences and can complement other means of identifying the genes that underlie complex traits11. An expression difference may occur in a gene that is not itself polymorphic, but that gene may contribute to the realization of the phenotypic difference. 0 cDNA microarray and qPCR Initially, we used cDNA microarrays13 that contained about onethird of the predicted genes in the genome to identify roughly 250 genes that showed an approximately twofold or greater expression differential between the Hi5 and Lo lines. We did these experiments in duplicate with different sets of flies and removed the few genes that behaved inconsistently from further analysis. The number of genes that showed consistent differential expression was about 5% of those assayed. Thus, gene expression in these strains has been modified as the result of laboratory selection. The polymorphisms responsible for this differential gene expression probably derive both from variation that was present Results in the initial selected populations and from spontaneous mutaGeotaxis behavior for selected lines As a starting point for identifying genes that affect a complex tions that occurred during the course of the selection experitrait, we analyzed the selected, established Hi5 and Lo extreme ments. Not all of these differentially expressed genes would be 0 Table 1 · Comparison of cDNA microarray and qPCR ratios of mRNAs Gene Array (Lo/Hi5) qPCR (Lo/Hi5) cry 3.57 5.96 Pdf 1.85 2.02 Experimental group Pen pros (l) 0.18 - 3.22 3.71 pros (sl) 3.22 1.57 cnk 0.92 0.69 Csp 1.03 1.00 for 1.27 1.42 Control group mth nmo 1.11 1.62 1.01 1.01 per - 1.74 0 The average coefficient of variance for the qPCR results from each selected line was 19.33% with a range of 17.32-23.08% for Hi5, and 22.86% with a range of 21.96-24.17% for Lo. Because arrays were repeated only twice, no estimate of variance was possible. We report no Pen qPCR data because, of six primer pairs tested, none amplified efficiently enough to obtain consistent results, although the direction of change for those that gave some amplification was in the predicted direction. pros has two splice variants17, short (s) and long (l), which the array did not resolve. We therefore designed a separate primer pair for each form, but the pair for the short form, designated (sl), amplifies both. 0 Nature Publishing Group http://genetics.nature.com 0 (CS), that was different from either of the selected lines. We tested the resultant strains (Table 2) in a geotaxis maze. We placed the mutants on a neutral background to assay for those genes that have the most robust phenotypic effect that is independent of the combination of alleles in the selected lines. We also tested the effects of varying the gene dosage of Pdf and pros. For Pdf, we constructed lines with Pdf01 (henceforth referred to as Pdf-) and the wildtype transgenic insertion Pdf+t3.530 (henceforth referred to as Pdf+t) to titrate its effect on the behavior. Likewise, for pros we used the mutant allele pros17 and the transgenic insertion pros+t30.8 (henceforth referred to as pros+t). The Pen and cry mutants deviated significantly from CS (Table 3 and Fig. 2a). Pdf- flies also deviated significantly from CS. There were also effects on geotaxic behavior in Pdf- flies owing to alterations in gene dosage and sex (genotype x sex interaction, F = 3.85, P < 0.0015; Table 4 and Fig. 2b). The sex-specific effect of varying Pdf gene dosage was graded, with the homozygous Pdf- males showing the same response as Hi5 males. In males, the effect was Hi5 = Pdf-/Pdf- > Pdf+t/+; Pdf-/Pdf- = Pdf+t/Pdf+t; Pdf-/Pdf- > Pdf-/+ = CS = Pdf+t/+ = Pdf+t/Pdf+t > Lo, where nonsignificance is indicated by `=' and significance is indicated by `>' (Table 4 and Fig. 2b). Thus, although Pdf-/Pdf- males did not differ significantly from Hi5 males, adding one copy of the transgene significantly lowered their score. Adding 0 Molecular Characterization of Clinical Study Schizophrenia Viewed by Microarray Analysis of Gene Expression in Prefrontal Cortex 0 Neuron 54 0 The changes in schizophrenic subjects were assessed by gene expression profiling for 250 gene groups related to metabolic pathways, enzymes, functional pathways, or brain-specific functions. More than 98% of the gene groups, when compared to the expression pattern of all detectable transcripts, were not significantly different (p 0.05) between the schizophrenic and control subjects (Figures 1E-1H), establishing that other changes that we did detect are not due simply to human subject variability. This observation also is in agreement with previous findings that total mRNA levels in schizophrenic subjects are comparable to those in the unaffected human population (Harrison et al., 1997). However, several gene groups exhibited significantly changed expression in schizophrenic subjects, both within individual pairs and across pairs (presynaptic sec 0 A cDNA microarray from the telencephalon of juvenile male and female zebra finches 1 Juli Wade a, , Camilla Peabody a , Paul Coussens b , Robert J. Tempelman b , David F. Clayton c , Lei Liu d , Arthur P. Arnold e , Robert Agate e 0 Abstract Studies over roughly the last decade have emphasized the importance of gene expression in the development of structure and function of the songbird forebrain. However, few tools have been available to efficiently identify the critical factors. To that end, we have produced a normalized cDNA library from juvenile zebra finch telencephalon, and have spotted inserts from 2400 randomly selected cDNA clones on microarrays (1664 unique sequences). We have also added several previously cloned cDNAs of interest, including three representing genes encoded on sex chromosomes. Hybridizations comparing Cy3- and Cy5-labeled cDNA from the telencephalon of day 25 male and female zebra finches confirmed sexually dimorphic expression of the Z- and W-linked genes, demonstrating the utility of these microarrays for detecting differential expression and providing information about the relative expression of these genes in the brains of juveniles of this age. © 2004 Elsevier B.V. All rights reserved. 0 Keywords: Songbird; Sexual differentiation; Sexual dimorphism; Song development; Brain development 0 ing song playbacks differs in juvenile males and females (Bailey and Wade, 2003). Sexual differentiation of the neural circuits governing reproductive behaviors is regulated by gonadal steroid hormones in diverse vertebrate groups. However, in the zebra finch, numerous experiments have suggested that gonadal steroids are not critical to the masculinization or feminization of the forebrain regions controlling their courtship song (Arnold, 2002; Balthazart and Adkins-Regan, 2002). Instead, factors intrinsic to the brain are responsible, likely both steroid hormones synthesized within that organ and gene products (proteins) produced in neurons and/or glia (Agate et al., 2003; Holloway and Clayton, 2001). However, relatively little is known about the specific genes involved in sexual differentiation--those that influence or are influenced by steroid hormones, as well as those that independently cause masculine or feminine development. Similarly, although the expression of immediate early genes has been a powerful technique for functionally mapping anatomical structures critical to song perception and perhaps 0 song-related memory formation (Bailey et al., 2002; Mello and Clayton, 1994; Mello et al., 1992; Stripling et al., 2001), cellular activity downstream of fos or zenk activation remains largely unexplored because an efficient means of screening for the transcription of songbird genes has not existed. Until very recently only tens of gene products had been cloned from the songbird brain (Clayton, 1997), with a few isolated from the zebra finch telencephalon using differential display RT-PCR (Denisenko-Nehrbass et al., 2000; Veney et al., 2003). To identify the critical gene products more quickly, we developed a microarray of cDNAs from the zebra finch telencephalon useful for the study of gene expression under a variety of developmental conditions. Morphological differentiation of the song circuit(s) occurs until approximately 50 days after hatching. Although some characteristics are sexually dimorphic before post-hatching day 10 (Gahr and Metzdorf, 1999), anatomical differentiation occurs at the greatest rate during about days 20-35 (Bottjer et al., 1985; Kirn and DeVoogd, 1989; Nixdorf-Bergweiler, 1996). Also, under normal conditions, exposure to song during roughly post-hatching days 25-35 influences the ability of both sexes to produce and/or respond to it appropriately in adulthood (Clayton, 1988; Eales, 1985; Immelmann, 1969; Miller, 1979; Nordeen and Nordeen, 1997). Males typically form templates of their fathers' songs during this period, and then integrate these memories with their own attempts at production until they create a song quite similar to their fathers' by about 60 days of age (Nordeen and Nordeen, 1997). Although it takes another 2 weeks or so to reliably take on its permanent, stable form, the majority of song learning is completed by day 60. To focus on genes involved in sexual differentiation and development of song production and perception, cDNA microarrays were produced using normalized libraries generated from the telencephalons of males and females at days 10-60 post-hatching. In addition to testing hypotheses associated with those processes, depending on the design of the experiment, the cDNAs on these arrays can provide information about gene expression associated with changes in neural function under numerous conditions. 0 Materials and methods 2.1. RNA isolation and library production RNA was isolated from the telencephalon of two males and two females at day 10, two females and one male at day 22, and one individual of each sex at days 30, 45, and 60 using Trizol (Invitrogen Life Technologies). The concentration of each sample was determined, and the purity and integrity of each was checked on 1% agarose gels before proceeding. Separate male and female cDNA libraries were produced a 0 LETTERS SCIENCE & SOCIETY POLICY FORUM BOOKS ET AL. PERSPECTIVES REVIEWS 0 IN OUR REPORT, "EVIDENCE FOR COHERENT proton tunneling in a hydrogen bond network" (1), we presented nuclear magnetic resonance relaxometry results for calix(4)arene in the solid state. A peak at 35 MHz in the magnetic field dependence of the proton spin-lattice relaxation rate was interpreted as a manifestation of coherent proton tunneling in a cyclic array of four hydrogen bonds. In the course of further investigations, it has become apparent that the sample supplied to us contained residues of dichloromethane. This brings into question the assignment of the spectral feature because we cannot now rule out the possibility that it derives from quadrupole resonance transitions associated with chlorine nuclei. Thus, we must retract our report. Conclusions regarding the incoherent tunneling of protons in this material are not in question. 0 HIV Among Drug Users in China 0 J. KAUFMAN AND J. JING PROVIDE AN EXCELlent overview of the potentially catastrophic epidemic of HIV/AIDS in China in their Policy Forum "China and AIDS--the time to act is now" (28 June, p. 2339). They note that the Chinese epidemic began among injecting drug users (IDUs) and call for education on safer injection and clean needle programs to reduce HIV transmission among IDUs. HIV among IDUs is clearly a major problem in China: (i) 68.7% of all reported cases of HIV are among IDUs; (ii) HIV infection has spread along drug distribution routes and has occurred among IDUs in all provinces; (iii) extremely rapid HIV transmission has occurred in some populations of IDUs, with incidence rates of over 30% per 1 CHENG FENG1 AND DON DES JARLAIS2 Kingdom HIV Prevention and Care Project, 27 Nanweilu, Beijing 100050, China. 2Baron Edmond de Rothschild Chemical Dependency Institute, Beth Israel Medical Center, First . Avenue at Street, New York, NY 10013, USA. 0 Trying to Make Sense of Disorder 0 CREDIT: AP PHOTO/GREG BAKER 0 FENG AND DES JARLAIS RAISE IMPORTANT points, and we fully agree with their opinions. Policies and programs to contain the spread of HIV among IDUs require much 0 IN HIS ARTICLE "A FRESH TAKE ON DISORDER, or disorderly science?" (News Focus, 23 Aug., p. 1268), Adrian Cho reports on a lively controversy presently raging over what is called "Tsallis entropy," which has been wrongly suposed to be the physical entropy of the natural world, superseding the universal and general Clausius-Boltzmann statisticalthermodynamic entropy. The new definition of entropy developed by Constantino Tsallis is a very useful--and sophisticated--tool for generating a so-called nonextensive thermostatistics, which can be used for adjusting and analyzing experimental data in certain partic- 0 NOVEMBER 2002 1 ROBERTO LUZZI, AUREA R. VASCONCELLOS, J. GALVAO RAMOS Instituto de Fisica-Unicamp, 13083-970 Campinas, SP, Brasil. 0 ADRIAN CHO'S ARTICLE ON TSALLIS ENTROPY ("A fresh take on disorder, or disorderly science," News Focus, 23 Aug., p. 1268) emphasizes the importance of nonextensive energies when analyzing complex systems. To complement his picture, I would like to draw attention to an alternative way of treating nonextensive energies, developed by Terrell Hill about 40 years ago (1-3). Hill's approach is based on the fundamental foundation of Gibbs' ensembles and does not involve modifying the definition of entropy. To my knowledge, Hill's work remains the only comprehensive treat 0 Mfold web server for nucleic acid folding and hybridization prediction 1 Michael Zuker* 0 Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180, USA 0 gij26014111jref jNW 044277:1jRnUn 1636 Rattus norvegicus WGS supercontig ATGTTCAATTTTATCTAATCCCTGTTACTCTGGAAAACAGGTTAAAAAAAAAAATCCTCCACAATCCATT TTCTGGAAAACAGCTTACTTCAAAGACCCACCCTTCCTGTAGGACTTTAGTACATCTTTCAGGTGCTTCT; 0 then the resulting sequence will be 0 GIREFNWRNU 60 UUUAUCUAAU 110 CACAAUCCAU 160 UAGGACUUUA 20 NRAUUUSNOR 70 CCCUGUUACU 120 UUUCUGGAAA 170 GUACAUCUUU 30 VEGICUSWGS 80 CUGGAAAACA 130 ACAGCUUACU 180 CAGGUGCUUC 40 50 SUPERCONUI GAUGUUCAAU 90 100 GGUUAAAAAA AAAAAUCCUC 140 150 UCAAAGACCC ACCCUUCCUG 190 U; 0 rather than 0 The letter `N' should be used for an unspecified base. It is not allowed to pair. The lett 0 BMC Bioinformatics 0 BMC Bioinformatics 2002, 3 0 BioMed Central 0 Methodology article 0 Open Access 0 Oliz, a suite of Perl scripts that assist in the design of microarrays using 50mer oligonucleotides from the 3' untranslated region 1 Hao Chen* and Burt M Sharp* 0 Keywords: oligonucleotide microarray, Perl, UniGene 0 DNA microarrays usually involve the hybridization of labeled cDNA samples to a set of complementary DNA (either PCR products or synthetic oligonucleotides) fixed onto solid media. Spotting presynthesized oligonucleotide has many advantages, such as high sensitivity, convenience, and cost effectiveness. Most importantly, the use of oligonucleotide probes circumvents the high error rate that is associated with the PCR amplification of bacterial clones [4,6]. The starting point in the design of oligonucleotide microarrays is the identification of short DNA sequences that can be used as probes for the genes of interest. Obviously, 0 all sequences should be gene specific and have similar melting temperature (Tm). We have been interested in using the 3' untranslated region (3'UTR) as the target region for the design of oligonucleotide probes primarily because of the relatively high specificity of this region [8] and the availability of sequence information (in the form of Expressed Sequence Tags, ESTs). Frist, our approach involves the identification of genes of interest in the form of UniGene clusters. The sequences of these clusters were retrieved and assembled into contigs. Then, the 3'UTRs were parsed from the contigs. Finally, oligonucleotide sequences of 50 nucleotides with similar 0 Page 1 of 7 0 (page number not for citation purposes) 0 BMC Bioinformatics 2002, 3 0 Step 1. UniGene retrieval and contig assembly A list of selected UniGenes is first compiled and used as the input file for the UNI module. The sequences contained in these UniGene clusters are extracted by the UNI module. To achieve this function, the UNI module requires a file that contains all the UniGene sequences of the species of interest. This file is available from NCBI's FTP site [ftp://ftp.ncbi.nih.gov/repository/UniGene]. The name of this file follows the convention of "species.seq.all". Then, the CONTIG module assembles each of the clusters into a contig using the CAP3 program[2]. Due to the high error rate in both the sequence and annotation of ESTs, clusters contains only one EST sequence are excluded from further analysis. Step 2. Parsing 3'UTR The UTR module performs several tasks. Initially, it determines the orientation of a contig by comparing it to a reference sequence, such as those provided by the NCBI RefSeq project [5] (1st priority), or GenBank sequences with coding region annotations (2nd priority), or sequences with polyA tails (3rd priority). It is generally assumed that these sequences are in 5'-3' orientation. When the above approaches fail to identify the orientation of the contig, its cluster identifier is sent to a separate file. The orientation of these contigs can be obtained manually by cross-referencing to their homologues in other species, and then be incorporated into the results. 0 The UTR module's main function is to parse the 3'UTR of the contigs, according to the coding region annotation in the reference sequence. The length of the 3'UTR varies from gene to gene. Based on the average length of the transcripts obtained from oligo dT primed cDNA synthesis, we decided to target the last 500 bases of the 3'UTR as the region for the selection of 50mer oligonucleotides. In addition, the UTR module generates several HTML files to facilitate visual inspection of the results. These files contain links to the UniGene cluster sequences, the contigs and the 3'UTRs. 0 Step 3. Generating 50mer oligonucleotides with close Tms The EMBOSS prima program was used to select 50mer oligonucleotides with similar Tms for each 3'UTR. The Tm was set at 76 ± 5°C based on the average Tm for 50mers. The resulting 50mer sequences were saved as a commaseparated text file, ready for processing by the UNIQ module. Step 4. Similarity search One of the advantages of using the 3'UTR as the target region for hybridization is that this region has been under less evolutionary pressure to remain constant. However, this does not guarantee that all 50mers selected from this region are gene specific. Therefore it is necessary to identi- 0 melting temperature (Tm) and GC content were selected and screened for specificity (Figure 1). 0 The Oliz suite was written in Perl (v.5.6) and was tested on the RedHat Linux (v.7.1) operating system. Oliz has four modules. The UNI module extracts UniGene clusters, which are assembled into contig(s) by the CONTIG module. Then, the UTR module parses the 3'UTRs of the contigs, and selects multiple 50mer sequences that are within the selected range for GC content (45-50%) and Tm (76°C ± 5). Lastly, the UNIQ module performs blast searches on the 50mers to ensure their gene specificity. 0 Page 2 of 7 0 (page number not for citation purposes) 0 BMC Bioinformatics 2002, 3 0 fy potentially similar sequences in other genes. The UNIQ module automates the blastn search, analyzes the blastn results, and decides whether to retain or discard a particular 50mer based on the set criteria. The UNIQ module runs blastn searches using a local database constructed using sequences obtained from NCBI. While analyzing the sequences identified by blastn, it disregards accession numbers that are found in the same UniGene cluster as the 50mer. Matches that are oriented complementary to the 50mer also are ignored, and only sense/sense pairs are analyzed further. The orientation of the blastn matches is apparent when they are known genes. EST hits are judged based on their "clone_end" annotation. Kane et. al. [3] reported that specificity of a 50mer oligonucleotide requires that it is less than 75% similar to all non-target transcripts. In addition, when it is 50-75% similar to a non-target transcript, the similar region must not include a stretch of sequence of greater than 15 contiguous bases. Since blast only returns part of the sequence where a match is found (usually less than 50 nucleotides), it is necessary for the UNIQ module to retrieve the entire matching sequence before calculating the overall sequence similarity. The guidelines reported by Kane et al. are then followed to determine whether candidate 50mers are acceptable. Occasionally all the candidate sequences generated by EMBOSS prima were disqualified when compared to one EST entry. This is a difficult issue, insofaras these ESTs may represent unknown genes, implying that the candidate 50mer is not gene specific. However, the apparent similarity may simply be caused by errors in the EST. When this occurs, the UNIQ module performs another blastn search that excludes all the ESTs from the database. The accession number of the EST in question is provided in the output file and a detailed log file for each oligonucleotide sequence is also provided. 0 Experimental verification of the specificity of the 50mer oligonucleotides A set of 1816 rat specific 50mer oligonucleotide sequences was obtained using the methods described above. Most of these genes are known to be expressed in the central nervous system. These oligonucleotides were spotted in duplicate onto TeleChem SuperAmine slides. 0 brain mRNA. Five of these ten primer pairs amplified a single product with the expected length. A second PCR reaction was performed on these 5 RT-PCR products to selectively amplify the antisense strand while incorporating amino allyl dUTP. These antisense DNAs then were labeled with Cy3 fluorescent dye, and were used for microarray hybridization. Each microarray slide was only probed with one Cy3-labeled DNA. All of the five Cy3-labeled cDNAs hybridized to their expected spots. The subgrids (13 ´ 8 spots) that contain the specific hybridized spots are shown in Figure 2. Two spots on the array, known to have green autofluorescence (not shown), were excluded from the analysis. Depending on the specific cDNA sequence, there were 0-4 additional spots that had detectable fluorescence. This represent 0 Spotted Long Oligonucleotide Arrays for Human Gene Expression Analysis 1 Andrea Barczak,1 Madeleine Willkom Rodriguez,1 Kristina Hanspers,2 1 Laura L. Koth,1 Yu Chuan Tai,3 Benjamin M. Bolstad,3 Terence P. Speed,4,5 and David J. Erle1,6 0 Microarrays can be produced by deposition (or spotting) of DNA or by in situ synthesis of oligonucleotides on a solid substrate. Spotted cDNA arrays are typically produced by depositing PCR amplicons, made from cDNA clones, on modified glass slides (Schena et al. 1996). In general, PCR amplicons are several hundred to a few thousand base pairs, and one amplicon (or sometimes a few different amplicons) are used to probe each gene. These arrays can be produced by individual investigators or core facilities, or can be purchased commercially. Production of microarrays by in situ synthesis requires more sophisticated and costly equipment, and these arrays are generally produced commercially. One widely used implementation of this technology is the Affymetrix short oligonucleotide array (GeneChip). Here, photolithography and solid-phase chemistry are used to produce high-density arrays of 25-mer oligonucleotides (Lockhart et al. 1996). Each perfect-match oligonucleotide is paired with a mismatched oligonucleotide, and several (11-20) pairs of 25-mers are used for each gene. Various approaches have been used to verify the accuracy of microarray data. Microarray assay technology can be calibrated by spiking known quantities of one or several RNA transcripts into test samples. Alternatively, independent 0 Genome Research 0 Barczak et al. 0 We produced two different sets of spotted arrays using two collections of long oligonucleotide probes (Operon Human Genome Oligo Set Versions 1 and 2, Table 1). There were 10,801 UniGene clusters that were represented in both groups of probes, but the sequences of these two groups of probes were largely independent: Version 1 and Version 2 probes overlapped significantly (by at least 25 identical bases) for just 0 Genome Research 0 Spotted Long Oligonucleotide Arrays 0 of the 10,801 gene clusters that were represented in both versions. We also used commercially produced arrays containing sets of 25-mer probes synthesized in situ (Affymetrix U95Av2 GeneChips). We used all three groups of probes to compare gene expression in two total RNA samples, one made from K562 erythroleukemia cells and one made from a pool of 10 different cell lines. For spotted long oligonucleotide arrays, the RNA samples were used to produce labeled cDNA targets. Two color hybridizations were performed using Cy3- and Cy5-labeled targets derived from the two 0 ANALYTICAL BIOCHEMISTRY 0 A new polymeric coating for protein microarrays 1 Marina Cretich,a,¤ Giovanna Pirri,a Francesco Damin,a Isabella Solinas,b and Marcella Chiaria 0 Keywords: Protein microarrays; Polymer coating; Rheumatoid factor 0 Protein microarrays are becoming an important tool in proteomics, drug discovery programs, and diagnostics [1]. The amount of information obtained from small quantities of biological samples is signiWcantly increased in the microarray format. This feature is extremely valuable in protein proWling, where samples are often limited in supply and unlike DNA, cannot be ampliWed [2]. Protein microarrays are more challenging to prepare than are DNA chips [3] because several technical hurdles hamper their application. The surfaces typically used with DNA are not easily adaptable to proteins, owing to the biophysical diVerences between the two classes of bioanalytes [4]. Arrayed proteins must be immobilized in a native conformation to maintain their biological function. Unfortunately, proteins tend to unfold when immobilized onto a support so as to allow internal hydrophobic side chains to form hydrophobic bonds with the solid surface [5]. The accessibility of the protein is also of crucial importance to 0 achieve proper recognition during hybridization; protein- substrate interactions reduce the accessibility of the target, leading to false negative results. Another important requirement of the surface is to provide a low unspeciWc background because unwanted adsorption of proteins leads to false positive results. The presence of an aspeciWc background is one of the most severe problems in antibody microarrays [6]. The achievement of a low degree of unspeciWc binding is extremely diYcult when the protein sample is a complex mixture of thousands of molecules [4]. Current microarray supporting materials can be divided into two major categories [7]: surfaces coated with gels, such as polyacrylamide and agarose, and surfaces derivatized with functional groups, such as aldehyde, epoxy, and amino groups (polylysine). Methods for on-chip protein analysis also include the ProteinChip array technology that is based on selective extraction and retention of proteins on chromatographic chip surfaces and analysis by laser desorption/ionization mass spectrometry [8]. 0 Recently, our group has introduced a new type of polymeric glass slide for DNA microarrays [9] obtained by adsorption of a copolymer of N,N-dimethylacrylamide (DMA),1 N,N-acryloyloxysuccinimide (NAS), and [3-(methacryloyl-oxy)propyl]trimethoxysilyl (MAPS): copoly(DMA-NAS-MAPS). Each monomer confers to the copolymer a speciWc feature. NAS is the reactive group able to bind amino-modiWed DNA and primary amines of lysines and arginines in proteins. DMA, which forms the polymer backbone, facilitates polymer adsorption on the glass surface, whereas MAPS covalently reacts with free silanols and stabilizes the coating. The coating is innovative in that it adsorbs onto the glass surface very quickly (10-30 min) from a diluted aqueous solution. Therefore, the coating procedure is fast and robust, providing an inexpensive hydrophilic functional surface. The performance of glass slides coated with the copoly(DMA-NAS-MAPS) has been studied extensively in DNA microarray experiments [9]. In the current work, copoly(DMA-NAS-MAPS) slides were used as a microarray support for protein- protein interaction experiments and in the assessment of rheumatoid factor (RF) in human serum samples. 0 Materials and methods Materials DMA and MAPS were obtained from Sigma (St. Louis, MO, USA). NAS was obtained from Polysciences (Warrington, PA). Anti-rabbit immunoglobulin G (IgG) F(ab )2 fragments speciWc, developed in goat (goat IgG speciWc for the Fab fragments) were obtained from Jackson ImmunoResearch Laboratories (West Grove, PA, USA). Anti-human polyvalent immunoglobulins developed in goat (goat IgG), Tris, BSA, and Tween 20 were obtained from Sigma. Immunoglobulins from rabbit serum (rabbit IgG) were obtained from Life Line Lab (Pomezia, Italy). CodeLink Activated Slides were obtained from Amersham Biosciences (Piscataway, NJ, USA), and ArrayIt Super Aldehyde Substrates were obtained from TeleChem International (Sunnyvale, CA, USA). Glass slide coating Untreated microscope glass slides (Sigma) were pretreated with 1 M NaOH for 30 min and 1 M HCl for 1 h, 0 Abbreviations used: DMA, N,N-dimethylacrylamide; NAS, N,N-acryloyloxysuccinimide; MAPS, [3-(methacryloyl-oxy)propyl]trimethoxysilyl; RF, rheumatoid factor; IgG, immunoglobulin G; NHS, N-hydroxysuccinimide; D/P, dye-to-protein ratio; PMT, photomultiplier tube; S/N, signal-to-noise ratio; EIA, enzyme-linked immunoassay; ELISA, enzyme-linked immunosorbent assay; XRR, X-ray reXectivity. 0 where 170,000 M¡1 cm¡1 is assumed as the molar extinction coeYcient for IgG. The dye-to-protein ratio (D/P) for the labeled IgG was calculated according to the following equation: (D/P) D (1.13A552)/[A280 ¡ (0.08A552)], (2) 0 and scanned again. Mean intensity values of 4 £ 4 spot subarrays were calculated and plotted against spotted concentration. Antibody Fab portion recognition on copoly(DMA-NASMAPS) slides Rabbit IgG were dissolved in a PBS buVer at diVerent concentrations and spotted on the copoly(DMA-NASMAPS) slides. After overnight binding in a humid chamber, printed slides were rinsed and blocked with BSA (2% w/v) in a phosphate buVer (50 mM, pH 7.2) for 1 h. The slides were incubated for 1 h with Cy3-labeled goat IgG speciWc for the Fab fragments, dissolved in the hybridization buVer (Tris-HCl, 0.1 M, pH 8; 0.1 M NaCl; 1% w/v BSA; 0.02% w/v Tween 20) at a concentration of 0.05 mg/ml. After washing with Tris-HCl (0.05 M, pH 9), 0.25 M NaCl, 0.05% Tween 20, PBS, and water, the slides were dried and scanned for Xuorescence evaluation. Sandwich immunoassay on microarray format The capture antigen (rabbit IgG) was dissolved in 0 Evolution of new nonantibody proteins via iterative somatic hypermutation 1 Lei Wang*, W. Coyt Jackson*, Paul A. Steinbach*, and Roger Y. Tsien* 0 B lymphocytes use somatic hypermutation (SHM) to optimize immunoglobulins. Although SHM can rescue single point mutations deliberately introduced into nonimmunoglobulin genes, such experiments do not show whether SHM can efficiently evolve challenging novel phenotypes requiring multiple unforeseeable mutations in nonantibody proteins. We have now iterated SHM over 23 rounds of fluorescence-activated cell sorting to create monomeric red fluorescent proteins with increased photostability and far-red emissions (e.g., 649 nm), surpassing the best efforts of structure-based design. SHM offers a strategy to evolve nonantibody proteins with desirable properties for which a high-throughput selection or viable single-cell screen can be devised. 0 directed evolution mPlum Ramos red fluorescent protein 0 Materials and Methods 0 Introduction of the mRFP1.2 Gene into Ramos Cells. The mRFP1.2 0 gene was amplified with primer pair LW5 (5 -CGCGGATCCGCCACCATGGTGAGCA AGGGC-3 ) and LW3 (5 CCATCGAT T TAGGCGCCGGTGGAGTGGCG-3 ), digested with BamHI and ClaI, and ligated into a precut pCLNCX (Imgenex, San Diego) derivative retroviral vector, in which the cytomegalovirus (CMV) promoter was replaced with the inducible Tet-on promoter. The resultant plasmid, pCLT-mRFP, was cotransfected with pCL-Ampho (Imgenex) into HEK293 cells to make the retrovirus, which was subsequently used to infect Ramos cells [CRL-1596, American Type Culture Collection (ATCC)] together with another retrovirus harboring the reverse Tet-controlled transactivator. Ramos cells were grown in modified RPMI medium 1640 as suggested by ATCC. Doxycycline (2 g ml) was added to induce the expression of mRFP 24 h before FACS, and infected cells were sorted for six rounds to enrich red fluorescent cells. In the initial sorting, 5% of cells became red, indicating a multiplicity of infection well below 1. 0 Protein Evolution by FACS. Ratio sorting was applied to evolve mRFP mutants with red-shifted emissions. Ramos cells were excited at 568 nm, and two emission filters (660 40 and 615 40) were used. The ratio of intensity at 660 nm to that at 615 nm was plotted against the intensity at 660 nm. Cells with the highest ratio and sufficient intensity at 660 nm were collected (Fig. 1B). Usually one million cells were collected each time, and they were grown in the absence of doxycycline until 24 h before the next round of sorting. Mutant Characterization. Sorted cells were amplified in the absence of doxycycline, and 0.1 g ml doxycycline was then added for 10 h. Total mRNA was extracted from these cells and used as template for RT-PCR to clone mRFP mutant DNA with primer pair pCL5 (5 -AGCTCGTTTAGTGAACCGTCAGATC-3 ) and pCL3 (5 -GGTCTTTCATTCCCCCCTTTTTCTGGAG-3 ). These mutant mRFP genes were subcloned into a pBAD vector (Invitrogen) and expressed in Escherichia coli. A His-6 tag was added to the C terminus to facilitate protein purification using Ni-NTA chromatography (Qiagen, Valencia, CA). Spectroscopic measurements were as described previously (12), except that concentrations of mRFPs were determined by assuming an extinction coefficient after denaturation in 0.1 M NaOH of 44,000 M 1 cm 1 at 452 nm, the same value as that of similarly denatured Renilla GFP (13, 14). Photobleaching Measurements. Microdroplets of aqueous protein, 0 pH 7.4, typically 5-10 0 m in diameter, were created on a 0 Freely available online through the PNAS open access option. Abbreviations: SHM, somatic hypermutation; mRFP, monomeric red fluorescent protein. Data deposition: The sequences reported in this paper have been deposited in the GenBank database [accession nos. AY786536 (mRaspberry) and AY786537 (mPlum)]. 0 by The National Academy of Sciences of the USA 0 November 30, 2004 0 APPLIED BIOLOGICAL SCIENCES 0 Identification of Integration Loci. The integration loci of provirus 0 microscope coverslip under mineral oil and bleached by using a Zeiss Axiovert 200 microscope at 14.3 W cm2 with a 75-W xenon lamp and a 540- to 595-nm excitation filter. Reproducible results required preextraction of the mineral oil with aqueous buffer shortly before microdroplet formation. 0 Wang et al. 0 MAXIMIZING THE POTENTIAL OF FUNCTIONAL GENOMICS 1 Lars M. Steinmetz* and Ronald W. Davis 0 Geneticists have made tremendous progress in understanding the genetic basis of phenotypes, and genomics promises to bring further insights at a rapid pace. The progress in functional genomics has been driven primarily by the development of new techniques that are used in a few dedicated research centres. Focusing on selected advances in genomic technologies, we assess the results that have been obtained so far, highlight the challenges faced by these new tools and suggest ways in which they can be overcome. We argue that progress in functional genomics will depend on developing high-throughput technologies that can easily be moved away from dedicated centres and into individual laboratories. 0 COMPLEX TRAITS 0 A trait that is determined by many genes, almost always interacting with environmental influences. 0 Biology is entering an exciting era brought about by the increase in genome-wide information. Functional genomics in particular is making rapid progress in assigning biological meaning to genomic data. The tools of functional genomics have enabled several systematic approaches that can provide the answers to a few basic questions for the majority of genes in a genome, including when is a gene expressed, where is its product localized, with which other gene products does it interact and what phenotype results if a gene is mutated. Functional genomics aspires to answer such questions systematically for all genes in a genome in contrast to conventional approaches that do so for one gene at a time. Several key biological challenges are central to continuing genome projects and are relevant to any eukaryotic organism, from yeast to humans. One challenge is to understand how genes that are encoded in a genome operate and interact to produce a complex living system. A related challenge is to determine the function of all the sequence elements in the genome. A third challenge is to understand the contributions of the multitude of sequence variants to phenotypic variation, both within and between species. One of the most enduring challenges in genetics has been to find the genetic variants that are responsible for COMPLEX 1 TRAITS . Current methods have mostly failed to meet 0 this challenge2, resulting in the need for new concepts and genome-wide technologies if this complexity is to be dissected. Despite the unresolved issues, the power and potential of functional genomics is impressive. We illustrate this here by discussing three core applications of genome technology, using selected examples from different organisms: genome-wide knock-out, gene expression and genetic mapping studies. We go beyond these examples to point out the areas in which technological improvements are possible. As functional approaches and verification of their accuracy often require genetic manipulation, many technical advances in functional genomics have their origin in model systems. Nonetheless, an effective transition of some of the technologies to humans is becoming more attractive3. The utility of such a transition can be maximized by careful evaluation of the power and limitation of these approaches. To obtain the most benefits from functional genomics, we argue, the technology, which is at present mainly carried out by a few dedicated centres, needs to become integrated into individual laboratories. Individual laboratories often have crucial expertise in a specific biological problem, and although functional genomics might provide approaches to address them, a key discovery can often only be made by bringing the 0 two together. We believe that for this to be achieved, two goals should be met: experiments must be further miniaturized and costs must be lowered. 0 Technological innovations 0 sequences takes centre stage3. With this role in mind, we evaluate three areas of functional genomics that have been piloted in different model systems. We indicate promising directions of research and suggest new approaches that need to be designed. Interfering with gene function. Phenotypic analysis of mutants has been a powerful approach for determining gene function. Gene function can be altered through gene deletions, insertional mutagenesis and RNA INTERFERENCE (RNAi) (BOX 1). Few methods offer the experimental control that is afforded by gene deletion. A true knock-out or null mutation achieves complete functional reduction of the encoded gene product. Because it is difficult to achieve in many organisms, compromises have been made by generating incomplete knock-outs. Gene products can be knocked-down or silenced as a result of point mutagenesis, insertional mutagenesis or RNAi. Although not yet feasible on a large scale, proteins might be targeted using drugs7, and it might eventually be possible to use drug compounds to generate knock-downs for every gene product in a genome and to apply them across species. The power of systematic mutant analysis is well illustrated by an experiment in which an international consortium systematically generated a gene deletion strain for every gene in the yeast Saccharomyces cerevisiae genome and analysed the phenotypes in a single tube assay8,9 (FIG. 1). The quantitative fitness measurements that are obtained for each gene with this tool enable applications beyond determining whether a gene is essential. This is an important advance because it opens up a wide variety of applications based on quantitative analysis, such as identifying functionally relevant genes and drug targets, comparing function and expression, defining candidate disease genes and studying molecular evolution (BOX 2). 0 Efforts towards increased miniaturization and decreased costs are exemplified by developments that originated from genome sequencing. In many ways, functional genomics was catalysed by the genome-sequencing projects: large-scale sequencing and the genome projects created an increase in available DNA sequences, around which new technologies that use this information were developed. A result is one of the most widely recognized and accessible genomics tools -- the DNA microarray -- which allows parallel hybridization assays to be carried out on an unprecedented, miniaturized scale. The second, and often unrecognized, contribution of the genome projects is the ~1,000-fold decrease in the cost of DNA sequencing, which had to be achieved to complete the Human Genome Project. The drop in sequencing costs facilitated large-scale sequencing projects of other organisms and has contributed to the fact that DNA sequencing is still the most frequently used technology for detecting DNA variation. Today, the comparison of genomes among several species allows the study of numerous biological features, such as studies of conserved sequences4-6. Developments of genomic technology have until now primarily focused on the generation of genome sequence data, from the development of genomeanalysis technologies to the generation of physical and genetic maps, the sequencing of model organism genomes and the completion of the human genome sequence. The next focus in genomics builds on the genome sequences and heralds the beginnings of an exciting phase of genome biology -- the true genome era, when deriving functional information from genome 0 Targeted deletion by homologous recombination 0 Precise gene deletion can be readily achieved by homologous recombination in yeast 8,9 and mouse11. Because this approach removes the targeted gene, functional reduction is complete. In organisms in which it works, this method is the gold standard. Unfortunately, homologous recombination does not work efficiently in several model organisms, including Arabidopsis and Caenorhabditis elegans. Although it has been shown to work in some cases, as seen recently in Drosophila12, the efficiencies are still too low for systematic application. 0 Insertional mutagenesis 0 Disruption of gene sequences can be achieved by insertional mutagenesis using transposons or other insertion sequences. Because the genome insertions are random, screening for disruption in a gene of interest is required. The insertion can lead to complete, incomplete or no functional reduction, depending on where the integration occurs. The insertion site and level of functional reduction therefore need to be determined experimentally. The method has been used extensively in Arabidopsis16 and Drosophila77,78, yeast15, mouse79 and C. elegans 80. 0 RNA interference 0 RNA INTERFERENCE 0 (RNAi). A process by which double-stranded RNA silences specifically the expression of homologous genes through degradation of their cognate mRNA. 0 RNA interference (RNAi) is the newest technology for reducing gene expression. It follows reports of gene silencing in plants and other model organisms81, and is based on the observation from C. elegans that adding double-stranded RNA (dsRNA) to cells often interferes with gene function in a sequence-specific manner17. In most cases, the level of functional reduction is incomplete and the level of specificity is not entirely predictable24-26. Nevertheless, RNAi has been shown to work in many model organisms. Current applications are primarily in C. elegans18, Drosophila19, various plants 81, tissue culture cells of Drosophila 82 and mammals23. 0 NATURE REVIEWS | GENETICS 0 CP UPTAG CP KanMX CP DNTAG CP Deletion cassette ORF Start Stop 0 F CP UPTAG F KanMX 0 F DNTAG CP F PCR amplification F TAG F F F TAG F Hybr 0 Picoliter-Scale Protein Microarrays by Laser Direct Write 1 B. R. Ringeisen,* P. K. Wu, H. Kim, A. Pique, R. Y. C. Auyeung, H. D. Young, and ´ D. B. Chrisey 0 Naval Research Laboratory, Code 6372, 4555 Overlook Ave. SW, Washington, D.C. 20375 1 D. B. Krizman 0 Advanced Technology Center, National Cancer Institute, Gaithersburg, Maryland 0 We demonstrate the accurate picoliter-scale dispensing of active proteins using a novel laser transfer technique. Droplets of protein solution are dispensed onto functionalized glass slides and into plastic microwells, activating as small as 50-µm diameter areas on these surfaces. Protein microarrays fabricated by laser transfer were assayed using standard fluorescent labeling techniques to demonstrate successful protein and antigen binding. These results indicate that laser transfer does not damage the active site of the dispensed protein and that this technique can be used to successfully fabricate a functioning protein microarray. Also, as a result of the efficient nature of the process, material usage is reduced by two to four orders of magnitude compared to conventional pin dispensing methods for protein spotting. 0 Microarrays are used widely as an efficient method to identify thousands of different analytes in solution with a single assay (e.g., protein expression, drug efficacy, DNA binding, etc.) (1). In the field of genomics, microarrays are fabricated using different immobilized cDNA molecules to detect genes for both biological and medical research (2). This technology has increased the speed and efficiency of gene identification orders of magnitude over more traditional assays such as Northern blot and RTPCR approaches (3). The power and success of high throughput screening experiments has resulted in a new industry that manufactures both standard cDNA microarrays and machines developed to fabricate arrays specific to user needs (4). The next, potentially more important step in biomedical research is that of high-throughput protein analysis (5). Because proteins perform most vital functions, many scientists believe that the key to early-stage disease detection, interdiction, and prevention lies with protein identification and expression analysis. One method of identifying proteins is to create an antibody microarray that uses thousands of different antibodies synthesized to bind specifically to different proteins (6). Knezevic et al. used this approach to successfully identify 365 different proteins and correlated differential patterns of protein expression with disease progression (7). This 0 Materials and Methods 0 Matrix assisted pulsed laser evaporation direct write, or MAPLE DW, is a laser-based processing technique that is capable of fabricating structures from a wide range of materials including metals, dielectrics, polymers, active proteins, and even living cells (9, 10). Figure 1 shows a schematic of the MAPLE DW technique as applied to protein solutions. To dispense active biological fluids, a variable concentration protein solution is mixed using 40 vol % glycerol/60 vol % phosphate buffer solution (PBS) as a solvent. Altering the concentration of proteins in these solutions over several orders of magnitude is used as a method to control the density of active molecules on the microarray substrate. A 0.5- to 1.0-µL aliquot of protein solution is then uniformly coated at room temperature onto a UV transparent quartz disk over an area of 1 cm2 by using a micropipet to spread the fluid and a spin coater to homogenize the film (disk is spun for 10 s at 1000 rpm) (11). A 193-nm laser pulse from an ArF excimer laser is first focused at the quartz/ fluid interface to 150 x 200 µm2 and 50 mJ/cm2. This pulse is directed through the backside of the quartz support so that the laser energy first interacts with the fluid at the quartz interface. Layers of fluid near the support interface then evaporate as a result of localized heating from electronic excitation, rapidly forming a bubble beneath the fluid layer. When this bubble bursts, an aliquot of protein solution is released, propelling a droplet away from the quartz support to a substrate positioned 25 µm to several millimeters away. The amount of protein solution in the aliquot is reproducibly determined by the focused laser spot size (variable from 102 to 3 x 104 µm2) and the thickness of the solution coating on the support (variable from 1 to 100 µm thick). Nearly all of the laser energy is absorbed interfacially, so that a minimal amount of the fluid coating is vaporized and the bulk protein solution is transferred in the liquid phase without significant heating (9). Movement of the computer-controlled stages is synchronized to the firing of the laser, enabling this tool to rapidly fabricate complex 2-D and 3-D structures, including microarrays. 0 Results and Discussion 0 Suppression subtractive hybridization: A method for generating differentially regulated or tissue-specific cDNA probes and libraries 1 LUDA DIATCHENKO*, YUN-FAI CHRIS LAU, AARON P. CAMPBELL, ALEX CHENCHIK*, FAUZIA MOQADAM*, BETTY HUANG*, SERGEY LUKYANOV, KONSTANTIN LUKYANOV, NADYA GURSKAYA, EUGENE D. SVERDLOV, AND PAUL D. SIEBERT* 0 solve the problem of the wide differences in abundance of individual mRNA species. Consequently, multiple rounds of subtraction are still needed (7). The mRNA differential display (8) and RNA fingerprinting by arbitrary primed PCR (9) are potentially faster methods for identifying differentially expressed genes. However, both of these methods have a high level of false positives (10, 11), biased for high copy number mRNA (12) and might be inappropriate in experiments in which only a few genes are expected to vary (11). Here we present a new PCR-based cDNA subtraction method, termed suppression subtractive hybridization (SSH), and demonstrate its effectiveness. SSH is used to selectively amplify target cDNA fragments (differentially expressed) and simultaneously suppress nontarget DNA amplification. The method is based on the suppression PCR effect previously described by our laboratories: long inverted terminal repeats when attached to DNA fragments can selectively suppress amplification of undesirable sequences in PCR procedures (14, 15). We have recently applied the suppression PCR effect in chromosome walking (14) and rapid amplification of cDNA ends (15). The subtraction method described here overcomes the problem of differences in mRNA abundance by incorporating a hybridization step that normalizes (equalizes) sequence abundance during the couse of subtraction by standard hybridization kinetics. It eliminates any intermediate step(s) for physical separation of ss and ds cDNAs, requires only one subtractive hybridization round, and can achieve greater than 1,000-fold enrichment for differentially expressed cDNAs. We demonstrate the effectiveness of the SSH method by generating a testis-specific cDNA library and characterizing selected cDNA clones. Furthermore, we show that subtracted cDNA mixture can be used directly as a hybridization probe for screening recombinant DNA libraries, such as a human Y chromosome cosmid library, thereby identifying chromosome-specific and tissuespecific expressed sequences. 0 MATERIALS AND METHODS 0 Oligonucleotides. The following gel-purified oligonucleotides were used. (i) cDNA synthesis primer: Pr16, 5 -TTTTGTACAAGCTT303. (ii) Adapters: adapter 1, 5 -GTAATACGACTCACTATAGGGCTCGAGCGGCCGCCCGGGCAGGT-3 3 -CCCGTCCA-5 0 Abbreviation: SSH, suppression subtractive hybridization. Data deposition: The sequences reported in this paper have been deposited in the GenBank data base (accession nos. H48477, H48478, H48931-H48939, H52858 -H54046, H54559 -H54560, H56769 - H56778, and H64202-H64207). 0 Biochemistry: Diatchenko et al. 0 Impact of surface chemistry and blocking strategies on DNA microarrays 1 Scott Taylor1, Stephanie Smith1, Brad Windle2 and Anthony Guiseppi-Elie1,3,* 0 ABSTRACT The surfaces and immobilization chemistries of DNA microarrays are the foundation for high quality gene expression data. Four surface modification chemistries, poly-L-lysine (PLL), 3-glycidoxypropyltrimethoxysilane (GPS), DAB-AM-poly(propyleminime hexadecaamine) dendrimer (DAB) and 3aminopropyltrimethoxysilane (APS), were evaluated using cDNA and oligonucleotide sub-arrays. Two un-silanized glass surfaces, RCA-cleaned and immersed in Tris±EDTA buffer were also studied. DNA on amine-modified surfaces was fixed by UV (90 mJ/cm2), while DNA on GPS-modified surfaces was immobilized by covalent coupling. Arrays were blocked with either succinic anhydride (SA), bovine serum albumin (BSA) or left unblocked prior to hybridization with labeled PCR product. Quality factors evaluated were surface affinity for cDNA versus oligonucleotides, spot and background intensity, spotting concentration and blocking chemistry. Contact angle measurements and atomic force microscopy were preformed to characterize surface wettability and morphology. The GPS surface exhibited the lowest background intensity regardless of blocking method. Blocking the arrays did not affect raw spot intensity, but affected background intensity on amine surfaces, BSA blocking being the lowest. Oligonucleotides and cDNA on unblocked GPS-modified slides gave the best signal (spot-tobackground intensity ratio). Under the conditions evaluated, the unblocked GPS surface along with amine covalent coupling was the most appropriate for both cDNA and oligonucleotide microarrays. INTRODUCTION The DNA microarray enables researchers to survey the entire transcriptome of virtually any cell population. This capability produces unprecedented quantities of raw data and enables the investigation of gene expression, functional genomics and 0 PAGE 2 OF 19 0 range of available surface chemistries. The GPS presents the reactive glycidoxy functional group to which amine-terminated oligonucleotides and cDNA, derived from amine-terminated primers, could be covalently affixed. The APS, PLL and DAB surfaces present varying densities of amine functionalities for hydrogen-bonding interactions with DNA. The RCAcleaned glass slides served as a reference surface while the TEB immersion deliberately introduced surface contamination to otherwise cleaned glass slide surfaces. The nonblocked surface served as the control for blocking. These surfaces and blocking strategies were evaluated by fabricating microarrays of cDNA and 30mer oligonuclotides prepared using the human GAPDH gene sequence. The oligonucleotides and cDNA were spotted at five different concentrations and hybridized to Alexaflour 555-labeled GAPDH PCR product. Wettability of the surfaces was determined by contact angle measurements with hexadecane and ultrapure water. Surface morphology was characterized by atomic force microscopy (AFM). MATERIALS AND METHODS Cleaning, preparation and surface modification of microarray slides In a class 1000 clean room, 50 VWR brand glass microscope slides (VWR 48300-025) were solvent cleaned by immersion for 1 min in boiling acetone followed by 1 min in boiling isopropanol. The slides were then washed in ult 0 Normalization strategies for cDNA microarrays 1 Johannes Schuchhardt*, Dieter Beule, Arif Malik1, Eryc Wolski1, Holger Eickhoff1, Hans Lehrach1 and Hanspeter Herzel 0 Institute for Theoretical Biology, Humboldt-Universitaet zu Berlin, Invalidenstrasse 43, D-10115 Berlin, Germany and 1Max Planck Institute of Molecular Genetics, Ihnestrasse 73, D-14195 Berlin, Germany 0 ABSTRACT Multiple Arabidopsis thaliana clones from an experimental series of cDNA microarrays are evaluated in order to identify essential sources of noise in the spotting and hybridization process. Theoretical and experimental strategies for an improved quantitative evaluation of cDNA microarrays are proposed and tested on a series of differently diluted control clones. Several sources of noise are identified from the data. Systematic and stochastic fluctuations in the spotting process are reduced by control spots and statistical techniques. The reliability of slide to slide comparison is critically assessed within the statistical framework of pattern matching and classification. INTRODUCTION Large areas of medical research and biotechnological development will be transformed by the evolution of high throughput techniques (1-3). Miniaturization and automatization enables the concurrent performance of many thousands or even millions of small-scale experiments on oligonucleotide chips (4,5) or spotted microarrays (6-8). Manufacturing processes and labeling techniques will lead to different performances (9,10) and detection ranges (11), but questions of statistical significance (12,13) and quality control (T.Beissbarth, K.Fellenberg, B.Brors, A.Arribas-Prat, M.J.Boer, V.N.Hauser, M.Scheideler, D.J.Hoheisel, G.Schuetz, A.Poustka and M.Vingron, submitted for publication; 14) are quite similar for the different technologies. Down-scaling of an experiment makes it generally sensitive to external and internal fluctuations (7). Since reliability of interaction patterns extracted from array data is essential for their interpretation (15,16), a reduction in these fluctuations by proper averaging and normalization procedures is of great practical interest (17). We will address this issue in the context of cDNA microarrays, spotted on glass slides and hybridized with a radioactively labeled probe. According to the experimental steps listed in Materials and Methods we will now give a list of the major sources of fluctuations to be expected in this type of microarray experiment. The list addresses fluctuations in probe, target and array 0 MATERIALS AND METHODS Array preparation A complex probe from several mouse tissues was purified and reverse transcribed with radioactively labeled cDNA. Arabidopsis thaliana cDNA (GenBank accession nos AF104328 and U29785) was spiked in a fixed amount for normalization purposes (18). Clones were amplified by PCR reaction, 5-amino-modified for attachment to glass slides, and purified (19). Prior to spotting, glass slides were cleaned and derivatized for covalent attachment of cDNA. A 384 pin gridding head (X5251; Genetix, Christchurch, UK) was used for spotting a grid of 384 blocks, each containing 36 spots. All clones were spotted twice within a block (double spotting). Details of the spotting pattern of library and control clones are explained in Figure 1. Altogether nine slides with an identical spotting pattern were produced. The radioactively labeled probe was hybridized on the cDNA array for 10 h at 42°C. For details on spotting technique and hybridization procedures see Eickhoff et al. (20). Scanning and image processing Arrays were exposed for 16 h to a Fuji BAS-SR 2025 intensifying screen (Raytest, Germany) and scanned at 25 µm resolution with a Fuji BAS 5000 phosphorimager (Raytest). The image was converted into a table of signal intensities using proprietary software. Data processing Intensity data were ordered in a table, each column corresponding to a slide and each row to a spot on the slide. The following normalization procedures were tested for their efficiency: · no normalization, averaging over k slides; · normalization by average intensity of control spots (slidewise normalization) and averaging over k slides; ii · division by the intensity of the two constant spots and averaging over k slides (pin-wise normalization); · slide-wise normalization of the diluted and constant signals, averaging of the dilution and control signals over several slides, then quotient formation (average pin-wise normalization). RESULTS Non-specific background and overshining The level of background noise and the influence of neighboring signal intensities is illustrated in Figure 2. The intensity of background spots is plotted versus the average signal intensity of the four next neighbor spots. The y-axis intercept of the linear regression gives an estimation of the non-specific background. The small background intensity indicates that there are only weak overshining effects for the 6 x 6 spotting pattern. The regression can be used for correction of the systematic part of these errors. The radius used to quantify spots was varied systematically: for the given spotting density only weak changes are observed if the scanning radius is kept in a reasonable range of about half the spotting distance (data not shown). The magnitude of the background and overshining effects is substantially smaller than fluctuations induced by spotting variabilities quantified below. Assessment of spotting variabilities In order to facilitate interpretation of the experimental data we neglect all non-linearities from image processing and assume that hybridization reactions reach mass action equilibrium. Due to the fact that different spots of a dilution series compete for the same probe the amount of probe bound in each spot is proportional to the amount of target cDNA present in the spot. The observed signal intensity then reflects the amount of spotted cDNA. Fluctuations in spot size and in the hybridization 0 Comparison between Different Strategies of Covalent Attachment of DNA to Glass Surfaces to Build DNA Microarrays 1 Nathalie Zammatteo,* ,1 Laurent Jeanmart, Sandrine Hamels,* Stephane Courtois,* ´ Pierre Louette, Laszlo Hevesi, and Jose Remacle* ´ 0 DNA microarray is a powerful tool allowing simultaneous detection of many different target molecules present in a sample. The efficiency of the array depends mainly on the sequence of the capture probes and the way they are attached to the support. The coupling procedure must be quick, covalent, and reproducible in order to be compatible with automatic spotting devices dispensing tiny drops of liquids on the surface. We compared several coupling strategies currently used to covalently graft DNA onto a glass surface. The results indicate that fixation of aminated DNA to an aldehyde-modified surface is a choice method to build DNA microarrays. Both the coupling procedure and the hybridization efficiency have been optimized. The detection limit of human cytomegalovirus target DNA amplicons on such DNA microarrays has been estimated to be 0.01 nM by fluorescent detection. © 2000 Academic Press Key Words: glass; functionalization; DNA probe; microarray. 0 DNA chip technology uses microscopic arrays of DNA molecules immobilized on solid supports for biomedical analysis such as gene expression analysis, polymorphism or mutation detection, DNA sequencing, and gene discovery (1). Several approaches can be used to prepare microarrays. DNA can be synthesized in situ on a glass surface using combinational chemistry (2). This method typically produces microarrays consisting of groups of oligonucleotides ranging in size from 10 to 25 bases 0 Copyright © 2000 by Academic Press All rights of reproduction in any form reserved. 0 ZAMMATTEO ET AL. 0 ditions. Covalent binding methods are thus preferred. Usually, DNA is cross-linked by ultraviolet irradiation to form covalent bonds between thymidine residues in the DNA and positively charged amino groups added on the functionalized slides (8). However, the location and the number of fixation sites of the DNA are not well defined so that the length and the sequences available for subsequent hybridization can vary with the fixation conditions. An alternative method is to fix DNA molecules at their extremities. Thus, carboxylated (9) or phosphorylated DNA (10) can be coupled on aminated supports as well as the reciprocal situation (11). Amino-terminal oligonucleotides can also be bound to isothiocyanate-activated glass (12), to aldehyde-activated glass (13), or to glass surfaces modified with epoxide (14). Thiol-modified or disulfide-modified oligonucleotides have also been grafted onto aminosilane via a heterobifunctional crosslinker (15) or on 3-mercaptopropylsilane (16). However, in these cases, the binding at high temperature was unstable. Recently, a more elaborate chemistry has been proposed for the construction of tethered molecules on glass to which DNA can be attached (17). A situation in which the accessibility of a tethered single-stranded probe covalently attached to the surface could be combined with the specificity of a long probe would represent a breaktrough in the field of DNA chips. In this paper we compare several methods of covalent coupling of DNA on activated glass, namely, the carbodiimide-mediated coupling of aminated, carboxylated, and phosphorylated DNA on carboxylic acid or amine-modified glass supports and the binding of aminated DNA to aldehyde-activated glass. 0 MATERIALS AND METHODS 0 Chemicals and Buffer 2-(N-morpholino)ethanesulfonic acid (Mes) and 1-methylimidazole (MeIm) were from Acros Chimica (Beerse, Belgium). Ethanol, maleic acid, NaCl, and SDS were from Merck (Darmstadt, Germany). 3-Aminopropyltrimethoxysilane, triethylamine solution, undecenoyl chloride, trifluoroethanol, anhydrous ether, trichlorosilane, and hexachloroplatinic acid were from Aldrich Chemical (Milwaukee, WI). NaBH 4, EDC, Tween 20 and streptavidin-Cy3 were from Sigma (St. Louis, MO). NHSS was from Pierce (Rockford, IL). Gloria milk powder was from Nestle (Vervey, Switzer´ land). [ - 32P]dCTP was from Dupont de Nemours (Boston, MA). Oligonucleotides were purchased from Eurogentec (Seraing, Belgium). Silylated (aldehyde) and silanated (amine) microscope slides were from Cell Associates (Houston, TX). Untreated glass slides were purchased from Knittel Glaser (Germany). The arrayer ¨ used was a Charlyrobot model with 250- m pins from Genetix (UK). DPX was from BDH Chemicals (UK). 0 GLASS FOR DNA-BINDING AND HYBRIDIZATION ASSAYS 0 The carboxylic acid terminal groups were obtained by hydrolysis of ester-functionalized slides by immersion into 8 M HCl solution at 95°C for 2 h. The samples were then ultrasonically cleaned through three consecutive steps (10 min each) in distilled water, dried under an argon flow. The aldehyde functions were obtained in two steps: the reduction of the ester groups into alcohol groups followed by oxidation by PCC (pyridinium chl 0 Analysis of repeatability in spotted cDNA microarrays 0 When referring to a single array, the measured log ratio of a repeatedly spotted clone is then denoted yij, with clone i, and repeated spotting j (where j = 1, ¼, ki). In the context of l several arrays, we will use the notation yij to denote the measurement in array l, with l = 1,¼, d. Correlation. For each clone, we calculated the average Pearson product-moment (linear) correlation between pairs of spots across data from the d arrays. If clone i has been spotted ki times, there will be [ki(ki ± 1)]/2 distinct pairs in its spot set. For a given pair of spots (denoted ij and ij¢) we l d l d constructed the vectors [yij , ¼, yij ] and [yij , ¼, yijH ], and computed the correlation coefficient with respect to clone i as 0 d l l yij A yij yijH A yijH 0 i1 ri v Y ud d 2 2 u l l t yij A yij yijH A yijH l1 l1 0 where yij 0 yij and yijH 0 For a given clone, the correlation coefficient was calculated for all distinct pairs in the spot set, and the average correlation coefficient was used as an indicator of repeatability for the clone. To assess 0 Applications of DNA tiling arrays for whole-genome analysis 1 Todd C. Mocklera, Joseph R. Eckera,b,* 0 The completion of numerous genome sequences has introduced an era of whole-genome study. Gaining a more complete understanding of the genome's information content will dramatically improve our understanding of various biological processes. In parallel with the sequencing of 0 entire genomes, recent advances in microarray technologies have made it feasible to interrogate an entire genome sequence with arrays. Such high-density whole-genome DNA microarrays can be used as a generic platform for numerous experimental approaches to decode the information contained within the genome. In this review, we discuss several approaches using high-density whole-genome oligonucleotide microarrays for transcriptome characterization, novel gene discovery, analysis of alternative splicing, mapping of regulatory DNA motifs using the chromatin- 0 researchers to analyze various features of the genome, including evidence of transcriptional activity, binding of transcriptional regulators, and DNA methylation, at high resolution without reference to prior annotations. Other array designs rely on prior genome annotation to interrogate a particular subset of features of an entire genome (Fig. 2C). These arrays are clearly limited by the quality and completeness of the annotations on which they are based. 0 exon-scanning arrays were designed using only known and computationally predicted exons, they were of limited use for discovering novel genes or gene features, such as terminal exons that are often missed by the gene prediction algorithms. For some genomic regions, tiling arrays with partially overlapping (10-base increments) 60-mer probes were used to demonstrate the utility of high-resolution tiling 0 arrays for refining and confirming gene structures predicted by the 0 Defining the sequence-recognition profile of DNA-binding molecules 1 Christopher L. Warren, Natasha C. S. Kratochvil, Karl E. Hauschild, Shane Foister§, Mary L. Brezinski, Peter B. Dervan§, George N. Phillips, , and Aseem Z. Ansari¶ 0 Contributed by Peter B. Dervan, November 11, 2005 0 Determining the sequence-recognition properties of DNA-binding proteins and small molecules remains a major challenge. To address this need, we have developed a high-throughput approach that provides a comprehensive profile of the binding properties of DNA-binding molecules. The approach is based on displaying every permutation of a duplex DNA sequence (up to 10 positional variants) on a microfabricated array. The entire sequence space is interrogated simultaneously, and the affinity of a DNA-binding molecule for every sequence is obtained in a rapid, unbiased, and unsupervised manner. Using this platform, we have determined the full molecular recognition profile of an engineered small molecule and a eukaryotic transcription factor. The approach also yielded unique insights into the altered sequence-recognition landscapes as a result of cooperative assembly of DNA-binding molecules in a ternary complex. Solution studies strongly corroborated the sequence preferences identified by the array analysis. 0 chemical genomics ligand-DNA recognition 0 central goal of synthetic biology, chemical biology, and molecular medicine is the design and creation of synthetic molecules that can target specific DNA sites in the genome (1, 2). Such molecules can be harnessed to regulate biological processes such as transcription, recombination, and DNA repair (1-4). The greatest success in designing molecules with programmable DNAbinding specificity has been with polyamides (2). However, a major hurdle in the design of new classes of sequence-specific DNAbinding molecules is the inability to comprehensively define the full range of their DNA sequence-recognition properties, and therefore, the inability to predict all their potential target sites in the genome. Given the importance of understanding the basis of molecular recognition between DNA and its ligands, several methods have been developed to determine the sequence specificity of DNAbinding molecules (small molecules as well as proteins). The most frequently used approach is the systematic evolution of ligands by exponential enrichment (SELEX), which utilizes selection and enrichment of the DNA sequences that bind with the highest affinity to a molecule of interest (4). This assay, although highly informative, identifies only the best binding sequences, whereas the less optimal, and often biologically relevant, sequences are missed. Other commonly used biochemical or biophysical approaches are labor-intensive and can be used only to study a limited set of sequence variants (5-10). Medium-throughput microarrays have also been developed in which duplex DNA molecules are immobilized on surfaces and protein binding is detected by surface plasmon resonance (11) or fluorescence (12, 13). Despite such demonstrations of feasibility, technical challenges have hindered the general application of these array platforms. A solution-phase medium-throughput assay utilizes DNA sequence variants presented in distinct wells and protein or small molecule binding detected by displacement of a DNA-intercalating fluorescent dye (14). Each of these medium-throughput approaches, however, is limited to querying DNA sequences with only three, four, or five permuted positions. 0 In a recent approach, a biased microarray bearing only the intergenic regions of yeast chromosome was used to map transcription factor binding sites in vitro (15). These arrays provide a biased binding profile and are limited to organisms with small and well annotated genomes. Another technique that circumvents this problem relies on sonicating genomic DNA into small fragments and adding a transcription factor to isolate putative binding sites (16). However, this method, like SELEX, is likely to overrepresent strong binding sites, thereby providing biased sequence-recognition profiles. These methods are not amenable to an unbiased analysis of the binding properties of small molecule DNA ligands. Chromatin immunoprecipitated (ChIP) DNA analyzed on oligonucleotide microarrays (chip) has also been used to map binding sites for DNA-binding transcription factors (17-19). Importantly, ChIP-chip studies have suggested that in vitro affinity of cooperatively binding transcription factors for specific DNA sequences is often recapitulated in the relative occupancy of these sequences in vivo (20, 21). This observation suggests that for a given transcription factor (or a set of cooperatively binding factors), the knowledge of its full sequence-recognition profile, measured in vitro, can be highly instructive in computationally identifying binding sites in the genome. Thus far, in the absence of genome-wide binding and expression data, computational approaches to identifying regulatory sites have been limited to phylogenetic comparisons of conserved noncoding sequences (22). However, unlike proteins, for most DNA-binding small molecules with unknown DNA-binding properties, ChIP-chip analysis is nontrivial, and phylogenetic comparisons are irrelevant. To bridge this gap between computational methods and molecular recognition properties of DNA ligands, we have developed a comprehensive high-throughput platform that can rapidly and reliably identify the cognate sites of DNA-binding molecules. This platform provides an unbiased analysis because it consists of a double-stranded DNA array that displays the entire sequence space represented by 8 bp (all possible permutations equal 32,896 molecules) and can currently be extended to as many as 10 variable base pair positions. We have also developed a systematic approach for treating the array data that can be applied to arrays of greater complexity. Because most metazoan DNA-binding proteins target 6-10 bp (23), and because DNA-binding small molecules rarely exceed 8 bp (24), our cognate site identifier (CSI) arrays should be capable of identifying and ranking sequences preferred by almost any DNA-binding ligand by itself, or, in many cases, in cooperatively binding pairs. Our approach derives comprehensive binding profiles from a rapid, unbiased, and unsupervised examination of the entire 0 Conflict of interest statement: No conflicts declared. Abbreviations: ChIP-chip, analysis of chromatin-immunoprecipitated DNA on oligonucleotide microarrays; CSI, cognate site identifier; PA1, polyamide 1; PA2, polyamide 2; PA3, polyamide 3; Exd, extradenticle; Hox, homeobox transcription factors; Dp, dimethylaminopripylamide; Py, N-methylpyrrole; Py*, Cy3-Py; Im, N-methylimidazole. 0 by The National Academy of Sciences of the USA 0 January 24, 2006 0 APPLIED BIOLOGICAL SCIENCES 0 DNA sequence space. These analyses can be extended to DNAbinding proteins from any organism or, in the case of small molecules, used to predict binding sites in any genome. Results 0 Array Design. The duplex DNA sequences are designed as self- 0 averaged intensities were then converted into Z scores [Z signal mean standard deviation] to reflect the signal-to-noise ratio (Fig. 2B). Sequences in the highest Z score bin ( 25) were subjected to several motif-searching algorithms (31-33), which identified 5 - 0 complementary palindromes interrupted at the center by a TCCT sequence to facilitate the formation of DNA hairpins (Fig. 1). The 34-residue oligonucleotide is synthesized directly on the glass surface by using a maskless array synthesizer (25) that can readily create up to 786,000 spatially resolved features. After inducing hairpin formation, we found that 95% of the oligonucleotides in the array form duplexes (see Materials and Methods). In our hairpin design, we added three constant base pairs on either side of the 8 bp that were permuted (N1-N8 in Fig. 1). Previous work shows that this addition is sufficient to buffer the core of the hairpin stem against thermal end-fraying of the duplex and against deviations from B-form DNA resulting from the presence of the loop (26). There is good evidence that the core of a hairpin stem interacts with proteins and small molecule ligands indistinguishably from DNA duplexes composed of two individual complementary strands (27, 28). 0 Array Validation Using an Engineered Small Molecule. To test the 0 accuracy and fidelity of the CSI array, we used a polyamide engineered to target a specific DNA sequence (PA1, Fig. 2A). Polyamides are DNA-binding small molecules composed of Nmethylpyrrole (Py) and N-methylimidazole (Im) heterocycle rings. The arrangement of the heterocycles (Im or Py) can be programmed to create polyamides that target most naturally occurring 6- to 8-bp DNA sequences (2). PA1, in particular, was designed to target the sequence 5 -WWGWWCWW-3 (W A or T) (Fig. 2) (29). A Cy3 fluorescent dye is conjugated to the N-methyl posi 0 Quantifying DNA-protein interactions by double-stranded DNA arrays 1 Martha L. Bulyk1, Erik Gentalen2, David J. Lockhart2, and George M. Church1* 0 We have created double-stranded oligonucleotide arrays to perform highly parallel investigations of DNA-protein interactions. Arrays of single-stranded DNA oligonucleotides, synthesized by a combination of photolithography and solid-state chemistry, have been used for a variety of applications, including large-scale mRNA expression monitoring, genotyping, and sequence-variation analysis. We converted a single-stranded to a double-stranded array by synthesizing a constant sequence at every position on an array and then annealing and enzymatically extending a complementary primer. The efficiency of secondstrand synthesis was demonstrated by incorporation of fluorescently labeled dNTPs (2´-deoxyribonucleoside 5´-triphosphates) and by terminal transferase addition of a fluorescently labeled ddNTP. The accuracy of second-strand synthesis was demonstrated by digestion of the arrayed double-stranded DNA (dsDNA) on the array with sequence-specific restriction enzymes. We showed dam methylation of dsDNA arrays by digestion with DpnI, which cleaves when its recognition site is methylated. This digestion demonstrated that the dsDNA arrays can be further biochemically modified and that the DNA is accessible for interaction with DNA-binding proteins. This dsDNA array approach could be extended to explore the spectrum of sequence-specific protein binding sites in genomes. 0 Keywords: dsDNA arrays, restriction enzymes, DNA-protein interactions 0 Sequence-specific DNA binding by proteins controls transcription1, recombination2, restriction3, and replication4. Sequence requirements are usually determined by assays that measure the effects of mutations on binding of DNA and amino acid residues implicated in these interactions. These assays, which include nitrocellulose binding assays5, gel shift analysis6, Southwestern blotting7,8, or reporter constructs in yeast9, are usually considered too laborious for the analysis of many DNA variants. Therefore, we have developed a highly parallel method for studying the sequence specificity of DNA-protein interactions. We have taken advantage of oligonucleotide arrays, or DNA arrays, that have previously been used for mRNA expression analysis10-12, polymorphism analysis13-16, deletion strain analysis17, and for identifying clones from genetic selections18. However, the arrays used for these purposes contain single-stranded DNA (ssDNA) oligonucleotides, and most sequence-specific regulatory DNA-binding proteins bind double-stranded DNA (dsDNA). Therefore, we present a method for enzymatically converting ssDNA arrays into arrays of duplex DNA. Sequence-specific digestion at the cognate restriction sites has been demonstrated using restriction-enzyme digestion of dsDNA arrays. In addition, we show that the dsDNA can be altered biochemically. Arrays of biochemically modified DNA may be useful for applications that seek to determine the effects of modifications, such as methylation, on sequence-specific binding. The results presented here suggest that these dsDNA arrays will be well suited for the analysis of DNA-protein interactions, particularly for the discovery of the sequences recognized by transcription factors and the quantitative assessment of those important interactions. Results and discussion Second-strand synthesis. ssDNA arrays were made on an Affymetrix (Santa Clara, CA) DNA array synthesizer. A constant sequence was synthesized before any variable sequences were introduced, and these strands were used as templates for enzymatic second-strand 0 synthesis. A primer complementary to the constant sequence was used in primer extension reactions, producing all the second strands on the array in a single enzymatic reaction. For our experiments, there are a number of advantages to creating dsDNA via primer extension instead of by chemically synthesizing single-stranded, self-complementary oligonucleotides19. First, 5¢-(4,4´dimethoxytrityl) (DMT) synthesis occurs with higher efficiency than that achieved with light-directed, 5¢-( -methyl-2-nitropiperonyl)oxycarbonyl (MeNPOC)20,21 synthesis. Therefore, longer strands of dsDNA can be made because only half as many nucleotides need to be produced by light-directed synthesis when the complementary strand is created via primer extension. Second, the exact complement of each template strand, including any degenerate nucleotides synthesized into the first strand, will be made because the Klenow fragment of DNA polymerase I is a highly processive polymerase with an error rate of approximately 10-5. Third, this mode of second-strand synthesis ensures a low mismatch rate as creation of dsDNA does not rely upon annealing a complex mix of exogenous complementary sequences. In order to verify initially that the primer was annealing to all sequences, a fluorescein-labeled primer was hybridized to the array, and signal intensity was seen over the entire chip (data not shown). Subsequently, unlabeled primers were used in all primer-extension reactions. To confirm enzymatic extension of the primer, we included fluorescein-labeled dATP in a reaction along with unlabeled 2´deoxyribonucleoside 5´-triphosphates (dNTPs) (Figs. 1 and 2A). As expected, there tended to be higher signal intensity in features with a greater proportion of adenine in the second strand (Fig. 3B). Of the features with identical subsites, those with longer spacers had higher signal intensities, as expected, because longer spacers allowed a greater number of fluorescein-labeled dATPs to be incorporated. The duplex DNA also can be end-labeled after synthesis (Fig. 2B) instead of being labeled by incorporation of fluorescein-tagged dNTPs. In this scheme, only unlabeled dNTPs were used in the 0 distal flanking sequence half-site spacer half-site proximal flanking sequence 0 Second strand labeling by incorporation of fluorescein-labeled dNTPs 0 Klenow exo - polymerase unlabeled and fluorescein-labeled dNTPs 0 annealed primer 0 constant priming sequence 0 annealed primer 0 HEG synthesis linker glass surface 0 Second strand labeling by terminal transferase addition of fluorescein-labeled ddNTPs 0 terminal transferase fluorescein-labelled ddNTP 0 primer extension 0 primer-extension reactions. The 3´-ends of the newly synthesized strands were then end-labeled by addition of fluorescein-labeled ddNTP with terminal transferase (Fig. 4A). Only the 3´-ends of the second strands were available for addition in these terminal transferase reactions, because the 3´-ends of the first strands were covalently attached to hexaethylene glycol (HEG) linkers. The observed variation in signal intensity from row to row was due to either different synthesis efficiencies or different efficiencies of terminal transferase addition for different sequences. Restriction enzyme digestion. To determine that the duplex DNA was both physically accessible and of proper structure for interaction with a protein, we digested dsDNA arrays with a restriction enzyme. This also confirmed that the second strands were synthesized correctly. A restriction enzyme with a 4 bp recognition site was chosen because the two subsites on the arrays were each either 3 or 4 bp long, although the design of the array can be changed according to the particular type of restriction enzyme being studied. The fluorescein-labeled dNTP included in the primer-extension reaction was chosen to be distal to the cleavage site (relative to the glass surface), so that after digestion the fluorescent label that had been incorporated into the second strand would be released (Fig. 3A). For end-labeled dsDNA arrays, the signal was distal to the cleavage site irrespective of the restriction site. Strand density and the distance of the strands from the array surface were varied to measure the effects of accessibility of the DNA strands for primer-extension reactions and enzymatic digestions. The distance from the surface was varied using either one or two HEG linkers. The two HEG linkers were expected to make the duplex DNA more flexible and more accessible by reducing steric hindrance from the glass surface and neighboring molecules. An array with variable densities and number of linkers was extended in the presence of fluorescein-labeled dATP, then digested with RsaI (Fig. 3B). As RsaI digestion leaves blunt ends between the T and the A of its recognition site (5¢-GTAC-3¢), incorporated label is lost with the portion of the strand that is released. Signal intensity loss was evaluated by calculating a z score for each feature. This statistic measures the amount of signal intensity loss beyond that due to photobleaching or other effects that might cause general signal intensity loss over the whole array. The average z score in the 30 features containing the RsaI recognition site was 7 (p 0 New developments in microarray technology Dietmar H Blohm* and Anthony Guiseppi-Elie 0 Microarrays have emerged as indispensable research tools for gene expression profiling and mutation analysis. New classification of cancer subtypes, dissecting the yeast metabolism and large-scale genotyping of human single nucleotide polymorphisms are important results being obtained with this technique. Realizing the microsphere-based massively parallel signature sequencing technique as fluid microarrays, building new types of protein arrays and constructing miniaturized flow-through systems, which can potentially take this technology from the research bench into industrial, clinical and other routine applications, exemplify the intense developments that are now ongoing in this field. 0 as from more than 200 companies worldwide engaged in the development and application of this technology. The scope of this review is therefore restricted to some examples of recent technical advances and research applications, and is focused on current trends in the movement of the microarray from being a purely research method to becoming an analytical instrument applicable in the clinic as well as in industry. 0 The present state of microarray technology 0 Working with microarrays requires the combination of at least five different components [8]: the chip itself with its special surface; the device for producing microarrays by spotting the nucleic acids (probes) onto the chip or for their in situ synthesis; a fluidic system for hybridization to target DNA; a scanner to read the chips; and sophisticated software programs to quantify and interpret the results. Additional tools are required for extracting nucleic acids from biological material to prepare them for the analysis. For each of these components special equipment is now commercially available. In addition, microarray components or complete systems, ready-to-use gene collections and PCR product libraries of cDNA and even comprehensive microarray studies are commercially offered as services (for details see [3,9]). Usually, the different systems show very different levels of reliability and reproducibility, are not compatible with each other and require a skilled scientist to setup, commission and even to routinely run them. The value of microarray experiments still depends critically on the quality of arraying, recently made possible by bubble jet technology [10·] or maskless in situ synthesis of oligonucleotides [11··]. Microarray experiments also depend on probe and target preparation, experimental variations during hybridization and specifically on the selection of the nucleic acids affixed to the microarray surface. Further, microarray experiments depend on the homogeneity of the surface and linking chemistries on the chip [12] as well as on background and overexposure problems during image processing [13]. Based on improvements in microarray surface chemistry [14,15·], scanner technology and software developments, quantitative changes in transcription activity can now be measured reproducibly in the range twofold or less, except in the case of low abundant mRNAs. However, technical standards or established procedures for the exact comparison of the different technical systems or among different approaches, such as cDNA-arrays versus oligonucleotidearrays [16,17], are still missing. Now, as before, the microarray field is moving very fast and new technical approaches and applications are emerging continuously. A remarkable recent advance is the development of `fluidic' microarrays, a system for massively parallel signature sequencing (MPSS). Millions of DNA-signatured microbeads, each 0 Analytical biotechnology 0 carrying a different cDNA attached by in vitro cloning, are repeatedly cycled between restriction type II cleavage, ligation steps and hybridization reactions to add decoder probes for reading the signatures. The number of microbeads carrying identical cDNAs are then counted by imaging them onto a charge-coupled device camera using a flow cell. Because ~250,000 microbeads are processed at once, even rare mRNAs can be assessed without prior knowledge of their sequence [18··]. Microbeads are also employed to attach molecular beacons that produce a fluorescence signal after binding of (unlabeled) target molecules. By encoding them with a particular dye signature, >107 randomly ordered microbeads can be analyzed simultaneously in a high-density fiber array using an imaging fluorescence system [19·]. To increase the sensitivity of microarrays a new `scanometric' detection system based on gold-nanoparticle-promoted silver reduction has been reported to be 100 times more sensitive than fluorescence measurement [20··]. As a method connecting genomics and proteomics (for review see [21]) microarray technology has also been used for large-scale peptide and protein analysis [22]. New protein microarrays can be used instead of the yeast two-hybrid system for in vitro analyzing protein-protein interactions, for identifying protein kinase substrates and for measuring interactions between proteins and low-molecular weight molecules and even low-affinity interactions [23,24·]. In addition, the microarray technique has been used to screen >18,000 antibodies against 15 different antigens in one experiment using high-density gridding of bacteria containing antibody genes and testing them using a solid-phase enzyme-linked immunosorbent assay (ELISA) [25]. Single-stranded nucleic acids coupled to proteins have been used to convert DNA microarrays into protein microarrays in a one-step, self-assembling hybridization process [26] and plasma polymerized protein films have been used to fabricate DNA-arrays [27·]. Another area of noteworthy advance, and one that has long been neglected, is the proper identification of sources of noise, error analyses and quantitative treatments of systematic and stochastic errors in DNA microarray analyses [13]. 0 and ovarian tissue used in the National Cancer Institute for anti-cancer drug screening revealed clearly distinguishable profiles if assayed with 9703 human cDNA probes [33]. Fundamentally new insights have also been obtained in studies comparing highly and less metastatic melanoma cells [34], tumor and normal colon tissue [35], and acute myeloid leukemia versus acute lymphoblastic leukemia [36··]. Whether some results of this kind might be questionable has to be clarified, because aneuploidy was shown to lead to spurious correlation among expression profiles and to be more widespread then expected [37]. Full-genome expression profiles from 300 different mutants, physiological situations or chemical treatments of a yeast culture have been measured from 4553 genes and compared with 63 such profiles of an isogenic strain grown under standard conditions [38··]. The resulting `compendium' database allowed the monitoring of hundreds of different cellular functions as one single assay using the microarray. This database was used to estimate that under constant conditions the level of gene induction or repression natively fluctuates in the range of twofold, but also to identify eight yeast ORFs as being involved in ergosterol biosynthesis, cellwall structure, mitochondrial function or protein synthesis. In addition, this database allowed the discovery that the cellular target of the anesthetic drug dyclonine in humans is the neuroactive sigma factor, which shows the greatest sequence homology to the effected yeast gene erg2p. Using the method of singular value decomposition (SVD), the complexity of large sets of microarray expression data can be reduced to show that the `music of genes is orchestrated' through a few simple underlying patterns [39]. Meanwhile, experiments including up to 15,000 genes and more have been carried out to analyze the susceptibility of murine B cell lymphoma to apoptosis after irradiation [40], to characterize the different gene activities between placenta and embryos in mice [41], to measure the response of the human intestinal cells to infection with Salmonella bac 0 Making and Using DNA Microarrays: A Short Course at Cold Spring Harbor Laboratory 1 David J. Stewart1 0 Meetings and Courses, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724 USA 0 conundrum is familiar. You are sent back in time to the Middle Ages with no artifact from the present, brought before the local ruler, and given 24 hours to prove you are indeed from the future, to impress the ruler and his advisors in some way, before you are executed in some suitably hideous fashion. What do you do? Toying with this conundrum reveals how little we know in a practical sense about the everyday items that surround us. Can you fix your car and your computer? My guess is that few, if any, readers can do so. And so it was with some trepidation that Cold Spring Harbor Laboratory agreed to host a short course in the Fall of 1999, funded in part by the National Cancer Institute, in which students, primarily biologists, would not only print, use, and analyze DNA microarrays but would start the course by building the machines used to print the arrays. For some time, Patrick Brown and colleagues (Chu et al. 1998; DeRisi et al. 1997; Lashkari et al. 1997) at Stanford had been advocating the idea that smaller laboratories could enter the fray and hype surrounding these emerging microarray technologies by building machines rather than by buying them, a self-help philosophy that was strengthened by the Brown laboratory's webbased publication in June 1998 of the MGuide, a step-by-step guide to construct the arrayer, complete with parts list. Indeed, a number of laboratories have gone ahead and built their own machines. Commercial vendors already offer some solutions for investigators interested in studying changes in genomewide gene expression. Efforts by Steve 0 from similar restrictions to the Affymetrix approach in terms of which genes the companies decide to array. Many of these products consist of low thousands, hundreds, or even tens of arrayed sequences. Meanwhile, a third approach, midway between the second strategy and the purist Stanford approach, is to buy an arrayer from a commercial vendor such as Cartesian Technologies (Irvine, CA), and then make the DNA chips de novo. This offers flexibility to the investigator in terms of which sequences are arrayed, and the technical support of the vendor in case the printing robot breaks down or becomes unaligned--printing tens of thousands of discrete DNA "features" requires that these arrayers are tightly aligned in both horizontal directions. However, these arrayers have specifications no better and are currently at least twice the cost of home-built machines. This brings us back to the Stanford approach--build the machines from scratch. And to our own trepidation, could a group of 16 biologists--selected from a pool of >125 applicants on the basis of their biological interests rather than their machining skills--actually build the machines, albeit with expert guidance from members and former members of the Brown and Botstein laboratories in Stanford, such that they could be used to print highdensity DNA microarrays (Table 1)? As is usual for Cold Spring Harbor courses, the students included laboratory heads, senior scientists, and postdocs, plus two from Britain, and one each from Sweden, Germany, and New Zealand, with the remainder coming from academic laboratories in the United States with widespread interest in topics ranging from the cell cycle, origins of replication, cancer (and the development of anti-cancer vaccines), 0 Genome Research 0 Table 1. 1 Juerg Baehler Arul Chinnaiyan David Collingwood Bruce Futcher Janet Hager Christian Kaltschmidt Thomas Kocarek Maria Lagerstrom-Fermer Matthias Lorenz Donald Love Michele Marron Vivek Mittal Daniel Notterman Michael Ryan Arthur Thompson Sudha Veeraraghavan 0 Instructors: Ash Alizadeh (Stanford), Patrick Brown (Stanford), Max Diehn (Stanford), Michael Eisen (Lawrence Berkeley National Laboratory), Jo DeRisi (UCSF), and Paul Spellman (Stanford). 0 signal transduction, apoptosis and neurobiology. Preference was given to individuals whose applications strongly suggested that they would move swiftly to develop and apply this technology at their home institutions and make it available to other investigators. The explicit intention was to spread the application of these techniques as widely as possible, both geographically and scientifically. The students assembled at Cold Spring Harbor Laboratory on the night of October 19 to begin the 2-week course, and began building the arrayers the next morning. With one arrayer built in advance by Vishy Iyer and Jo DeRisi, a lead instructor in the course, serving as a guide, the students were able to build three complete machines by the third day of the course--these were long 16 hour days--despite "teething problems" in terms of broken or malfunctioning components (Fig. 1). Predictably, the students learned more from the problems that they encountered than an error-free assembly of the equipment might have offered. By the fourth and fifth days, the course was printing duplicate arrays of the entire 6200-gene set of Saccharomyces cerevisiae, chips valued in excess of several tens of thousands of dollars by current commercial prices, using clones 0 reduced by increasing the number of replicate arrays or even by altering the pattern of printing. With sufficient arrays printed and available for experimentation, the students were ready to prepare samples for hybridization. Regardless of how DNA microarrays are fabricated, at this point methods for using these arrays start to coalesce, particularly in terms of gene expression analysis. Because of the enormous variation in the number of mRNA molecules being analyzed, and because 0 Genome Research 0 of the complexities of the hybridization kinetics of individual DNA sequences, microarrays are used to measure the ratio between a reference and a sample, typically labeled with green and red fluorescent dyes, rather than the absolute quantity of transcript. It is for this reason that raw array data are typically represented as a grid of dots of varying intensities of red, yellow and green. The individu 0 REVIEW Experiments using microarray technology: limitations and standard operating procedures 1 T Forster, D Roy and P Ghazal 0 Abstract Microarrays are a powerful method for the global analysis of gene or protein content and expression, opening up new horizons in molecular and physiological systems. This review focuses on the critical aspects of acquiring meaningful data for analysis following fluorescence-based target hybridisation to arrays. Although microarray technology is adaptable to the analysis of a range of biomolecules (DNA, RNA, protein, carbohydrates and lipids), the scheme presented here is applicable primarily to customised DNA arrays fabricated using long oligomer or cDNA probes. Rather than provide a comprehensive review of microarray technology and analysis techniques, both of which are large and complex areas, the aim of this paper is to provide a restricted overview, highlighting salient features to provide initial guidance in terms of pitfalls in planning and executing array projects. We outline standard operating procedures, which help streamline the analysis of microarray data resulting from a diversity of array formats and biological systems. We hope that this overview will provide practical initial guidance for those embarking on microarray studies. 0 experiments with each chip hybridised with experimental and reference samples, thought must go into the correct selection of the reference material to ensure biological relevance to the study. Due consideration must be given to whether material is pooled or individually sampled. The entire planning stage is as important as the subsequent implementation (see below) and omissions at this stage can easily lead to non-representative or false results. Planning of a study benefits from multiple inputs from biological researchers as well as statistician/bioinformaticians with experience in microarray technology. Experimental sampling and extraction of RNA is a vitally important component of this process since successful microarray studies are dependent on the consistent extraction of high quality RNA. In broad terms, microarrays are performed on two basic biological systems: simple and complex. Simple biological systems are those where homogeneous cell populations are present, such as cell lines or purified cell populations. Sampling from simple systems is more likely to represent the expression level for the particular cell or tissue under study. Complex systems are typified by tissues and organs where there is a diversity of cellular substructures and mixed cellularity. Extraction of RNA from complex systems means that critical spatial and cellular information as to the origin of the signal is lost. This reduction of contextual information makes 1 T FORSTER 0 and others 0 Microarray standard operating procedures 0 gram) quantities of RNA are gleaned from these sampling strategies - quantities that are usually too small for conventional labelling strategies. New amplification methods for the labelling of minute quantities of RNA are now being employed. However, it is becoming increasingly evident that even highly purified cell populations and apparently homogeneous cell lines may demonstrate complexity of phenotype and metabolism at the individual cell level. This variation is likely to encompass differences in RNA turnover, sublocalisation, splicing and translational activity. This only serves to highlight the importance of 0 Microarray standard operating procedures · 1 T FORSTER 0 and others 197 0 standardising culture and purification methods as rigorously as possible to achieve consistency during sampling and extraction phases. Regardless of the RNA sampling methods employed, it is important to apply rigorous quality control to purified RNA populations. For instance, the Bioanalyser system from Agilent Technologies (Cheadle Royal Business Park, Stockport, Cheshire, UK) is now commonly employed to check the quality and consistency of RNA samples. The resulting absorbance profile provides a useful means of assessing the suitability of RNA for labelling. At this stage, consistency during labelling and hybridisation steps is the starting point for the generation of consistent array data (Hegde et al. 2000). The selection and production of the correct array format is important and a central feature of the process. The majority of custom arrays are produced by the direct deposition of nucleic acid probes as cDNA or long oligomeric sequences onto treated glass substrates. The production of reproducible arrays with current pin printing methods is challenging. In our own Centre we have introduced a number of quality control steps to ensure consistency of array production, but these are outside the scope of this review. An essential theme is the requirement for microarray data to be MIAME (minimum information about a microarray experiment) (Brazma et al. 2001) compliant. In essence, this addition of standardised information about all stages of a microarray experiment allows for amalgamation of array data from different groups and sources in the public domain, ultimately permitting advanced and automatic data mining. Accordingly, there is an absolute necessity for the implementation of M-SOPs. The M-SOPs outlined here aid in the production of standardised project documentation, which ensures MIAME compliance for publication. In the following sections we outline in more detail the analytical steps of the workflow. Data Generation and Validation The chronological order of processes in a microarray project utilising customised arrays is given in Fig. 1. Approaches for individual and combined processing and analysis steps have recently been reviewed (Nature Genetics 2002, Speed 2002). Array scanning and image quantification The process of scanning an array is known as image acquisition, whereas the process of converting images to numerical data is referred to as image quantification or processing. The majority of microarray experiments involve the fluorescent detection of hybridised signal using confocal laser scanners. A wide variety of different scanning instruments are available, and a number of different 0 image acquisition and quantification packages are associated with them. In general, selection of image quantification parameters (e.g. `adaptive', `fixed circle', `spot distance') should be carefully assessed and decided for each project as a whole, and will depend on array design, slide type and spot morphology. As an exception to this, a limited form of manual input is often required to fine-tune the layout of the template quantification grid for individual arrays and care should be taken to avoid user bias. Apart from this limited fine-tuning, it should be noted that the image quantification method should be identical for all slides constituting a project, whereas image acquisition parameters, for instance laser power and/or photo multiplier, can be optimised from slide to slide. For a comparative discussion of issues concerned with statistical image a 0 TRENDS in Biotechnology 0 directed towards improvement of agricultural qualities, perhaps these goals can be combined to increase tolerance to temperature extremes, salinity, flooding, or insect pests in plants capable of pollutant detoxification or, more importantly for value-enhancement - transfer of phytoremediative traits to elite plant cultivars having the highest biomass or agricultural productivity. Obviously, concerns about contaminant uptake and accumulation will limit the use of phyto-crops for food or human contact products, so every effort must be made to identify parent compound fate and toxicity for these applications. However, as observed with the development of chemopreventative enriched, Se-hyperaccumulating plants, opportunities exist to combine pollutant decontamination capabilities with beneficial human and ecological health qualities in engineered plants. 0 see front matter Q 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.tibtech.2004.08.003 0 Exploring the post-transcriptional RNA world with DNA microarrays 1 Vishwanath R. Iyer 0 Genomic approaches are valuable for understanding the complex layer of gene regulation that involves the control of RNA processing, localization and stability. Recent work 0 provides a prime example of the power of unbiased microarray-based methods to discover unexpected functions for proteins in the RNA world. The challenges ahead relate to extending such approaches to larger genomes and to integrating this type of information with that generated by standard expression profiling. 0 TRENDS in Biotechnology 0 Although gene expression is often regulated by transcription factors at the level of transcription initiation, the subsequent steps of RNA processing, turnover, subcellular localization and entry into the translation machinery strongly influence the extent of protein translation and the function of encoded proteins. Such post-transcriptional steps therefore have marked effects on the expression and function of genes in processes as diverse as cytokinesis, early embryonic development and neuronal function [1]. When trying to infer the global phenotypes of cells from large-scale mRNA expression profiling data, it is important to be aware of this intervening layer of gene regulation. Most post-transcriptional events are mediated by the association of RNAs with specific proteins or macromolecular protein complexes. Comprehensive determination of the RNA targets of RNA-binding proteins is therefore likely to be important in deciphering the complex events at this level of gene regulation. The La protein is a conserved eukaryotic protein that is thought to be important in the realm of posttranscriptional regulation and, as we discuss here, a recent study by Inada and Guthrie [2] provides a prime example of the use of a genomic approach to elucidate the targets and potential function of such an RNA-binding protein. Ribonomics with cDNA microarrays cDNA microarrays have been heavily used for quantitative mRNA profiling, but there are increasing examples of the varied use of cDNA microarrays to follow the fates of mRNAs in the cell after they are made, rather than to measure only their steady-state levels. One objective is to determine the binding targets of proteins that interact with RNAs at any point during the lifetime of the RNA. Protein-RNA interactions represent one of the most abundant categories of molecular interactions in cells, and the total number of RNA-interacting proteins rivals that of other categories such as transcription factors and signaling molecules, even if one excludes the hundreds of proteins that are integral components of the spliceosome and ribosome [3,4]. Proteins can interact with RNA from the time that they are transcribed, and they affect transcriptional efficiency, capping, 3 0 -end processing, splicing, nuclear export, subcellular localization, translation and turnover of RNA [5]. The sheer diversity, cell- and tissue-specificity, and conservation of RNA-binding proteins has led to the notion that primary transcripts, rather than advancing smoothly through each of the subsequent RNA processing steps, participate in a complex network of regulatory processes at the post-transcriptional level [6]. Clearly, identifying the RNA targets of specific RNA-binding proteins is likely to be at least as informative and important with regard to understanding global gene regulation as is measuring changes in steady-state levels of RNAs in response to cellular signals. The genomic strategy for determining the RNA partners of RNA-binding proteins involves immunoprecipitation of the protein of interest along with its associated RNA, fluorescent labeling of the enriched RNA (as cDNA), and finally microarray hybridization in conjunction with an appropriate reference probe (Figure 1). This approach 0 was first used independently in the laboratories of Ron Vale [7] and Jack Keene [8] and was termed `ribonomics' by the latter. Variations of this method have been subsequently used to identify the targets of more than a dozen RNA-binding proteins (see Gerber et al. [9] and references therein). The function of La in the cell A prime example of the power of ribonomics has been provided recently by Maki Inada and Christine Guthrie [2] in their analysis of the function of the La protein in yeast. La is a ubiquitous, nuclear RNA-binding protein that is conserved among eukaryotes. It is known to associate with the 3 0 -UUU-OH co 0 YOUNG INVESTIGATOR PERSPECTIVES DNA Microarray Analyses of Circadian Timing: The Genomic Basis of Biological Time 1 G. E. Duffield 0 Department of Integrative and Molecular Neuroscience, Division of Neuroscience and Psychological Medicine, Faculty of Medicine, Imperial College London, London, UK. Key words: circadian rhythm, microarray, clock gene, gene expression, clock controlled gene. 0 Abstract Many aspects of physiology and behaviour are organized around a daily rhythm, driven by an endogenous circadian clock. Studies across numerous taxa have identified interlocked autoregulatory molecular feedback loops which underlie circadian organization in single cells. Until recently, little was known of (i) how the core clock mechanism regulates circadian output and (ii) what proportion of the cellular transcriptome is clock regulated. Studies using DNA microarray technology have addressed these questions in a global fashion and identified rhythmically expressed genes in numerous tissues in the rodent (suprachiasmatic nucleus, pineal gland, liver, heart, kidney) and immortalized fibroblasts, in the head and body of Drosophila, in the fungus Neurospora and the higher plant Arabidopsis. These clock controlled genes represent 0.5±9% of probed genes, with functional groups covering a broad spectrum of cellular pathways. There is considerable tissue specificity, with only approximately 10% rhythmic genes common to at least one other tissue, principally consisting of known clock genes. The remaining common genes may constitute genes operating close to the clock mechanism or novel core clock components. Microarray technology has also been applied to understand input pathways to the clock, identifying potential signalling components for clock resetting in fibroblasts, and elucidating the temperature entrainment mechanism in Neurospora. This review explores some of the common themes found between tissues and organisms, and focuses on some of the striking connections between the molecular core oscillator and aspects of circadian physiology and behaviour. It also addresses the limitations of the microarray technology and analyses, and suggests directions for future studies. The circadian timing system 0 Circadian rhythms are endogenous, near 24-h rhythms of physiology and behaviour generated by underlying genetic feedback loops occurring in a majority of organisms from prokaryotes to humans (1). Their importance to human health is becoming apparent, such as in the increasing occurrence of shift work and jet-lag (2), sleep syndromes (e.g. advanced sleep phase syndrome) (3), and in the connection of clock genes with cell division, tumour development and DNA damage±response pathways (4). It has long been appreciated that many neuroendocrine systems are regulated on a circadian basis, examples being rhythms of plasma melatonin and cortisol, behavioural parameters such as sleep onset and offset, cognitive attention, and physiological parameters such as core body temperature and urine output (5). In mammals, the master clock resides in the paired supra- 0 chiasmatic nuclei (SCN) of the hypothalamus (6, 7). Studies monitoring electrical activity of single dissociated SCN neurones have revealed that this oscillator mechanism resides within individual cells (6). These oscillators consist of interconnected molecular feedback loops composed of a positive loop, where activators drive the transcription of gene products, which feedback to repress the transcription of themselves and/or other core oscillator molecules (1, 6). The core feedback loops of the mammalian clock consist of three period (Per) genes and two cryptochrome (Cry) genes (negative loop) and PAS domain proteins Bmal1 and Clock (positive loop) (6). The Per and Cry genes are activated by CLOCK:BMAL1 heterodimers. The PER and CRY proteins are then translated in the cytoplasm where PER1 and PER2 are phosphorylated by casein kinase Ie/d. The phosphorylated PER proteins dimerize with CRY1 allowing them entry into the nucleus, where CRY1 is proposed to repress the activation of 0 DNA microarrays and their application to circadian biology 0 A number of central questions regarding circadian biology are amenable to investigation by DNA microarray technologies. (i) Although there exists considerable knowledge about the core oscillator mechanism, and some of the physiological and behavioural processes that are under circadian control, little is known about the connection between the oscillator and down-stream biological processes that are under clock control. Profiling of gene expression over several days can identify novel downstream genes 0 Computational Approach to Systems Biology: From Fraction to Integration and Beyond 1 Pawan K. Dhar, Hao Zhu, and Santosh K. Mishra* 0 Abstract--Systems biology is an approach to understanding the workings of whole biological systems. The various methods used for systems analyses range from experimental to computational. In this paper, we describe basic concepts of systems biology, modeling challenges that arise from the massively parallel interaction among components in biological systems, and what lies beyond integration of modular knowledge. Index Terms--Cellular automata, modeling, signaling pathway, systems biology. 0 I. ORIGIN OF SYSTEMS BIOLOGY 0 IOLOGY IS systems by default. Surprisingly, biology has hardly been practiced that way. Reductionism has been the dominant approach of experimental biologists, who like to: 1) reduce a problem into components (or modules); 2) integrate the modular knowledge by using assumptions; and 3) iterate reductionism and integration till a reasonably good understanding of the system appears. This classical way of doing biology was successfully practiced till recently, when researchers shifted focus from reductionism to integration. The advent of high-throughput technologies, such as microarrays that simultaneously measure thousands of gene expression profiles, significantly influenced the move toward a systems approach. However, going by the documented literature, the concept of systems approach was born more than seven decades ago. In the early part of the last century, von Bertalanfy described a system as a group of dynamic and mutually interacting parts and processes and argued that the fundamental task of biology was to discover laws of biological systems [1]. In the 1940s, Wiener (1894-1964) searched for general biological laws using "cybernetics" as a guiding principle. Cybernetics is a field that describes common factors of control and communication in automatic machines, organizations, and living organisms [2]. This was the first attempt to look at biological complexity from a computational standpoint. Based on his work on communication engineering during World War II, Weiner proposed a common conceptual framework from men 0 to machines. Though his contributions in communication engineering are well known, his discrete contribution in biology has largely been unappreciated, due to the unavailability of relevant biological data and unvalidated biological models at that time. Throughout the 1960s and 1970s, researchers from the fields of mathematics and engineering continued their hunt for mathematical and physical principles of biological systems, but faced similar problems of data scarcity and model validation. However, a much bigger issue was the lack of understanding of the fundamental properties of living systems, i.e., dynamic and nonlinear behavior. Due to this reason, initial modeling efforts were helpful only to the extent of simulating isolated events without explaining their fundamental principles. The proposition of biochemical system theory and metabolic control theory sparked a renewed interest in this field [3]-[7]. With an enormous increase in computational bandwidth, the capability of solving large-scale mathematical equations registered a significant jump, making it a routine task to build large and complex biological models using mathematical equations. Recently a paradigm shift in biology, i.e., from low throughput, single investigator driven to high throughput, consortia driven, has occurred [8]. In parallel to these changes, a new era of modeling efforts, with novel strategies and methods, has emerged. Starting from the classical ordinary differential equations, new mathematical representations have been invented [9]-[12], broadening the area and encouraging more applications ranging from basic sciences to drug discovery. Even though systems biology has found widespread acceptance among researchers, a few fundamental issues remain. One of the main concerns has been the meaning and application of the "systems biology" itself. There is also an apprehension whether the term has gotten well ahead of the science. Though the term "systems biology" was coined many decades ago, Hood brought it into mainstream science few years back [13]. Alternative terms like network biology, integrative biology, or interactive biology have also been proposed. We hold the view that systems biology is a new way of doing biology, starting with experimental knowledge, passing through in silico modeling, and finally returning to biological experiments. Systems biology is an approach that works best when integrated with experimental biology. In this paper, we try to assess the role of computation in moving the biological knowledge from fraction to integration, the key features that differentiate systems biology from traditional biology. II. INTRODUCTION AND TERMINOLOGY In 2003, as the scientific community was commemorating the golden jubilee of the discovery of DNA's double helical struc- 0 IEEE 0 DHAR et al.: COMPUTATIONAL APPROACH TO SYSTEMS BIOLOGY 0 ture, the question "what next?" was raised. There was a general consensus that transcriptomics and proteomics are much more challenging than genomics--a problem thought to be most demanding till recently. An accelerated postgenomics effort was triggered mainly due to the invention of high-throughput technologies. It is unlikely that a morass of data produced by sequencing, microarray, and gene knockout experiments can be fully captured with the current tools and technologies. The pressing need is not the quantity of data but their quality and semantics--something that cannot be addressed by a divide-and-conquer approach alone. Knowledge from modular biology gathered from "isolated" systems is conceptually crosslinked to create a molecular level and, by extension, cell, organ, and even organism level understanding. However, very often knowledge acquired through such an approach comes with exceptions and gaps. For example, Mendelian laws of inheritance apply in all conditions except when traits are multifactorial, in which an additive effect predominates. Another example is the coexistence of dominant alleles in the ABO blood group in humans. Likewise, the expansion of triplet repeats (CGG) is an "anticipation" phenomenon that sometimes results in neuromuscular diseases. Added to this is the "intramodular" inaccuracy and incompleteness of data. Thus, to gain a holistic view of cell transactions, the classical reductionism needs to be supplemented with an approach that builds the system bottom up and analyzes it top down. With the recent development in instrumentation and information technology, this goal looks realistic. A cell is a massively parallel and interacting system. The parallel nature of the system speeds up the transfer of instructions within the cell, while the interactive feature determines the nonlinear and dynamic behavior of a system that exhibits feedback loops, noise, redundancy, and robustness. Furthermore, the cross-interactive nature of intracellular processes gives rise to fuzzy boundaries among pathways. For example, DNA polymerase participates in both the synthesis of a new DNA strand and the repair of DNA damage. Thus, it may be considered a common link between replication and repair pathways. Likewise there are hundreds of parallel and cross-interacting events within the cell, making it difficult to draw boundaries between pathways. Therefore, a practical rule of thumb is, a system is really where you draw a box. If systems biology could be simply defined as comprehensive biology or as biology at system level, then traditional Chinese and Indian medicine could be considered as precursors of systems biology. The systems biology is based on two prominent features. First, it is built on the knowledge gained from experimental biology; second, computational technologies are used to bridge multilayer experimental data. The goal is to describe biology not only at molecular level and the system level but also to understand life in the form of mechanisms and principles. Computational methods have been key driving forces in mathematics, physics, chemistry, and also biology. In systems biology, they function as hubs connecting theoretical, mathematical, and quantitative findings. The key is to find appropriate representations of biological events for numerically describing cellular processes. In fact, many biological processes are more suitably described in the language of computer 0 science than that of mathematics, especially for those in which phenomenological knowledge is more easily available than precise mechanistic and quantitative description. Developmentally regulated pathways, signal transduction, and pattern formation are such cases. The importance of computational approach in systems biology is underscored by the fact that it can provide effective description for systems at different levels. Though systems biology is sometimes practiced and termed as quantitative systems biology (QSB) or computational systems biology (CSB), both are approaches rather than hierarchical branches of systems biology. With the availability of high-throughput quantitative methods, concentrations of gene products, metabolites, and small molecules in di 0 Array of hope 1 Eric S. Lander 1 Bob Crimi 0 Genomics aims to provide biologists with the equivalent of chemistry's Periodic Table1 --an inventory of all genes used to assemble a living creature, together with an insightful system for classifying these building blocks. A short decade ago, the task of enumeration alone appeared to many to be a quixotic quest. Whereas chemical matter is composed of a mere hundred or so elements, organismal parts lists are huge--running into the thousands for bacteria and hundreds of thousands for mammals. Genomic mapping and sequencing, however, has steadily extended its dominion: it has domesticated the Megabase and will tame the Gigabase in the not-too-distant future. The next great challenge is to discern the underlying order. The Periodic Table summarized chemical propensities in its rows and columns, and thereby foreshadowed the secrets of subatomic structure. Understanding biological systems with 100,000 genes will similarly require organizing the parts by their properties. The Biological Periodic Table will not be two-dimensional, but will reflect similarities at diverse levels: primary DNA sequence in coding and regulatory regions; polymorphic variation within a species or subgroup; time and place of expression of RNAs during development, physiological response and disease; and subcellular localization and intermolecular interaction of protein products. The traditional gene-by-gene approach will not suffice to meet the sheer magnitude of the problem. It will be necessary to take `global views' of biological processes: simultaneous readouts of all components. Arrays offer the first great hope for such global views by providing a systematic way to survey DNA and RNA variation. They seem likely to become a standard tool of both molecular biology research and clinical diagnostics. These prospects have attracted great interest and investment from both the public and private sectors. The reviews in this supplement describe important issues in this fast-moving area2-12. 0 used in semiconductor manufacture to produce arrays with 400,000 distinct oligonucleotides, each in its own 20 µm2 region15. Other companies are developing in situ synthesis with reagents delivered by ink-jet printer devices. The new generation of array technologies is still in its infancy. As one reviewer wryly notes8, the scientific literature contains more reviews about arrays than primary research papers applying them. The techniques have become established in only a few places. The tools remain prohibitively expensive for many laboratories (owing to the actual capital cost of setting up an arraying facility or the amortized capital costs reflected in the purchase price of arrays). Still, these problems are likely to be solved by economies of scale, free-market competition and time--just as they are for new generations of computer microprocessors. 0 differed (for example, in metastatic versus nonmetastatic derivatives of a tumour cell line). Deeper biological insight is likely to emerge from examining datasets with scores of samples--for example, multiple time points from multiple cell lines treated independently with multiple growth factors. Each gene defines a point in k-dimensional space (where k is the number of samples studied), and functional similarities are likely to reveal themselves as `clusters' in this space. Computational scientists working in the field of `data mining' have devised a dizzying assortment of techniques for clustering, predicting and visualizing patterns in high-dimensional space--most based on inherent assumptions about the types of patterns to be found. Empirical exploration will be needed to flesh out which types of datasets and analytical tools will be most fruitful for biology. How well can causation be inferred from correlation? The problem is akin to inferring the design of a microprocessor based on the readout of its transistors in response to a variety of inputs. The task is impossible in a strict mathematical sense, in that the microprocessor layout could be arbitrarily complicated, but is likely to prove at least somewhat tractable in a more constrained biological setting, especially when combined with ways to cut specific wires in biological circuits using antisense and related techniques. The great opportunities ahead would well justify an influx of bright young computational scientists and technologists into biology. 0 DNA variation Arrays can also be used to study DNA, with the primary application being identification and genotyping of mutations and polymorphisms. These applications pose rather different challenges than RNA expression monitoring, and many issues remain to be worked out. Identification of novel DNA variants has largely been the province of oligonucleotide, as opposed to spotted, arrays7,9. Exploiting the ability to perform custom synthesis at high density, one can construct a `tiling' array to scan a target sequence for mutations. Each overlapping 25-mer in the sequence is covered by four complementary oligonucleotide probes that differ only by having A, T, C or G substituted at the central position. An amplified product containing the expected sequence will hybridize best to the expected probe, whereas a sequence variation will typically alter the hybridization pattern. Such tiling arrays have been used to detect variants in such targets as the HIV genome, human mitochondria and the gene encoding p53. In such specific settings, the process can be optimized to have high specificity and sensitivity. The approach has also been used for much larger surveys--for example, a set 0 Microarray data normalization and transformation 1 John Quackenbush 0 The goal of most microarray experiments is to survey patterns of gene expression by assaying the expression levels of thousands to tens of thousands of genes in a single assay. Typically, RNA is first isolated from different tissues, developmental stages, disease states or samples subjected to appropriate treatments. The RNA is then labeled and hybridized to the arrays using an experimental strategy that allows expression to be assayed and compared between appropriate sample pairs. Common strategies include the use of a single label and independent arrays for each sample, or a single array with distinguishable fluorescent dye labels for the individual RNAs. Regardless of the approach chosen, the arrays are scanned after hybridization and independent grayscale images, typically 16-bit TIFF (Tagged Information File Format) images, are generated for each pair of samples to be compared. These images must then be analyzed to identify the arrayed spots and to measure the relative fluorescence intensities for each element. There are many commercial and freely available software packages for image quantitation. Although there are minor differences between them, most give high-quality, reproducible measures of hybridization intensities. For the purpose of the discussion here, we will ignore the particular microarray platform used, the type of measurement reported (mean, median or integrated intensity, or the average difference for Affymetrix GeneChipsTM), the background correction performed, or spot-quality assessment and trimming used. As our starting point, we will assume that for each biological sample we assay, we have a high-quality measurement of the intensity of hybridization for each gene element on the array. The hypothesis underlying microarray analysis is that the measured intensities for each arrayed gene represent its relative expression level. Biologically relevant patterns of expression are typically identified by comparing measured expression levels between different states on a gene-by-gene basis. But before the levels can be compared appropriately, a number of transformations must be carried out on the data to eliminate questionable or low-quality measurements, to adjust the measured intensities to facilitate comparisons, and to select genes that are significantly differentially expressed between classes of samples. 0 Expression ratios: the primary comparison Most microarray experiments investigate relationships between related biological samples based on patterns of expression, and the simplest approach looks for genes that are differentially expressed. If we have an array that has Narray distinct elements, and compare a query and a reference sample, which for convenience we will call R and G, respectively (for the red and green colors commonly used to represent array data), then the ratio (T) for the ith gene (where i is an index running over all the arrayed genes from 1 to Narray) can be written as R Ti = i . Gi 0 (Note that this definition does not limit us to any particular array technology: the measures Ri and Gi can be made on either a single array or on two replicate arrays. Furthermore, all the transformations described below can be applied to data from any microarray platform.) Although ratios provide an intuitive measure of expression changes, they have the disadvantage of treating up- and downregulated genes differently. Genes upregulated by a factor of 2 have an expression ratio of 2, whereas those downregulated by the same factor have an expression ratio of (-0.5). The most widely used alternative transformation of the ratio is the logarithm base 2, which has the advantage of producing a continuous spectrum of values and treating up- and downregulated genes in a similar fashion. Recall that logarithms treat numbers and their reciprocals symmetrically: log2(1) = 0, log2(2) = 1, log2(1/2) = -1, log2(4) = 2, log2(1/4) = -2, and so on. The logarithms of the expression ratios are also treated symmetrically, so that a gene upregulated by a factor of 2 has a log2(ratio) of 1, a gene downregulated by a factor of 2 has a log2(ratio) of -1, and a gene expressed at a constant level (with a ratio of 1) has a log2(ratio) equal to zero. For the remainder of this discussion, log2(ratio) will be used to represent expression levels. 0 Normalization Typically, the first transformation applied to expression data, referred to as normalization, adjusts the individual hybridiza- 0 R-I plot raw data 0 where Gi and Ri are the measured intensities for the ith array element (for example, the green and red intensities in a two-color microarray assay) and Narray is the total number of elements represented in the microarray. One or both intensities are appropriately scaled, for example, 0 Gk = NtotalGk and Rk = Rk , 0 tion intensities to balance them appropriately so that meaningful biological comparisons can be made. There are a number of reasons why data must be normalized, including unequal quantities of starting RNA, differences in labeling or detection efficiencies between the fluorescent dyes used, and systematic biases in the measured expression levels. Conceptually, normalization is similar to adjusting expression levels measured by northern analysis or quantitative reverse transcription PCR (RT-PCR) relative to the expression of one or more reference genes whose levels are assumed to be constant between samples. There are many approaches to normalizing expression levels. Some, such as total intensity normalization, are based on simple assumptions. Here, let us assume that we are starting with equal quantities of RNA for the two samples we are going to compare. Given that there are millions of individual RNA molecules in each sample, we will assume that the average mass of each molecule is approximately the same, and that, consequently, the number of molecules in each sample is also the same. Second, let us assume that the arrayed elements represent a random sampling of the genes in the organism. This point is important because we will also assume that the arrayed elements randomly interrogate the two RNA samples. If the arrayed genes are selected to represent only those we know will change, then we will likely over- or under-sample the genes in one of the biological samples being compared. If the array contains a large enough assortment of random genes, we do not expect to see such bias. This is because for a finite RNA sample, when representation of one RNA species increases, representation of other species must decrease. Consequently, approximately the same number of labeled molecules from each sample should hybridize to the arrays and, therefore, the total hybridization intensities summed over all elements in the arrays should be the same for each sample. Using this approach, a normalization factor is calculated by summing the measured intensities in both channels 0 Narray i=1 Ntotal = Narray , Gi 0 so that the normalized expression ratio for each element becomes Ri 1 = , Ntotal Gi 0 which adjusts each ratio such that the mean ratio is equal to 1. This process is equivalent to subtracting a constant from the logarithm of the expression ratio, 0 which results in a mean log2(ratio) equal to zero. There are many variations on this type of normalization, including scaling the individual intensities so that the mean or median intensities are the same within a single array or across all arrays, or using a selected subset of the arrayed genes rather than the entire collection. 0 Lowess normalization In addition to total intensity normalization described above, there are a number of alternative approaches to normalizing expression ratios, including linear regression analysis1, log centering, rank invariant methods2 and Chen's ratio statistics3, among others. However, none of these approaches takes into account systematic biases that may appear in the data. Several reports have indicated that the log2(ratio) values can have a systematic dependence on intensity4,5, which most commonly appears as a deviation from zero for low-intensity spots. Locally weighted linear regression (lowess)6 analysis has been proposed4,5 as a normalization method that can remove such intensity-dependent effects in the log2(ratio) values. The easiest way to visualize intensity-dependent effects, and the starting point for the lowess analysis described here, is to plot the measured log2(Ri/Gi) for each element on the array as a function of the log10(Ri*Gi) product intensities. This `R-I' (for ratiointensity) plot can reveal intensity- 0 research focus 0 Protein microarray technology 1 Markus F. Templin, Dieter Stoll, Monika Schrenk, Petra C. Traub, Christian F. Voehringer and Thomas O. Joos 0 Microarray technology allows the simultaneous analysis of thousands of parameters within a single experiment. Microspots of capture molecules are immobilised in rows and columns onto a solid support and exposed to samples containing the corresponding binding molecules. Readout systems based on fluorescence, chemiluminescence, mass spectrometry, radioactivity or electrochemistry can be used to detect complex formation within each microspot. Such miniaturised and parallelised binding assays can be highly sensitive, and the extraordinary power of the method is exemplified by array-based gene expression analysis. In these systems, arrays containing immobilised DNA probes are exposed to complementary targets and the degree of hybridisation is measured. Recent developments in the field of protein microarrays show applications for enzyme-substrate, DNA-protein and different types of protein-protein interactions. This article discusses theoretical advantages and limitations of any miniaturised capture-molecule-ligand assay system and discusses how the use of protein microarrays will change diagnostic methods and genome and proteome research. 0 w The fundamental principles of miniaturised 0 and parallelised microspot ligand-binding assays were described more than a decade ago. In the `ambient analyte theory', Roger Ekins and coworkers [1-4] explained why microspot assays are more sensitive than any other ligand-binding assay. At that time, the high sensitivity and enormous potential of microspot technology had already been demonstrated using miniaturised immunological assay systems. Nevertheless, the enormous interest that microarray-based assays evoked came from work using DNA chips. The possibility of determining thousands of different binding events in one reaction in a massively parallel fashion perfectly suited the needs of genomic approaches in biology. The rapid progress in whole-genome sequencing (e.g. [5,6]) and the increasing importance of expression studies (expressed sequence tag [EST] sequencing) was matched with efficient in vitro techniques for synthesising specific capture molecules for ligand-binding assays. Oligonucleotide synthesis and PCR amplification allow thousands of highly specific capture molecules to be generated efficiently. New trends in technology, mainly in microtechnology and microfluidics, newly established detection systems and improvements in computer technology and bioinformatics were rapidly integrated into the development of microarray-based assay systems. Now, DNA microarrays, some of them built from tens of thousands of different oligonucleotide probes per square centimetre, are well-established high-throughput hybridisation systems that generate huge sets of genomic data within a single experiment (Fig. 1). Their use for the analysis of single nucleotide polymorphisms and in expression profiling has already changed pharmaceutical research, and their use as diagnostic tools will have a big impact on medical and biological research. As known from gene expression studies, however, mRNA level and protein expression do not necessarily correlate [7-9]. Protein functionality is often dependent on posttranslational processing of the precursor protein and regulation of cellular pathways frequently occurs by specific interaction between proteins and/or by reversible covalent modifications such as phosphorylation. To obtain detailed information about a complex biological system, information on the state of many proteins is required. The analysis of the proteome of a cell (i.e. the quantification of all proteins and the determination of their post-translational modifications and how these are dependent on cell-state and environmental influences) is not possible without novel experimental approaches. High-throughput protein analysis methods allowing a fast, direct and quantitative detection are needed. Efforts are underway, therefore, to expand microarray technology beyond DNA chips and 0 research focus 0 Internal parameters Genetic Aging Diseases 0 External parameters Drugs Environment 0 Signal density Decrease 0 Cell Signal log (Total intensity) Signal density log (Signal/area) 0 Genetic analysis · SNP · Mutation · Sequencing 0 Expression analysis · mRNA · Protein 0 Interaction analysis · · · · · Protein-protein Antigen-antibody Enzyme-substrate Protein-DNA Ligand-receptor 0 Drug Discovery Today 0 Total amount of antibody 0 Drug Discovery Today 0 establish array-based approaches to characterise proteomes (Fig. 1) [10-12]. 0 Miniaturised ligand-binding assays: theoretical considerations 0 The ambient analyte assay theory shows that miniaturised ligand-binding assays are able to achieve a superior sensitivity. A system that uses a small amount of capture molecules and a small amount of sample can be more sensitive than a system that uses a hundred times more material. Ekins and coworkers [1-4] developed a sensitive microarray-based analytical technology and proved the high sensitivity of the miniaturised assay. With this system, analytes, such as thyroid stimulating hormone (TSH) or Hepatitis B surface antigen (HbsAG), could be quantified down to the femtomolar concentration range (corresponding to 106 molecules ml-1). Miniaturisation is the key to understanding the principle of miniaturised binding assays. Capture molecules are immobilised to the solid phase only in a very small area, the microspot - although the amount of capture molecules present in the system is low, a high density of molecules in the microspot can be obtained (Fig. 2). During an assay, target molecules, or analytes, are captured by the microspot but the number of 0 research focus 0 DNA mRNA Protein Biological target 0 DNA mRNA Protein 0 Protein Biological target 0 Amplification Different labeling 0 Amplification Labeling 0 Competitive binding 0 Differentially regulated targets 0 Quantification YYYY YYYY YYYY 0 Drug Discovery Today 0 THE USE AND ANALYSIS OF MICROARRAY DATA 1 Atul Butte 0 Functional genomics is the study of gene function through the parallel expression measurements of genomes, most commonly using the technologies of microarrays and serial analysis of gene expression. Microarray usage in drug discovery is expanding, and its applications include basic research and target discovery, biomarker determination, pharmacology, toxicogenomics, target selectivity, development of prognostic tests and disease-subclass determination. This article reviews the different ways to analyse large sets of microarray data, including the questions that can be asked and the challenges in interpreting the measurements. 0 NATURE REVIEWS | DRUG DISCOVERY 0 Nature Publishing Group 0 Tissue or tissue under influence 0 cDNA or cRNA copy 0 Tagged or incorporating fluor 0 Fluorescent intensities scanned into computer 0 cDNA spotted on glass slide or oligonucleotides built on slide 0 Instead of fitting a complex polynomial curve to data, splines allow the fitting of data by putting together smaller, less complex curves. 0 NORTHERN BLOT 0 Different RNA molecules are separated by mass on a gel, then radioactively labelled complementary DNA or RNA molecules are used to quantify specific RNA amounts. 0 REVERSE TRANSCRIPTION 0 determine differences in gene expression in tissues exposed to various doses of compounds; toxicogenomics, to find gene-expression patterns in a model tissue or organism exposed to a compound and their use as early predictors of adverse events in humans; target selectivity, to define a compound by the geneexpression pattern it provokes in a target tissue and then compare it with other compounds using these patterns; prognostic tests, to find a set of genes that accurately distinguishes one disease from another; and diseasesubclass determination, to find multiple subcategories of tumours in a single clinical diagnosis. Many free (BOX 1) and commercial software packages are now available to analyse microarray data sets, although it is still difficult to find a single off-the-shelf software package that answers all functional-genomics questions. As the field is still young, when developing a bioinformatics analysis pipeline, it is more important to have a good understanding of both the biology involved and the analytical techniques rather than having the right software. This article reviews the different ways to analyse microarray data, and will concentrate on choosing the appropriate method for the given hypothesis. 0 Normalization and noise 0 The synthesis of a strand of DNA from RNA, which is used to make a complementary DNA copy of sample RNA. 0 Before multiple microarray measurements can be integrated into a single analysis, the reported measurements need to be normalized, or modified (possibly corrected) to make them comparable.When microarrays are used to collect gene-expression data in an experiment in which the measurements are made at the same time, with homogeneous populations of similar cells and using a 0 single microarray technology, normalization might simply be a matter of adjusting the overall brightness of each scanned microarray image, assuming that the quantity of RNA is equal4. Other normalization methods include: using expression levels of `housekeeping' genes5; using assumptions that most genes do not change across experiments6; using SPLINES7; or other nonlinear techniques8,9. Typically, however, functional-genomics experiments are more complicated. Recently, increasing efforts have been invested in characterizing the `noise' in microarray technology. Studies addressing the reproducibility of microarray data analysed replicated data10, compared microarray measurements with NORTHERN 11,12 BLOTS and SAGE13, and evaluated strategies for 14 REVERSE TRANSCRIPTION and in vitro transcription amplification15. As a result, it has become increasingly clear that there are several substantial sources of noise in microarray data. Intra- and inter-microarray variations can markedly skew the interpretation of such expression data. First, improving the reliability of expression measurements starts with proper experimental design. For example, microarrays can measure across the genome, including genes with expression that is controlled by hormones, such as growth hormone or cortisol. So, if organ samples are acquired at various times during the day, genes that appear to be differentially expressed might only be reflecting normal circadian physiology. Pooling samples before hybridization might control for this biological `noise.' In addition, scanned hybridization images need to be inspected for artefacts, such as scratches and bubbles16,17. Measuring replicate microarrays for each biological sample allows the modelling of this technical noise. 0 Nature Publishing Group 0 Most reported expression data have been obtained on relatively homogeneous cell populations. However, when RNA is extracted from whole organs or from tumour biopsies, the sources of variation increase. There is substantial heterogeneity of expression in cell subpopulations in most organs and in many tumours. Failure to account for such variation could lead to overinterpretation or spurious functional gene association. Microdissection of cell subpopulations (for example, with laser capture18) is possible only in a minority of the systems of interest. If microarray-based geneexpression measurements are to be reliable and economical, both at the level of basic biology and clinical assays, then all of these further sources of noise/variation must be incorporated directly into the analytical tools that interpret these data. A further issue that needs to be addressed is the difference between the two most commonly used microarray technologies: spotted cDNA microarrays, which report differences in gene expression between two samples, and oligonucleotide microarrays, which report absolute expression levels. Normalization techniques for one microarray technology might not apply to another, owing to differences in assumptions and the distributions of the output measurements. For example, if we assume th 0 A strategy for optimizing quality and quantity of DNA extracted from soil 1 Helmut Burgmann a,) , Manuel Pesaro a , Franco Widmer a,b, Josef Zeyer a ¨ 0 Keywords: DNA; Bead beating; Soil; Extraction 0 Introduction Molecular ecology relies heavily on methods for the direct extraction of DNA from environmental samples. Molecular methods for the analysis of gene pools using polymerase chain reaction ZPCR. or 0 cloning techniques rely on high quality nucleic acids as template, as these techniques require pure, unfragmented DNA templates. Extraction of pure nucleic acids from soil samples has been a challenge because of the complex and heterogeneous nature of the soil matrix and the inhibition of biochemical reactions by coextracted substances such as humic acids ZPorteous and Armstrong, 1993; Steffan and Atlas, 1991; Young et al., 1993.. The efficiency of the extraction is of equal importance. High DNA yields are important to obtain a low detection limit and to ensure the 0 from soil with a method optimized for quality of the extracted DNA, and we investigate the impact of the extraction method on the apparent microbial community. 0 Materials and methods 2.1. Soil sampling and storage One agricultural and five forest soil samples were collected in August 1999 from sites in northern Switzerland and the upper Rhone valley in southern ^ Switzerland. They represent a range of typical European soils with respect to parameters like pH, texture and organic matter content ZTable 1. ZFavre, 1982; Richard and Luscher, 1983.. At each site, a block of ¨ soil was removed with a spade and the A horizons were separated and transported to the laboratory in plastic bags. All soils were passed through a 2.5-mm sieve and stored at 108C. DNA extractions were performed after an equilibration time of at least 3 weeks. While this method of storage allows for some change in the microbial communities over time, it was undesirable to freeze soil samples because of the additional physical stress introduced by freezing and thawing. 2.2. DNA extraction procedures Extractions were performed with a modification of a buffer previously described for RNA extraction ZCheung et al., 1994.. The buffer contains 0.2% hexadecyltrimethylammonium bromide ZCTAB., 1 mM dithiotreitol ZDTT., 0.2 M sodium phosphate buffer ZpH 8., 0.1 M NaCl and 50 mM EDTA. Silica or ceramic beads ZTable 2, types A, C, and D. or bead mixtures ZTable 2, types B and E. were weighed into sterile 2-ml microtubes, an amount of soil was added and the buffer was pipetted directly into the tube. The tubes were processed in the bead-beater ZFastPrep FP120 bead-beater, Bio101rSavant, Farmingdale, NY., which allowed simultaneous processing of up to 12 samples. The machine supports beating speeds Zmaximum speed of the tube during vertical movement. between 4.0 and 6.5 m sy1 Zin 0.5 m sy1 increments., corresponding to approxi- 1 Osterliwald Rafz Steig Winzlerboden 0 Data from Favre Z1982. and Richard and Luscher Z1983.. ¨ Gartenacker from the upper Rhone valley in southern Switzerland, all other soils from northern Switzerland. ^ 0 Experiment FastPrep parameters Bead types Amount of beads Temperature Reextractionc Maximum extractiond Comparison of soils 0 Beads Ztype. a A A, B, C, D, E A A A A A 0 Three-Detergent Method for the Extraction of RNA from Several Bacteria 0 Recent trends in molecular bacteriology have highlighted the importance of examining and comparing gene expression in different species in many cases. Also, studies with a number of different bacterial strains may be required when working on their ecology or population biology. In all such cases, high-efficiency protocols applicable to a variety of bacteria are relevant. A potential hurdle in the isolation of intact 0 RNA from bacteria is the relatively short half-life of the messenger RNA. Hence, the rapidity of cellular lysis and complete inhibition of RNases is of particular importance in such protocols. A mixture of detergents at low pH was previously shown to be efficient for cellular lysis for mycobacteria (4). On this basis, we have developed a threedetergent method for the isolation of RNA from several gram-negative bacterial species. In our method, cellular lysis is achieved through a combination of SDS, Tweenfi 20 and Tritonfi X-100 while genomic DNA contamination is reduced through acid depurination-cumdeproteination through the use of citrate-buffered phenol (pH 4.0). The three detergents are readily available: SDS is 0 tity of the RNA obtained. The RNA yields ranged between 21.8 and 47.2 µg RNA/mL starting culture, and the A260/A280 nm ratios were between 1.80 and 2.09. Figure 1A shows the gel profile of total RNA obtained from P. putida wild-type using different methods. A non-denaturing gel was used because it shows more clearly both the RNA quantity and quality and the degree of persisting DNA. Figure 1, lane 3 shows that the quantity of RNA isolated using the three-detergent technique was significantly higher than when a single detergent was used (Figure 1, lanes 1 and 2, 2% and 5% SDS, respectively). Having established that this threedetergent method was the most efficient, we then proceeded to optimize the reduction of chromosomal DNA carry-over. The persisting DNA and RNA yields obtained from LiCl precipitation for 1 h, 3 h and overnight are shown in Figure 1, lanes 4-6, respectively. Total yields are reduced, but so are the persisting DNA. Lane 7 shows 0 Table 1. Average, Based on Three Experiments, RNA Recovery from Different Bacterial Strains 0 Strains P. putida 39169 P. putida 39169 P. putida 39169 P. putida 39169 Epicurian colifi XL1-Blue P. aeruginosa BO267 E. tarda PPD 130/91 B. cepacia 53267 A. tumefaciens AGL1 B. cereus 14579 B. subtilis 6051 0 Yield (µg RNA/mL µ Starting Culture) 47.2 ± 3.2 8.1 ± 22.1 25.1 ± 2.8 34.5 ± 2.7 35.7 ± 2.0 46.9 ± 3.3 25.3 ± 2.4 21.8 ± 2.0 24.4 ± 1.7 37.4 ± 2.4 (24)b 39.2 ± 1.9 (9)b 0 Cells were lysed in 20 mL of STT extraction buffer, and RNA was precipitated with a 1 vol of isopropanol; method 2: as in method 1, but with an additional lysozyme treatment prior to cell lysis with STT; method 3: RNA was precipitated with LiCl for either 1 h, 3 h or overnight; method 4: RNA was first precipitated with isopropanol and then DNase-treated. yields obtained if lysozyme treatment was omitted. 0 the RNA obtained from isopropanol precipitation followed by DNase I treatment. The contaminating DNA is fully removed, and the RNA yields are still higher (1.1- to 3.1-fold) than that obtained from LiCl precipitation. RNA was also isolated u 0 mRNA Extraction and Reverse Transcription-PCR Protocol for Detection of nifH Gene Expression by Azotobacter vinelandii in Soil 1 Helmut Burgmann,1* Franco Widmer,2 William V. Sigler,1 and Josef Zeyer1 ¨ 0 Soil Biology, Institute of Terrestrial Ecology, Swiss Federal Institute of Technology (ETH-Zurich), ¨ CH-8952 Schlieren,1 and Swiss Federal Research Station for Agroecology and Agriculture (FAL Reckenholz), CH-8046 Zurich,2 Switzerland ¨ 0 A. VINELANDII nifH ACTIVITY IN SOIL AND LIQUID CULTURE TABLE 1. Starting conditions for the experimental treatments and controls 0 A. vinelandii concn (cells ml 1 or cells g 1)b 0 Sucrose concn (%)c 0 NH4NO3 concn ( mol ml 1 or mol g 0 No. of replicates 0 LC N LC N SC N SC N LC control SC control Reference soil 0 Liquid medium Liquid medium Sterile soil Sterile soil Liquid medium Sterile soil Nonsterile soil 0 The Liquid medium was ATTC 14 medium, and the soil was Pappelacker (see text). Strain DSM 85. c The concentration in liquid medium was 2% (wt/vol), and the concentration in soil was 2% (wt/wt). d Concentration of NH4NO3 added. The soil contained additional indigenous nitrogen. e NA, not applicable. 0 most previous investigations high-density inoculation or very active communities were required in order to reliably detect mRNA. Reliable extraction of mRNA from soil is still considered a challenge in soil microbiological research (17). Recent progress in extraction technology, however, has shown that the approach is feasible (19). Here we describe an effective total RNA extraction protocol based on a previously described direct extraction procedure for total nucleic acids (8). Azotobacter vinelandii, an aerobic freeliving soil diazotroph, was cultivated in a previously sterilized soil and in liquid culture. This system was used to establish and verify a method for nifH mRNA extraction and detection by reverse transcription (RT) and PCR. N fixation was either induced by providing excess organic carbon (sucrose) or repressed by providing excess bioavailable N (NH4NO3). Population growth, bulk N-fixing activities, and nifH mRNA expression were monitored and compared in order to link nifH gene expression to N-fixing activity in a soil environment. 0 aubergine enhances oskar translation in the Drosophila ovary 1 Joan E. Wilson, Joanne E. Connell and Paul M. Macdonald* 0 Key words: aubergine, oskar, translation, maternal mRNA, Drosophila 0 RESEARCH ARTICLE 0 A Gene Expression Map for the Euchromatic Genome of Drosophila melanogaster 1 Viktor Stolc,1,5* Zareen Gauhar,1,2* Christopher Mason,2* Gabor Halasz,7 Marinus F. van Batenburg,7,9 Scott A. Rifkin,2,3 Sujun Hua,2 Tine Herreman,2 Waraporn Tongprasit,6 Paolo Emilio Barbano,2,4 Harmen J. Bussemaker,7,8 Kevin P. White2,3. 0 We used a maskless photolithography method to produce DNA oligonucleotide microarrays with unique probe sequences tiled throughout the genome of Drosophila melanogaster and across predicted splice junctions. RNA expression of protein coding and nonprotein coding sequences was determined for each major stage of the life cycle, including adult males and females. We detected transcriptional activity for 93% of annotated genes and RNA expression for 41% of the probes in intronic and intergenic sequences. Comparison to genome-wide RNA interference data and to gene annotations revealed distinguishable levels of expression for different classes of genes and higher levels of expression for genes with essential cellular functions. Differential splicing was observed in about 40% of predicted genes, and 5440 previously unknown splice forms were detected. Genes within conserved regions of synteny with D. pseudoobscura had highly correlated expression; these regions ranged in length from 10 to 900 kilobase pairs. The expressed intergenic and intronic sequences are more likely to be evolutionarily conserved than nonexpressed ones, and about 15% of them appear to be developmentally regulated. Our results provide a draft expression map for the entire nonrepetitive genome, which reveals a much more extensive and diverse set of expressed sequences than was previously predicted. Characterization of the complete expressed set of RNA sequences is central to the functional interpretation of each genome. For almost 3 decades, the analysis of the Drosophila genome has served as an important model for studying the relationship between gene expression and development. In recent years, Drosophila provided the initial demonstration that DNA microarrays could be used to study gene expression during development (1), and subsequent large-scale studies of gene expression in this and other developmental model organisms have given new insights into how 0 of the human genome and for Arabidopsis (11-13). Microarrays have also recently been used to characterize the great diversity of RNA transcripts brought about by differential splicing in human tissues (14). We used both types of approaches to characterize the Drosophila genome. Experimental design. To determine the expressed portion of the Drosophila genome, we designed high-density oligonucleotide microarrays with probes for each predicted exon and probes tiled throughout the predicted intronic and intergenic regions of the genome. We used maskless array synthesizer (MAS) technology (15, 16) to synthesize custom microarrays containing 179,972 unique 36-nucleotide (nt) probes (17). Of these, 61,371 exon probes (EPs) assayed 52,888 exons from 13,197 predicted genes, 87,814 nonexon probes (NEPs) assayed expression from intronic and intergenic regions, and 30,787 splice junction probes (SJPs) assayed potential exon junctions for a test subset of 3955 genes. For the SJPs, we used 36-nt probes spanning each predicted splice junction, with 18 nt corresponding to each exon (14). RNA from six developmental stages during the Drosophila life cycle (early embryos, late embryos, larvae, pupae, and male and female adults) was isolated and reversetranscribed in the presence of oligodeothymidine and random hexamers, and the labeled cDNA was hybridized to these arrays. The stages were chosen to maximize the number of transcripts that would be differentially expressed between samples on the basis of previous results (3, 7). Each sample was hybridized four times, twice with Cy5 labeling and twice with Cy3 labeling (fig. S1). Genomic and chromosomal expression patterns. We determined which exon or nonexon probes correspond to genomic regions that are transcribed at any stage during development (18). We used a negative control probe (NCP) distribution (fig. S3) to score the statistical significance of the EP or NEP signal intensities for each of the 24 unique combinations of stage, dye, and array, correcting for probe sequence bias (17, 19). These results were combined into a single expression-level estimate (19), a threshold for which was determined by requiring a false discovery rate of 5% (20). This threshold shows 47,419 of 61,371 EPs (77%) and 35,985 out of 87,814 NEPs (41%) were significantly expressed at some point during the fly life cycle. Significantly expressed EPs correspond to 79% (41,559/52,888) of all exons probed and 93% (12,305/13,197) of all probed gene annotations. Our results confirmed 2426 annotated genes not yet validated through an EST sequence (Fig. 1A). Out of 10,280 genes represented by EST sequences, 0 OCTOBER 2004 0 RESEARCH ARTICLE 0 only 401 (3.0%) were not detected in these microarray experiments. Our finding that a large fraction of intergenic and intronic regions (NEPs) is expressed in D. melanogaster mirrors similar observations for chromosomes 21 and 22 in humans (16) and for Arabidopsis (14). These results support the conclusion that extensive expression of intergenic and intronic sequences occurs in the major evolutionary lineages of animals (deuterostomes and protostomes) and in plants. We noted that mRNA expression levels for protein-encoding genes varied with the protein function assigned in the Drosophila Gene Ontology (fig. S2) (21). For example, genes encoding G protein receptors were expressed at relatively low levels, whereas genes encoding ribosomal proteins were highly expressed. A gene's expression level was also associated with cellular compartmentalization and the biological process it mediates (fig. S2). For example, genes encoding cytosolic and cytoskeletal factors were more highly expressed than those predicted to function within organelles such as the endoplasmic reticulum, Golgi, and peroxisome. To determine whether a high level of gene expression was associated with essential genetic functions, we examined the expression levels of genes recently shown to be required for cell viability (Fig. 1B) in a genome-wide RNA interference (RNAi) screen in Drosophila (22). Compared to the rest of the genome, the genes identified as essential by RNAi showed a significant increase in expression during all stages of development (P 0 0.0009, t test), even when the highly expressed ribosomal protein genes were omitted (P 0 0.0005, t test). This result is also consistent with the observation that genes with mutant phenotypes from the 3-Mbase Adh genomic region are overrepresented in EST libraries (23). High levels of essential gene expression may in part reflect widespread expression in cells throughout the animal, and the relative RNA expression level may serve as a rough predictor of essential cellular function. We also examined changes in gene expression during the fly life cycle to determine what fraction of the entire genome is differentially expressed between developmental stages. Figure 2A shows the expression signal intensities of transcripts from a typical 50-kilobase pair (kbp) region of the Drosophila genome during each major developmental stage. Stage-specific variation in expression is observed not only for exon probes, as expected, but also for intergenic and intronic probes. We used analysis of variance (ANOVA) (24) to systematically identify probes as differentially expressed at a false discovery rate of 5% (16). As expected, the majority of probes detecting differentially expressed sequences are also expressed above background noise level (89% of EPs and 81% of NEPs) (17) (Table 1). We found 27,176 EPs to be differentially expressed, corresponding to 76% of annotated genes, and even more when we applied a less conservative background model (fig. S4). The fact that the 0 Review articles 0 Control of developmental timing by small temporal RNAs: a paradigm for RNA-mediated regulation of gene expression 1 Diya Banerjee and Frank Slack* 0 BioEssays 24.2 0 Review articles 0 For the majority of animals, spatial pattern is laid down over time and hence spatial identity is often a result of the temporal sequence of patterning events. The key role that developmental time plays in pattern formation is illustrated in the exquisite series of heterochronic grafting experiments performed by Summerbell et al.(20) When the tips of young chick limb buds are grafted onto older limb buds, the limbs develop with reiterations of limb segments along the proximal±distal (shoulder to fingers) axis, i.e. these limbs develop with two consecutive sets of humerus, radius, and ulna bones (Fig. 1). In the reciprocal heterochronic graft, old limb buds are grafted onto young limb buds and the limbs develop with deletion of segments along the proximal±distal axis, i.e. these limbs develop with a humerus immediately followed by digits, deleting the radius and ulna. The proximal±distal axis of the limb develops over time with the proximal elements being produced first and the distal elements last. Undifferentiated cells in the progress zone divide under the influence of fibroblast growth factors (FGFs) produced from the apical epidermal ridge, the most distal structure in the limb bud. As their daughter cells move away from the FGF signal, they differentiate into limb elements.(21±23) The progress zone model proposes that the 0 BioEssays 24.2 0 Review articles 0 length of time that a progenitor cell spends in the progress zone dictates which proximal±distal fates its daughters will assume. Thus, spatial patterning in the proximal±distal axis can be thought of as a consequence of temporal patterning because the specification of each limb element is dependent on the relative age of the progenitor cell in the progress zone. Proximal elements are derived from daughters of younger progenitor cells and distal elements are derived from daughters of older progenitor cells. Another example of dependence on time for correct spatial patterning can be found during anterior±posterior patterning by Hox genes in vertebrates. Hox genes are arranged in linear clusters in which the physical order of individual Hox genes along the DNA correlates with their time of expression as well as their spatial domains of expression along the anterior± posterior axis. As cell proliferation progresses in the posteriorly migrating primitive streak, cells that are derived from developmentally younger progenitors become anteriorly located and express genes in the Hox cluster that are located near the 30 end of the cluster. More posteriorly located cells derived from older progenitors express genes closer to the 50 end of the cluster. This correlative relationship, known as ``colinearity'', emphasizes the intimacy of the relationship between developmental space and time.(24,25) The observation of Hox gene colinearity raises the possibility that temporal and spatial patterning pathways may share common mechanisms and genes. A first hint of this possibility is the recent observation that hunchback and kruppel, two well-known regulators of spatial identity in Drosophila embryogenesis, are also required for temporal identity of neurons.(26) Temporal boundaries and segment identities Heterochronic genes can be thought of as the temporal equivalents of the homeotic spatial patterning genes. While homeotic mutations result in alterations as to where particular cell fates are expressed, heterochronic mutations result in temporal transformations of cell fate, that is, changes in when a particular cell fate is expressed (Fig. 1). Both sets of genes generate graded levels of morphogens that modify a basic reiterated pattern of segments. In Drosophila, spatial patterning involves expression of segmentation genes defining the segment boundaries in the early embryo, followed by specification of segment identity by the homeotic genes. Similarly, one can define two broad classes of developmental timing genes, temporal identity genes that affect the fate choices that a cell makes at a specific time and temporal boundary genes that set the pace of development, for example, the genes that control the timing of molting. The C. elegans heterochronic mutations identified thus far transform temporal cell fate identity without appreciably affecting the periodicity of progression through the larval stages. These mutations thus define temporal identity genes. The larval molting cycle is unaffected by the known heterochronic mutations in C. elegans, sug 0 Functional anatomy of siRNAs for mediating efficient RNAi in Drosophila melanogaster embryo lysate 1 Sayda M.Elbashir, Javier Martinez, Agnieszka Patkaniowska, Winfried Lendeckel and Thomas Tuschl1 0 Department of Cellular Biochemistry, Max-Planck-Institute for E Biophysical Chemistry, Am Fassberg 11, D-37077 Gottingen, Germany 0 Duplexes of 21±23 nucleotide (nt) RNAs are the sequence-specific mediators of RNA interference (RNAi) and post-transcriptional gene silencing (PTGS). Synthetic, short interfering RNAs (siRNAs) were examined in Drosophila melanogaster embryo lysate for their requirements regarding length, structure, chemical composition and sequence in order to mediate efficient RNAi. Duplexes of 21 nt siRNAs with 2 nt 3¢ overhangs were the most efficient triggers of sequence-specific mRNA degradation. Substitution of one or both siRNA strands by 2¢-deoxy or 2¢-O-methyl oligonucleotides abolished RNAi, although multiple 2¢-deoxynucleotide substitutions at the 3¢ end of siRNAs were tolerated. The target recognition process is highly sequence specific, but not all positions of a siRNA contribute equally to target recognition; mismatches in the centre of the siRNA duplex prevent target RNA cleavage. The position of the cleavage site in the target RNA is defined by the 5¢ end of the guide siRNA rather than its 3¢ end. These results provide a rational basis for the design of siRNAs in future gene targeting experiments. Keywords: PTGS/RNA interference/small interfering RNA 0 a European Molecular Biology Organization 0 S.M.Elbashir et al. 0 nucleotide mismatches between the siRNA duplex and the target mRNA abolish interference. These results provide a rational basis for the design of siRNAs for future gene targeting experiments. 0 We reported previously that two or three unpaired nucleotides at the 3¢ end of siRNA duplexes were more efficient in target RNA degradation than blunt-ended duplexes (Elbashir et al., 2001b). To perform a more comprehensive analysis of the function of the terminal nucleotides, we synthesized five 21 nt sense siRNAs, each displaced by one nucleotide relative to the target RNA, and eight 21 nt antisense siRNAs, each displaced by one nucleotide relative to the target (Figure 1A). By combining these sense and antisense siRNAs, a series of eight siRNA duplexes with symmetric overhanging ends were generated spanning a range from 7 nt 3¢ overhang to 4 nt 5¢ overhang. The interference was measured using the dual luciferase assay system (Tuschl et al., 1999; Zamore et al., 2000). siRNA duplexes were directed against firefly luciferase mRNA and sea pansy luciferase mRNA was used as internal control. The luminescence ratio of target to control luciferase activity was determined in the presence of siRNA duplex and was normalized to that observed in its absence. For comparison, the interference ratios of long dsRNAs (39±504 bp) are shown in Figure 1B (Elbashir et al., 2001b). The interference ratios were determined at concentrations of 5 nM for long dsRNAs (Figure 1A) and at 100 nM for siRNA duplexes (Figure 1C±J). The 100 nM concentration of siRNAs was chosen because complete processing of 5 nM 504 bp dsRNA would result in 120 nM total siRNA duplexes. The ability of 21 0 CHAPTER 8 0 Preparation and Analysis of Pure Cell Populations from Drosophila 1 Susan Cumberledge' and Mark A. Krasnow 0 I. Introduction .II. Purifying Embryonic Cells by Fluorescence-Activated Cell Sorting 0 A . Equipment and Reagents B. Methods 111. Culturing and Analysis of Purified Cells A. Short-Term Culturing B. Fixation and Staining with Antibodies C . Stable Fluorescent Marking of Purified Cells IV. Conclusions References 0 I. Introduction 0 As the genetic analysis of development and cell function in Drosophila melanogaster has burgeoned over the last 15 years, so has our ability to distinguish various cell types in developing tissues, using molecular cell markers that have become available mostly through gene cloning. As our understanding of development and cell function in vivo becomes more sophisticated, it is increasingly important to isolate the various cell types so that they can be more fully analyzed and manipulated in various ways. This allows one to test the emerging models of the underlying cellular and molecular processes and to characterize these processes biochemically and discover new components. 1 Susan Cumberledge and Mark A. Krasnow 0 What has been needed is a convenient, reliable way to purify large quantities of different cell types from Drosophila. A wealth of knowledge has emerged from studies of purified cells and continuous cell lines from vertebrates, with the mammalian immune system perhaps the most dramatic example (Parks et a f . , 1989). In contrast, there have been only a few serious attempts to isolate and study pure populations of Drosophila cells. Mahowald and his colleagues have shown that highly enriched populations of pole cells (germ-line precursors) and neuroblasts can be obtained in reasonable quantity from embryos (Allis et al., 1977; Furst and Mahowald, 1985), and other groups (Bernstein et al., 1978; Storti et al., 1978) have described procedures for the isolation of myoblasts (see Mahowald (Chapter 7) and Ashburner (1989a) for reviews). This pioneering work demonstrated the feasibility of cell purification from Drosophila embryos, and it showed that purified cells can retain the ability to differentiate appropriately into morphologically distinct cell types. The fractionation schemes relied primarily on differences in general physical characteristics of the cells, such as their size, shape, density, or adhesive properties. For example. pole cells, because they tend to have a low lipid content and are larger than most embryonic cells, can be purified by equilibrium density centrifugation followed by sedimentation velocity centrifugation (Allis et a f . , 1977). Neuroblasts also tend to be large and can be selectively enriched by centrifugal elutriation and adherence to glass (Furst and Mahowald, 1985). However, most Drosophila embryonic cells, at least during early embryogenesis, are rather unexceptional in morphology and hence may not be amenable to purification by methods based solely on such physical characteristics. Methods for purifying these cells must rely on other properties of the cells, such as expression of cell type-specific molecular maikers. Surface markers have been widely used in mammalian systems to isolate specific cell types, particularly cells of the immune system (Parks et al., 1989). Antibodies that recognize specific cell surface antigens are commonly employed in the purification by using the antibodies to fluorescently label the cells followed by flow cytometry/fluorescence-activated cell sorting ( FACS) or by coupling the antibodies to a solid phase and selectively resorbing the cells of interest ("panning") (Wysocki and Sato, 1978).These techniques have not been applied to Drosophila, at least in part because few antibodies to cell type-specific surface antigens have been available until recently. However, in Drosophila, many intracellular markers are known, perhaps the most important of which is the Escherichia coli lac2 (P-galactosidase)gene, which is not normally present but is easily introduced by P-element-mediated transformation. Thousands of different strains expressing lac2 under control of various cell- and tissue-specific promoters and regulatory elements have been constructed, many by random insertion of a lac2 transposon such that lac2 expression comes under the control of an endogenous enhancer or regulatory element ("enhancer trap") (O'Kane and Gehring, 1987; Bier et al., 1989; Bellen et al., 1989). We have established a method, called whole animal cell sorting (WACS), for purifying the P-galactosidase expressing cells from such transgenic strains by FACS 0 Preparation and Analysis of Pure Cell Populations 0 (Krasnow et al., 1991). The key technical innovation that opened the way to this approach was the development of a viable, fluorogenic P-galactosidase substrate (fluorescein di-P-D-galactopyranoside) that was shown to be effective in the analysis and purification of cultured mammalian cells engineered to express P-galactosidase (Nolan et af., 1988; Fiering et af., 1991). The general scheme for WACS is as follows (Fig. 1). (1) Embryos carrying a lac2 transgene expressed in a specific cell type are grown to the desired developmental stage. (2) Cells of the developing embryos are dissociated and stained with FDG and then stained with a viable cell stain and a dead cell stain. 0 Embryo with lacZ transgene 0 Grow to desired developmental 0 Cells expressing &galactosidase 0 Dissociate cells 0 Stain with a fluorogenic p-galactosidase substrate (FDG) 0 Stain with vital dead cell dye 0 e (CBAM) and 0 Purifylive, p-galactosidaseexpressing cells by FACS 0 Analyze directly 0 Culture in vitro 0 Transplant into recipient embryo 1 Susan Cumberledge and Mark A. Krasnow 0 Purifying Embryonic Cells by Fluorescence-Activated Cell Sorting 0 A. Equipment and Reagents 0 Flow CytometedFACS Instrument We have used a modified Becton Dickinson FACStar Plus flow cytometer, equipped with two argon-ion lasers. Dual laser flow cytometry, data collection, and multiparameter analysis are performed essentially as described by Parks 0 Preparation and Analysis of Pure Cell Populations 0 et of.(1986,1989). One argon-ion laser (488 nrn, 400 mW output) is used to generate four signals: forward light scatter, large angle light scatter, fluorescein (detected through a 530/30-nm bandpass filter), and propidium iodide (detected through a 575/26-nm bandpass filter). A second argon-ion laser was used as an ultraviolet light source (351-363 nm, 50 mW) to excite calcein blue, whose emission was detected through a 405/20-nm filter. Data collection and multiparameter analysis are carried out on a Digital VAX computer system using the FACSiDESK software (Moore and Kautz, 1986). For applications in which the highest degree of cell purity and viability are not required, calcein blue staining can be omitted and a single laser flow cytometer (488 nm excitation) used for cell isolation. 0 Fluorescent Dyes and P-Gal 0 Fluorescence-activated cell sorting (FACS) of Drosophila hemocytes reveals important functional similarities to mammalian leukocytes 1 Rabindra Tirouvanziam*, Colin J. Davidson, Joseph S. Lipsick, and Leonard A. Herzenberg* 0 Drosophila is a powerful model for molecular studies of hematopoiesis and innate immunity. However, its use for functional cellular studies remains hampered by the lack of single-cell assays for hemocytes (blood cells). Here we introduce a generic method combining fluorescence-activated cell sorting and nonantibody probes that enables the selective gating of live Drosophila hemocytes from the lymph glands (larval hematopoietic organ) or hemolymph (blood equivalent). Gated live hemocytes are analyzed and sorted at will based on precise quantitation of fluorescence levels originating from metabolic indicators, lectins, reporters (GFP and -galactosidase) and antibodies. With this approach, we discriminate and sort plasmatocytes, the major hemocyte subset, from lamellocytes, an activated subset present in gain-of-function mutants of the Janus kinase and Toll pathways. We also illustrate how important, evolutionarily conserved, blood-cell-regulatory molecules, such as calcium and glutathione, can be studied functionally within hemocytes. Finally, we report an in vivo transfer of sorted live hemocytes and their successful reanalysis on retrieval from single hosts. This generic and versatile fluorescence-activated cell sorting approach for hemocyte detection, analysis, and sorting, which is efficient down to one animal, should critically enhance in vivo and ex vivo hemocyte studies in Drosophila and other species, notably mosquitoes. 0 tudies focusing on hematopoiesis and innate immunity in the model organism Drosophila melanogaster have identified extensive homologies between Drosophila hemocytes (blood cells) and mammalian leukocytes. Whole-animal functional studies have suggested that Drosophila hemocytes participate in similar activities to mammalian leukocytes, including phagocytosis encapsulation of pathogens, release of reactive oxygen species (ROS) and reactive nitrogen species and antimicrobial peptides, activation of humoral serine protease cascades, scavenging of dead bodies, wound repair, and extracellular matrix deposition (1-6). Molecular genetic studies have unravelled important evolutionarily conserved regulatory elements, including transcription factors of the Runt acute myelogenous leukemia (7), GATA (8), and Polycomb (9) families and integral transduction cascades, including the immune deficiency tumor necrosis receptor (2), Toll IL-1 receptor (2), Janus kinase (10, 11), mitogen-activated protein kinase (12), Notch (13), steroid (14), and vascular endothelial growth factor (15) pathways. Compared to mammalian species, Drosophila is particularly well suited to study the molecular genetics of blood cell development and function, thanks to the existence of a well annotated genome database, assorted genetic tools, and large mutant collections (16). By contrast, the lack of single-cell assays for Drosophila hemocytes severely restricts the scope of cellular studies (10, 11). Accordingly, our knowledge of Drosophila hemocyte subsets and functions remains very limited. In mammals, the use of fluorescence-activated cell sorting (FACS) has driven much of the progress in subset discrimination and functional analysis of leukocytes (17). Current three-laser, ``multidimensional,'' FACS machines enable up to 14 simultaneous 0 Drosophila Stocks. Stocks used in this study include y, w67 (control), Tum-l [Janus kinase gain-of-function mutant (24)], and Toll10B [Toll gain-of-function mutant (25)]. The Tum-l 11707 line was generated by crossing the Tum-l line and the LacZ enhancer-trap line, 11707 (26). The GAL4-e33c upstream activating sequence (UAS)-gfp strain was generated by crossing flies carrying the GAL4-e33c enhancer trap (27) to flies carrying the gfp transgene under control of the UAS (GAL4 response element), thus achieving constitutive GFP expression in hemolymph and lymph glands hemocytes. For in vivo transfers, we used two GFP-expressing lines: His::GFP [ubiquitous expression of a fusion protein between histone His2AvD and GFP (28)] and Tum-l; His::GFP (generated by standard crossing). Stocks were fed standard cornmeal, molasses, yeast, and agar medium and were maintained at 25°C. Late wandering third instar larvae were used for all experiments because they show maximal hemocyte numbers in lymph glands and hemolymph (6, 14). 0 Abbreviations: DHR, dihydrorhodamine 123; FACS, fluorescence-activated cell sorting; GSB, glutathione-S-bimane; GSH, glutathione; LacZ, -galactosidase; MCB, monochlorobimane; PI, propidium iodide; ROS, reactive oxygen species; UAS, upstream activating sequence; WGA, wheat germ agglutinin. 0 by The National Academy of Sciences of the USA 0 CELL BIOLOGY 0 Hemocyte Collection. Hemolymph cells were collected by rupturing the larval cuticle with a pair of fine forceps. For the collection of lymph glands cells, lymph glands were carefully dissected out, rinsed, and ruptured by repeated pipetting with siliconized tips. Cells were collected in ice-cold Schneider's medium (Invitrogen GIBCO) containing 1 complete mini protease inhibitor mixture (Roche Applied Science) to prevent melanization, clump formation, and autolysis and kept on ice until incubation with FACS probes. Most analyses were performed with cells from 5-10 animals. However, several analyses were also performed with cells from one animal to validate single-animal hemocyte assays with both hemolymph- and lymph glandsderived hemocytes. 0 Tirouvanziam et al. 0 FACS Probes and Staining Procedures. The main probes validated so 0 For this purpose, H2, antilamellocyte antibody (L1a), and antiplasmatocyte antibody (P1b 0 GAL4 Enhancer Trap Targeting of the Drosophila Sex Determination Gene fruitless 1 Anthony J. Dornan,1 Donald A. Gailey,2 and Stephen F. Goodwin1* 0 INTRODUCTION The Drosophila sex-determination gene fruitless (fru) encodes transcription factors with a conserved BTB/ POZ dimerization domain at the amino terminus and one of four alternatively spliced zinc-finger domains at the carboxyl terminus (Ito et al., 1996; Ryner et al., 1996; Goodwin et al., 2000; Usui-Aoki et al., 2000). With at least four identified promoters (designated P1, P2, P3, and P4) and both sex- and nonsex-specific alternative splicing, the gene's molecular complexity speaks to fru's pleiotropy (Ito et al., 1996; Ryner et al., 1996; Goodwin et al., 2000; Usui-Aoki et al., 2000; Anand et al., 2001). For example, fru regulates not only sex-specific aspects of the male nervous system associated with sexual behavior, but also other aspects of development com- 0 mon to both sexes (Anand et al., 2001; Song et al., 2002; Song and Taylor, 2003). Transcripts from the P1 promoter undergo sex-specific alternative splicing (Ryner et al., 1996; Heinrichs et al., 1998; Goodwin et al., 2000; Usui-Aoki et al., 2000), leading to a class of Fru proteins (FruM) that are present only in males (Lee et al., 2000). FruM proteins are expressed exclusively in the central nervous system (CNS) (Lee et al., 2000) and subserve the establishment of stereotypical male courtship behaviors, such as the ability of males to bend the abdomen in order to initiate mating, generation of a species-specific courtship song, fertility, and the concomitant differentiation of male-specific serotonergic innervation of parts of the internal reproductive organs and of a male-specific neuronally determined abdominal muscle, the muscle of Lawrence (MOL) (Gailey et al., 1991, Ito et al., 1996; Ryner et al., 1996; Goodwin et al., 2000; Usui-Aoki et al., 2000; Lee and Hall, 2000, 2001; Lee et al., 2001; Billeter and Goodwin, 2004; Manoli and Baker, 2004). fru also performs nonsex-specific essential roles in the development of the fly (Lee et al., 2000; Anand et al., 2001; Song et al., 2002; Song and Taylor, 2003). Genetic analysis of fru mutants demonstrated that P3- (and perhaps P4-) derived transcripts are necessary for viability in the adult and for fru's nonsex-specific functions (Ryner et al., 1996; Goodwin et al., 2000; Lee et al., 2000; Anand et al., 2001). Application of an antibody capable of detecting all classes of fru proteins (antiFrucom; Lee et al., 2000; Song et al., 2002) showed that the other promoters (P2, P3, and P4) produce nonsexually dimorphic products with differing spatial and tem- 0 FRUITLESS GAL4 ENHANCER TRAP LINES 0 FruCom expression (Lee et al., 2000), a pattern that reflects P3- and P4-derived transcript expression. Given the lack of information pertaining to the function of these transcripts, the availability of a novel GAL4 element that recapitulates the associated endogenous FruCom expression provides a unique avenue to investigate the essential roles of these promoters and the sexspecific and nonsex-specific functions of fruitless in Drosophila development. RESULTS Molecular Verification of the Precise Replacement Events Using a targeted transposition strategy, 10 lines were confirmed to have precisely replaced the extant fru4 (P[PZ]) element insert with the donor GAL4 (P[GawB]) element at the original point of insertion (Fig. 1). Southern blots, PCR amplification of regions spanning the junction between the gene and the inserted P-element, and direct sequencing of these products confirmed the absence of the original element, and the presence of a single GAL4 P-element for each replacement line and that no deletions, either of the element itself or of the flanking regions of the locus, had occurred (data not shown; Gloor et al., 1991; Johnson-Schlitz and Engels, 1993; Sepp and Auld, 1999). This also determined the 0 DORNAN ET AL. 0 orientation of the inserted P-element. The original fru4 element is oriented such that the rosyþ marker gene is expressed from the same strand as fru (Goodwin et al., 2000), designated the ``same'' 0 letters to nature 0 Median bundle neurons coordinate behaviours during Drosophila male courtship 1 Devanand S. Manoli1,2 & Bruce S. Baker2 0 Throughout the animal kingdom the innate nature of basic behaviour routines suggests that the underlying neuronal substrates necessary for their execution are genetically determined and developmentally programmed1-2. Complex innate behaviours require proper timing and ordering of individual component behaviours. In Drosophila melanogaster, analyses of combinations of mutations of the fruitless (fru) gene have shown that male-specific isoforms (FruM) of the Fru transcription factor are necessary for proper execution of all steps of the innate courtship ritual3-9. Here, we eliminate FruM expression in one group of about 60 neurons in the Drosophila central nervous 0 Nature Publishing Group 0 letters to nature 0 Males in which Fru M expression had been eliminated in median bundle neurons by the P52a-GAL4-directed expression of UAS-fru MIR (P52a/fru MIR) were used in standard courtship assays (see Methods) to assess the FruM-dependent roles of these neurons in courtship. In P52a/fru MIR males, courtship latency--the period from the initial presentation of a virgin female to the initiation of courtship behaviour, defined here as wing extension; Fig. 1a-- decreased (8 ^ 1 s (^s.e.m.) for P52a/fru MIR males, compared with 94 ^ 8 s for control males; Fig. 3a and Table 1). However, P52a/fru MIR males can still distinguish females from males, because they do not sustain courtship towards each other or towards control males (data not shown), unlike previously described mutants that exhibited a rapid initiation of courtship towards both virgin females and mature males13.We did several controls to ensure that the rapid initiation of courtship seen in P52a/fru MIR males is the consequence of blocking FruM expression in these 60 median bundle neurons. All of the individual transgenes used in these studies were backcrossed into a common genetic background before use. For each of these transgenes the courtship behaviours of males carrying that transgene alone did not differ from our controls (Fig. 3a). Additionally, the P52a-GAL4-directed expression of a UAS-traF transgene (Fig. 1b) also eliminates FruM expression in these 60 neurons (data not shown) and reduces courtship latency (10 ^ 2 s versus 94 ^ 8 s) (Fig. 3a and Table 1). On the basis of these and other controls (see Methods), we conclude that it is the elimination of FruM protein expression in the ,60 median bundle neurons, through the P52a-driven expression of UAS-fru MIR, that is responsible for the decreased courtship latency. To address whether rapid courtship by P52a/fru MIR males was a reflection of general heightened activity, we performed short-, intermediate and long-term locomotor assays on both control and P52a/fru MIR males14 (Table 1 and Fig. 3b). There were no significant differences in their activity (see Methods), suggesting that the behavioural differences observed in P52a/fru MIR males are specific to courtship. The longer courtship latency seen in wild-type relative to P52a/ fru MIR males suggests that initiation of courtship by wild 0 Dispatch R23 0 Sexual Behaviour: Do a Few Dead Neurons Make the Difference? 0 Why do males and females behave so differently? Sexually dimorphic neural circuitry has just been found in parts of the fly's brain thought to control mating behaviour. Might this explain why males and females have such distinct sexual behaviours? Jai Y. Yu and Barry J. Dickson Males and females of most species behave rather differently, particularly when it comes to sex. This makes sexual behaviours attractive models for trying to understand innate behaviours in general. Instead of trying to identify all the genes and all the neurons involved in a given behaviour, and then figure out how they all work, one can just look for the genes and neurons that make the sexes different, and try to understand how these genes and neurons shape the distinct sexual behaviours of males and females. In what might be a major step towards this goal, Kimura et al. [1] have now discovered a clear difference in neural circuitry in the brains of male and female fruit flies. This difference, they speculate, might just explain why male flies do the male thing and females do not. Fly sex is a complicated business. To woo a female, the male must perform an elaborate song-and-dance courtship ritual [2]. The fruitless (fru) gene, the RNA transcript of which is spliced differently in males and females, plays a key role during development to lay the foundation for this behaviour (Figure 1). In males, fru RNA is spliced in such a way as to encode male-specific FruM proteins. Males that lack the fru gene [3], or splice it the wrong way [4], make a complete mess of the courtship ritual. For the most part, they do not even bother, and if they do, they are just as likely to try to woo another male as a female. What is more, females that splice fru RNA in the male way, and therefore make FruM, behave like males and try to woo other females [4]. So, genetically, fru seems to account for much of the difference between male and female sexual behaviour. Can fru also lead us to the neuronal circuits in the brain that make the difference? It turns out that FruM is made in 3000 neurons in the male brain, or 3% of the total number of neurons [5]. These neurons are grouped into distinct clusters in various regions of the brain. Are these neurons also present in females, and if so, what is different about them? Because the female fru transcripts do not encode FruM, it has been rather difficult to identify cells in females that correspond to the FruMexpressing cells in males. To circumvent this problem, two groups [6,7] recently used gene targeting to insert coding sequences for an independent marker (GAL4) into the fru locus, 0 replacing the alternatively spliced exon so that the marker would be produced in both males and females. Surprisingly, these studies revealed that almost all of the FruM-producing neurons in the male have counterparts in the female, and at a gross level, they seem to be wired up the same way. Of course, this does not exclude more subtle differences in neuroanatomy, but without knowing which of these 3000 neurons make the essential difference, there seemed little point to go on examining them all at higher resolution. Kimura et al. [1] took a different line of attack, both technically and strategically. They isolated a random enhancer trap insertion further downstream in the fru locus, called NP21 (Figure 1). NP21 labels many, but not all, of the FruM neurons in males, as well as the corresponding cells in females. Kimura et al. [1] then went on to characterize some of these neurons at higher resolution, undeterred by the lack of behavioural data to indicate which of them might be the most relevant. Nevertheless, two sets of NP21-positive neurons clearly differed anatomically in males and females (Figure 1). One of these, belonging to the so-called frumAL cluster [5], particularly attracted their attention. These neurons seem to serve as a relay between the primary gustatory centre of the brain and higher brain regions thought to integrate information from multiple sensory modalities. There are, on average, about 30 NP21positive fru-mAL cells in males and about five in females. In a 0 No FruM 0 Early-born fru-mAL neurons 0 Late-born fru-mAL neurons 0 Male sexual behaviour? 0 Female sexual behaviour? 0 Current Biology 0 clever set of cell-labelling and lineage-tracing experiments, Kimura et al. [1] found that these cells all derive from a common precursor which, in males, gives 0 rise to two distinct classes of neurons: early-born neurons with contralateral dendritic projections, and later-born neurons with bilateral projections. In females, 0 Dispatch R25 0 need to find out what, if anything, such sex-specific circuits contribute to the all-important difference in sexual behaviour between males and females. 0 Kimura, K., Ote, M., Tazawa, T., and Yamamoto, D. (2005). Fruitless specifies sexually dimo 0 Functional analysis of fruitless gene expression by transgenic manipulations of Drosophila courtship 1 Adriana Villella*, Sarah L. Ferri, Jonathan D. Krystal, and Jeffrey C. Hall* 0 A gal4-containing enhancer-trap called C309 was previously shown to cause subnormal courtship of Drosophila males toward females and courtship among males when driving a conditional disrupter of synaptic transmission (shiTS). We extended these manipulations to analyze all features of male-specific behavior, including courtship song, which was almost eliminated by driving shiTS at high temperature. In the context of singing defects and homosexual courtship affected by mutations in the fru gene, a tra-regulated component of the sex-determination hierarchy, we found a C309 traF combination also to induce high levels of courtship between pairs of males and ``chaining'' behavior in groups; however, these doubly transgenic males sang normally. Because production of male-specific FRUM protein is regulated by TRA, we hypothesized that a fru-derived transgene encoding the male (M) form of an Inhibitory RNA (fruMIR) would mimic the effects of traF; but C309 fruMIR males exhibited no courtship chaining, although they courted other males in single-pair tests. Doublelabeling of neurons in which GFP was driven by C309 revealed that 10 of the 20 CNS clusters containing FRUM in wild-type males included coexpressing neurons. Histological analysis of the developing CNS could not rationalize the absence of traF or fruMIR effects on courtship song, because we found C309 to be coexpressed with FRUM within the same 10 neuronal clusters in pupae. Thus, we hypothesize that elimination of singing behavior by the C309 shiTS combination involves neurons acting downstream of FRUM cells 0 reproductive behavior C309 enhancer trap shiTS transgene traF transgene inhibitory fru RNA transgene 0 revealed that C309 drives marker expression in a widespread manner (18). Therefore, we sought to correlate various CNS regions in which this transgene is expressed with its effects on male behavior, emphasizing a search for ``C309 neurons'' that might overlap with elements of the FRUM pattern. We also entertained the possibility that the C309 shiTS combination causes a mere caricature of fruitless-like behavior. Therefore, what would be the courtship effects of C309 driving a transgene that produces the female form of the transformer gene product? This TRA protein participates in posttranscriptional control of fru's primary ``sex transcript,'' so that FRUM protein is not produced in females (reviewed in ref. 8; also see refs. 16 and 21). If C309 and traF are naturally coexpressed in a subset of the to-be-analyzed neurons, feminization of the overlapping cells should eliminate this protein. We extended these transgenic experiments to target fruitless expression specifically by gal4 driving of an inhibitory RNA (IR) construct, which was generated with fru DNA by Manoli and Baker (22). Their experiments furnish one object lesson as to how ``enhancer-trap mosaics'' can delve into the neural substrates of a complex behavioral process, an approach commonly taken to manipulate brain structures and functions in courtship experiments (2-7). Because few genetic loci putatively identified by such transposons have been specified, the tactics we applied are in the context of CNS regions in which expression of a ``real gene'' is hypothesized to underlie well defined behaviors. Materials and Methods 0 Supporting Information. For further details, see Tables 3-5 and 0 arious portions of the CNS in Drosophila melanogaster are inferred to control separate elements of normal male courtship (e.g., refs. 1 and 2), in part by analysis of abnormal behavior (e.g., refs. 3-7). Some such studies have involved brainbehavioral analyses of the fruitless ( fru) gene and its mutants (reviewed in ref. 8). Different fru mutants exhibit courtship subnormalities to varying degrees and at separate stages of the courtship sequence, depending on the mutant allele (e.g., refs. 9-12). Most fru mutants court other males substantially above levels normally exhibited by pairs or groups of wild-type males (e.g., refs. 12 and 13). The original fruitless mutation leads to spatially nonrandom decreases of fru-product presence (14, 15) within particular subsets of the normal CNS expression pattern (16, 17), which may be causally connected with the breakdown of recognition that is a salient effect of fru1 on male behavior (9, 12). fru-like courtship can be induced by the effects of a transgene that encodes GAL4 (a transcription factor derived from yeast). When this C309 enhancer trap was combined with a GAL4drivable factor containing a dominant-negative, conditionally expressed variant of the shibire gene (shiTS), heat treatment of doubly transgenic males caused them to court females subnormally and to court other males vigorously (18). Although this strain had been termed a mushroom body enhancer trap in terms of the gal4 sequence it contains, being expressed ``predominantly'' within that dorsal-brain structure (19, 20), Kitamoto 0 Stocks of D. melanogaster, Crosses, and Fly Handlings. Cultures were 0 maintained as in ref. 23. Pure control males came from a Canton-S wild-type (WT) stock. Other control types were male progeny of a given transgenic strain (see below) crossed to Canton-S. Adult males and females were collected and stored as in refs. 12 and 23 (see below for exceptions). The enhancer-trap line C309 (19) is homozygous for a gal4-containing transposon inserted into chromosome 2; such females were crossed separately to males carrying the following transgenes: UAS-shiTS (homozygous on chromosome 3), which disrupts synaptic transmission in a heat-sensitive manner under the control of a given gal4-containing, neurally expressed transgene (24); UAS-traF (homozygous on chromosome 2), which, when GAL4-driven, causes the female form of transformer (tra) mRNA to be produced (e.g., refs. 3 and 4); UAS-fruMIR [inserted into both the second and third chromosomes, the former heterozygous for the 0 PNAS Early Edition 0 INAUGURAL ARTICLE 0 transgene and In(2LR)O,Cy, the latter homozygous], designed to produce a double-stranded IR that blocks production of male (M)-specific protein encoded by the endogenous fru gene (22); and UAS-egfp (homozygous on chromosome 2), which encodes an ``enhanced'' nuclear form of GFP (25). Most culture rearings occurred at 25°C; but those involving UAS-fruMIR were effected separately at 25°C and 29°C, because the hotter condition was reported to accentuate the inhibitory effects of this transgene (22). Histochemistry involving effects of traF or fruMIR on the presence of FRUM in C309-expressing neurons used females from a stock carrying both C309 and UAS-egfp on the second chromosome (generated by meiotic recombination), crossed to UAS-traF or to ``double-insert'' UASfruMIR males. Additional transgene combinations used females from a C309 C309 Cha-gal80 In(3LR)TM6B,Hu transgenic stock, crossed separately to UAS-shiTS, UAS-traF, UAS-fruMIR, or UAS-egfp males; triply transgenic progeny should have gal4 driving eliminated in neurons that coexpress gal80 (see ref. 26) under the control of regulatory sequences from the Cholineacetyltransferase (Cha) gene (see refs. 18 and 27). 0 Behavior. Basic courtship quantification. Audio video recordings were obtained and processed as in refs. 12 and 23, but most of the current records were captured with a Sony VX2100 digital camera. For transgenic-male WT-female pairings, the two types of flies were readily distinguishable despite the largely feminized external appearance of XY flies carrying C309 UAS-traF or C309 UAS-traF Cha-gal80. For transgenic male WT male observations involving UAS-shiTS or UAS-fruMIR, the two male types look the same, so each WT male had the tip of one wing clipped off at the time of collection. Males including UAS-shiTS were stored at 25°C (permissive temperature) before testing. For restrictive-temperature observations, a male- and food-containing tube was placed in a 30°C water bath for 20-40 min, then aspirated into a mating cell for recording at 30°C. For permissive-temperature controls, test males remained in food containers at 25°C before transfer into female-containing chambers at that temperature. Recordings were converted to computerized files, and behaviors were ``logged'' and analyzed by using LIFESONGX (http: lifesong.bio.brandeis.edu, compare ref. 28) to compute percentages of observation periods during which any interfly interactions occurred (courtship index, CI) or courtship wing displays (wing extension index, WEI). Song sounds. Digitized audio tracks were logged then analyzed (as in refs. 12 and 23), leading to computations of the parameters specified in Table 3. Mating behaviors. Attempted copulations, Mating-initiation latencies, and copulation successes were quantified for several fly pairs in a plastic device (see ref. 1), at 25°C or at 30°C for tests involving shiTS. Courtship chaining. Eight to 10 males of a given genotype were grouped in a food vial upon collection, stored for 3-4 days (at 25°C or 20°C), and then hand-timer recorded at 25°C for the amount of time that at least three males spent 0 NEWS & VIEWS 0 If decreasing atmospheric CO2 stabilized the glacial state in the Oligocene, might increasing atmospheric CO2 from fossil-fuel burning destabilize it in the future? The lesson to be learned here is that we should watch for subtle signs that we are moving from the icehouse world in which Earth has remained for 34 million years into a new, greenhouse world. 0 BEHAVIOURAL GENETICS 0 Sex in fruitflies is fruitless 0 Charalambos P. Kyriacou The courtship rituals of fruitflies are disrupted by mutations in the fruitless gene. A close look at the gene's products -- some of which are sex-specific -- hints at the neural basis of the flies' behaviour. 0 Tra-binding sequences.) Similarly, Tra protein binds to the doublesex (dsx) gene and splices it in male- and female-specific modes (DsxM and DsxF, respectively)8. The DsxM and DsxF transcription factors mainly determine sexual morphologies8, but the sexual identity of the nervous system is shaped by fru. By forcing males to express the femalespecific fruF transcript, Demir and Dickson1 produced males that showed the characteristics of the worst-affected fru mutants. These males were sterile, they barely courted females and they were more interested in courting males, forming courtship chains. By contrast, females jammed into fruM mode mated poorly, produced very few eggs, but -- astonishingly -- courted other females (Fig. 2), even to the point of forming chains. And an identity crisis of similar epic proportions was observed in females that were `masculinized' using a different fru-related genetic trick3. Finally, by feminizing specific abdominal glands in males to produce female pheromones, and placing the altered males with fruM females, the sex roles were reversed, so that the females courted the males1. In another nifty piece of genetic engineering, both teams2,3 generated flies in which they could, among other things, mark the parts of the nervous system (just 2%) that show sexspecific expression of Fru. Further genetic manipulations showed that high levels of male-male courtshipresult when the communication between these neurons is shut down, or when fruM expression in these neurons in males is inhibited2,3. Both studies found that the central nervous system of males and females looked very similar in terms of sexspecific fru expression, with few differences between the sexes in the numbers, positions or wiring of cells expressing Fru. The fru products were found in almost all sensory organs that have been implicated in courtship2,3. Olfactory sensory neurons showed some evidence for sexual dimorphisms. Those receptors that respond to pheromones project to certain other brain regions that are larger in males than females, reflecting the fact that sex pheromones have a greater functional significance in male Drosophila2. By reversibly shutting down the fru-expressing olfactory receptors, both in males and in masculinized females in the 0 Nature Publishing Group 0 NEWS & VIEWS 0 the focus of attention for those interested in the debate (scientific and political) on the genetic versus environmental bases of human sexuality. Perhaps we should remind ourselves that normal fly sexual preferences, unlike human sexual behaviour, cannot be modulated to any significant extent by altering experience11. 0 other females -- apparently because of a genetic factor(s) on chromosome 2 (fru is on chromosome 3). Might this long-lost strain have carried a mutation in one of the fru target genes? The work discussed here may well find itself 0 Shaken on impact 0 Erik Asphaug A single recent impact may have modified the craters on the asteroid Eros into the pattern we see today. This finding has implications for how we view the structure of asteroids -- and for addressing any hazards they present. 0 Asteroids seem to get stranger with every passing year. Thomas and Robinson's finding (page 366 of this issue)1 -- that impact-induced vibrations of an asteroid may be the dominant mechanism reshaping its surface -- shakes things up still further. In the case of the wellstudied asteroid Eros, the authors link this resurfacing mechanism to the recent impact of a meteoroid that left a particularly large crater. They thereby make the first detailed mechanical connection between surface observations and an asteroid'