0	scientific  comment
0	Acta  Crystallographica  Section  D
0	Biological  Crystallography
0	ISSN  0907-4449
0	WWWWhy  does  nature  stutter?  A  survey  of  strands  of  repeated  amino  acids
1	Edgar  F.  Meyer*  and  W.  John  Tollett  Jr²
0	Human  stuttering  is  a  simple  example  of  the  repetition  of  sounds  or  symbols,  sometimes  associated  with  single  letters,  and  may  be  used  to  illustrate  the  amazing  repetition  of  amino  acids  (symbolized  by  a  letter,  e.g.  W)  in  proteins.  A  survey  of  available  databases  with  highly  improbable  strings  of  single  amino  acids  is  tabulated.  This  paper  concludes  with  a  challenge  to  the  crystallographic  community  to  probe  the  structural  origins  of  the  structure±function  relationship  in  this  neglected  area.  When  nature  stutters,  we  should  pay  attention.
0	Current  address:  A&M  Consolidated  High  School,  College  Station,  TX  77840,  USA.
0	Introduction
0	That  34  virus  structures  were  detected  suggests  that  this  model  may  be  overly  simplistic  and  that  crosscorrelations  may  occur,  but  our  purpose  here  is  to  report  a  finding  and  encourage  others  to  explore  its  implications,  be  they  probabilistic,  statistical,  genetic,  functional  or  structural.
0	International  Union  of  Crystallography  Printed  in  Denmark  ±  all  rights  reserved
0	As  gene,  protein  and  structural  databases  were  searched,  who  would  have  guessed  that  67  consecutive  threonines  would  be  found  in  Cryptosporidium  parvum  (Barnes  et  al.,  1998)?  The  probability  of  67  repeats  in  a  random  sequence  at  a  specific  site  is  $1  in  2067  =  1/1.5  A  1087  events;  the  difference  in  probabilities  is  exponentially  significant.  Even  though  this  statistical  approximation  begs  for  a  more  rigorous  treatment,  it  is  amazing.  WWWWhat  is  nature  telling  us?  Long  consecutive  strands  of  positively  or  negatively  charged  amino  acids  must  carry  electrostatic  penalties,  yet  these  too  abound.  In  a  nuclear  transport  protein  (PDB  code  1qbk),  polyaspartate  is  augmented  by  two  glutamates  to  create  a  startling  exposed  strand  of  14  consecutive  negatively  charged  residues.  Intuitively,  one  could  assume  that  uncharged  amino  acids  would  be  more  likely  to  occur  repetitively,  but  polymethionine  also  has  a  relatively  low  occurrence  (7).  Because  of  pronounced  peptide  backbone  angular  constraints,  proline  was  considered  to  be  a  `helix  breaker',  but  polyPro  actually  forms  a  left-handed  helix  (1jvr).  In  HIV-1  reverse  transcriptase  (1c9r;  residues  315±326),  an  extended  polyAla  strand  is  parallel  to  an  -helix  that  is  also  rich  in  Ala.  Conversely,  a  12-Ala  repeat  forms  a  cluster  of  three  -helices  at  the  tip  of  a  tumor  necrosis  factor  receptor  (1czz).  At  this  stage,  it  appears  that  while  polyPro  may  be  structurally  conserved,  polyAla  is  not.  PolyCys  is  one  of  the  few  repeat  sequences  which  is  generally  buried,  forming  a  tight  trimer  knot  in  a  spider  toxin  (1qdp),  a  triple  S±S  knot  (1ag8),  and  a  tight  buried  loop  central  to  an  amazing  chain  of  seven  S±S  linkages  in  the  ferric  hydroxamate  uptake  receptor  (1cw3,  1a4z).  These  searches  reveal  a  wide  range  of  structures,  populations  and  probabilities,  summarized  by  abbreviated  tables  [tables  also
0	Acta  Cryst.  (2001).  D57,  181±186
0	Meyer  &  Tollett
0	WWWWhy  does  nature  stutter?
0	scientific  comment
0	Table  1
0	GenBank  results,  23  June  2000.
0	=$key&id=1);  the  related  Chime  links  will  make  the  structural  results  more  readily  accessible  to  a  broader  audience].  While  some  entries  of  gene  sequences  are  deposited  without  comment  and/or  literature
0	citation  (Table  1),  many  protein  sequence  entries  (e.g.  PIR,  SwissProt,  EMBL)  are  cited  (Table  2)  and  infer  functional  roles.  Although  smallest  in  size,  the  Protein  Data  Bank  (Bernstein  et  al.,  1977;  Meyer,  1997;
0	Amino  acid  Alanine
0	Residues  129±148  129±148  497±517  497±517  241±260  241±260  241±260  241±260  241±260  13±42  138±187  24±69  720±768  266±311  50±95  777±822  285±325  11±33  1856±1900  362±402  152±191  58±95  58±95
0	GenBank  ID#  GBINV:DMJ001164  GBINV:AE003814  GBINV:DMU11383  GBINV:DMOVO  GBPRI:AF117979  GBPRI:D82344  GBROD:MMPHOX2B  GBROD:AB015672  GBPRI:AB015671  GBPRI:HUMFMR1  GBINV:DDU38197  GBINV:AF019981  GBINV:DDI238883  GBINV:AF104350  GBINV:AE001416  GBINV:AE001418  GBPLN:F11A17  GBPRI:HSU63332  GBINV:AF153362  GBVRT:CCJ002238  GBPRI:HSU80741  GBPRI:HUMTFIIDA  GBPRI:HS191N21
0	Arginine  Asparagine
0	GBPRI:HUMTFIID  GBINV:AF024654  GBINV:AE003446  GBROD:MMJ225123  GBROD:AF028737  GBPLN:SCYBR289W  GBPLN:SCDPB3  GBPLN:YSCSNF5  GBINV:AE003536  GBPLN:ATF17C15  GBPLN:ATF23E13  GBPLN:ATCHRIV85  GBPRI:HUMARB  GBPRI:L29496  GBPRI:HSU16371  GBPLN:ATAC011708  GBINV:AE003451  GBINV:AE003430  GBINV:DMSEG0007  GBVRL:AF169823  GBINV:CELC15C7  GBSYN:AF025672
0	Meyer  &  Tollett
0	WWWWhy  does  nature  stutter?
0	Acta  C
0	ANALYTICAL  BIOCHEMISTRY
0	Effects  of  relative  humidity  and  buffer  additives  on  the  contact  printing  of  microarrays  by  quill  pins
1	Mark  K.  McQuain,a  Kevin  Seale,b  Joel  Peek,b  Shawn  Levy,c  and  Frederick  R.  Haseltona,*
0	Abstract  DNA  microarrays  printed  with  quill  pins  exhibit  significant  variation  in  probe  DNA  spots.  Interspot  variations  and  nonuniform  distribution  of  probe  within  spots  are  major  sources  of  experimental  uncertainty  in  microarray  analysis.  To  gain  better  insight  into  the  sources  of  variation,  we  analyzed  450  consecutive  depositions  printed  at  relative  humidities  between  40  and  80%  using  three  print  buffers.  Increasing  relative  humidity  improved  printing  performance  by  delaying  pin  failure  but  did  not  reduce  the  variability  in  spot  characteristics.  Adding  either  betaine  or  dimethyl  sulfoxide  (DMSO)  to  the  print  buffer  also  improved  quill  pin  performance.  Least  interspot  variation  was  observed  with  the  DMSO  additive  printed  at  80%  relative  humidity,  but  this  additive  also  resulted  in  the  greatest  intraspot  variation.  Least  intraspot  variation  was  observed  with  1.5  M  betaine  printed  at  60%  relative  humidity,  but  these  conditions  produced  microarrays  with  high  interspot  variability.  Evaporation  of  printing  solution  from  the  quill  reservoir  appeared  to  be  the  primary  cause  of  interspot  and  intraspot  variations.  Our  studies  indicate  that  relative  humidity  and  printing  solution  additives  reduce  evaporation.  Based  on  the  spot  variability  requirements  for  a  particular  application,  humidity  and  additives  may  be  chosen  to  optimize  either  inter-  or  intraspot  variability.  O  2003  Elsevier  Science  (USA).  All  rights  reserved.
0	Keywords:  DNA  microarrays;  Microfluidics
0	DNA  microarrays  are  important  tools  for  obtaining  high-throughput  genetic  information  and  are  often  used  for  expression  profiling,  gene  copy  estimation,  and  polymorphism  analysis  [1-11].  Though  they  have  been  applied  successfully  in  many  research  applications,  there  are  significant  problems  which  limit  their  use  to  qualitative  analysis  of  large  signal  changes.  To  compensate  for  experimental  variability,  almost  all  current  microarray  analyses  rely  on  differential  measurement  techniques  that  assess  results  compared  to  a  reference  [12].  Analysis  is  often  focused  on  the  most  reliable  and  repeatable  portions  of  the  data  [13].  The  difficulty  in  interpreting  the  remaining  data  is  usually  attributed  to  a  variety  of  factors,  including  inter-  and  intraspot  variations  [14,15].
0	Abbreviations  used:  SSC,  standard  saline  citrate;  DMSO,  dimethyl  sulfoxide;  R.H.,  relative  humidity;  RFU,  relative  fluorescence  unit.
0	interest  to  be  captured  and  stored  electronically.  Length  calibration  was  achieved  using  a  laser-etched  reference  grid  positioned  to  achieve  sharp  focus  at  the  same  height  as  the  point  of  pin  contact  with  the  printing  surface.  Scanning  of  multiple  spots  printed  manually  or  robotically  For  manual  printing,  the  video  microscope  apparatus  described  above  was  used.  Depositions  of  a  freshly  loaded  pin  were  recorded  over  the  course  of  a  10-min  period  at  the  rate  of  one  deposition  every  3  s.  For  robotic  printing,  a  commercial  robot  (designed  by
0	Comparative  effects  of  levosulpiride  and  cisapride  on  gastric  emptying  and  symptoms  in  patients  with  functional  dyspepsia  and  gastroparesis
0	Background:  The  efficacy  of  several  prokinetic  drugs  on  dyspeptic  symptoms  and  on  gastric  emptying  rates  are  well-established  in  patients  with  functional  dyspepsia,  but  formal  studies  comparing  different  prokinetic  drugs  are  lacking.  Aim:  To  compare  the  effects  of  chronic  oral  administration  of  cisapride  and  levosulpiride  in  patients  with  functional  dyspepsia  and  delayed  gastric  emptying.  Methods:  In  a  double-blind  crossover  comparison,  the  effects  of  a  4-week  administration  of  levosulpiride  (25  mg  t.d.s.)  and  cisapride  (10  mg  t.d.s.)  on  the  gastric  emptying  rate  and  on  symptoms  were  evaluated  in  30  dyspeptic  patients  with  functional  gastroparesis.  At  the  beginning  of  the  study  and  after  levosulpiride  or  cisapride  treatment,  the  gastric  emptying  time  of  a  standard  meal  was  measured  by  13C-octanoic  acid
0	breath  test.  Gastrointestinal  symptom  scores  were  also  evaluated.  Results:  The  efficacy  of  levosulpiride  was  similar  to  that  of  cisapride  in  significantly  shortening  (P  <  0.001)  the  t1/2  of  gastric  emptying.  No  significant  differences  were  observed  between  the  two  treatments  with  regards  to  improvements  in  total  symptom  scores.  However,  levosulpiride  was  significantly  more  effective  (P  <  0.01)  than  cisapride  in  improving  the  impact  of  symptoms  on  the  patients'  every-day  activities  and  in  improving  individual  symptoms  such  as  nausea,  vomiting  and  early  postprandial  satiety.  Conclusion:  The  efficacy  of  levosulpiride  and  cisapride  in  reducing  gastric  emptying  times  with  no  relevant  sideeffects  is  similar.  The  impact  of  symptoms  on  patients'  everyday  activities  and  the  improvement  of  some  symptoms  such  as  nausea,  vomiting  and  early  satiety  was  more  evident  with  levosulpiride  than  cisapride.
0	Prokinetic  drugs  have  been  extensively  tested  in  the  treatment  of  functional  dyspepsia.  This  is  because  gastrointestinal  motor  abnormalities  and,  in  particular,  delayed  gastric  emptying  have  been  frequently  reported  in  patients  suffering  from  this  common  syndrome.1±6
0	These  abnormalities  are  regarded  as  a  likely  source  of  symptoms  even  if  no  clear  cause±effect  relationship  between  severity  of  symptoms  and  degree  of  delay  in  gastric  emptying  has  been  proven  to  date.7  Among  prokinetic  drugs,  several  placebo-controlled  trials  have  provided  evidence  on  the  efficacy  of  cisapride  and  dopamine  receptor  antagonists  such  as  metoclopramide,  domperidone,  and  recently  levosulpiride  in  the  treatment  of  functional  dyspepsia.8±28  Metoclopramide,  domperidone  and  levosulpiride  have  both  antiemetic  and  prokinetic  properties  because  they  antagonize  dopamine  receptors  in  the  central  nervous  system  as
0	C.  MANSI  et  al.
0	O  2000  Blackwell  Science  Ltd,  Aliment  Pharmacol  Ther  14,  561±569
0	MATERIALS  AND  METHODS
0	LEVOSULPIRIDE  AND  CISAPRIDE  IN  FUNCTIONAL  DYSPEPSIA
0	impact  on  every-day  activities  was  scored  as:  0,  not  at  all  bothersome;  1,  a  little  bit  bothersome;  2,  moderately  bothersome;  3,  quite  a  bit  bothersome;  4,  extremely  bothersome.  The  cut-off  values  of  symptom  scores  for  inclusion  in  the  study  was  established  on  the  basis  of  the  data  obtained  by  the  same  questionnaires  filled  in  by  200  healthy  volunteers  (84  males  116  females,  aged  42  4  years).  A  score  decrease  of  at  least  50%  was  defined  as  a  `symptom  improvement'.  The  reproducibility  of  the  symptom  questionnaire  had  previously  been  validated  in  40  patients  with  functional  dyspepsia.  The  score  evaluation  of  their  symptoms  was  performed  by  the  patients  themselves  on  two  separate  occasions  (2±4  weeks  apart).  The  calculated  K-values  were  0.84  for  total  severity  scores,  whereas  scores  for  frequency,  duration  and  impact  were  0.72,  0.69,  and  0.87,  respectively.  Gastric  emptying  studies  Gastric  emptying  time  was  measured  by  means  of  13  C-octanoic  acid  breath  test  as  previously  described.34  This  test  was  performed  during  the  run-in  period  and  at  the  end  of  each  treatment.  Patients  were  given  a  standard  test  meal  consisting  of  one  egg  with  5  g  of  butter,  two  slices  of  white  bread  and  150  mL  of  water;  100  mg  13C-octanoic  acid  (Cortex  Italia,  Milan,  Italy)  was  incorporated  into  the  homogenized  egg  yolk,  which  was  baked  separately  from  the  egg  white.  For  practical  reasons,  the  test  meal  was  given  at  13.00  hours,  after  an  overnight  fast,  and  eaten  in  10  min.  In  order  to  interfere  as  little  as  possible  with  the  subjects'  normal  eating  habits,  they  were  allowed  to  eat  a  light  breakfast  restricted  to  100  mL  of  milk  alone  with  10  g  of  sugar  at  07.00/08.00  hours.  Females  were  studied  during  the  first  10  days  of  the  menstrual  cycle.  Breath  samples  were  collected  just  before,  and  every  15  min  after  the  test  meal  for  6  h;  13CO2  measurements  were  performed  with  an  isotope  ratio  mass  spectrometer 
0	THE  THERMODYNAMICS  OF  DNA  STRUCTURAL  MOTIFS
1	John  SantaLucia,  1,2  and  Donald  Hicks2
0	Key  Words  secondary  structure,  prediction,  hybridization,  oligonucleotides,  nucleic  acid  folding  s  Abstract  DNA  secondary  structure  plays  an  important  role  in  biology,  genotyping  diagnostics,  a  variety  of  molecular  biology  techniques,  in  vitro-selected  DNA  catalysts,  nanotechnology,  and  DNA-based  computing.  Accurate  prediction  of  DNA  secondary  structure  and  hybridization  using  dynamic  programming  algorithms  requires  a  database  of  thermodynamic  parameters  for  several  motifs  including  Watson-Crick  base  pairs,  internal  mismatches,  terminal  mismatches,  terminal  dangling  ends,  hairpins,  bulges,  internal  loops,  and  multibranched  loops.  To  make  the  database  useful  for  predictions  under  a  variety  of  salt  conditions,  empirical  equations  for  monovalent  and  magnesium  dependence  of  thermodynamics  have  been  developed.  Bimolecular  hybridization  is  often  inhibited  by  competing  unimolecular  folding  of  a  target  or  probe  DNA.  Powerful  numerical  methods  have  been  developed  to  solve  multistate-coupled  equilibria  in  bimolecular  and  higher-order  complexes.  This  review  presents  the  current  parameter  set  available  for  making  accurate  DNA  structure  predictions  and  also  points  to  future  directions  for  improvement.
0	Loop  Database  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  Hairpin  Loops  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  Internal  Loops  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  Bulges  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  Coaxial  Stacking  Parameters  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  Multibranched  Loops  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  QUALITY  OF  SECONDARY  STRUCTURE  PREDICTIONS  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  MULTISTATE  MODELING  OF  DNA  FOLDING  AND  HYBRIDIZATION  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  FUTURE  DIRECTIONS  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .
0	INTRODUCTION  Biological  Importance  of  DNA  Secondary  Structure
0	Molecular  Biology  and  Biotechnology  Applications  of  DNA  Secondary  Structure
0	THERMODYNAMICS  OF  DNA  MOTIFS
0	of  biotechnology  techniques  that  exploit  the  three-dimensional  folding  potential  of  DNA  have  also  been  demonstrated  including  DNA  nanotechnology  (75)  and  DNA  computing  (21).
0	The  DNA  Folding  Problem
0	Similar  to  the  protein  and  RNA  folding  problems,  there  is  a  corresponding  "DNA  folding  problem"  in  which  it  is  desired  to  predict  the  structure  and  folding  energy  of  the  DNA  given  its  sequence.  Fortunately,  several  features  of  DNA  and  RNA  make  them  especially  amenable  to  structure  prediction.  Notably,  DNA  and  RNA  secondary  structures  result  from  strong  Watson-Crick  pairing  interactions,  and  tertiary  interactions  are  a  weaker  second-order  effect  (81).  Thus,  to  an  excellent  approximation,  tertiary  interactions  may  be  neglected  and  accurate  secondary  structure  prediction  is  possible.  The  strong  pairing  rules  also  allow  for  the  DNA  secondary  structure  to  be  reduced  to  discrete  interactions  in  which  two  positions  in  a  sequence  are  either  paired  or  not.  Even  with  the  neglect  of  tertiary  interactions  such  as  pseudoknots,  however,  the  number  of  possible  secondary  structures  is  approximately  1.8N,  where  N  is  the  sequence  length  (95).  Fortunately,  with  the  discrete  pairing  approximation,  DNA  and  RNA  are  suitable  for  powerful  dynamic  programming  algorithms,  which  were  described  in  a  previous  review  (83).  Dynamic  programming  algorithms  guarantee  that  for  a  given  set  of  rules,  the  minimum  energy  structure  (i.e.,  optimal)  will  be  found  in  computation  time  order  N3  with  memory  order  N2,  thereby  allowing  predictions  of  sequences  with  fewer  than  10,000  nucleotides  with  currently  available  computers.  Dynamic  programming  algorithms  also  predict  suboptimal  structures  within  user-defined  energy  and  distance  windows  (94).  This  is  important  because  the  energy  rules  are  not  perfect  and  tertiary  interactions  are  neglected  (as  are  interactions  with  proteins  and  the  specific  interactions  with  magnesium  or  other  cofactors).  Thus,  one  of  the  few  structures  near  the  free-energy  minimum  is  likely  to  be  correct.  It  is  important  to  note  the  important  difference  between  selected  functional  sequences  and  random  sequences  of  DNA  or  RNA.  Random  sequences  have  a  low  probability  of  folding  into  compact  three-dimensional  structures  stabilized  by  tertiary  interactions;  thus  random  sequences  are  most  amenable  to  secondary  structure  prediction  because  the  neglect  of  tertiary  interactions  is  appropriate.  On  the  other  hand,  selected  sequences  (selected  either  by  evolution  or  by  in  vitro  selection,  or  rationally  designed)  are  more  likely  to  contain  tertiary  interactions,  which  compromise  the  reliability  of  the  secondary  structure  prediction  algorithms.  This  difference  makes  DNA  folding  much  easier  to  predict  (for  random  sequences)  than  corresponding  biologically  selected  RNAs.  Note  that  dynamic  programming  algorithms  also  neglect  kinetically  trapped  structures  and  assume  structures  are  populated  according  to  an  equilibrium  Boltzmann  distribution;  thus  the  structures  close  to  minimum  free  energy  are  most  probable.  Recently,  we  have  also  extended  the  dynamic  programming  algorithm  to  predict  bimolecular  optimal  and  suboptimal  structures  so  that  match  and  mismatch  hybridizations  of  a  short  probe  to  long-target  DNA  may  be  readily  identified  on
0	Overview  of  the  DNA  Thermodynamic  Database
0	Dynamic  programming  algorithms  for  DNA  secondary  structure  predicti
0	Articles  Nearest-Neighbor  Thermodynamics  and  NMR  of  DNA  Sequences  with  Internal  A,A,  C,C,  G,G,  and  T,T  Mismatches
1	Nicolas  Peyret,  P.  Ananda  Seneviratne,  Hatim  T.  Allawi,  and  John  SantaLucia,  *
0	ABSTRACT:  Thermodynamic  measurements  are  reported  for  51  DNA  duplexes  with  A,A,  C,C,  G,G,  and  T,T  single  mismatches  in  all  possible  Watson-Crick  contexts.  These  measurements  were  used  to  test  the  applicability  of  the  nearest-neighbor  model  and  to  calculate  the  16  unique  nearest-neighbor  parameters  for  the  4  single  like  with  like  base  mismatches  next  to  a  Watson-Crick  pair.  The  observed  trend  in  stabilities  of  mismatches  at  37  °C  is  G,G  >  T,T  A,A  >  C,C.  The  observed  stability  trend  for  the  closing  Watson-Crick  pair  on  the  5  side  of  the  mismatch  is  G,C  g  C,G  g  A,T  g  T,A.  The  mismatch  contribution  to  duplex  stability  ranges  from  -2.22  kcal/mol  for  GGC,GGC  to  +2.66  kcal/mol  for  ACT,  ACT.  The  mismatch  nearest-neighbor  parameters  predict  the  measured  thermodynamics  with  average  deviations  of  G°37  )  3.3%,  H°  )  7.4%,  S°  )  8.1%,  and  TM  )  1.1  °C.  The  imino  proton  region  of  1-D  NMR  spectra  shows  that  G,G  and  T,T  mismatches  form  hydrogen-bonded  structures  that  vary  depending  on  the  Watson-Crick  context.  The  data  reported  here  combined  with  our  previous  work  provide  for  the  first  time  a  complete  set  of  thermodynamic  parameters  for  molecular  recognition  of  DNA  by  DNA  with  or  without  single  internal  mismatches.  The  results  are  useful  for  primer  design  and  understanding  the  mechanism  of  triplet  repeat  diseases.
0	DNA  mismatches  occur  in  vivo  due  to  misincorporation  of  bases  during  replication  (1),  heteroduplex  formation  during  homologous  recombination  (2),  mutagenic  chemicals  (3,  4),  ionizing  radiation  (5),  and  spontaneous  deamination  (6).  Knowledge  of  the  thermodynamics  of  DNA  mismatches  will  be  useful  for  elucidating  the  mechanisms  of  polymerase  fidelity  and  mismatch  repair  efficiency.  Moreover,  thermodynamic  parameters  for  mismatch  formation  are  important  for  DNA  secondary  structure  prediction  (see  http://sun2.science.wayne.edu/jslsun2  and  http://mfold1.wustl.edu/mfold/dna/form1.cgi).  Recent  work  has  shown  that  triplet  repeat  sequences  form  transiently  stable  hairpins  that  contain  like  with  like  base  mismatches  (714).  The  formation  of  these  secondary  structures  can  induce  genome  expansion  or  deletion  during  replication  (15,  16)  resulting  in  at  least  11  different  human  diseases  (17-19).  Mismatch  thermodynamics  is  also  important  for  molecular  biological  techniques  such  as  PCR  (20),  Southern  blotting  (21),  single-stranded  conformational  polymorphism  (SSCP)  (22-24),  sequencing  by  hybridization  (25,  26),  antigene  targeting  (27),  Kunkel  site-directed  mutagenesis  (28),  and  optimization  of  DNA  chip  arrays  for  diagnostics  (29).  These  techniques  require  optimization  of  sequence,  temperature,
0	and  solution  conditions  to  avoid  detection  or  amplification  of  wrong  sequences.  Previous  work  from  our  laboratory  has  shown  that  a  NN1  model  is  valid  to  describe  the  thermodynamics  of  DNA  structures  involving  canonical  A,T  and  G,C  base  pairs  (30-32)  as  well  as  G,T  (31),  G,A  (33),  C,T  (34),  and  A,C  (35)  mismatches.  We  hypothesized  that  the  nearestneighbor  model  is  also  applicable  to  single  A,A,  C,C,  G,G,  and  T,T  mismatches.  To  test  this  hypothesis,  thermodynamic  measurements  of  45  sequences  combined  with  6  from  the  literature  (36,  37)  were  used  to  derive  NN  parameters  for  like  with  like  base  mismatches.  1-D  NMR  and  CD  studies  were  used  to  qualitatively  probe  the  structures  formed  by  the  mismatches.  These  data  combined  with  our  previous  results  provide  a  complete  thermodynamic  database  for  DNA  molecular  recognition  by  DNA  with  or  without  single  internal  mismatches.  MATERIALS  AND  METHODS  DNA  Synthesis  and  Purification.  Oligonucleotides  were  graciously  provided  by  Hitachi  Chemical  Research  and  were  synthesized  on  solid  support  using  standard  phosphoramidite  chemistry  (38).  Oligonucleotides  were  detached  from  the
0	Abbreviations:  Na  EDTA,  disodium  ethylenediaminetetraacetate;  2  eu,  entropy  unit;  MES,  2-(4-morpholino)ethane  sulfonate;  NMR,  nuclear  magnetic  resonance;  NN,  nearest-neighbor;  SVD,  singular  value  decomposition;  TLC,  thin-layer  chromatography;  UV,  ultraviolet.
0	Y°total  )  Y°initiation  +  Y°sym  +  2Y°(GG/CC)  +  2Y°(GA/CT)  +  2Y°(AG/TC)  +  2Y°(GT/CT)  (2)
0	The  notation  GT/CT  refers  to  a  5GT3  dimer  hydrogen  bonded  to  a  3CT5  dimer  with  the  mismatch  underlined.  The  mismatch  contribution  to  duplex  stability  is  given  by  rearranging  eq  2:
0	2Y°(GT/CT)  )  Y°total  -  Y°initiation  -  Y°sym  2Y°(GG/CC)  -  2Y°(GA/CT)  -  2Y°(AG/TC)  (3)
0	Thus,  the  mismatch  contribution  is  calculated  by  subtracting  the  initiation,  symmetry,  and  Watson-Crick  nearest-neighbor  increments  (31)  from  the  total  experimental  value.  Number  of  Linearly  Independent  Parameters.  In  our  previous  studies  of  G,T,  G,A,  A,C,  and  C,T  single  mismatches,  we  showed  that  it  is  impossible  to  uniquely  solve  for  eight  dimer  nearest  neighbors  from  a  data  set  of  oligomers  containing  only  single  internal  mismatches  (31).  Instead,  within  the  limits  of  the  nearest-neighbor  model,  only  seven  linearly  independent  trimers  are  sufficient  to  accurately  predict  internal  mismatch  thermodynamics.  In  the  case  of  single  like  with  like  base  mismatches  (i.e.,  A,A,  C,C,  G,G,  and  T,T),  however,  symmetry  allows  for  a  unique  solution  of  four  internal  nearest-neighbor  dimers  to  be  found.  In  particular,  the  dimer  nearest  neighbors  can  be  uniquely  solved  from  sequences  that  contain  these  trimers:
0	where  X  )  A,  C,  G,  or  T.  According  to  the  nearest-neighbor  model,  any  sequence  with  an  internal  X,X  mismatch  can  be  determined  from  linear  combinations  of  eqs  4a-d.  It  should  be  noted,  however,  that  even  though  it  is  possible  to  uniquely  solve  for  the  X,X  dimer  nearest-neighbor  parameters  from  a  set  of  oligonucleotides  with  only  internal  mismatches,  these  parameters  cannot  be  used  to  accurately  predict  the  thermodynamics  of  duplexes  with  terminal  mismatches.  As  we  found  earlier  (31),  terminal  mismatches  always  make  favorable  contributions  to  dup
0	REVIEW  ARTICLE
0	The  marks,  mechanisms  and  memory  of  epigenetic  states  in  mammals
1	Vardhman  K.  RAKYAN,  Jost  PREIS,  Hugh  D.  MORGAN  and  Emma  WHITELAW1
0	It  is  well  recognized  that  there  is  a  surprising  degree  of  phenotypic  variation  among  genetically  identical  individuals,  even  when  the  environmental  influences,  in  the  strict  sense  of  the  word,  are  identical.  Genetic  textbooks  acknowledge  this  fact  and  use  different  terms,  such  as  `  intangible  variation  '  or  `  developmental  noise  ',  to  describe  it.  We  believe  that  this  intangible  variation  results  from  the  stochastic  establishment  of  epigenetic  modifications  to  the  DNA  nucleotide  sequence.  These  modifications,  which  may  involve  cytosine  methylation  and  chromatin  remodelling,  result  in  alterations  in  gene  expression  which,  in  turn,  affects  the  phenotype  of  the  organism.  Recent  evidence,  from  our  work  and  that  of  others  in  mice,  suggests  that  these  epigenetic
0	modifications,  which  in  the  past  were  thought  to  be  cleared  and  reset  on  passage  through  the  germline,  may  sometimes  be  inherited  to  the  next  generation.  This  is  termed  epigenetic  inheritance,  and  while  this  process  has  been  well  recognized  in  plants,  the  recent  findings  in  mice  force  us  to  consider  the  implications  of  this  type  of  inheritance  in  mammals.  At  this  stage  we  do  not  know  how  extensive  this  phenomenon  is  in  humans,  but  it  may  well  turn  out  to  be  the  explanation  for  some  diseases  which  appear  to  be  sporadic  or  show  only  weak  genetic  linkage.
0	Key  words  :  chromatin,  inheritance,  methylation.
0	The  various  cell  types  in  a  multicellular  organism  are  genotypically  identical  and  yet  phenotypically  different.  This  is  due  to  differences  in  the  patterns  of  gene  expression  that  exist  between  the  different  cell  groups.  The  stable  maintenance  of  these  differences  is  thought  to  be  due  to  epigenetic  control  of  gene  expression.  This  involves  physically  `  marking  '  the  DNA,  without  altering  the  nucleotide  sequence,  either  by  the  addition  of  methyl  groups  to  certain  cytosine  bases  and\or  the  packaging  of  the  DNA  into  a  highly  condensed  state.  These  modifications  interfere  with  the  DNA-protein  interactions  that  facilitate  transcription,  resulting  in  transcriptional  silencing  of  the  epigenetically  modified  allele.  Epigenetic  modifications  can,  therefore,  cause  phenotypic  variation  in  the  absence  of  genetic  differences.  It  is  well  recognized  that  `  silenced  '  alleles  can  be  inherited  through  many  rounds  of  DNA  replication,  and  therefore  epigenetic  modifications  or  `  marks  '  can  be  maintained  through  mitotic  cell  divisions.  Generally,  however,  it  has  been  assumed  that  these  marks  are  erased  and  reset  at  some  stage  during  gametogenesis  or  early  embryogenesis  to  reinstate  the  totipotency  of  the  developing  embryo.  There  is  now  an  increasing  body  of  evidence  which  suggests  that  epigenetic  marks  at  some  mammalian  alleles  are  not  completely  erased  from  one  generation  to  the  next,  resulting  in  complex  patterns  of  inheritance  that  do  not  conform  to  Mendelian  principles.  Therefore  not  only  can  phenotype  vary  in  the  absence  of  genetic  and  environmental  factors,  described  by  some  as  `  intangible  variation  '  [1]  or  `  developmental  noise  '  [2],  but  these  phenotypic  differences  can  also  be  inherited  by  the  offspring.  This  review  will  present  a  brief  overview  of  the  role  of  methylation  and  chromatin  remodelling  in  epigenetic  regulation
0	of  gene  expression,  followed  by  examples  of  classic  epigenetic  phenomena  in  mammals.  We  will  then  discuss  the  evidence  available  for  epigenetic  inheritance  through  the  germline,  with  an  emphasis  on  murine  models,  which  suggest  that  this  form  of  inheritance  may  be  occurring  at  a  number  of  mammalian  loci.
0	EPIGENETIC  MODIFICATIONS  OF  DNA
0	The  two  mechanisms  by  which  DNA  is  epigenetically  marked,  although  there  may  be  others  yet  to  be  discovered,  are  methylation  and  chromatin  condensation.  Both  of  these  mechanisms  are  associated  with  gene  silencing,  and  recent  evidence,  discussed  below,  suggests  that  these  two  mechanisms  are  not  mutually  exclusive,  but  instead  act  in  concert  to  silence  gene  expression  in  mammalian  cells.
0	DNA  methylation
0	Methylation  involves  the  enzymic  transfer  of  a  methyl  group  to  the  5-position  of  the  pyrimidine  ring  of  a  cytosine  residue  [3-5].  This  usually  occurs  at  cytosine  bases  that  are  immediately  followed  by  a  guanine,  known  as  CpG  dinucleotides  [6,7].  In  mammalian  genomes,  the  CpG  dinucleotide  is  greatly  underrepresented  due  to  the  increased  spontaneous  deamination  rate  of  5-methylcytosine  into  thymine.  Of  the  CpGs  present,  approx.  70  %  are  methylated  [8],  whereas  the  majority  of  unmethylated  CpGs  occur  in  small  clusters  known  as  CpG  islands,  which  are  ordinarily  found  within  or  near  promoters  or  first  exons  of  `  housekeeping  '  genes  [9,10].  Methylation  is  catalysed  by  DNA  methyltransferases  (Dnmts)  and  four  mammalian  Dnmts  have  been  identified  so  far,  Dnmt1
0	V.  K.  Rakyan  and  others
0	the  vicinity  and  reassociating  with  the  newly  assembled  chromatin  following  DNA  replication.  Evidence  for  this  mechanism  comes  from  the  observation  that  some  HATs  form  part  of  a  complex  that  remains  associated  with  its  target  DNA  throughout  the  cell  cycle  [42-44].  A  second  mechanism  may  involve  targeting  the  HATs  and  HDACs  to  regions  of  methylated  DNA,  so  that  preexisting  acetylation  patterns  are  propagated  along  with  methylation  patterns  during  DNA  replication.  Indeed,  it  has  recently  been  discovered  that  the  maintenance  methylase,  Dnmt1,  can  interact  with  a  histone  deacetylase  [45-47].
0	Dnmt2  [12],  Dnmt3A  and  Dnmt3B  [13],  although  our  understanding  of  how  these  enzymes  function  is  sketchy  at  best.  Dnmt1  is  probably  involved  in  maintaining  methylation  patterns  through  mitosis  [14].  Following  DNA  replication,  the  two  doublestranded  daughter  molecules  initially  contain  a  hemi-methylated  CpG  pattern,  which  is  recognized  and  converted  into  the  fully  methylated  parental  pattern  by  Dnmt1  [15].  However,  it  has  been  found  that  the  error  rate  of  replication  of  methylation  patterns  of  an  artificially  methylated  DNA  sequence  transfected  into  cell  lines  is  significantly  higher  than  that  observed  for  DNA  replication  [16,17].  In  addition,  a  later  study  [18]  showed  that  clonal  populations  of  histologically  homogenous  cells  did  not  have  homologous  methylation  patterns.  These  findings  have  been  confirmed  by  more  recent  work,  using  the  highly  sensitive  bisulphite  conversion  method  to  analyse  methylation  patterns  in  i  o  [19,20].  Therefore  the  infidelity  of  replication  of  methylation  patterns  has  the  potential  to  generate  phenotypic  diversity  among  genetically  identical  cells  of  the  same  lineage.  Dnmt2  may  play  a  role  in  epigenetic  control  of  centromere  function  [21],  and  Dnmt3A  and  3B  are  thought  to  be  de  no  o  methylases  which  set  up  the  initial  patterns  of  methylation  during  embryogenesis  [22].  However,  data  suggests  that  Dnmts  have  overlapping  functions  [23,24],  and  the  precise  role  of  any  particular  Dnmt  is  determined  by  the  cellular  context.  During  mammalian  development,  there  are  `  waves  '  of  extensive  demethylation  of  the  genome  in  the  primordial  germ  cell  stage  and  pre-implanatation  embryo  [25-28].  A  mammalian  protein  with  specific  demethylase  activity  for  CpG  dinucleotides  has  been  reported  [29,30],  although  it  remains  to  be  fully  characterized  biochemically.
0	Epigenetic  regulation  of  transcription
0	The  precise  mechanisms  by  which  methylation  and  chromatin  compaction  regulate  transcription  are  unclear,  although  several  studies  suggest  that  these  two  mechanisms  are  linked.  MECP2  (methyl-CpG  binding  protein  2)  is  a  transcriptional  repressor  that  selectively  recognizes  methylated  CpG  dinucleotides  [48,49].  MECP2,  and  other  methyl-CpG  binding  proteins,  associate  with  co-repressor  complexes  that  include  HDACs  [50-53].  This  directs  the  formation  of  stable  repressive  chromatin  structures  [54].  Recent  findings  [51,52]  link  the  four  different  methyl-CpG  binding  domain  (MBD)  proteins,  MECP2,  MBD1,  MBD2  and  MBD3,  with  the  chromatin-remodelling  machinery,  providing  further  evidence  for  the  association  between  methylation  and  chromatin  remodelling.  Therefore  it  seems  that  methylation  acts  through  histone  deacetylation  to  establish  a  repressive  chromatin  state  that  blocks  the  access  of  the  transcription  machinery,  although  at  present  we  do  not  know  how  the  initial  patterns  of  methylation  are  set  up  de  no  o.  However,  for  certain  organisms,  e.g.  Drosophila,  methylation  is  observed  only  in  very  early  embryogenesis  [55]  (for  decades  it  was  believed  that  DNA  methylation  was  non-existent  in  Drosophila),  and  others  like  the  yeast  Schizosaccharomyces  pombe,  do  not  methylate  their  DNA  at  all.  Therefore  in  some  eukaryotic  organisms  chromatinmediated  mechanisms  alone  may  be  sufficient  to  mediate  epigenetic  regulation  of  gene  expression.
0	Chromatin  packaging
0	In  the  nucleus,  DNA  exists  as  a  nucleoprotein  complex  termed  chromatin.  Chromatin  is  assembled  from  arrays  of  nucleosomes,  each  of  which  is  approx.  200  bp  of  linear  DNA  wrapped  around  an  octamer  of  histone  proteins.  Two  distinct  types  of  chromatin  are  known,  heterochromatin  and  euchromatin.  Heterochromatin  is  believed  to  represent  regions  of  DNA-protein  complexes  that  are  in  a  tightly  packed  conformation  [31,32].  Constitutive  heterochromatin  is  usually  found  at  the  centromeric  and  subtelomeric  regions  of  chromosomes
0	Spot  shape  modelling  and  data  transformations  for  microarrays
1	Claus  Thorn  Ekstrom1,,  Soren  Bak2  ,  Charlotte  Kristensen2,  and  Mats  Rudemo1
0	Department
0	In  order  to  study  lowly  expressed  genes  in  microarray  experiments,  it  is  useful  to  increase  the  photometric  gain  in  the  scanning.  However,  a  large  gain  may  cause  some  pixels  for  highly  expressed  genes  to  become  saturated,  i.e.  the  registered
0	Present  address:  Poalis  A/S,  Buelowsvej  25,  1870  Frederiksberg  C,  Denmark
0	pixel  values  become  censored  at  the  upper  limit,  which  with  16-bit  precision  is  216  -  1  =  65535.  Techniques  for  adjustment  of  highly  expressed  signal  intensities  are  given  in  Wit  and  McClure  (2003)  based  on  a  small  set  of  available  spot  summaries,  such  as  spot  mean,  spot  median  and  spot  variance.  As  mentioned  in  Wit  and  McClure  (2003),  it  should  be  possible  to  get  more  accurate  adjustments  when  all  pixel  values  are  available.  In  the  present  paper,  we  study  spatial  statistical  models  for  pixel  values  that  should  enable  such  adjustments.  A  convenient  type  of  modelling  is  to  transform  data  to  become  approximately  Gaussian  distributed  with  a  mean  value  function  determined  by  gene  intensities  and  spot  shapes  and  a  corresponding  covariance  function.  For  such  models,  censored  pixel  values  can  be  estimated  optimally.  We  investigate  several  types  of  transformations  on  the  pixel  level  such  as  the  logarithmic  transformation,  the  Box-Cox  family  (Box  and  Cox,  1964)  and  the  inverse  hyperbolic  sine  transformation  (Huber  et  al.,  2002;  Durbin  et  al.,  2002),  also  called  the  generalized  logarithm  (Rocke  and  Durbin,  2003).  The  inverse  hyperbolic  sine  transformation  has  been  proven  useful  for  analyzing  microarray  spot  intensities,  but  here  we  apply  it  at  the  pixel  level.  The  Box-Cox  transformation  with  exponent  0.5,  i.e.  a  square  root  transformation  optimal  for  Poisson  distributed  counts,  has  been  used  at  pixel  level  analysis  of  microarray  data  by  Glasbey  and  Ghazal  (2003).  The  spot  shapes  studied  include  three  types  suggested  by  Wierling  et  al.  (2002):  (i)  a  cylindric  plateau  spot  distribution,  (ii)  an  isotropic  two-dimensional  (2D)  Gaussian  distribution  and  (iii)  a  crater  spot  distribution  consisting  of  a  difference  between  two  scaled  isotropic  2D  Gaussian  distributions.  These  models  does  not  seem  to  provide  a  satisfactory  description  for  the  dataset  considered,  and  we  introduce  a  new  class  of  models  with  polynomial-hyperbolic  spot  shape.  With  a  second  degree  polynomial  we  get  a  considerably  improved  performance.  This  spot  shape  may  be  regarded  as  a  generalization  of  the  cylindric  plateau  spot  shape.
0	Spot  shape  models  and  transformations
0	The  models  are  applied  to  a  dataset  obtained  with  a  specially  designed  spotted  50mer  oligonucleotide  microarray.  Here,  the  expression  of  452  selected  genes  in  transgenic  Arabidopsis  plants  are  compared  with  the  corresponding  genes  in  wildtype  plants.  Data  include  scans  with  different  photometric  gains  ranging  from  no  saturation  to  heavy  saturation.
0	where  1  >  0,  and  an  inverse  hyperbolic  sine  transformation
0	DATA,  TRANSFORMATIONS  AND  EXPLORATORY  ANALYSIS  Materials
0	Y  =  k  arsinh
0	SPOT  SHAPE  MODELS
0	Based  on  empirical  observations  of  spot  intensity  profiles  as  seen  in  Figure  1  as  well  as  in  Duggan  et  al.  (1999)  (Fig.  2)  and  Glasbey  and  Ghazal  (2003)  (Fig.  1),  we  desire  a  spatial  spot  shape  model  to  have  the  following  three  properties:  (i)  isotropic,  i.e.  that  the  average  intensity  at  a  pixel  x  only  depends  on  the  distance  from  x  to  the  spot  centre  and  not  on  the  direction  from  the  centre;  (ii)  should  allow  for  spot-shapes  resembling  both  `volcanos/craters/donuts'  and  `plateaus'.  Spot  intensities  are  often  highest  near  the  edge  of  the  spot  and  smaller  near  the  spot  centre  making  the  resulting  spot  shape  resemble  a  volcano  (middle  panel  of  Fig.  1);  and  (iii)  allow  for  spatial  correlation,  i.e.  pixels  close  together  and  with  the  same  distance  from  the  spot  centre  should  be  more  correlated  than  pixels  further  apart.
0	Let  Z  =  Z(x)  denote  the  intensity  of  a  pixel  x.  Here,  Z  is  a  16-bit  integer,  i.e.  0  Z  216  -  1  =  65535.  Let  Y  (x)  denote  a  transformation  of  Z(x),  Y  (x)  =  f  (Z(x),  ),  (1)
0	where  f  (·,  )  is  a  family  of  transformation  depending  on  the  parameter  vector  .  In  the  following,  we  shall  consider  three  transformations:  A  logarithmic  transformation  Y  =  k  log(Z  +  1  ),  (2)
0	C.T.Ekstrom  et  al.
0	January  2003
0	The  Importance  of  Thermodynamic  Equilibrium  for  High  Throughput  Gene  Expression  Arrays
1	Gyan  Bhanot,*  Yoram  Louzoun,y  Jianhua  Zhu,z  and  Charles  DeLisiz
0	ABSTRACT  We  present  an  analysis  of  physical  chemical  constraints  on  the  accuracy  of  DNA  micro-arrays  under  equilibrium  and  nonequilibrium  conditions.  At  the  beginning  of  the  article  we  describe  an  algorithm  for  choosing  a  probe  set  with  high  specificity  for  targeted  genes  under  equilibrium  conditions.  The  algorithm  as  well  as  existing  methods  is  used  to  select  probes  from  the  full  Saccharomyces  cerevisiae  genome,  and  these  probe  sets,  along  with  a  randomly  selected  set,  are  used  to  simulate  array  experiments  and  identify  sources  of  error.  Inasmuch  as  specificity  and  sensitivity  are  maximum  at  thermodynamic  equilibrium,  we  are  particularly  interested  in  the  factors  that  affect  the  approach  to  equilibrium.  These  are  analyzed  later  in  the  article,  where  we  develop  and  apply  a  rapidly  executable  method  to  simulate  the  kinetics  of  hybridization  on  a  solid  phase  support.  Although  the  difference  between  solution  phase  and  solid  phase  hybridization  is  of  little  consequence  for  specificity  and  sensitivity  when  equilibrium  is  achieved,  the  kinetics  of  hybridization  has  a  pronounced  effect  on  both.  We  first  use  the  model  to  estimate  the  effects  of  diffusion,  crosshybridization,  relaxation  time,  and  target  concentration  on  the  hybridization  kinetics,  and  then  investigate  the  effects  of  the  most  important  kinetic  parameters  on  specificity.  We  find  even  when  using  probe  sets  that  have  high  specificity  at  equilibrium  that  substantial  crosshybridization  is  present  under  nonequilibrium  conditions.  Although  those  complexes  that  differ  from  perfect  complementarity  by  more  than  a  single  base  do  not  contribute  to  sources  of  error  at  equilibrium,  they  slow  the  approach  to  equilibrium  dramatically  and  confound  interpretation  of  the  data  when  they  dissociate  on  a  time  scale  comparable  to  the  time  of  the  experiment.  For  the  best  probe  set,  our  simulation  shows  that  steady-state  behavior  is  obtained  in  a  relaxation  time  of  ;12-15  h  for  experimental  target  concentrations  ;(10y13  y  10y14)M,  but  the  time  is  greater  for  lower  target  concentrations  in  the  range  (10y15-10y16)M.  The  result  points  to  an  asymmetry  in  the  accuracy  with  which  upand  downregulated  genes  are  identified.
0	INTRODUCTION  Single  assay  characterization  of  the  response  of  thousands  of  genes  to  environmental  perturbations  is  altering  the  research  paradigm  in  biomolecular  science.  Applications  are  increasing  explosively  in  areas  as  wide  ranging  as  gene  expression  and  regulation  (Lashkari  et  al.,  1997),  genotyping  and  resequencing,  and  drug  discovery  and  disease  stratification  (Eisen  et  al.,  1998).  The  potential  impact  of  micro-arrays  on  basic  and  applied  biology  is  so  important  that  an  entire  industry  has  been  spawned,  using  any  of  dozens  of  variants  of  two  generic  methods  to  fabricate  arrays--either  direct  deposition  of  probes  (Schena  et  al.,  1998;  DeRisi  et  al.,  1996;  Duggan  et  al.,  1999)  or  covalent  attachment  by  in  situ  synthesis  (Hughes  et  al.,  2001;  LeProust  et  al.,  2000;  Lipshutz  et  al.,  1999;  Singh-Gasson  et  al.,  1999).  The  former  method  allows  a  wide  range  of  substances  such  as  presynthesized  oligomers,  proteins,  cloned  DNA,  etc.,  to  be  used  as  probes.  The  latter  is  generally  restricted  to  oligonucleotides  but  offers  higher  specificity.  The  central  theme  of  this  article  is  the  physical  chemical  limits  of  specificity;  i.e.,  conditions  that  allow  the  best  specificity  we  consider  mainly,  though  not  exclusively,  arrays  of  20-30  nucleotides  long  probes,  manufactured  by  in  situ  synthesis.  These  conditions  minimize  false  hybridizations  resulting  from  the  slow  equilibration  that  is  characteristic  of  long  probes,  and  avoid  competition  between  surface-bound  and  solubilized  probes.  Typically  an  array  of  tens  to  hundreds  of  thousands  of  different  pixels,  each  consisting  of  a  homogeneous  set  of  1-10  million  oligonucleotide  probes,  is  used  to  determine  the  expression  levels  of  genes  of  known  sequence.  The  molecules  to  be  assayed,  e.g.,  cDNA,  are  hybridized,  during  a  12-15  h  incubation,  with  probes  chosen  to  be  their  reverse  complements  The  most  common  detection  method  relies  on  fluorescence.  Usually  molecules  from  the  target  and  reference  cells  are  labeled  with  red  and  green  dyes  respectively;  pixels  are  then  scanned  at  the  two  distinct  wavelengths  to  determine  expression  changes.  Genes  that  are  up-  or  downregulated  in  response  to  drugs,  hormones,  or  other  environmental  influences  are  thus  quickly  identified.  Although  micro-array  assays  are  high  throughput  in  the  sense  that  in  excess  of  10,000  genes  at  a  time  are  probed,  the  number  of  false-positives  is  high,  even  for  arrays  prepared  by  in  situ  synthesis.  Increased  specificity  is  typically  achieved  by  sacrificing  sensitivity:  only  genes  with  a  pronounced  change  in  expression  level,  typically  in  the  fifth  percentile,  are  scored  as  having  changed.  The  screened  set,  or  a  select
0	Gene  Array  Thermodynamics
0	group  of  the  screened  set,  is  then  investigated  further  using  traditional  methods  such  as  Northern  blotting.  Increased  throughput  is  generally  achieved  by  increased  array  density.  However,  as  the  above  remarks  imply,  a  substantial  increase  in  throughput  can  be  achieved  by  a  well  validated,  high-specificity  system.  To  increase  specificity  by  rational  design  procedures,  it  is  helpful  to  have  a  clear  understanding  of  the  physical  limitations  of  the  assay.  This  includes  understanding  the  conditions  that  will  provide  the  best  specificity,  the  robustness  to  deviations  from  optimal  conditions,  the  relation  of  optimal  conditions  to  those  prevalent  in  the  most  common  experimental  procedures,  and  strategies  for  optimization.  This  article  is  divided  into  two  broad  components:  equilibrium  and  kinetic.  In  the  first  section,  we  outline  the  thermodynamics  of  hybridization.  Specificity  and  sensitivity  are  maximum  when  equilibrium  has  been  achieved,  but  even  under  this  ideal  condition  the  method  used  to  select  probes  affects  the  formation  of  crosshybrids,  and  thus  it  affects  specificity.  Probe  selection  is  a  large  optimization  problem.  We  discuss  this  below,  and  present  a  new  probe  selection  method.  Further  below,  we  use  this  method  to  select  probes  for  the  full  set  of  yeast  genes  and  compare  the  specificities  obtained  at  equilibrium  where  both  specificity  and  sensitivity  are  maximum.  This  has  particular  implications  for  long  probes  inasmuch  as  length  substantially  reduces  the  rate  at  which  equilibrium  is  approached,  and  consequently  increases  false-positives  if  equilibrium  is  not  achieved.
0	melting  temperature  is  easily  obtained.  Define  b  as  the  equilibrium  constant  for  bimolecular  nucleation  (formation  of  the  first  bond)  in  units  of  inverse  concentration,  and  let  K  be  the  (dimensionless)  equilibrium  constant  for  the  formation  of  the  remainder  of  the  helix.  For  a  helix  with  n  bases,  there  will  be  n-1  stacking  interactions.  We  write  the  sum  of  the  standard  Gibbs  free  energies  for  the  n-1  stacks  as  DHyTDS,  so  that  the  corresponding  intramolecular  equilibrium  constant  is  K  ¼  e½ydDHyTDSÞ=RT  ,  where  DH  and  DS  are  the  sums  of  the  standard  enthalpies  and  entropies  for  base  stacking,  in  accordance  with  the  base  sequence.  The  free  energy  of  the  nucleation  event  also,  to  some  extent,  depends  on  the  basepairs  that  nucleate  dimerization.  If  A  be  the  free  strand  concentration  and  B  the  concentration  of  hybrids,  and  we  assume  the  molecules  are  either  fully  hybridized  or  completely  separated,  then,  B  ¼  bA2  K:  (1)
0	If  cT  is  the  total  strand  concentration,  then  by  conservation  cT  ¼  2B  þ  A:  In  addition,  at  the  melting  temperature  Tm  we  have  by  definition  2B  ¼  A.  Substituting  these  relations  in  the  equation  for  B,  and  utilizing  the  definition  of  K,  we  have  that,  Tm  ¼  DH  :  ½RlogdbcT  Þ  þ  DS  (2)
0	The  presence  of  a  surface
0	Thermodynamics  of  hybridization
0	Melting  profiles
0	As  temperature  is  increased,  an  initially  fully  intact  hybrid  will  gradually  destabilize,  and  at  high  enough  temperature,  the  strands  will  separate.  Approximately  90%  of  the  transition  occurs  over  a  temperature  range  of  ;10-15  degrees  for  25-mers,  with  the  range  narrowing  as  length  increases.  The  so-called  melting  curve,  determined  under  equilibrium  conditions,  is  cooperative  and  has  an  inflection  point  which  is  referred  to  as  the  melting  temperature,  Tm.  The  melting  temperature  is  defined  as  the  temperature  at  which  half  the  total  number  of  strands  are  free  (i.e.,  not  hybridized).  In  general  the  population  of  hybridized  strands  will  have  a  distribution  of  intact  basepairs,  and  the  arrangement  of  a  given  number  of  pairs  will  also  be  distributed.  The  common  practice  of  neglecting  partially  hybridized  states  reduces  a  very  complex  multistage  model  to  a  two  state  model,  eliminates  the  physical  basis  for  cooperativity,  and  broadens  the  melting  profile.  For  short  chains,  however,  it  has  little  affect  on  the  midpoint  of  the  transition,  introducing  an  error  that  is  within  the  error  caused  by  experimental  uncertainty  in  the  stacking  free  energy.  For  this  two-state  model  in  which  partially  hybridized  states  are  neglected,  a  sequence-dependent  expression  for  the
0	The  formation  of  a  DNA  hybrid  consists  of  a  bimolecular  nucleation  event  followed  by  formation  of  a  double 
1	Arnold  Vainrub  B.  Montgomery  Pettitt
0	Surface  Electrostatic  Effects  in  Oligonucleotide  Microarrays:  Control  and  Optimization  of  Binding  Thermodynamics
0	retical  analysis  of  the  surface  electrostatic  effects,6  which  is  in  accord  with  recent  experiments,7  we  describe  here  the  effect  of  the  surface  charge  density  on  the  melting  curve  and  match/mismatch  discrimination  ratio  for  surface  hybridization,  and  predict  possible  substantial  improvements  in  several  properties  for  microarrays.  The  surface  material,  dielectric  or  metal,
0	Vainrub  and  Pettitt
0	and  the  surface  electrostatic  conditions  are  shown  to  be  critically  important  because  they  strongly  determine  the  yield  of  the  nucleic  acid  target  hybridization  to  the  surface-immobilized  oligonucleotide  probes.  We  propose  to  use  these  properties  for  control  and  enhancement  of  sensitivity  during  surface  hybridization.  In  particular,  an  equal  sensitivity  of  the  probes  with  different  base-pair  composition  may  be  achieved  by  adjustment  of  their  specific  linker  molecule  length  or  the  local  surface  charge.  Further,  we  suggest  enhancement  of  the  match/mismatch  discrimination  by  narrowing  the  melting  curve  by  optimizing  the  surface  charge.  Finally,  we  discuss  a  new  microarray  design  using  hybridization  at  low  salt  where  the  duplex  stability  is  achieved  by  the  positive  surface  charge.  Under  these  conditions  the  target's  secondary  structure  is  melted,  allowing  hybridization  to  most  of  the  target's  nucleotides  and  increasing  the  sequencing  information  up  to  tenfold.
0	RESULTS  AND  DISCUSSION  Statistical  Thermodynamics  of  Hybridization
0	THEORETICAL  MODEL  AND  CALCULATION  METHODS
0	where  n  is  the  fraction  of  the  hybridized  probes  in  equilibrium,  C0  is  the  concentration  of  the  targets,  and  G  is  the  molar  Gibbs  free  energy  of  the  probe:target  duplex  formation.  Equation  (1)  is  valid  under  the  condition  that  the  target  concentration  is  constant.  For  brevity,  we  omit  a  straightforward  derivation  for  a  general  case  when  targets  are  depleted  because  of  hybridization.  Note  that  at  constant  temperature  Eq.  (1)  corresponds  to  the  well-known  Langmuir  adsorption  isotherm  equation,  which  is  often  used  to  interpret  microarray  experiments.3  For  discussing  the  mechanism  of  the  interaction  below,  we  introduce  here  the  interaction  Gibbs  free  energy  with  the  surface  for  the  probe  Vp,  target  Vt,  and  duplex  Vd.  This  interaction  impacts  the  hybridization  equilibrium  and  therefore  the  parameters  in  Eq.  (1)  in  several  ways.  First,  the  target  concentrations  on  the  surface  Cs  and  in  solution  C0  vary  according  to  the  Boltzmann  distribution  formula  Cs  C0  exp(  Vt/RT)  (2)
0	Second,  the  Gibbs  free  energy  differences  of  the  duplex  formation  on  the  surface  Gs  and  in  solution  G  differ  by  the  change  of  the  interaction  energy  after  and  before  hybridization,  (Vd  Vp  Vt).  Thus  Gs  G  Vd  Vp  Vt  (3)
0	Equations  2  and  3  account  for  the  target  concentration  and  duplex  binding  strength  changes  near  the  surface,  respectively.  Substitution  of  Eqs.  (2)  and  (3)  in  Eq.  (1)  gives  the  formula  ns  1/{1  C0  1  exp[(  G  Vd  Vp)/RT]},  (4)
0	Surface  Electrostatic  Effects
0	which  describes  the  effect  of  surface  interactions  on  the  hybridization  equilibrium.  This  equation  differs  from  Eq.  (1)  for  hybridization  in  bulk  by  addition  of  (Vd  Vp)  to  the  hybrid  formation  free  energy.  Hence,  if  duplex  and  probe  are  attracted  to  the  surface  (Vd  0  and  Vp  0),  the  stronger  attraction  of  the  duplex  for  the  surface  Vd  Vp  promotes  duplex  formation.  In  contrast,  a  stronger  surface  repulsion  of  the  duplex  than  the  probe  shifts  the  hybridization  equilibrium  toward  melting  of  duplexes  into  single  strand  targets  and  probes.  This  approach  can  be  also  used  out  of  thermodynamic  equilibrium  when  the  target's  concentration  on  the  surface  Cs  is  determined  not  by  the  Boltzmann  distribution  Eq.  (2),  but  rather  by  some  steady  state  transport  process.  The  corresponding  Cs  and  Eq.  (3)  should  be  substituted  in  Eq.  (1)  to  obtain  the  equilibrium  yield  of  the  duplexes  in  surface  hybridization,  ns.  This  is  relevant  to  electronic  DNA  chips  where  the  assayed  nucleic  acid  is  transported  by  electrokinetic  drag13,14  and  flow-through  biochips.15
0	Surface  Electrostatic  Interaction
0	In  order  to  evaluate  the  hybridization  with  the  surface  tethered  probes,  one  need  to  know  the  probe  Vp  and  duplex  Vd  interaction  energies  in  Eq.  (4).  Recently,  we  calculated  the  oligonucleotide-surface  interaction  in  electrolyte  solution.6  We  assumed  the  electrostatic  interaction  to  be  dominant  since  in  microarray  applications  typically  the  oligonucleotide  is  tethered  to  the  surface  through  a  sufficiently  long  linker  molecule,  making  the  short-range  van  der  Waals  forces  weak  and  therefore  their  effect  small.  The  electrostatic  Gibbs  free  energy  was  shown  to  be  a  sum  of  two  components,  V1  and  V2.  As  depicted  in  Figure  1,  V1  corresponds  to  the  direct  electrostatic  interaction  with  the  surface  charge  and  is  attractive  (repulsive)  for  the  positively  (negatively)  charged  surface  because  of  the  negative  charge  of  the  nucleic  acid  target.  V2  is  the  target's  electrostatic  free  e
0	BGX:  a  fully  Bayesian  gene  expression  index  for  Affymetrix  GeneChip  data
1	By  ANNE-METTE  K.  HEIN
0	Department  of  Epidemiology  and  Public  Health,  Imperial  College,  Norfolk  Place,  London  W2  1PG,  UK
0	Department  of  Epidemiology  and  Public  Health,  Imperial  College,  Norfolk  Place,  London  W2  1PG,  UK
1	HELEN  C.  CAUSTON
0	Microarray  Centre,  MRC  Clinical  Sciences  Centre,  Imperial  College,  Hammersmith  Hospital,  London  W12  0NN,  UK
1	GRAEME  K.  AMBLER  and  PETER  J.  GREEN
0	Some  key  words:  Bayesian,  Affymetrix,  GeneChip,  probe-level  analysis,  gene  expression,  differential  expression,  MCMC
0	Introduction  Microarrays  are  one  of  the  new  technologies  that  have  developed  in  line  with  the  sequencing  of  the  human  and  other  genomes  and  developments  in  miniaturization  and  robotics.  They  permit
0	A.K.  Hein  et  al.
0	the  expression  profiles  of  tens  of  thousands  of  genes  to  be  measured  in  a  single  experiment  and  promise  to  revolutionize  the  biomedical  and  life  sciences.  This  is  partly  because  the  gene  expression  profiles  obtained  form  a  `signature'  --  a  molecular  phenotype  --  that  can  be  used  to  characterize  the  type,  age,  disease  state  and  growth  conditions  of  an  organism.  Affymetrix  are  one  of  the  leading  manufacturers  of  microarrays  (Affymetrix  gene  expression  arrays  are  also  referred  to  as  `GeneChips')  and  these  are  widely  used.  They  differ  from  many  other  array  types  in  that  a  single  labelled  extract  is  hybridized  to  each  array  and  because  they  contain  multiple  `match'  and  `mismatch'  sequences  for  each  transcript.  This  presents  particular  challenges  for  low-level  data  analysis  including  the  integration  of  data  from  the  multiple  probes  representing  each  transcript  on  an  array  to  provide  a  measure  that  represents  gene  expression  and  its  inherent  uncertainty,  and  the  bringing  into  par  (`normalization')  of  data  from  different  arrays.
0	Affymetrix  Oligonucleotide  arrays  The  oligonucleotide  array  technology  exploits  two  fundamental  biological  properties:  (a)  mRNA  is  an  intermediate  product  between  genes  encoded  in  DNA  and  their  protein  products,  so  mRNA  abundance  can  be  used  as  a  measure  of  gene  expression,  and  (b)  single  stranded  RNA  molecules  have  a  high  affinity  to  form  double  stranded  structures.  Pairing  between  RNA  strands  is  highly  specific  and  complementary  strands  have  particularly  high  binding  affinities.  Oligonucleotide  arrays  contain  hundreds  of  thousands  of  features.  A  feature  is  a  small  rectangular  area,  containing  a  large  number  of  identical  oligonucleotides.  In  general,  a  different  oligonucleotide  sequence  is  represented  at  each  feature.  The  features  on  oligonucleotide  arrays  are  referred  to  as  probes.  A  measure  of  the  abundance  of  a  particular  transcript  RNA  in  a  biological  sample  can  be  obtained  by  going  through  the  following  procedure:  isolating  RNA,  making  a  labelled  representation  of  it,  fragmenting  the  sample,  hybridizing  the  labelled,  fragmented  RNA  to  an  array,  washing  off  the  material  that  has  not  hybridized  and  scanning  the  array  to  obtain  fluorescence  intensities  at  each  probe  (Schena  et  al.,  1995).  The  abundance  of  a  transcript  is  related  to  the  intensity  measured  at  the  features  representing  the  complementary  RNA  sequence.  On  GeneChip  arrays  oligonucleotides  of  length  25  are  used.  However,  many  genes  are  similar,  sharing  common  motifs  or  subsequences,  and  cannot,  in  general,  be  uniquely  identified  by  a  single  sequence  of  length  25.  Therefore  each  gene  is  represented  by  a  probe  set,  consisting  of  a  number  of  probe  pairs.  A  probe  pair  consists  of  a  perfect  match  probe  (PM)  and  a  mismatch  probe  (MM).  At  each  perfect  match  probe,  an  oligonucleotide  which  perfectly  matches  part  of  the  transcript  is  represented.  The  detection  of  transcripts  at  the  PMs  of  a  probe  set  indicates  that  the  gene  is  expressed,  and  the  level  of  detection  indicates  the  degree  of  expression.  However,  although  complementary  RNA  sequences  have  particularly  high  affinities,  sequences  that  are  complementary  over  only  part  of  the  length  of  the  sequence,  or  shorter  sequence  fragments,  may  also  hybridize.  We  refer  to  the  hybridization  of  non-complementary  transcripts  to  the  probes  as  non-specific  hybridization.  This  is  the  motivation  for  including  MM  probes.  The  oligonucleotides  represented  at  an  MM  probe  are  identical  to  those  at  the  corresponding  PM  probe,  except  that  the  middle  nucleotide  is  that  of  the  complementary  base.  The  intention  is  that,  since  PM  and  MM  probes  are  almost  identical,  equal  amounts  of  non-specific  hybridization  will  occur  at  these  probes.  Excess  hybridization  to  the  PM  probe,  relative  to  the  MM  probe  will  be  due  to  specific  hybridization,  that  is,  the  hybridization  of  complementary  transcripts.  A  probe  set  for  a  gene  typically  consists  of  11-20  PM  and  MM  probe  pairs,  and  these  represent  the  information  available  about  the  expression  of  the  gene.
0	BGX:  a  new  gene  expression  index  1.2.  Gene  expression  experiments  and  analysis
0	The  generation  of  gene  expression  data  is  a  multi-step  process,  and  variability  (from  different  sources)  may  be  introduced  at  a  number  of  experimental  stages.  The  variability  of  interest  is  that  of  biological  origin,  e.g.,  variability  in  gene  expression  between  experimental  conditions,  individuals  or  tissue  types.  Variability  of  non-biological  origin  may  arise  due  to  differences  in  the  preparation  of  the  biological  samples  to  be  hybridized,  in  the  manufacture  of  the  arrays,  or  in  the  process  of  scanning  the  arrays  (see  Hartemink  et  al.  (2001)  for  a  more  detailed  discussion).  The  replicability  of  raw  gene  expression  data  is  low  and  gene  expression  data  is  notoriously  noisy.  This  can  be  clearly  demonstrated  by  hybridizing  two  technical  replicates  of  the  same  biological  sample  on  two  arrays.  The  intensities  obtained  will  often  be  found  to  differ  (Figure  1).  FIGURE  1  ABOUT  HERE  The  analysis  of  gene  expression  data  is  usually  treated  as  a  multi-step  process.  The  individual  steps  often  consist  of  correcting  the  intensities  for  background  noise,  estimation  of  gene  expression  indices,  normalization  between  samples,  assessment  of  which  genes  are  differentially  expressed  and  clustering  of  genes  or  conditions  with  similar  expression  profiles  or  patterns.  The  focus  of  this  paper  is  on  the  steps  leading  to  the  estimation  of  gene  expression  and  on  detection  of  differential  expression.  A  drawback  of  splitting  up  the  analysis  of  gene  expression  data  into  separate  steps  that  are  dealt  with  independently  is  that  the  error  associated  with  each  step  is  ignored  in  the  downstream  analysis.  In  assessing  differential  expression,  it  is  clearly  of  interest  to  know  how  reliable  the  expression  index  of  a  gene  is.  In  turn,  in  the  estimation  of  the  gene  expression  index,  it  is  of  interest  to  quantify  the  variability  in  the  background  corrected  intensities,  on  which  the  estimation  is  based.  A  primary  aim  of  the  work  presented  here  is  to  develop  a  statistically  coherent  framework  for  the  analysis  of  Affymetrix  GeneChip  arrays,  in  which  the  splitting  up  of  the  analysis  into  separate  steps  is  avoided.  1.3.  Bayesian  hierarchical  modelling  of  Affymetrix  gene  expression  data  In  this  paper  we  present  Bayesian  hierarchical  models  for  the  analysis  of  gene  expression  data,  where  all  steps  in  the  process,  and  thus  the  associated  errors,  are  modelled  simultaneously.  For  clarity,  we  first  set  out  a  model  for  estimating  the  expression  of  genes  using  data  obtained  from  a  single  array.  In  the  model,  background  correction  for  non-specific  hybridization  and  calculation  of  gene  expression  indices  are  considered  simultaneously.  We  base  the  inference  on  the  full  posterior  distributions  for  the  parameters,  so  that,  in  addition  to  point  estimates  of  gene  expression  levels  we  obtain  their  credibility  intervals.  Next,  we  extend  the  model  to  encompass  the  more  commonly  encountered  situation,  in  which  different  experimental  conditions  are  considered,  and  where  replicate  arrays  may  be  available  under  some  or  all  of  the  conditions.  Here  all  information  is  used  simultaneously  to  make  the  relevant  inferences:  where  replicate  arrays  are  available,  measures  of  the  expression  of  genes  are  obtained  from  a  simultaneous  consideration  of  the  probe  sets  for  the  genes  on  the  arrays.  When  experimental  conditions  are  compared  it  is  often  of  interest  to  identify  genes  that  are  differentially  expressed,  and  to  rank  the  genes  according  to  their  degre
0	Short  Technical  Reports
0	SHORT  TECHNICAL  REPORTS
0	Analysis  of  DNA  Microarrays  by  NonDestructive  Fluorescent  Staining  Using  SYBRfi  Green  II
0	ABSTRACT  A  simple,  non-destructive  procedure  is  described  to  determine  the  quality  of  DNA  arrays  before  they  are  used.  It  consists  of  a  preliminary  staining  step  of  the  DNA  microarray  by  using  SYBRfi  green  II,  a  fluorophore  with  specific  affinity  for  ssDNA,  followed  by  a  laser  scan  analysis.  The  surface  quality,  integrity  and  homogeneity  of  each  DNA  spot  of  the  array  can  thus  be  assessed.  After  this  preliminary  control,  which  may  avoid  further  analytical  steps  that  lead  to  the  waste  of  precious  biological  samples,  a  fully  reversible  staining  procedure  is  performed  that  produces  an  array  ready  for  subsequent  use.
0	INTRODUCTION  The  use  of  microarrays  is  growing  exponentially  (5).  The  technology  consists  of  dense  arrays  of  DNA  spots  deposited  on  suitably  prepared  surfaces,  mainly  glass.  Several  formats  have  been
0	BioTechniques
0	plate  and  primers  used.  A  portion  of  each  PCR  amplification  product  (5  µL)  was  examined  by  agarose  gel  electrophoresis,  followed  by  ethidium  bromide  staining.  Only  PCR  products  showing  a  clear  and  strong  band  on  UV  transillumination  were  recovered  by  ethanol  precipitation  and  resuspension  in  15  µL  3x  standard  saline  citrate  (SSC)  (450  mM  NaCl,  45  mM  sodium  citrate,  pH  7.0).  The  DNA  concentration  was  determined  using  PicoGreenfi  reagent  (Molecular  Probes),  a  fluorescent  nucleic  acid  stain  useful  for  quantitating  dsDNA  in  solution.  The  final  concentration  of  DNA  averaged  50  ng/µL.  Samples  were  transferred  into  96-well  plates,  which  were  sealed  and  stored  at  -20°C  until  used.  Preparation  of  Polylysine-Coated  Glass  Slides  Standard  glass  microscope  slides  (Sigma  Aldrich)  were  pre-cleaned  by  immersion  for  at  least  2  h  in  an  alkaline  wash  solution  consisting  of  10%  (w/v)  NaOH  and  57%  (v/v)  ethanol,  followed  by  rinsing  five  times  in  double-distilled
0	water.  The  slides  were  then  gently  shaken  for  1  h  in  a  coating  solution  consisting  of  35  mL  Poly-L-Lysine  (Sigma  Aldrich;  0.1%  w/v  in  water),  35  mL  filtered  PBS  and  280  mL  doubledistilled  water.  Coated  slides  were  extensively  washed  with  double-distilled  water,  centrifuged  at  low  speed,  (80x  g)  dried  in  a  vacuum  drying  oven  at  45°C  for  10  min  and  then  stored  at  room  temperature  in  a  tightly  sealed  slide  box.  Slides  were  used  after  at  least  two  weeks  to  produce  a  sufficiently  hydrophobic  surface.  This  aging  process  is  a  key  step  in  obtaining  a  suitable  surface  for  array  preparation.  Printing  of  DNA  Microarrays  Target  DNA  samples  in  3x  SSC  were  spotted  on  the  glass  slides  using  a  piezoelectric  pipet  (Nanoplotter  SystemTM,  Gesim  GmbH,  Germany).  The  pipet  was  programmed  to  release  about  10  nL  DNA  solution  for  each  DNA  spot.  Spots  were  arrayed  in  a  20  x  20  arrangement  (400  spots  in  a  1.5  x  1.5cm  square  with  a  center-to-center  spacing  between  spots  of  approximately  750  µm)  or  a  30  x  30  arrangement  (900  spots  in  a  1.5  x  1.5-cm  square  with  a  center-to-center  spacing  of  500  µm).  After  deposition,  arrayed  DNA  spots  were  completely  dried  by  overnight  incubation  at  room  temperature  in  a  covered  box.  Printed  slides  were  rehydrated  (DNA  side  down)  in  a  plastic  humid  chamber  (Sigma  Aldrich)  until  spots  glistened  and  then  snap-dried  at  100°C.
0	BioTechniques  79
0	BMC  Bioinformatics
0	Methodology  article
0	BioMed  Central
0	Open  Access
0	In  silico  microdissection  of  microarray  data  from  heterogeneous  cell  populations
1	Harri  Laehdesmaeki1,  llya  Shmulevich2,  Valerie  Dunmire2,  Olli  Yli-Harja1  and  Wei  Zhang*2
0	Background:  Very  few  analytical  approaches  have  been  reported  to  resolve  the  variability  in  microarray  measurements  stemming  from  sample  heterogeneity.  For  example,  tissue  samples  used  in  cancer  studies  are  usually  contaminated  with  the  surrounding  or  infiltrating  cell  types.  This  heterogeneity  in  the  sample  preparation  hinders  further  statistical  analysis,  significantly  so  if  different  samples  contain  different  proportions  of  these  cell  types.  Thus,  sample  heterogeneity  can  result  in  the  identification  of  differentially  expressed  genes  that  may  be  unrelated  to  the  biological  question  being  studied.  Similarly,  irrelevant  gene  combinations  can  be  discovered  in  the  case  of  gene  expression  based  classification.  Results:  We  propose  a  computational  framework  for  removing  the  effects  of  sample  heterogeneity  by  "microdissecting"  microarray  data  in  silico.  The  computational  method  provides  estimates  of  the  expression  values  of  the  pure  (non-heterogeneous)  cell  samples.  The  inversion  of  the  sample  heterogeneity  can  be  facilitated  by  providing  accurate  estimates  of  the  mixing  percentages  of  different  cell  types  in  each  measurement.  For  those  cases  where  no  such  information  is  available,  we  develop  an  optimization-based  method  for  joint  estimation  of  the  mixing  percentages  and  the  expression  values  of  the  pure  cell  samples.  We  also  consider  the  problem  of  selecting  the  correct  number  of  cell  types.  Conclusion:  The  efficiency  of  the  proposed  methods  is  illustrated  by  applying  them  to  a  carefully  controlled  cDNA  microarray  data  obtained  from  heterogeneous  samples.  The  results  demonstrate  that  the  methods  are  capable  of  reconstructing  both  the  sample  and  cell  type  specific  expression  values  from  heterogeneous  mixtures  and  that  the  mixing  percentages  of  different  cell  types  can  also  be  estimated.  Furthermore,  a  general  purpose  model  selection  method  can  be  used  to  select  the  correct  number  of  cell  types.
0	Page  1  of  15
0	(page  number  not  for  citation  purposes)
0	Recent  developments  in  high-throughput  genomic  techTable  3:  The  measured  mixing  percentages.  The  measured  mixing  percentages  (RKO/normal)  in  the  five  heterogeneous  samples.
0	sample  #1  RKO  normal  100  0
0	nologies  have  revolutionized  the  approaches  aimed  at  understanding  biological  systems  and  emphasized  the  need  for  computational  and  systems  biology  research.  Microarray  analysis,  for  instance,  can  provide  massive  amounts  of  information  about  a  biological  sample  by  simultaneously  measuring  thousands  of  transcript  levels.  Application  of  such  methodologies  has  already  yielded  important  molecular  insight  into  cellular  phenotypes  under  various  experimental  conditions  [1]  and  provided  new  knowledge  about  the  development  and  treatment  of  human  diseases,  such  as  cancers  [2-4].  During  the  last  several  years,  microarray  technology  has  undergone  continued  improvement  with  better  quality  control  in  the  overall  measurement  process,  ranging  from  hybridization  conditions  to  image  processing  techniques  [5].  Nevertheless,  to  fully  harness  the  power  of  the  microarray  technology  to  study  biological  materials  such  as  cancer  tissues,  one  has  to  deal  with  a  source  of  measurement  variability  that  comes  from  the  biological  materials  themselves,  which  rarely  consist  of  homogeneous  cell  populations.  For  example,  except  for  a  few  types  of  immune-privileged  tissues  such  as  the  brain,  most  solid  tumor  tissues  contain  infiltrating  lymphocytes  as  a  result  of  the  immune  response.  Most  tumor  tissues  also  contain  endothelial  cells  as  part  of  the  necessary  vasculature  systems  that  provide  nutrients  for  the  tumor  cells.  The  complexity  of  this  problem  is  that  different  tumor  tissues  contain  different  proportions  of  these  non-tumor  cells.  Therefore,  if  tumor  tissues  are  used  without  consideration  of  such  a  mixing  phenomenon,  measurement  of  differential  gene  expression  will  certainly  be  confounded  by  the  heterogeneous  cell  populations.  In  some  studies  [6],  pathologists  carefully  evaluated  the  tissues  and  only  selected  tissues  with  more  than  a  certain  percentage  of  tumor  cells.  This  prescreening  step,  however,  results  in  the  exclusion  of  many  tumor  tissues  for  the  study  and  contributes  to  the  small  sample  size  problem  in  some  of  the  studies.  Alternatively,  laser  capture  microdissection  (LCM)  technology  can  be  used  to  purify  the  tumor  cells  from  mixed  populations  [7].  This  approach  has  been  very  successful  in  DNA-based  studies  because  of  the  relatively  high  stability  of  DNA.  However,  for  microarray  studies,  which  require  less  stable  RNA,  LCM  has  seen  limited  success  because  it  is  much
0	more  challenging  to  maintain  RNA  stability  during  the  microdissection  process.  Other  drawbacks  of  LCM  are  that  such  procedures  are  time-consuming  and  yield  insufficient  quantities  of  RNA,  thus  requiring  multiple  amplification  steps  that  may  confound  quantitative  inferences  from  gene  expression  data.  A  recent  paper  by  Ghosh  [8]  introduced  a  mixture  model  based  framework  for  determining  differential  expression  in  the  presence  of  mixed  cell  populations.  In  this  study,  we  aim  at  reconstructing  the  actual  expression  values  of  the  pure  cell  types  from  the  heterogeneous  mixtures.  That  is,  we  develop  a  computational  method  for  removing  the  effect  of  mixing  from  heterogeneous  samples  and  to  microdissect  microarray  data  in  silico.  Similar  analytical  approaches  have  been  previously  proposed  by  Lu  et  al.  [9],  Stuart  et  al.  [10]  and  Venet  et  al.  [11].  Lu  et  al.  focused  on  estimating  the  fraction  of  cells  in  different  phases  of  the  cell  cycle  whereas  Stuart  et  al.  considered  the  problem  of  estimating  the  cell  type  specific  expression  patterns  over  all  samples.  Here  we  focus  on  estimating  both  the  sample  and  cell  type  specific  expression  values  using  carefully  controlled  microarray  experiments.  The  inversion  of  the  'cell  mixing  effect'  can  be  made  appreciably  easier  by  providing  estimates  of  the  mixing  percentages  of  different  cell  types  in  each  measurement,  which  can  be  measured  by  an  experienced  pathologist.  The  entire  process  does  not  hinge  upon  such  measurements,  however,  as  the  mixing  percentages  can  be  estimated  within  the  modeling  framework.  Venet  et  al.  [11]  introduced  some  preliminary  methods  and  results  for  tackling  the  same  problem  as  we  consider  here.  In  particular,  they  used  a  similar  regression  based  framework  as  in  [10]  and  as  we  do.  We  also  consider  the  problem  of  selecting  the  correct  number  of  cell  types  using  the  cross-validation  model  selection  framework.
0	The  microarray  data  to  which  we  apply  our  computational  methods  consists  of  five  different  heterogeneous  mixtures  of  lymph  node  and  colon  cancer  samples  which  are  hereafter  abbreviated  as  normal  and  RKO,  respectively.  For  more  details,  see  Materials  and  methods  Section.  Each
0	Page  2  of  15
0	(page  number  not  for  citation  purposes)
0	heterogeneous  mixture  consists  of  different  fractions  of  different  cell  samples,  see  Table  3.
0	Inversion  of  sample  heterogeneity  The  first  goal  is  to  invert  the  mixing  effect  caused  by  sample  heterogeneity.  We  apply  the  linear  model  developed  in  Materials  and  methods  Section  to  the  heterogeneous  microarray  data.  The  obtained  results  are  presented  below.
0	clearly  shows  that  the  heterogeneous  samples  ('m1'  through  'm5')  are  located  almost  on  a  straight  line  in  the  2-dimensional  PCA  space.  Furthermore,  the  line  on  which  the  heterogeneous  samples  are  lying  is  parallel  to  the  first  principal  component,  suggesting  that  the  most  significant  variation  in  the  data  is  due  to  the  linear  mixing  effect.  The  estimated  expression  profile  of  the  pure  colon  cancer  cells  and  lymphocytes  are  close  to  samples  number  #1  and  #5,  respectively,  indicating  that  the  inversion  of  the  mixing  phenomenon  produces  reasonable  results.  The  results  are  more  easily  appreciated  when  only  the  most  significant  PCA  component  is  shown.  As  discussed  above,  the  variation  in  the  most  significant  PCA  component  is  due  to  the  mixing  effect.  The  results  in  Figure  2  (a)  are  as  in  Figure  1,  but  now  shown  in  1-dimension  in  order  to  facilitate  the  interpretation.  Results  in  Figure  2  (b),  in  turn,  are  as  in  Figure  2  (a)  except  that  the  inversion  was  done  using  only  the  samples  #2,  #3,  and  #4.  This  represents  a  more  difficult  and  realistic  case,  since  fewer  mixtures  are  available.  When  comparing  Figure  2  (a)  with  Figure  2  (b),  one  can  conclude  that  the  method  performs  slightly  better  when  more  samples  are  used  to  estimate  the  true  expression  profiles  -  a  result  that  was  expected.  Overall  performance,  however,  is  good  in  both  cases.  The  est
0	BMC  Bioinformatics
0	BioMed  Central
0	Open  Access
0	ProbeMaker:  an  extensible  framework  for  design  of  sets  of  oligonucleotide  probes
1	Johan  Stenberg*,  Mats  Nilsson  and  Ulf  Landegren
0	Background:  Procedures  for  genetic  analyses  based  on  oligonucleotide  probes  are  powerful  tools  that  can  allow  highly  parallel  investigations  of  genetic  material.  Such  procedures  require  the  design  of  large  sets  of  probes  using  application-specific  design  constraints.  Results:  ProbeMaker  is  a  software  framework  for  computer-assisted  design  and  analysis  of  sets  of  oligonucleotide  probe  sequences.  The  tool  assists  in  the  design  of  probes  for  sets  of  target  sequences,  incorporating  sequence  motifs  for  purposes  such  as  amplification,  visualization,  or  identification.  An  extension  system  allows  the  framework  to  be  equipped  with  application-specific  components  for  evaluation  of  probe  sequences,  and  provides  the  possibility  to  include  support  for  importing  sequence  data  from  a  variety  of  file  formats.  Conclusion:  ProbeMaker  is  a  suitable  tool  for  many  different  oligonucleotide  design  and  analysis  tasks,  including  the  design  of  probe  sets  for  various  types  of  parallel  genetic  analyses,  experimental  validation  of  design  parameters,  and  in  silico  testing  of  probe  sequence  evaluation  algorithms.
0	Increasing  numbers  of  methods  are  being  developed  for  parallel  nucleic  acid  analyses  for  different  purposes.  Many  of  these  methods  employ  sets  of  oligonucleotide  probes  or  probe  pairs  that  hybridize  to  the  sequences  targeted  for  analysis,  allowing  the  probe  sequences  to  be  acted  upon  by  one  or  more  enzymes,  creating  new  molecular  species  that  reflect  the  presence  or  nature  of  the  different  target  sequences.  The  reaction  products  generally  contain  identifying  sequences  or  other  features  that  allow  the  separation  of  signals  originating  from  different  targets.  This  is  the  case  in  methods  such  as  the  multiplex  oligonucleotide  ligation  assay  (OLA)  [1],  the  multiplex  ligation-dependent  probe  amplification  assay  (MLPA)  [2],  the  RNA-  and  cDNA-mediated  annealing,  selection,  extension  and  ligation  assays  (RASL,  DASL)  [3,4],  the  GoldenGate  genotyp-
0	ing  assay  [5],  multiplex  minisequencing  [6],  and  the  padlock  or  molecular  inversion  probe  assay  [7,8].  The  latter  method  has  been  used  to  genotype  more  than  10,000  single  nucleotide  polymorphisms  (SNPs)  in  multiplex.  Another  method  that  utilizes  sets  of  oligonucleotide  probes  for  multiplex  processing  of  nucleic  acid  molecules  is  the  selector  amplification  technique.  This  technique  uses  partially  double-stranded  oligonucleotides,  called  selectors,  to  circularize  a  selection  of  restriction  fragments  from  total  genomic  DNA,  and  it  incorporates  a  general  sequence  motif  that  allows  parallel  amplification  of  all  circularized  fragments  using  a  single  primer  pair  [9].  With  molecular  solutions  to  many  tasks  of  highly  parallel  genetic  analysis  now  at  hand,  other  factors  become  limiting,  such  as  the  design  and  the  synthesis  of  reagents.  In  the
0	Page  1  of  6
0	(page  number  not  for  citation  purposes)
0	work  presented  here,  we  address  the  problem  of  largescale  probe  design.  When  large  numbers  of  probes  are  combined,  the  risk  for  unintended  interactions  between  probes  and  targets  must  be  considered.  This  risk  places  strict  requirements  on  the  design  of  sets  of  probes  to  be  used  together.  In  particular,  it  is  important  that  probes  do  not  contain  sequences  that  result  in  the  production  of  detectable  signal  from  any  probe  in  the  absence  of  its  cognate  target  molecule,  or  that  otherwise  interfere  with  the  activity  of  other  probes  in  the  set.  Due  to  these  and  other  constraints  and  the  many  possible  alternative  probe  sequences  to  evaluate,  the  difficulty  of  designing  probe  sets  increases  rapidly  with  the  size  of  the  probe  sets.  Many  computer  programs  exist  for  the  design  of  oligonucleotide  probes  such  as  PCR  primers  [10-12],  microarray  probes  [13,14],  and  more  [15].  These  programs  define  algorithms  to  evaluate  the  risk  of  primer  or  probe  sequences  being  involved  in  undesired  interactions  such  as  probe  homo-  or  heterodimer  formation,  cross-hybridization,  false  priming,  etc.  However,  the  available  programs  are  generally  limited  in  scope,  and  are  not  applicable  to  the  task  of  designing  sets  of  complex  probes  containing  multiple  sequence  elements.  The  ProbeMaker  software  presented  herein  is  a  framework  for  computer-assisted  design  and  analysis  of  sets  of  oligonucleotide  probe  sequences  composed  of  several  functional  sequence  elements.  As  the  composition  of  probes  and  the  constraints  imposed  on  sets  of  probes  vary  between  applications,  this  framework  has  been  constructed  to  support  the  design  of  different  types  of  probes  using  application-specific  constraints,  as  defined  by  the  user.  ProbeMaker  takes  as  input  a  set  of  target  sequences  and  a  number  of  sets  of  so-called  'tag'  sequences.  These  tag  sequences  may  serve  as  targets  for  restriction  digestion,  as  binding  sites  for  amplification  primers  or  fluorescent  detection  probes,  or  as  identification  codes  for  individual  amplification  products  that  are  decoded  by  hybridization  to  oligonucleotide  arrays  [16].  Probes  are  designed  for  each  target  by  construction  of  target-specific  sequences  and  addition  of  tag  sequences  according  to  rules  specified  by  the  user.  Different  combinations  of  sequence  elements  are  evaluated  for  each  probe,  and  a  set  of  probe  sequences  is  created  that  satisfies  user-defined  criteria.
0	it  should  have  the  potential  to  import  sequence  data  from  a  variety  of  sources.  The  flexibility  is  provided  by  the  target  and  probe  sequence  data  structures  used.  Each  target  defines  two  template  sequences  that  are  used  to  construct  target-specific  sequences  (TSSs)  to  use  in  the  corresponding  probe.  Each  probe  is  made  up  of  two  such  TSSs  and  a  number  of  tag  sequences,  which  may  be  located  5'  of,  between,  or  3'  of  the  TSSs.  As  TSSs  may  be  of  zero  length,  this  system  allows  the  design  of  many  different  types  of  probes.  Support  for  more  than  two  TSSs  per  probe  was  not  deemed  necessary  as  this  is  not  used  in  any  current  methods.  Furthermore,  targets  may  be  grouped,  allowing  the  program  to  perform  selection  of  tag  sequences  based  on  the  relations  of  target  sequences,  for  example  variants  of  the  same  polymorphic  sequence.  The  extensibility  is  realized  by  using  an  extension  mechanism  for  much  of  the  functionality.  Extensions  are  constructed  in  the  form  of  Java  classes  that  implement  defined  interfaces  and  may  be  loaded  into  the  framework  at  run-time.  This  mechanism  allows  the  addition  of  new  target  types  and  support  for  different  formats  for  sequence  input  and  output,  as  well  as  design  constraints  and  acceptor  schemes,  the  function  of  which  will  be  described  below.  ProbeMaker  may  be  run  through  a  graphical  user  interface  or  from  the  command  line.  For  the  graphical  user  interface,  a  set  of  target  sequences  and  sets  of  tag  sequences  are  provided  as  input  by  the  user.  Application-specific  parameters  for  probe  design  and  evaluation  are  set  through  the  user  interface.  When  running  ProbeMaker  from  the  command  line,  a  project  file  defining  all  sequences  and  parameters  is  used  as  input.  The  potential  for  supporting  different  file  formats  is  provided  by  using  the  sequence  input  system  of  the  MolTools  Java  library  [17].  A  combination  of  components  for  sequence  file  parsing,  sequence  notation  conversion,  and  post-import  modifications  are  used  to  allow  creation  of  sets  of  any  type  of  target  from  a  variety  of  sequence  file  formats,  with  the  possibility  to  carry  out  other  operations  on  the  imported  data,  such  as  selecting  which  position  within  the  target  sequence  to  design  probes  for,  or  to  group  or  sort  sequences  based  on  some  particular  property.
0	The  main  objectives  in  the  development  of  ProbeMaker  were  to  provide  a  framework  that  is  flexible,  in  the  sense  that  it  should  support  design  of  oligonucleotide  probes  for  different  purposes,  and  extensible,  in  that  it  should  be  possible  to  add  support  for  designing  new  types  of  probes  and  to  add  new  types  of  design  constraints.  Furthermore,  the  software  should  be  adaptable  to  new  applications,  and
0	For  a  given  set  of  targets,  and  a  number  of  sets  of  tag  sequences,  ProbeMaker  performs  two  tasks  (Figure  1A).  Firstly,  TSSs  are  constructed  for  each  target  as  determined  by  the  target  type  in  use,  forming  the  basis  for  a  probe  for  that  target.  Secondly,  tag  sequences  are  added  to  each  probe  sequentially  in  a  pattern  specified  by  the  user.
0	Page  2  of  6
0	(page  number  not  for  citation  purposes)
0	BMC  Genomics
0	Research  article
0	BioMed  Central
0	Open  Access
0	A  generic  approach  for  the  design  of  whole-genome  oligoarrays,  validated  for  genomotyping,  deletion  mapping  and  gene  expression  analysis  on  Staphylococcus  aureus
1	Yvan  Charbonnier*1,2,  Brian  Gettler1,2,  Patrice  Francois1,  Manuela  Bento1,  Adriana  Renzoni3,  Pierre  Vaudaux3,  Werner  Schlegel2  and  Jacques  Schrenzel1,4
0	Background:  DNA  microarray  technology  is  widely  used  to  determine  the  expression  levels  of  thousands  of  genes  in  a  single  experiment,  for  a  broad  range  of  organisms.  Optimal  design  of  immobilized  nucleic  acids  has  a  direct  impact  on  the  reliability  of  microarray  results.  However,  despite  small  genome  size  and  complexity,  prokaryotic  organisms  are  not  frequently  studied  to  validate  selected  bioinformatics  approaches.  Relying  on  parameters  shown  to  affect  the  hybridization  of  nucleic  acids,  we  designed  freely  available  software  and  validated  experimentally  its  performance  on  the  bacterial  pathogen  Staphylococcus  aureus.  Results:  We  describe  an  efficient  procedure  for  selecting  40-60  mer  oligonucleotide  probes  combining  optimal  thermodynamic  properties  with  high  target  specificity,  suitable  for  genomic  studies  of  microbial  species.  The  algorithm  for  filtering  probes  from  extensive  oligonucleotides  libraries  fitting  standard  thermodynamic  criteria  includes  positional  information  of  predicted  targetprobe  binding  regions.  This  algorithm  efficiently  selected  probes  recognizing  homologous  gene  targets  across  three  different  sequenced  genomes  of  Staphylococcus  aureus.  BLAST  analysis  of  the  final  selection  of  5,427  probes  yielded  >97%,  93%,  and  81%  of  Staphylococcus  aureus  genome  coverage  in  strains  N315,  Mu50,  and  COL,  respectively.  A  manufactured  oligoarray  including  a  subset  of  control  Escherichia  coli  probes  was  validated  for  applications  in  the  fields  of  comparative  genomics  and  molecular  epidemiology,  mapping  of  deletion  mutations  and  transcription  profiling.  Conclusion:  This  generic  chip-design  process  merging  sequence  information  from  several  related  genomes  improves  genome  coverage  even  in  conserved  regions.
0	Page  1  of  12
0	(page  number  not  for  citation  purposes)
0	Current  hybridization  technologies  allow  assaying  thousands  of  nucleic  acid  sequences  in  a  single  reaction  on  a  solid  substrate.  Such  massively  parallel  systems  offer  unprecedented  opportunities  for  basic  research  and  diagnostic  applications,  including  gene  sequencing  [1],  detection  of  genetic  polymorphisms  [2],  genome-composition  analysis  [3,4]  and  measurement  of  gene  expression  profiles  in  prokaryotes  [5,6]  or  cancer  cells  [7].  Oligonucleotide  probes  (up  to  70-mer)  offer  more  flexibility  than  cDNA  probes  since  they  can  be  tailored  according  to  optimal  in  silico  physico-chemical  and  specificity  properties,  and  applied  to  any  sequence  data.  Early  available  probe  design  software  identified  sets  of  probes  sharing  homogeneous  thermodynamic  properties  for  probe-target  hybridization  [8].  More  elaborated  software  tools  include  cross-homology  testing  of  probes  against  a  reference  database  by  BLAST  (Basic  Local  Alignment  Search  Tool)  [9,10]  or  prediction  of  secondary  structures  into  the  thermodynamically-based  approach  [1114].  A  frequent  drawback  of  some  of  these  algorithms  is  to  yield  an  excessive  number  of  unprocessed  BLAST  outputs  that  complicates  final  selection  of  the  most  specific  probes.  Furthermore,  these  approaches  do  not  take  into  consideration  probe  interaction  with  microarray  surface,  in  particular  the  impact  of  mismatches  position  between  the  target  and  probes,  as  shown  by  Hughes  et  al  [15].  Designing  reliable  oligonucleotide  probes  with  available  software  is  quite  difficult  for  bacterial  genomes  with  low  GC  content  [16],  low  complexity  in  sequence  composition,  or  frequent  conserved  repeats  leading  to  erroneous  target  identification  by  cross-hybridization.  The  reported  method  (OliCheck)  implements  an  algorithm  for  filtering  oligonucleotide  probes  libraries  sharing  homogeneous  thermodynamic  properties  by  using  positional  information  of  predicted  target-probe  binding  regions.  An  additional  characteristic  of  OliCheck  is  to  annotate  probes  recognizing  highly  conserved  targets  shared  by  different  genomes.  Staphylococcus  aureus  (S.  aureus)  was  selected  as  a  model  organism  for  implementing  and  experimentally  validating  this  approach.  The  choice  of  this  clinically  important  pathogen  for  fundamental  and  applied  genomic  studies  is  prompted  by  the  availability  of  several  fully  or  partially  sequenced  strain  genomes  [16-18].  A  set  of  feature  elements  was  designed  by  OliCheck  to  yield  an  extensive  S.  aureus  genome  coverage.  This  S.  aureus  specific  probe  set  together  with  control  probes  were  used  to  manufacture  an  oligoarray  that  was  extensively  validated  for  comparative  genomics,  molecular  epidemiology,  mapping  of  deletion  mutations,  and  transcription  profiling  applications.  The  specificity,  signal-response  linearity,  and  influence  of  hybridization  temperatures  for  transcript  profiling  are  also  described.
0	Further  genomic  oligoarrays  of  several  distinct  microbial  species  have  been  successfully  designed  using  this  generic  methodological  approach.
0	In  silico  properties  of  the  S.  aureus  oligoarray  and  manufacturing  of  StaphChip  The  final  set  of  5,335  S.  aureus  OliCheck-filtered  probes  recognized  97.5,  93.0,  and  81.0%  of  N315,  Mu50,  and  COL  ORFs,  respectively.  The  low  residual  percentage  of
0	Page  2  of  12
0	(page  number  not  for  citation  purposes)
0	Step  A
0	N315  (2'593  ORFs)  (2,593
0	BLAST  probes
0	N315  (2'593  ORFs)  (2,593
0	Hybridization  intensities  prediction  (%)
0	Surface  end
0	Solution  end
0	Probe  A
0	Step  B
0	Probe  B
0	BLAST  probes
0	Hybridization  intensities  prediction  (%)
0	Surface  end
0	Solution  end
0	Probe  A
0	Step  C
0	Probe  B
0	Step  D
0	BMC  Genomics
0	BMC  Genomics  2002,  3
0	BioMed  Central
0	Methodology  article
0	Open  Access
0	Optimization  and  evaluation  of  T7  based  RNA  linear  amplification  protocols  for  cDNA  microarray  analysis
1	Hongjuan  Zhao1,  Trevor  Hastie2,  Michael  L  Whitfield3,  Anne-Lise  BorresenDale4  and  Stefanie  S  Jeffrey*1
0	Background:  T7  based  linear  amplification  of  RNA  is  used  to  obtain  sufficient  antisense  RNA  for  microarray  expression  profiling.  We  optimized  and  systematically  evaluated  the  fidelity  and  reproducibility  of  different  amplification  protocols  using  total  RNA  obtained  from  primary  human  breast  carcinomas  and  high-density  cDNA  microarrays.  Results:  Using  an  optimized  protocol,  the  average  correlation  coefficient  of  gene  expression  of  11,123  cDNA  clones  between  amplified  and  unamplified  samples  is  0.82  (0.85  when  a  virtual  array  was  created  using  repeatedly  amplified  samples  to  minimize  experimental  variation).  Less  than  4%  of  genes  show  changes  in  expression  level  by  2-fold  or  greater  after  amplification  compared  to  unamplified  samples.  Most  changes  due  to  amplification  are  not  systematic  both  within  one  tumor  sample  and  between  different  tumors.  Amplification  appears  to  dampen  the  variation  of  gene  expression  for  some  genes  when  compared  to  unamplified  poly(A)+  RNA.  The  reproducibility  between  repeatedly  amplified  samples  is  0.97  when  performed  on  the  same  day,  but  drops  to  0.90  when  performed  weeks  apart.  The  fidelity  and  reproducibility  of  amplification  is  not  affected  by  decreasing  the  amount  of  input  total  RNA  in  the  0.3-3  µg  range.  Adding  template-switching  primer,  DNA  ligase,  or  column  purification  of  double-stranded  cDNA  does  not  improve  the  fidelity  of  amplification.  The  correlation  coefficient  between  amplified  and  unamplified  samples  is  higher  when  total  RNA  is  used  as  template  for  both  experimental  and  reference  RNA  amplification.  Conclusion:  T7  based  linear  amplification  reproducibly  generates  amplified  RNA  that  closely  approximates  original  sample  for  gene  expression  profiling  using  cDNA  microarrays.
0	Gene  expression  profiling  using  complementary  DNA  (cDNA)  microarrays  is  being  applied  for  multiple  purposes  such  as  defining  the  taxonomy  of  different  molecular
0	subtypes  of  human  breast  and  other  cancers  [1-10]  and  discovering  biomarkers  and  therapeutic  targets  [11,12].  A  limitation  of  the  use  of  this  technology  is  that  small  specimens  of  human  tissue,  such  as  obtained  by  core  needle  or
0	Page  1  of  15
0	(page  number  not  for  citation  purposes)
0	BMC  Genomics  2002,  3
0	fine  needle  aspiration  (FNA)  biopsies,  may  not  be  sufficient  for  microarray  hybridization  using  direct  labelling  protocols.  Typical  microarray  labelling  procedures  require  2-4  µg  poly(A)+  RNA  or  25-50  µg  total  RNA  per  cDNA  microarray.  This  amount  of  poly(A)+  RNA  or  total  RNA  can  be  obtained  from  samples  of  human  tissue  that  weigh  greater  than  50-100  mg.  However,  core  needle  biopsies  of  breast  cancers,  for  example,  weigh  in  the  10-25  mg  range  and  yield  only  3-15  µg  of  total  RNA.  Small  tumors  identified  using  early  detection  strategies  may  thus  be  too  small  to  excise  a  specimen  with  enough  RNA  for  microarray  analysis.  A  pilot  study  by  Assersohn  et  al.  [13]  showed  that  only  15%  of  FNA  samples  from  human  breast  cancers  produced  sufficient  mRNA  for  expression  array  analysis.  One  approach  to  low  specimen  RNA  input  has  been  to  use  indirect  labelling  techniques  to  increase  fluorescence  signal  intensity,  such  as  with  aminoallyl  nucleotides.  Although  less  expensive,  we  and  other  colleagues  have  found  that  indirect  labelling  techniques  are  not  always  reliable  compared  to  direct  labelling  methods.  For  valuable  tumor  specimen,  reliability  is  paramount.  A  very  recent  report  used  amino  C6dT-modified  random  hexamers  to  prime  cDNA  synthesis  in  conjunction  with  aminoallyldUTP  and  increased  fluorescence  intensity  enough  such  that  as  little  as  1  µg  of  total  RNA  from  cell  lines  gave  sufficient  signal  for  cDNA  microarray  hybridization  [14].  The  reliability  of  this  method  with  human  tumor  specimen  warrants  further  testing.  RNA  amplification  techniques  have  been  developed  to  address  the  need  for  sufficient  RNA  from  tiny  specimen  for  microarray  hybridization.  Other  examples  of  specimen  requiring  amplification  for  genome-wide  characterization  of  gene  expression  include  purified  populations  of  cells  obtained  by  either  flow  cytometry,  laser  capture  microdissection,  breast  ductal  or  bronchial  lavage,  or  microendoscopy.  Although  one  group  has  used  unamplified  total  RNA  extracted  from  ~2  x  104  microdissected  cells  for  hybridization  on  5000  clone  membrane-based  arrays  [15],  most  groups  perform  RNA  amplification  for  this  purpose  [16-18],  especially  when  using  high-density  slide-based  arrays.  The  most  commonly  used  mechanism  for  RNA  amplification  is  a  T7  based  linear  amplification  method  first  developed  by  Van  Gelder,  Eberwine  and  coworkers  [19-21].  This  method  utilizes  a  synthetic  oligo(dT)  primer  containing  the  phage  T7  RNA  polymerase  promoter  to  prime  synthesis  of  first  strand  cDNA  by  reverse  transcription  of  the  poly(A)+  RNA  component  of  total  RNA.  Second  strand  cDNA  is  synthesized  by  degrading  the  poly(A)+  RNA  strand  with  RNase  H,  followed  by  second  strand  synthesis  with  E.  coli  DNA  polymerase  I.  Amplified  antisense  RNA  (aRNA)  is  obtained  from  in  vitro  transcription  of  the  double-stranded  cDNA  (ds  cDNA)  template  using  T7  RNA
0	Page  2  of  15
0	(page  number  not  for  citation  purposes)
0	BMC  Genomics  2002,  3
0	Table  1:  Correlation  coefficients  of  amplified  and  unamplified  expression  levels  of  14,044  genes  selected  according  to  the  described  criteria.  Amplifications  with  or  without  TS  primer  and  with  two  different  ds  cDNA  cleanup  protocols  were  performed  on  BC91  total  RNA.
0	Column  for  ds  cDNA  cleanup
0	Reference  RNA  amplified
0	Total  RNA
0	Poly(A)+  RNA
0	Total  RNA
0	Poly(A)+  RNA
0	Virtual  Average  Virtual  Average
0	Stefan  Tomiuk  is  a  member  of  the  bioinformatics  group  at  MEMOREC,  a  Cologne-based  biotechnology  company  focusing  on  gene  discovery  and  expression  profiling  by  SAGE  and  cDNA  microarrays.  He  participates  in  building  up  the  company's  cDNA  collection  and  is  responsible  for  the  selection  of  DNA  fragments  suitable  for  microarray  application.  Kay  Hofmann  is  head  of  the  bioinformatics  group  at  MEMOREC.
0	Microarray  probe  selection  strategies
1	Stefan  Tomiuk  and  Kay  Hofmann
0	Keywords:  cDNA  microarray,  expression  profiling,  high  throughput,  clustering,  hybridisation
0	During  recent  years,  DNA  microarrays  have  become  the  method  of  choice  to  monitor  the  expression  level  of  a  large  number  of  genes.  Depending  on  the  focus  of  the  study  and  the  method  of  microarray  fabrication,  a  number  of  different  strategies  for  probe  selection  may  be  most  appropriate.  One  consideration  concerns  the  length  of  the  probe,  ranging  from  some  25  residues  used  for  oligonucleotide  arrays  to  complete  cDNAs.  Unless  resources  are  truly  unlimited,  an  important  decision  to  be  made  is  the  amount  of  effort  to  be  put  into  the  selection  of  genes  and  gene  fragments.  While  high-throughput  cDNA  arraying  projects  usually  will  select  from  a  collection  of  existing  cDNA  clones,  smaller  projects  focusing  on  a  number  of  selected  genes  can  afford  to  selectively  amplify  fragments  optimised  for  that  purpose.  This  paper  discusses  the  full  scope  of  probe  selection  strategies,  highlighting  the  problems  that  may  be  encountered  in  the  various  systems.
0	DNA  microarrays  are  made  up  of  a  collection  of  distinct  nucleic  acid  samples,  arranged  in  a  regular  lattice  of  spots  on  a  solid  support  generally  made  of  coated  glass.  Arrays  intended  to  monitor  changes  in  the  expression  level  of  various  genes  use  cDNA  samples  or  synthetic  oligonucleotides  derived  from  cDNA  sequences.1,2  Other  possible  array  applications  include  the  detection  of  mutations  or  copy  number  changes  on  the  genome  level  3±5  and  thus  use  samples  derived  from  genomic  DNA.  The  successful  application  of  each  DNA  microarray  technique  requires  particular  conditions  and  prerequisites,  which  impose  certain  criteria  for  selecting  appropriate  DNA  probes.  The  following  paragraphs  focus  on  probe  selection  strategies  for  the  more  widely  used  expression  arrays  of  both  the  oligonucleotide-  and  cDNA-using  variety.  Nevertheless,  some  of  these  criteria  are  also  valid  for  mutation-detection  arrays.
0	GENERAL  CONSIDERATIONS
0	When  monitoring  the  expression  level  of  a  large  number  of  genes,  sufficient  sensitivity  and  specificity  of  an  array,  as  well  as  the  broad  coverage  of  all  relevant  genes,  are  of  crucial  importance.  In  addition,  the  quality  of  the  array  should  guarantee  the  reproducibility  of  the  results  to  ensure  their  statistical  significance.  A  further  prerequisite  for  a  successful  interpretation  of  the  array  results  is  a  correct  assignment  and  annotation  of  the  DNA  probes,  providing  an  unambiguous  link  to  the  corresponding  entries  in  gene  and  literature  databases.  Some  aspects  of  probe  design,  including  the  fragment  length,  are  influenced  by  the  manufacturing  process  of  the  arrays.  Photolithographic  procedures  allow  a  massively  parallel  production  of  oligonucleotide  arrays,  but  are  restricted  to  an  oligonucleotide  length  of  20±25  nucleotides  due  to  the  high  error  rate  of  each  extension  cycle.6±8  Alternative  methods  for  in  situ  oligonucleotide  synthesis,  employing  high-precision  delivery  of  chemical
0	Tomiuk  and  Hofmann
0	Physical  properties  of  the  probe  influence  hybridisation  kinetics
0	High  coverage  but  poor  sample  annotation  in  high  density  arrays  Short  vs.  long  array  probes
0	reliable  hybridisation  properties  but  the  increased  viscosity  might  complicate  the  array  manufacturing  process.  In  addition,  increasing  the  fragment  length  raises  the  danger  of  non-specific  cross-hybridisation  events.  If  fragments  of  very  heterogeneous  length  are  used,  the  comparability  of  the  investigated  genes  and  the  robustness  of  the  array  might  suffer  from  the  different  hybridisation  kinetics.  Oligonucleotide  probes  with  the  length  of  50±60  nucleotides  may  not  be  suitable  for  reliably  distinguishing  single  base  mismatches,  but  show  an  improved  specificity  and  sensitivity  compared  to  shorter  oligonucleotides.9,30
0	The  most  appropriate  probe  selection  strategy  depends  primarily  on  the  objective  of  the  experiment.  As  summarised  in  Figure  1,  there  is  a  whole  spectrum  of  different  approaches,  differing  in  aspects  of  throughput,  accuracy  and  the  necessary  effort  before  and  after  the  microarray  experiment.  In  situations  where  little  prior  information  on  relevant  genes  is  available,  or  where  the  prime  motivation  is  an  unbiased  overview  of  global  changes  in  gene  expression  patterns,  the  high-density  method  is  the  appropriate  choice.  Typically,  samples  are  selected  from  a  preexisting  collection  of  cDNA  sequences  or  fragments,  or  they  are  synthesised  by  a  method  amenable  to  high  throughput.  The  downside  of  this  approach  is  a  general  lack  of  reliable  sample  annotation,  shifting  some  of  the  necessary  work  to  the  post-hybridisation  phase.  These  highdensity  microarrays,  which  aim  to  cover  the  complete  transcriptome  of  a  biological  system,2,7  are  in  contrast  to  small  but  specialised  arrays  that  are  designed  with  a  focus  on  defined  subject  areas  such  as,  for  example,  genes  relevant  to  a  particular  metabolic  pathways  or  a  particular  tissue  type.31,32  The  limited  number  of  DNA  fragments  on  these  low-density  arrays  allows  a  more  thorough  selection  and  annotation  protocol.  Obviously,  there  also  exists  a  whole  range  of  intermediates
0	Microarray  probe  selection  strategies
0	The  quality  of  ESTbased  arrays  depends  on  the  reliability  of  the  library  used
0	Spotting  without  prior  sequencing
0	PCR-amplification  is  the  most  reliable  but  most  expensive  probe  generating  method
0	between  ultrahigh-density  and  highaccuracy  arrays.  In  the  following  paragraphs,  some  common  strategies  for  probe  selection  are  discussed.  The  easiest  and  cheapest  method  consists  of  the  spotting  of  clones  from  a  library  without  prior  sequencing.  Only  those  clones  that  show  differential  expression  after  hybridisation  are  submitted  to  sequencing  and  further  analysis.  This  strategy  is  particularly  useful  for  arrays  produced  in  small  editions,  since  only  a  small  fraction  of  presumably  interesting  genes  must  be  annotated.  The  more  frequently  a  particular  array  set-up  is  used,  the  less  efficient  becomes  the  deferment  of  the  sequence  analysis.  Typical  applications  include  highthroughput  screens  for  potential  new  drug  targets,33,34  or  the  analysis  of  `exotic'  biological  systems  without  any  available  sequence  information.  Owing  to  the  frequent  representation  bias  of  some  genes,  a  normalisation  of  the  library  used  is  strongly  recommended  for  reaching  a  more  equal  distribution.35  A  somewhat  more  refined  strategy  relies  on  available  collections  of  sequenced  cDNA  clones.  Most  of  the  available  clones  have  the  status  of  ESTs  (expressed  sequence  tags36  ),  and  their  corresponding  sequences  are  collected  in  the  dbEST  database.37  Access  to  the  physical  clones  of  most  animal  ESTs  is  provided  by  the  IMAGE  consortium  (Integrated  Molecular  Analysis  of  Genomes  and  their  Expression),38  and  by
0	several  distributors.  Since  clones  from  this  exhaustive  collection  are  also  available  in  large  sets,  they  are  a  valuable  and  widely  used  source  for  microarray  probes.  For  plants  and  other  organisms,  similar  sources  exist.  A  comm
0	Research  Update
0	Genome  Analysis
0	Eubacterial  phylogeny  based  on  translational  apparatus  proteins
1	Celine  Brochier,  Eric  Bapteste,  David  Moreira  and  Herve  Philippe
0	Lateral  gene  transfers  are  frequent  among  prokaryotes,  although  their  detection  remains  difficult.  If  all  genes  are  equally  affected,  this  questions  the  very  existence  of  an  organismal  phylogeny.  The  complexity  hypothesis  postulates  the  existence  of  a  core  of  genes  (those  involved  in  numerous  interactions)  that  are  unaffected  by  transfers.  To  test  the  hypothesis,  we  studied  all  the  proteins  involved  in  translation  from  45  eubacterial  taxa,  and  developed  a  new  phylogenetic  method  to  detect  transfers.  Few  of  the  genes  studied  show  evidence  for  transfer.  The  phylogeny  based  on  the  genes  devoid  of  transfer  is  very  consistent  with  the  ribosomal  RNA  tree,  suggesting  that  an  eubacterial  phylogeny  does  exist.
0	The  completion  of  many  genome  sequence  projects  has  revealed  the  fundamental  importance  of  lateral  gene  transfers
0	species  and  that  have  no  (or  very  few)  duplicated  copies.  We  concatenated  the  sequences  of  the  57  genes  into  a  large  fusion  (~  9000  amino  acid  positions).  The  phylogeny  based  on  this  fusion  is  very  similar  to  that  inferred  from  rRNA  and  gene  content.  Detailed  analysis  revealed  that  13  out  of  the  57  gene  phylogenies  were  INCONGRUENT  (see  Glossary)  with  the  phylogeny  based  on  the  fusion  of  the  57  genes,  either  due  to  methodological  treereconstruction  problems  or  to  a  few  recent  LGTs.  A  true  organismal  phylogeny  for  Bacteria  seems  to  exist,  which  could  be  fully  resolved  by  the  analysis  of  a  core  group  of  very  rarely  transferred  genes.
0	Phylogenetic  analysis  of  a  large  protein  fusion
0	For  our  analysis,  we  retrieved  from  the  public  databanks  and  from  ongoing
0	Congruence  and  incongruence:  Congruence  is  the  agreement  between  phylogenies  obtained  using  different  datasets  or  different  reconstruction  methods.  Trees  are  perfectly  congruent  if  they  display  the  same  topology;  that  is,  they  reflect  the  same  evolutionary  history.  By  contrast,  incongruent  trees  show  conflicting  robust  nodes,  which  could  be  due  to  different  evolutionary  histories  (e.g.  lateral  gene  transfers)  or  tree  reconstruction  problems.  law:  Traditional  models  of  sequence  evolution  assume  that  all  positions  in  the  sequences  are  equally  likely  to  undergo  a  substitution,  which  reduces  the  complexity  of  these  models.  However,  in  reality,  positions  in  sequences  are  more  or  less  `free'  to  vary;  that  is,  they  have  different  probabilities  of  undergoing  substitutions.  This  limits  the  biological  realism  of  traditional  models  and  their  efficiency  for  phylogenetic  reconstruction.  The  variation  of  substitution  rates  is  commonly  approximated  using  a  gamma  distribution,  also  known  as  a  law,  which  has  a  shape  parameter  that  specifies  the  range  of  rate  variation  [a].  Small  values  result  in  an  L-shaped  distribution  with  extreme  variation  of  rates  (most  sites  are  invariable,  but  a  few  have  very  high  substitution  rates).  As  gets  larger,  the  range  of  variation  diminishes,  until  approaches  infinity  and  all  sites  have  the  same  substitution  rate.  HKY  model:  The  Hasegawa,  Kishino  and  Yano  [b]  model  of  sequence  evolution  is  a  merger  of  the  Felsenstein  [c]  and  the  Kimura  two-parameter  models  [d],  which  allows  transitions  and  transversions  to  occur  at  different  rates  and  base  frequencies  to  vary  during  the  course  of  evolution,  respectively.  Jack-knife  analysis:  A  statistical  method  to  evaluate  the  robustness  of  an  inference.  It  is  based  on  the  construction  of  random  sub-samples  of  the  original  alignment  by  taking  a  fraction  of  the  positions  without  replacement  (in  contrast  to  the  bootstrap  method,  which  allows  replacement).  Usually,  trees  are  reconstructed  with  the  random  sub-samples  and  the  robustness  of  each  node  is  estimated  as  the  number  of  its  occurrences  among  these  trees  [e].  Log-Det  method:  A  method  to  evaluate  evolutionary  distances  that  are  consistent  for  sequences  with  different  nucleotide  or  amino  acid  composition  [f].  This  approach  is  required  because  other  methods  tend  to  group  sequences  on  the  basis  of  their  composition,  irrespective  of  their  evolutionary  history.  Kishino-Hasegawa  test:  A  test  used  for  the  estimation  of  incompatibility  between  alternative  tree  topologies  with  the  same  taxonomic  sampling  but  obtained  using
0	different  datasets  [g].  Two  tree  topologies  are  significantly  different  if  the  differences  of  their  likelihood  values  (expressed  as  the  lnL,  where  L  is  the  likelihood)  is  larger  than  1.96  standard  error  in  the  estimation  of  likelihood.  For  a  recent  criticism  of  this  test  see  Ref.  [h].  Principal  component  analysis  (PCA):  This  involves  a  mathematical  procedure  that  transforms  a  number  of  (possibly)  correlated  variables  into  a  (smaller)  number  of  uncorrelated  variables  called  principal  components.  The  first  principal  component  accounts  for  as  much  of  the  variability  in  the  data  as  possible,  and  each  succeeding  component  accounts  for  as  much  of  the  remaining  variability  as  possible.  Principal  components  are  obtained  by  projecting  the  multivariate  data  vectors  on  the  space  spanned  by  the  eigen  vectors.
0	Research  Update
0	Proteobacteria  Spirochetes  Green  sulfur
0	Chlamydiales  Proteobacteria
0	Mycoplasmas  (Low  G+C  Gram  positives)
0	Green  sulfur
0	D.  radiodurans
0	Low  G+C  Gram  positives
0	High  G+C  Gram  positives  Thermotogales
0	Low  G+C  Gram  positives  5  High  G+C  Gram  positives
0	D.  radiodurans
0	Aquificales
0	TRENDS  in  Genetics
0	genome  projects  sequences  homologous  to  all  Escherichia  coli  proteins  classified  as  involved  in  translation  in  the  Cluster  of  Orthologous  Genes  (COG)  database  [7],  as  well  as  the  16S  and  23S  rRNAs.  We  aligned  76  proteins  from  45  bacterial  species,  having  eliminated  any  proteins  that  are  present  only  in  a  restricted  sample  of  phyla  (see  http://sorex.snv.jussieu.fr/  translation/translation.html).  In  addition,  as  a  sample  of  transferred  genes,  we  used  the  tRNA  synthetases  (tRS),  most  of  which  are  known  to  have  undergone  numerous  LGTs  (perhaps  related  to  antibiotic  resistance  [8,9]).  The  76  genes  were  analysed  individually,  and  19  of  them  were  excluded  from  further  analyses  because  they  were:  (1)  difficult  to  align  reliably,  (2)  present  in  less  than  42  of  the  45  species,  or  (3)  have  more  than  one  copy  for  certain  phyla  (indicating  possible  ancient  duplications  and  losses,  and/or  LGTs).  The  remaining  57  genes,  after  elimination  of  ambiguously  aligned  regions  (alignments  available  on  our  website),  were  concatenated  for  the  45  bacterial  species  into  a  large  fusion  of  8857  amino  acids  (fusion  P1).  Most  of
0	the  best-known  bacterial  phyla  were  represented,  of  which  we  had  a  broad  taxonomic  sampling  for  Proteobacteria  and  Gram-positive  bacteria.  We  do  not  use  Archa
0	Robustness,  Flexibility,  and  the  Role  of  Lateral  Inhibition  in  the  Neurogenic  Network
0	Summary  Background:  Many  gene  networks  used  by  developing  organisms  have  been  conserved  over  long  periods  of  evolutionary  time.  Why  is  that?  We  showed  previously  that  a  model  of  the  segment  polarity  network  in  Drosophila  is  robust  to  parameter  variation  and  is  likely  to  act  as  a  semiautonomous  patterning  module.  Is  this  true  of  other  networks  as  well?  Results:  We  present  a  model  of  the  core  neurogenic  network  in  Drosophila.  Our  model  exhibits  at  least  three  related  pattern-resolving  behaviors  that  the  real  neurogenic  network  accomplishes  during  embryogenesis  in  Drosophila.  Furthermore,  we  find  that  it  exhibits  these  behaviors  across  a  wide  range  of  parameter  values,  with  most  of  its  parameters  able  to  vary  more  than  an  order  of  magnitude  while  it  still  successfully  forms  our  test  patterns.  With  a  single  set  of  parameters,  different  initial  conditions  (prepatterns)  can  select  between  different  behaviors  in  the  network's  repertoire.  We  introduce  two  new  measures  for  quantifying  network  robustness  that  mimic  recombination  and  allelic  divergence  and  use  these  to  reveal  the  shape  of  the  domain  in  the  parameter  space  in  which  the  model  functions.  We  show  that  lateral  inhibition  yields  robustness  to  changes  in  prepatterns  and  suggest  a  reconciliation  of  two  divergent  sets  of  experimental  results.  Finally,  we  show  that,  for  this  model,  robustness  confers  functional  flexibility.  Conclusions:  The  neurogenic  network  is  robust  to  changes  in  parameter  values,  which  gives  it  the  flexibility  to  make  new  patterns.  Our  model  also  offers  a  possible  resolution  of  a  debate  on  the  role  of  lateral  inhibition  in  cell  fate  specification.  Introduction  In  this  paper,  we  use  a  computer  model  to  explore  the  properties  of  the  neurogenic  network,  originally  characterized  in  Drosophila  melanogaster.  This  is  but  one  example  of  the  many  networks  of  cross-regulatory  genes  at  work  in  complex  organisms.  Other  familiar  examples  include  the  networks  of  segment  polarity  genes,  of  cell  cycle  genes,  of  circadian  clock  genes,  and  so  on.  Each  of  these  seems  to  have  remained  more  or  less  intact  through  long  periods  of  evolutionary  time  and  across
0	Robustness  in  the  Neurogenic  Network  779
0	embryos  and  imaginal  disks.  Figure  1  shows  our  summary  of  the  core  genes,  their  products,  and  their  interactions.  In  crafting  Figure  1,  we  approached  the  modelbuilding  process  as  a  biochemist  approaches  in  vitro  reconstitution;  by  adding  to  the  system  piece  by  piece,  we  hope  to  figure  out  how  each  design  feature  contributes  to  the  function  of  the  essential  core  network.  We  rationalize  our  choice  of  this  diagram  in  the  Supplementary  Material  available  with  this  article  online,  with  a  synopsis  as  follows  (Below,  "ac"  and  "Ac"  refer  to  the  real  achaete  gene  and  its  protein  product,  whereas  "ac"  and  "AC"  refer  to  corresponding  nodes  in  the  model):  Delta  (Dl)  is  a  ligand  for  the  receptor  Notch  (N).  When  Dl  activates  N,  a  cleaved-off  cytoplasmic  piece  of  N  binds  to  the  transcription  factor  Suppressor  of  Hairless  (Su(H)),  and  that  heterodimer  activates  Enhancer  of  split  (E(spl))  complex  genes.  The  proneural  genes  achaete  (ac)  and  scute  (sc)  encode  transcription  factors  that  actually  specify  neural  fate.  Both  Ac  and  Sc  are  autoactivating  and  cross-activating:  they  promote  their  own,  and  each  others',  transcription.  Thus,  the  proneural  genes  constitute  a  bistable  switch  at  the  heart  of  the  neurogenic  network.  They  also  activate  transcription  of  E(spl)  and  Dl.  E(spl)  in  turn  represses  transcription  of  ac  and  sc.  Thus,  the  loop  works  as  follows:  something  activates  ac  and/or  sc  in  the  neural-competent  cluster.  They  upregulate  Dl,  whose  product  activates  N  in  neighboring  cells,  which,  through  Su(H),  activates  E(spl).  E(spl)  represses  ac  and  sc  in  those  neighboring  cells.  To  achieve  a  neural  fate,  a  cell  must  upregulate  ac  and  sc  enough  that  their  autoactivation  overwhelms  E(spl)-mediated  repression  due  to  neighboring  cells  signaling  through  N.  We  constructed  three  different  models  of  the  network  in  Figure  1,  which  we  call  "augmented",  "standard",  and  "reduced".  The  standard  network  includes  all  components  and  interactions  shown  in  Figure  1,  except  for  cis-negative  regulation  of  N  activity  by  Dl  and  E(spl)  autorepression  (Figure  1  without  red  or  blue  connections).  Experimental  evidence  for  each  of  the  latter  interactions  exists  (see  the  Supplementary  Material),  but  the  literature  has  not  given  them  much  attention.  Neither  did  we  initially,  but  our  results  below  regarding  the  aug-
0	mented  network  (which  adds  the  red  connections)  indicate  that  these  may  indeed  be  important.  Our  reduced  network  eliminates  intracellular  negative  feedback  from  AC  and/or  SC  to  suppress  ac  and  sc  transcription  (blue  connections  replacing  red  and  green  connections  and  their  E(spl)  hub).  Such  a  simplified  network  could  have  functioned  in  a  precursor  to  the  Drosophila  network  since  the  similar  process  of  anchor  cell  specification  in  the  worm  Caenorhabditis  elegans  appears  to  take  place  without  E(spl)-like  genes  or  function  (X.  Karp  and  I.  Greenwald,  personal
0	Involvement  of  Putative  SNF2  Chromatin  Remodeling  Protein  DRD1  in  RNA-Directed  DNA  Methylation
0	Current  Biology  802
0	eling  protein  CHR35  (At2g16390)  [15],  which  is  a  member  of  a  previously  uncharacterized  SNF2-like  protein  subfamily  that  is  unique  to  plants.  The  DRD1  subfamily  can  be  defined  by  four  ProDom  [16]  domains  (Figure  5).  These  overlap  with  matches  to  the  functional  signatures  SNF2_N  and  HELICc,  which  together  constitute  the  SWI/  SNF  ATPase  domain  essential  for  chromatin  remodeling  activity  [17].  The  drd1-1  mutation  consists  of  a  G-to-R  change  in  the  putative  Mg2  binding  site  of  SNF2_N.  Five  additional  drd1  alleles  (drd1-2,  drd1-3,  drd1-4,  drd1-5,  and  drd1-6)  were  identified  and  sequenced.  They  all
0	contained  a  mutation  in  strongly  conserved  or  functionally  implicated  regions  of  the  SWI/SNF  ATPase  domain  (Figure  5).  The  DRD1  subfamily  comprises  six  additional  members,  including  a  clear  DRD1  homolog  in  rice  (BAC84084)  (Figure  S2).  CHR34  (At2g21450),  which  still  shares  all  six  ProDom  domains,  is  the  Arabidopsis  protein  most  similar  to  DRD1.  Another  rice  protein  (AAM15781)  is  highly  similar  to  DRD1  and  also  contains  all  six  domains.  The  remaining  three  members  [At1g05480,  T25N20.14  (Q9ZVY9,  similar  to  CHR31),  and  CHR40  (At3g24340)]  have  only  four  of  the  six  ProDom  domains  in  common
0	SNF2  Protein  DRD1  and  RNA-Directed  DNA  Methylation  803
0	The  stability  of  proteins  in  extreme  environments  Rainer  Jaenicke*  and  Gerald  Boehm
0	Three  complete  genome  sequences  of  thermophilic  bacteria  provide  a  wealth  of  information  challenging  current  ideas  concerning  phylogeny  and  evolution,  as  well  as  the  determinants  of  protein  stability.  Considering  known  protein  structures  from  extremophiles,  it  becomes  clear  that  no  general  conclusions  can  be  drawn  regarding  adaptive  mechanisms  to  extremes  of  physical  conditions.  Proteins  are  individuals  that  accumulate  increments  of  stabilization;  in  thermophiles  these  come  from  charge  clusters,  networks  of  hydrogen  bonds,  optimization  of  packing  and  hydrophobic  interactions,  each  in  its  own  way.  Recent  examples  indicate  ways  for  the  rational  design  of  ultrastable  proteins.
0	been  isolated  --  thousands  of  microbes  were  isolated  from  the  first  samples  collected  from  the  Challenger  Deep  at  110  MPa  [2],  but  very  few  of  them  were  truly  barophilic  [3·].  Their  proteins  are  still  terra  incognita.
0	Limits  of  stability  and  growth
0	Proteins,  independent  of  their  mesophilic  or  extremophilic  origin,  consist  exclusively  of  the  20  canonical  natural  amino  acids.  In  the  multicomponent  system  of  the  cytosol,  these  are  known  to  undergo  covalent  modifications  at  extremes  of  temperature,  pH  and  pressure  (deamidation,  elimination,  disulfide  interchange,  oxidation,  Maillard  reactions,  hydrolysis,  etc.  [4]).  Extremophiles  must  compensate  for  amino  acid  degradation  either  by  using  compatible  protectants  or  by  enhanced  synthesis  and  repair.  Little  is  known  about  the  chemistry  involved,  for  example,  in  the  hydrothermal  decomposition  of  proteins,  and  even  less  is  known  about  protection  and  repair.  Applying  temperatures  beyond  100°C,  the  thermal  stabilities  of  the  common  amino  acids  are  (Val,Leu)>Ile>Tyr>Lys>His>Met>Thr>Ser>Trp>(Asp,Glu,  Arg,Cys).  In  many  cases,  the  half-lives  of  the  degradation  reactions  are  significantly  shorter  than  the  generation  time  of  hyperthermophilic  microorganisms  [5];  to  this  limit,  biomolecules  could  still  be  resynthesized  at  biologically  feasible  rates.  The  temperature  at  which  ATP  hydrolysis  becomes  the  limiting  factor  for  viability  lies  between  110  and  140°C  [6].  This  temperature  limit  coincides  with  the  temperature  range  at  which  the  hydrophobic  hydration  of  proteins  vanishes  and  water  becomes  an  `ordinary  solvent'  [1].  Apparently,  both  the  integrity  of  the  natural  amino  acids  and  the  formation  of  the  hydrophobic  core  upon  protein  folding  are  essential  for  viability.  Extrinsic  factors  and  compatible  solutes  may  enhance  the  stability  and  shift  the  limits  of  growth  of  prokaryotes  as  well  as  eukaryotes  [7].
0	Life  on  earth  exhibits  an  enormous  adaptive  capacity.  Except  for  centers  of  volcanic  activity,  the  surface  of  our  planet  is  `biosphere'.  In  quantitative  terms,  the  limits  of  the  biologically  relevant  physical  variables  are  -40  to  +115°C  (in  the  stratosphere  and  hydrothermal  vents,  respectively),  120  MPa  (for  hydrostatic  pressures  in  the  deep  sea),  aw  0.6  (for  the  activity  of  water  in  salt  lakes)  and  1<pH<11  (for  acidic  or  alkaline  biotopes).  During  evolution,  organisms  achieved  viability  under  extreme  conditions  either  by  `escaping'  or  `compensating'  the  stress  or  by  enhancing  the  stability  of  their  cellular  inventory.  In  the  case  of  temperature  and  pressure,  there  is  no  alternative  to  mutative  adaptation  for  survival  [1].  Here,  we  shall  review  the  recent  progress  in  research  on  protein  stabilization,  focusing  on  thermophiles  with  optimum  temperatures  of  growth  of  more  than  60°C  (for  hyperthermophiles,  more  than  80°C)  and  halophiles  with  optimum  water  activities  around  0.6.  Studies  on  proteins  from  acidophiles  and  alkalophiles  have  been  scarce.  Strict  barophiles  have  recently
0	Fundamentals  of  protein  stability
0	Proteins  exhibit  marginal  stabilities  that  are  equivalent  to  only  a  small  number  of  weak  intermolecular  interactions  [1,8].  In  this  respect,  proteins  from  extremophiles  do  not  differ  strongly  from  their  mesophilic  counterparts.  Their  adaptation,  either  intrinsic  or  through  interaction  with  extrinsic  factors,  is  accompanied  by  only  marginal  increases  in  the  free  energy  of  stabilization.  No  general  strategy  of  stabilization  has  yet  been  established.  In  recent  years,  however,  well-defined  increments  of  stability  have  been  elucidated  by  analyzing  ultrastable  proteins  and  verifying  their  specific  anomalies  by  rational  design.  As  indicated  by  these  studies,  stabilization  may  involve  all  levels  of  the  hierarchy  of  protein  structure:  local  packing  of  the  polypeptide  chain,  secondary  and  supersecondary  structural  elements,  domains  and  subunits  [4].  Taking  thermal  stability  as  an  example,  several  experimental  approaches  have  been  used  to  assign  specific  structural  alterations  to  changes  in  stability:  selection  of  temperature-sensitive
0	The  stability  of  proteins  in  extreme  environments  Jaenicke  and  Boehm
0	mutants;  systematic  variations  of  amino  acid  residues  in  the  core  or  in  the  periphery  of  model  proteins;  fragmentation  of  domain  proteins  or  modifications  of  connecting  peptides  between  domains;  and  alteration  of  subunit  interactions  by  mutagenesis  or  solvent  perturbation  [1,9].  Stability  refers  to  the  maintenance  of  a  defined  functional  state  under  extreme  conditions.  High-resolution  structures  in  the  crystalline  state  and  in  solution  have  shown  that  the  atomic  coordinates  of  proteins  can  be  determined  down  to  a  resolution  better  than  1  A.  Even  this  precision,  however,  does  not  allow  the  calculation  of  the  free  energy  of  stabilization  from  coordinates,  nor  does  it  consider  the  dynamics  as  an  essential  prerequisite  of  protein  function.  The  polypeptide  chain  may  fluctuate  between  preferred  conformations  with  amplitudes  and  angles  up  to  50  A  and  20°,  respectively  [10].  Considering  extremophiles  in  comparison  with  their  mesophilic  counterparts,  evolutionary  adaptation  is  nothing  more  than  the  conservation  of  functionally  important  motions  in  such  a  way  that,  under  altered  physical  conditions,  the  protein  inventories  of  extremophiles  and  mesophiles  are  in  `corresponding  states'  [1].  In  this  context,  the  stability  of  an  individual  protein  refers  to  the  native  state,  as  well  as  the  intermediates  on  its  pathway  from  the  nascent  or  unfolded  ensemble  of  states  (U)  to  the  functional  entity  (N).  Evidently,  in  order  `to  be  extremophilic',  a  protein  has  to  cope  with  the  extreme  conditions  at  all  stages  along  its  folding  pathway.
0	Hypothetical  temperature  profile  of  the  free  energy  of  (a)  mesophilic  and  (b-d)  thermophilic  proteins.  G  is  defined  as  the  difference  in  the  free  energies  between  the  native  and  denatured  proteins.  Tm  and  Tm  are  the  melting  temperatures  of  the  mesophilic  and  thermophilic  variants,  respectively.  The  minimum  of  the  G  parabola  for  a  given  protein  (i.e.  maximum  stability)  is  observed  at  a  temperature  that  is  much  below  the  optimal  growth  temperature  (Topt  and  Topt)  of  the  respective  mesophilic  or  thermophilic  organism.
0	Stability  and  folding
0	heat  and  cold  denaturation  (Figure  1).  Commonly,  the  latter  becomes  detectable  only  under  moderately  destabilizing  conditions  [14,15·].  In  the  case  of  proteins  from  thermophiles,  the  G  versus  temperature  profile  is  either  flattened  or  increased  to  larger  GNU  levels,  rather  than  being  shifted  to  higher  temperatures.  The  G  maximum  is  always  far  below  the  optimal  growth  temperature;  this  holds  also  true  for  the  (hyper-)thermophilic  proteins  [12·,16].  In  order  to  simulate  the  effect  of  temperature  on  folding,  the  in  vitro  denaturation/renaturation  of  hyperthermophilic  glyceraldehyde  3-phosphate  dehydrogenase  (GAPDH)  was  studied  at  0-100°C.  Refolding  over  a  wide  temperature  range  was  found  to  yield  the  native  state,  even  beyond  the  physiological  temperature  range,  indicati
0	Everything  in  moderation:  Archaea  as  `non-extremophiles'  Edward  F  DeLong
0	Well  characterized  and  cultivated  archaea  are  prokaryotic  specialists  that  thrive  in  habitats  of  elevated  temperature,  low  pH,  high  salinity,  or  strict  anoxia.  Recently,  however,  new  groups  of  abundant,  uncultivated  archaea  have  been  found  to  be  widespread  in  more  pedestrian  biotopes,  including  marine  plankton,  terrestrial  soils,  lakes,  marine  and  freshwater  sediments,  and  in  association  with  metazoa.  Research  efforts  are  presently  focused  on  characterizing  the  physiology,  biochemistry  and  genetics  of  these  abundant  and  cosmopolitan  but  poorly  understood  archaea.
0	homologues  [7·,8,9··]  but  proteins  involved  in  transcription  and  translation  specifically  relate  Archaea  with  Eucarya,  to  the  exclusion  of  Bacteria  [7·,8,9··].  The  chimeric  nature  of  archaeal  genomes,  with  Bacteria-like  metabolic  genes  on  the  one  hand  and  Eucarya-like  transcriptional  and  translational  proteins  on  the  other,  confounds  simple  biological  categorization.  Woese  postulates  that  this  situation  may  be  the  result  of  early  evolutionary  processes  that  occurred  when  horizontal  genetic  exchange  exceeded  vertical  inheritance,  in  a  period  of  `pre-genealogical'  evolution  [9··].  The  known  phenotypic  motifs  of  cultivated  archaea  are  still  largely  represented  by  extreme  halophiles,  sulfurmetabolizing  thermophiles,  and  methanogens.  Judging  solely  from  cultivated  strains,  archaeal  phenotypic  diversity  appears  limited,  in  comparison  to  the  wide  variety  of  phenotypes  in  the  Bacteria  [5].  Justifiably,  it  had  been  presumed  that  archaea  were  ecologically  significant  in  only  a  few  highly  specialized  (and  predominantly  anaerobic)  habitats.  This  picture  has  altered  considerably  as  molecular  biological  methods  have  been  applied  to  the  study  of  naturally  occurring  microorganisms  [10··].
0	Recent  geological,  microbiological,  and  ecological  investigations  have  uncovered  new  extremes  for  the  physicochemical  limits  of  the  biosphere.  It  is  now  well  established  that  microbial  life  not  only  survives  but  actually  flourishes  at  extremely  high  temperatures,  low  pH,  high  salinity  and  low  water  availability.  The  precise  physicochemical  limits  for  life  and  the  absolute  boundaries  of  the  biosphere  are  not  well  defined;  more  certain  is  the  remarkable  pervasiveness  of  Earth's  biota,  in  settings  that  chemically  and  physically  challenge  the  very  fabric  of  life.  The  record  holders  for  growth  at  low  pH  (pH  0;  [1]),  high  temperature  (113°C;  [2··])  or  high-salt  concentration  (5  M  NaCl)  all  belong  to  a  distinctive  prokaryotic  lineage  --  Archaea.  Archaea  were  first  recognized  as  a  coherent,  monophyletic  lineage  by  Woese  and  collaborators  [3,4].  Inititally,  ribosomal  RNA  oligoncucleotide  catalogues,  followed  later  by  direct  rRNA  sequence  comparisons,  indicated  that  archaeal  members  belonged  to  a  unique  prokaryotic  kingdom.  This  phylogenetic  coherence  was  at  first  surprising,  considering  that  archaeal  phenotypes  were  then  represented  exclusively  by  extreme  thermophiles,  halophiles,  and  obligately  anaerobic  methanogens  [4].  The  distinctiveness  of  Archaea  led  to  a  recategorization  of  life  into  three  major  domains,  comprising  Eucarya  (all  eukaryotes)  and  the  two  prokaryotic  domains,  Archaea  and  Bacteria  [5,6].  Recent  whole-genome  analyses  support  this  tripartite  organization  to  some  extent  but  not  without  some  ambiguity.  Archaeal  and  bacterial  metabolic  genes  frequently  share  a  common  evolutionary  history,  to  the  exclusion  of  eucaryal
0	Archaea  in  the  mainstream:  widespread  occurrence  of  novel  archaeal  types
0	The  presence  of  new  uncultivated  types  of  archaea  was  first  suggested  during  molecular  phylogenetic  surveys  of  marine  planktonic  microorganisms.  A  preliminary  survey  of  PCRampified  small-subunit  rRNA  genes  revealed  archaeal-like  rRNA  sequences  in  seawater  samples  from  100  m  and  500  m  depths  in  the  Pacific  Ocean  [11].  These  oceanic  archaeal  rRNAs  were  most  closely  related  to  those  of  Crenarchaeota,  a  branch  of  archaea  previously  thought  to  consist  solely  of  hyperthermophiles.  At  the  same  time,  microorganisms  collected  in  surface  waters  off  the  North  American  coast  showed  the  presence  of  two  new  archaeal  groups:  one  crenarchaeotal,  one  euryarchaeotal  [12].  Initially,  it  had  to  be  considered  that  the  planktonic  archaea  might  be  allochthonous  thermophiles,  transported  far  from  a  putative  hydrothermal  vent  habitat  but  the  widespread  distribution  and  relatively  high  abundance  of  the  planktonic  archaea  render  a  hydrothermal  vent  habitat  for  them  unlikely.  The  discovery  of  high  numbers  of  archaea  in  aerobic,  Antarctic  waters  of  -1.8°C  [13],  and  the  association  of  one  crenarchaeal  species  with  a  marine  sponge  living  at  10°C  [14],  provided  further  evidence  that  the  new  archaea  were  native  to  cold  seawater  biotopes.
0	Diversity  and  distribution  of  the  `nonextreme'  Archaea:  inferences  from  rRNA  gene  sequences
0	Genomes  and  evolution
0	Extreme  halophiles
0	Methanomicrobiales  Picrophilus  oshimae  Ferromonas  metallovorans  Thermoplasma  acidophilum
0	Phylogenetic  positions  of  the  new  uncultivated  archaeal  groups,  based  on  small  subunit  rRNA  sequences.  Uncultivated,  nonthermophilic  archaeal  groups  are  indicated  in  gray.  Predominantly  thermophilic  lineages  are  indicated  in  black.  pJP  and  pSL  nodes  represent  rRNA  sequences  obtained  from  a  hot  spring  in  Yellowstone  National  Park  [1],  from  presumptive  hyperthermophilic  archaea.
0	Marine  plankton,  anaerobic  digestor  'Group  2'  Marine  sediments,  marine  plankton  'Group  3'
0	Methanobacteriales  Methanothermus  spp.  Archaeoglobus  spp.  Methanococcales  Thermococcales  Methanopyrus  kandleri
0	Euryarchaeota  Crenarchaeota
0	Group  1.1a  Group1.1b
0	Marine  plankton  Soil,  lake  sediments,  marine  snow  Forest  soil  Group  1
0	pSL12,  hot  spring
0	Group1.2  Marine  and  lake  sediments  pSL4,  hot  spring
0	pSL78,  hot  spring  pSL22,  hot  spring
0	pJP89,  hot  spring  pSL123,  hot  spring  pSL17,  hot  spring  pJP41,  hot  spring  pLaw4  Law1
0	Lake  sediments,  paleosoils,  anaerobic  digestor
0	Cultured  Crenarchaeota
0	Current  Opinion  in  Genetics  &  Development
0	Group  I  crenarchaeota  appear  to  be  the  the  most  widely  distributed,  abundant,  and  ecologically  diverse  of  all  known  Archaea  (Table  1  and  references  cited  therein).  2.  There  is  a  tight  phylogenetic  coherence  among  all  marine  planktonic  Group  I  rRNA  sequences,  whereas  those  from  sediment  and  soil  group  into  several  different  phylogenetic  subclusters  (Figure  1).  3.  Different  Group  I  subclusters  appear  related  to  specific  archaeal  rRNA  genes  recovered  from  terrestrial  hot  springs.  (Figure  1;  [10··,26])  Current  data  suggest  that  several  lineages  of  hyperthermophilic  crenarchaeotes  adapted  to  colder  habitats  independently  [10··,26,36].  4.  In  marine  plankton,  Group  I  and  Group  II  archaea  appear  to  reach  maximal  abudances  at  different  depths  in  the  water  column.  The  planktonic  Group  I  crenarchaeotes  generally  reach  maximal  abundance  below  100  m  depth  [16·,18·].
0	Everything  in  moderation:  Archaea  as  `non-extremophiles'  DeLong
0	Table  1  Summary  of  small  subunit  rRNA  surveys  of  uncultivated  archaeal  diversity.  Reports  of  archaeal  group  within  habitat  Group  I  Group  II  Group  III  Habitat  type  Marine  plankton  Marine  fish  Marine  sediment  Marine  invertebrates  Lake  plankton  Lake  sediments  Deep  paleosol  Forest  soil  Agricultural  soil  Anaerobic  digestor  References
0	In  the  absence  of  readily  available  pure  cultures,  what  progress  can  be  made  in  the  biological  characterization  of  uncultivated  prokaryotes?  The  development  of  general  approaches  for  characterizing  uncultivated  microbial  species  presents  a  major  challenge  for  contemporary  microbiologists.  Uncultivated  archaea  represent  a  good  test  bed  for  new  approaches  as  their  phylogenetic  diversity  is  reasonably  limited  and  unique  archaeal  biochemical  signatures  can  be  detected  in  mixed  populations.  Several  new  approaches  now  demonstrate  that  progress  can  be  made,  even  in  the  absence  of  axenic  cultures.
0	Genome  architecture
0	Advances  in  genomic  analysis  are  providing  new  technologies  that  may  be  useful  for  characterizing  uncultivated  prokaryotes.  The  main  requirement  is  the  availability  of  pure,  intact,  high  molecular  weight  genomic  DNA.  Large  DNA  fragments  can  be  recovered  from  mixed  microb
0	Experimental  Design  for  Gene  Expression  Microarrays1
0	Abstract  We  examine  experimental  design  issues  arising  with  gene  expression  microarray  technology.  Microarray  experiments  have  multiple  sources  of  variation,  and  experimental  plans  should  ensure  that  effects  of  interest  are  not  confounded  with  ancillary  effects.  A  commonly-used  design  is  shown  to  violate  this  principle  and  to  be  generally  inefficient.  We  explore  the  connection  between  microarray  designs  and  classical  block  design  and  use  a  family  of  ANOVA  models  as  a  guide  to  choosing  a  design.  We  combine  principles  of  good  design  and  A-optimality  to  give  a  general  set  of  recommendations  for  design  with  microarrays.  These  recommendations  are  illustrated  in  detail  for  one  kind  of  experimental  objective,  where  we  also  give  the  results  of  a  computer  search  for  good  designs.  Keywords:  Incomplete  block  design,  confounding,  robust  design,  A-optimality,  connected  design,  even  graph
0	Geneticists  are  very  interested  in  comparing  the  relative  quantities  of  mRNA  sequences  in  cell  populations.  Spotted  cDNA  microarrays  (Brown  and  Botstein  1999)  are  emerging  as  a  powerful  and  cost-effective  tool  for  quantifying  gene  transcription  for  thousands  of  genes  at  a  time.  In  the  first  step  of  the  technique,  samples  of  DNA  clones  with  known  sequence  content  are  spotted  and  immobilized  onto  a  glass  slide  or  other  substrate,  the  "microarray."  Next,  pools  of  purified  mRNA  from  cell  populations  under  study  are  reverse-transcribed  into  cDNA  and  labeled  with  one  of  two  fluorescent  dyes,  "red"  and  "green."  Two  pools  of  differentially  labeled  cDNA  are  combined  and  applied  to  a  microarray.  Strands  of  cDNA  in  the  pool  hybridize  to  complementary  sequences  on  the  array  and  any  unhybridized  cDNA  is  washed  off.  Although  hybridization  efficiency  can  vary  from  clone  to  clone,  the  efficiency  for  any  particular  clone  should  not  be  affected  by  the  type  of  the  dye  label.  The  "red"  and  "green"  signals  from  a  spot  indicate  the  relative  abundance  of  the  corresponding  mRNA  in  the  two  cell  populations.  Some  of  the  first  experiments  with  microarrays  were  time-series  studies.  DeRisi  et  al.  (1997)  studied  gene  expression  patterns  in  yeast  during  metabolic  shift  from  fermentation  to  respiration.  Chu  et  al.  (1998)  conducted  a  similar  study  of  yeast  during  sporulation.  The  approach  of  this  research  was  to  cluster  genes  according  to  their  patterns  of  expression  over  timepoints  of  a  biological  process.  The  general  idea  is  that  when  a  gene  of  unknown  function  ends  up  in  a  cluster  of  genes  with  known  function,  one  has  a  valuable  clue  as  to  the  function  of  the  unknown  gene.  Clustering  ideas  have  similarly  been  used  to  classify  tissue  samples  according  to  their  global  patterns  of  gene  expression.  For  example,  Perou  et  al.  (1999)  used  gene  expression  patterns  to  classify  human  breast  cancers.  Ross  et  al.  (2000)  studied  gene  expression  variation  in  60  cancer  cell  lines  and  found  associations  between  gene  expression  patterns  as  well  as  other  properties  such  as  growth  rate.  Alizadeh  et  al.  (2000)  used  this  approach  to  identify  clinically  relevant  subtypes  of  B-cell  lymphoma.  These  experiments  are  just  the  beginning  of  the  projected  use  of  microarray  technology.  For  example,  Alon  et  al.  (1999)  used  a  related  technology  to  make  paired  comparisons  of  cancerous  tissue  samples  versus  normal  surrounding  tissues.  Microarray  experiments  will  soon  become  multi-factorial  in  nature.  For  example,  a  researcher  may  want  to  study  tissue  samples  from  male  and  female  mice  from  different  strain  backgrounds  raised  on  different  diets.  It  is  easy  to  imagine  a  rich  variety  of  experimental  scenarios  and  substantial  effort  will  be  required  to  develop  tools  for  higher-order  analyses  of  microarray  data.  Following  the  precedent  of  the  leading  experiments,  there  have  been  many  new  ideas  proposed  about  the  best  way  to  cluster  genes  (Ben-Dor  et  al.  1999,  Eisen  et  al.  1998,  Heyer  et  al.  1999,  Lazzeroni  and  Owen  2000,  Tamayo  et  al.  1999  to  name  a  few).  Yet  we  believe  that  some  fundamental  questions  still  lack  satisfactory  answers.  The  sources  of  variation  in  microarray  data  are  yet  to  be  completely  understood.  To  the  extent  that  sources  of  variation  are  known,  however,  they  should  be  considered  in  the  design  and  analysis  of  microarray  experiments.  The  structure  of  microarray  data,  the  types  of  analyses  that  are  possible,  and  the  quality  of  the  results  are  determined  by  the  experimental  design.  We  believe  there  has  been  a  lack  of  healthy  skepticism  about  the  "right"  way  to  design  a  microarray  experiment,  and  that  this  is  an  area  that
0	deserves  careful  consideration  and  study.  Different  cDNAs  are  known  to  incorporate  dye  with  differential  efficiency  and  hybridize  with  their  target  spots  on  arrays  at  different  rates.  Further,  with  spotted  arrays  it  is  not  known  how  much  DNA  is  immobilized  on  the  array  in  any  particular  spot.  Therefore,  as  scientists  have  recognized,  a  single  fluorescent  intensity  measurement  from  a  spot  contains  little  useful  information  because  of  the  unknown  characteristics  of  the  spot  and  the  unknown  interpretation  of  a  unit  of  fluorescence  for  any  particular  gene.  This  realization  undoubtedly  motivated  the  two-dye  system  and  the  practice  of  calculating  the  ratio  of  the  pair  of  readings  from  a  spot.  There  is  meaningful  information  in  the  relative  red  and  green  intensities  from  a  spot.  Now  consider  an  experiment  from  the  archives  of  statistics.  If  an  agriculturalist  wants  to  measure  the  yields  of  strains  of  corn,  s/he  would  realize  that  different  plots  of  land  vary  in  soil  fertility,  amount  of  rainfall  and  sunlight,  etc.,  so  the  only  meaningful  direct  yield  comparisons  are  for  strains  grown  on  the  same  plot  of  land.  These  twentieth  century  agricultural  experiments  share  an  important  characteristic  with  twenty-first  century  microarray  experiments:  the  meaningful  interpretation  of  the  data  is  in  terms  of  relative  comparisons.  We  believe  there  are  valuable  lessons  to  be  learned  from  the  several  generations  of  scientists  and  statisticians  who  studied  experimental  designs  for  agriculture.  In  this  work,  we  explore  some  of  the  connections  between  classical  experimental  design  and  microarray  technology.  The  cell  populations  under  study  are  the  factor  of  interest  in  a  microarray  experiment,  but  they  are  not  the  only  sources  of  variation.  The  design  of  microarray  experiments  --  how  the  samples  are  paired  onto  arrays  --  should  take  this  into  account.  Section  2  identifies  the  experimental  design  factors  involved  with  this  technology.  To  illustrate  basic  design  ideas,  a  commonly  used  setup  for  microarray  experiments  is  studied  in  Section  3  and  an  alternative  is  proposed.  We  introduce  a  family  of  ANOVA  models,  explore  more  examples,  and  give  general  design  recommendations  in  Section  4.  In  Section  5  we  consider  general  A-optimality  and  generalize  classical  results  from  experimental  design  to  microarrays.  In  Section  6  we  discuss  a  search  for  good  designs  for  small  (  10)  numbers  of  samples  when  one  wants  efficiency  with  respect  to  general  A-optimality  but  also  requires  certain  model-robustness  properties.  Section  7  concludes  with  a  discussion  of  open  questions  for  microarray  experimental  design.
0	Sources  of  Variation  in  Microarray  Experiments
0	The  simplest  microarray  experiment  looks  for  changes  in  gene  expression  across  a  single  factor  of  interest.  This  factor  might  be  the  timepoints  of  a  biological  process,  or  different  types  of  tissue,  or  drug  treatments.  We  generically  call  the  categories  of  a  factor  of  interest  varieties.  Fluorescent  intensities  clearly  also  depend  on  the  cDNA  sequence  spotted  on  the  arrays.  We  call  the  spotted  sequences  "genes"  whether  they  are  actually  genes,  ESTs,  or  DNA  from  another  source.  Further,  microarray  technology  makes  use  of  two  different  dyes  and  an  entire  experiment  uses  multiple  arrays.  Therefore  we  identify  four  basic  experimental  factors:  varieties,  genes,  dyes,  and  arrays.
0	With  these  four  factors  there  are  24  =  16  possible  experimental  effects.  Explicitly,  there  is  the  mean  or  baseline  effect,  four  factor  main  effects  for  arrays  (A),  dyes  (D),  varieties  (V  ),  and  genes  (G),  six  two-factor  interactions,  four  three-factor  interactions,  and  one  four-factor  interaction.  The  first  step  in  choosing  a  good  design  is  to  identify  which  effects  might  possibly  contribute  to  variation  in  the  data.  Array  main  effects  measure  overall  variation  in  fluorescent  signal  from  array  to  array.  These  effects  arise  if,  for  example,  arrays  are  probed  under  inconsistent  conditions  that  increase  or  reduce  hybridization  efficiencies  of  labeled  cDNA.  Dye  main  effects  measure  differences  in  the  two  dye  fluorescent  labels.  For  example,  one  dye  may  be  consistently  "brighter"  than  the  other.  Gene  main  effects  occur  when  certain  genes  emit  a  higher  or  lower  fluorescent  signal  overall,  compared  to  other  genes.  These  effects  arise  because  some  genes  have  generally  higher  or  lower  levels  of  expression  than  others,  and  also  because  of  differential  hybridization  efficiency  and  differential  labeling  efficiency  for  different  sequences.  Variety  main  effects  occur  when  the  varieties  of  the  factor  of  interest  have  higher  or  lower  overall  expression  levels  for  the  genes  spotted  on  the  arrays.  It  is  reasonable  to  suspect  that  all  four  of  these  main  effects  will  contribute  to  variation  in  microarray  data.  For  a  particular  tissue  sample,  red-  and  green-labeled  cDNA  is  produced  in  separate  runs  of  the  reverse-transcription  process.  Differences  in  the  runs  can  produce  pools  of  cDNA  of  varying  concentrations  or  quality.  This  results  in  experimental  dyexvariety  (DV  )  interactions.  Arrayxgene  interactions  (AG)  occur  because  spots  for  a  given  gene  on  the  di
0	ARTICLE  IN  PRESS
0	TRENDS  in  Biotechnology  Vol.xx  No.xx  Monthxxxx
0	TIBTEC  317
0	Robotic  spotting  of  cDNA  and  oligonucleotide  microarrays
1	Richard  P.  Auburn1,  David  P.  Kreil1,2,  Lisa  A.  Meadows1,  Bettina  Fischer1,  Santiago  Sevillano  Matilla1  and  Steven  Russell1
0	DNA  microarrays  are  a  uniquely  efficient  method  for  simultaneously  assessing  the  expression  levels  of  thousands  of  genes.  Owing  to  their  flexibility  and  value,  mechanically  spotted  microarrays  remain  the  most  popular  platform.  Here,  we  review  recent  technological  advances  with  a  focus  on  spotted  arrays.  Robotic  spotting  still  poses  numerous  technical  challenges.  To  reduce  artefacts,  many  laboratories  have  recently  investigated  ways  of  improving  the  spotting  process.  We  compare  alternative  options  and  discuss  implications  for  next-generation  systems.  Together  with  modern  approaches  to  data  analysis,  such  developments  bring  greatly  improved  reliability  to  individual  microarray  experiments.  Advancing  towards  the  ultimate  goal  of  delivering  calibrated,  truly  quantitative  gene-expression  measurements  on  a  genomic  scale,  microarray  technology  remains  at  the  forefront  of  post-genomic  systems  biology.
0	Introduction  Genome  sequencing  confronts  us  with  the  unequivocal  fact  that  we  know  very  little  about  the  function  of  many  genes,  even  in  the  most  intensely  studied  genomic  regions  of  key  model  organisms  [1].  Generating  loss-of-function  or  gain-of-function  mutations  followed  by  detailed  phenotypic  analysis  has  traditionally  been  used  in  genetically  tractable  model  organisms.  However,  this  is  not  always  possible,  or  indeed  informative,  because  many  gene  mutations  have  no  obvious  phenotype.  In  such  instances,  gene  expression  patterns  can  be  used  to  suggest  more  appropriate  genetic,  molecular  or  biochemical  assays.  Gene  expression  patterns  can  also  be  used  to  search  for  co-regulated  groups  of  genes,  having  functional  implications  and  giving  valuable  insights  into  interactions  between  genes.  Microarrays  exploit  the  specificity  of  nucleic  acid  basepairing  during  hybridization  to  simultaneously  assess  the  expression  of  tens  of  thousands  of  genes  [2,3].  Without  microarrays,  gene  expression  analysis  would  be  limited  to  studies  of  one  or  a  few  genes  using,  for  example,  Northern  blots  and  real-time  quantitative  PCR,  or  to  time-consuming  and  costly  approaches  like  Serial  Analysis  of  Gene  Expression  (SAGE)  [4]  and  Massively  Parallel  Signature
0	Sequencing  (MPSS)  [5].  Microarray  experiments  are  unique  in  offering  cost-effective  and  efficient  analysis  of  gene  expression  at  the  genomic  level  (Box  1).  Although  many  of  the  protocols  for  microarray  experiments  are  not  new,  some  are  highly  technical  and  are  widely  considered  to  be  challenging,  most  notably  the  production  of  the  arrays.  Many  manufacturing  considerations  similarly  apply  to  all  microarray  applications  (Box  1),  therefore,  this  review  will  focus  on  the  most  popular  one,  microarrays  for  gene  expression  analysis.  Microarrays  can  be  manufactured  using  robotic  spotting  of  gene-specific  cDNAs  or  long  oligonucleotides  and  by  in  situ  synthesis  of  short  or  long  oligonucleotides.  Barrett  and  Kawasaki  have  reviewed  these  established  manufacturing  processes  [12].  More  recent  approaches  include  voltage  dependent  nanopipettes  [13],  piezoelectric  inkjets  for  non-contact  printing  [14]  and  maskless  light-directed  synthesis  of  oligonucleotides  [15].  However,  robotic  spotting  is  still  the  most  popular  method  because  of  its  wide  availability,  high  flexibility  and  low  cost  (Box  2).  Although  spotted  microarrays  can  provide  accurate  measurements  at  the  genomic  level  [16,17],  similar  to  other  microarray  platforms,  their  sensitivity  is  limited  by  high  levels  of
0	Spotted  microarrays  were  developed  for  identifying  differences  in  gene  expression  between  samples  based  on  the  relative  amounts  of  sample  bound  to  a  particular  spotted  probe  DNA  on  the  microarray  [2,3].  The  full  utility  of  the  technique  is  clearly  reflected  in  the  wide  variety  of  its  applications  [6],  ranging  from  gene  expression  analyses  to  studies  of  genomic  DNA.  Comparative  genomic  hybridization  (CGH),  for  example,  is  used  to  identify  allelic  differences  between  individuals  [7].  Chromatin  immunopurification  (ChIP)  microarrays,  or  `ChIP-chips',  locate  the  binding-sites  of  DNA-binding  proteins  [8].  The  samples  to  be  compared  are  each  labelled  with  a  different  fluorescent  dye  and  then  subjected  to  competitive  hybridization.  The  aim  of  any  spotted  microarray  experiment  is  to  generate  spot  fluorescence  measurements  that  reflect  how  much  sample  is  bound  to  a  spotted  probe  DNA.  These  measurements  are  derived  from  images  taken  by  laser  scanners  or  charge-coupled  device  (CCD)  cameras  [9].  Software  tools  locate  and  then  quantify  the  fluorescence  intensity  or  `spot  signal'  from  each  spotted  probe  [10,11].  Downstream  data  processing  varies  according  to  application.  In  gene  expression  analysis,  for  example,  typical  approaches  include  gene  selection  by  ranking  expression  ratios,  clustering  or  probabilistic  analysis,  with  the  aim  of  identifying  statistically  significant  differential  gene  expression  or  groups  of  co-regulated  genes.  This  permits  inferences  to  be  made  about  the  regulation  of  the  processes  being  investigated.
0	ARTICLE  IN  PRESS
0	TRENDS  in  Biotechnology  Vol.xx  No.xx  Monthxxxx
0	TIBTEC  317
0	A  recently  introduced  alternative  uses  probes  obtained  by  PCR  from  shotgun  genomic  DNA  libraries  [36].  However,  PCR  amplification  can  suffer  frequent  failures,  variable  DNA  yields,  and  brings  the  danger  of  cross-contamination  [22].  Simply  performing  PCR  on  this  scale  and  transferring  the  PCR  products  from  96-  to  384well  plates  for  spotting  is  error-prone  [37].  Consequently,  resequencing  of  probe  DNA  has  found  that  only  66-79%  of  probes  had  been  correctly  annotated  and  contained  no  contaminating  sequences.  Therefore,  gel  electrophoresis  and  resequencing  of  PCR  amplicons  before  printing  is  highly  recommended  [38,39].  Because  of  their  length,  PCR-amplicon  probes  are  highly  sensitive  and  have  an  inherent  tolerance  to  small  sequence  variations.  They  are  thus  the  method  of  choice  when  interrogating  samples  from  one  species  using  probes  of  another  [40].  This  feature,  however,  reduces  their  ability  to  discriminate  similar  sequences  within  an  organism.  Many  microarray  users  have  started  to  spot  single-stranded  long  oligonucleotide 
0	REVIEWS  Glutamine  repeats  and  neurodegenerative  diseases:  molecular  aspects
1	Max  F.  Perutz
0	Eight  severe  inherited  neurodegenerative  diseases  are  caused  by  expansion  of  glutamine  repeats  in  the  affected  proteins.  In  every  case,  proteins  with  repeats  of  fewer  than  38  glutamine  residues  are  harmless,  but  those  with  repeats  of  more  than  41  glutamine  residues  form  toxic  neuronal  nuclear  aggregates  in  the  affected  neurons.  Similarly,  proteins  that  have  repeats  of  fewer  than  37  glutamine  residues  are  soluble  in  vitro,  whereas  proteins  with  repeats  of  more  than  40  glutamine  residues  precipitate  as  insoluble  fibres,  apparently  because  of  a  structural  transition  associated  with  the  increased  length.
0	TIBS  24  -  FEBRUARY  1999
0	Some  properties  of  glutamine  repeats
0	IN  1991,  FISCHBECK  and  his  collaborators1  discovered  that  Kennedy  disease,  a  sex-linked,  late-onset,  neurodegenerative  disorder,  is  due  to  expansion  of  a  CAG  repeat  in  the  gene  that  encodes  the  androgen  receptor.  Since  then,  seven  dominant,  autosomal,  late-onset,  neurodegenerative  diseases  have  been  found  to  be  due  to  CAG  expansion,  which  causes  expansion  of  glutamine  repeats  in  the  affected  proteins.  Huntington's  disease  (HD),  which  has  an  incidence  of  4  in  105  among  European  populations,  is  the  most  common  of  these  diseases.  It  had  long  been  known  that  these  diseases  are  accompanied  by  progressive  death  of  neurons  but,  until  1997,  there  was  no  indication  of  what  killed  the  neurons.  That  year  saw  a  dramatic  turnaround.  First  in  transgenic  mice  expressing  an  N-terminal  fragment  of  the  HD  protein,  then  in  spinocerebellar  ataxia  type  3  and  in  HD  patients  and,  finally,  in  four  of  the  other  diseases,  investigators  discovered  insoluble,  granular  and  fibrous  deposits  in  the  cell  nuclei  of  the  affected  neurons.  In  HD,  these  deposits  took  up  immunostains  specific  for  an  N-terminal  fragment  of  the  affected  protein,  which  includes  the  expanded  glutamine  repeat,  and  also  stains  specific  for  ubiquitin.  In  transgenic  mice,  formation  of  these  deposits  preceded  the  appearance  of  symptoms,  which  clearly  linked  cause  and  effect.
0	M.  F.  Perutz  is  at  the  MRC  Laboratory  of  Molecular  Biology,  Hills  Rd,  Cambridge,  UK  CB2  2QH.
0	The  gene  for  HD  consists  of  67  exons,  which  are  spread  over  180  kb  of  DNA;  it  contains  an  open  reading  frame  for  a  polypeptide  of  3140  residues  -  one  of  the  longest  known2.  The  CAG  repeat  is  part  of  the  first  exon  and  is  followed  by  CCG,  CCA  and  CCT  repeats  that  encode  prolines.  Apart  from  these  repeats,  the  amino  acid  sequence  shows  no  homology  to  any  known  protein.  The  HD  protein,  now  christened  huntingtin,  is  essential  for  embryonic  neurogenesis,  but  its  function  there  is  unknown3.  With  up  to  37  CAG  repeats  in  healthy  individuals,  human  huntingtin  has  a  glutamine  repeat  longer  than  that  of  the  homologous  protein  of  any  other  species  analysed  so  far;  however,  the  repeats  might  have  no  function,  because  mice  develop  normally  with  only  seven  and  puffer  fish  develop  normally  with  only  four4.  No  case  of  HD  has  been  reported  in  individuals  who  have  fewer  than  35  CAGs  or  37  glutamine  residues,  nor  has  anyone  with  41  glutamine  residues  been  found  to  be  free  from  the  disease.  With  one  exception  (spinocerebellar  ataxia  6,  which  is  due  to  CAG  expansion  in  the  1A  subunit  of  a  voltage-dependent  calcium  channel5-7),  the  same  approximate  range  holds  for  the  other  CAG  diseases.  This  striking  observation  implies  that  expansion  of  the  number  of  glutamine  repeats  beyond  about  40  must  be  accompanied  by  a  change  in  structure,  an  implication  that  is  corroborated  by  the  isolation  of  a  monoclonal  antibody  that  recognizes  expanded  glutamine  repeats  specifically8.
0	Aggregation  of  huntingtin  in  neural  cell  nuclei
0	The  great  turnaround  began  with  the  discovery,  by  Bates  and  co-workers12,
0	See  front  matter  ©  1999,  Elsevier  Science.  All  rights  reserved.
0	TIBS  24  -  FEBRUARY  1999  that  transgenic  mice  expressing  the  first  exon  of  the  human  HD  gene  developed  neurological  symptoms.  This  exon  includes  the  CAG  repeat  and  the  codons  for  a  proline  repeat.  Lines  of  mice  that  carried  18  CAG  repeats  developed  normally  and  remained  healthy.  By  contrast,  lines  that  possessed  115-156  repeats  developed  neurological  symptoms  at  approximately  two  months,  and  the  disease  progressed  rapidly  during  the  following  months.  Because  these  lines  also  carried  the  normal  mouse  HD  genes,  the  experiment  provided  evidence,  if  such  was  still  needed,  that  the  neurological  symptoms  of  HD  are  due  to  gain,  rather  than  loss,  of  function  and  that  the  glutamine  repeats  alone  are  sufficient  to  provoke  the  disease.  The  lengths  of  the  CAG  repeats  in  the  mice  also  exhibited  somatic  and  intergenerational  instability  similar  to  that  found  in  humans.  But  why  did  expression  of  exon  1  provoke  disease?  Davies  and  others13  found  the  answer  by  performing  an  immunohistochemical  analysis  of  the  brains  of  affected  and  control  mice,  using  antibodies  that  recognized  exon-1-encoded  peptides  of  mice  and  humans  (Fig.  2).  In  control  mice,  the  antibodies  labelled  isolated  particles  that  were  scattered  throughout  the  cytoplasm  and  vesicular  membranes,  but  not  cell  nuclei.  By  contrast,  the  neurons  of  the  mice  expressing  the  expanded  CAG  repeats  contained,  in  addition  to  the  scattered  stains  seen  in  control  mice,  prominent  circular  intranuclear  inclusions  and  occasional  filaments.  Stain  could  also  be  seen  in  neurites  and  in  nuclear  pores.  Antibodies  to  portions  of  huntingtin  not  coded  for  by  exon  1  failed  to  stain  the  inclusions,  which  showed  that  the  endogenous,  normal  mouse  huntingtin  was  excluded.  Nuclear  membranes  in  the  striatum  had  many  invaginations  and  more  pores  than  those  present  in  controls.  The  largest  inclusion  bodies  were  in  the  cerebral  cortex,  striatum,  Purkinje  cells  of  the  cerebellum  and  motor  neurons  of  the  spinal  cord  (Fig.  3).  As  in  the  plaques  and  fibres  of  amyloid  diseases,  the  inclusion  bodies  are  stained  by  antibodies  against  ubiquitin,  a  scavenging  protein  that  forms  covalent  links  with  lysine  residues  in  unfolded  or  incorrectly  folded  proteins.  Following  these  discoveries,  Di  Figlia  and  colleagues14  re-examined  postmortem  brains  from  HD  patients  and  found  nuclear  inclusions  similar  to  those  in  the  transgenic  mice,  which  were  not  seen  in  control  patients.  They  were  stained  by  an  antiserum  against  residues  1-17  of  huntingtin,  which  precede  the
0	Truncated  wild-type  CI2  21
0	Loop-insertion  mutant
0	...LPVGTIVTM-G-Q  10-G-MEYRID...
0	Loop-replacement  mutant
0	glutamine  repeat,  and  by  antibodies  against  ubiquitin  (Fig.  3).  An  antibody  against  residues  about  one  fifth  of  the  way  along  the  huntingtin  chain  did  not  stain  the  nuclear  inclusion  bodies,  even  though  this  stain  was  taken  up  readily  by  huntingtin  in  the  cytoplasm  of  both  HD  and  normal  brains.  These  results  implied  that  only  N-terminal  fragments  of  huntingtin  had  entered  the  nuclei.  The  size  of  these  fragments  was  ~350  amino  acid  residues  -  about
0	plex  may  participate  directly  in  this  repression.  Intriguingly,  EED,  the  mammalian  homolog  of  ESC  and  MES-6,  is  involved  in  maintaining  X-chromosome  inactivation  in  extraembryonic  tissues  of  female  mouse  embryos  (23).  How  might  MES-4  participate  in  X-chromosome  repression?  MES-4  on  the  autosomes  may  protect  them  from  the  binding,  spreading,  or  action  of  repressors,  such  as  the  MES-2/MES-3/MES-6  complex  or  histone-modifying  enzymes.  This  would  serve  to  focus  repression  on  the  X  chromosomes,  which  lack  MES-4  protection.  This  model  for  MES-4  action  is  consistent  with  several  observations,  including  the  following:  (i)  mes-4  mutants  display  the  same  sensitivity  to  Xchromosome  dosage  as  mes-2,  mes-3,  and  mes-6  mutants;  and  (ii)  MES-4,  like  MES-2,  MES-3,  and  MES-6,  is  required  for  repression  of  germline  expression  of  transgenes  present  in  repetitive  arrays  (24).  The  activation  of  transgenes  in  mes-4  mutants  may  be  due  to  titration  of  limited  levels  of  repressor  by  autosomal  chromatin  that  in  wild  type  does  not  bind  the  repressor.  This  scenario  predicts  that  the  X  chromosomes  are  desilenced  in  mes-4  mutants,  as  we  predicted  occurs  in  mes-2,  mes-3,  and  mes-6  mutants.
0	Sp1  and  TAFII130  Transcriptional  Activity  Disrupted  in  Early  Huntington's  Disease
1	Anthone  W.  Dunah,1  Hyunkyung  Jeong,1  April  Griffin,1  Yong-Man  Kim,2  David  G.  Standaert,1  Steven  M.  Hersch,1  M.  Maral  Mouradian,2  Anne  B.  Young,1  Naoko  Tanese,3  Dimitri  Krainc1*
0	Huntington's  disease  (HD)  is  an  inherited  neurodegenerative  disease  caused  by  expansion  of  a  polyglutamine  tract  in  the  huntingtin  protein.  Transcriptional  dysregulation  has  been  implicated  in  HD  pathogenesis.  Here,  we  report  that  huntingtin  interacts  with  the  transcriptional  activator  Sp1  and  coactivator  TAFII130.  Coexpression  of  Sp1  and  TAFII130  in  cultured  striatal  cells  from  wild-type  and  HD  transgenic  mice  reverses  the  transcriptional  inhibition  of  the  dopamine  D2  receptor  gene  caused  by  mutant  huntingtin,  as  well  as  protects  neurons  from  huntingtin-induced  cellular  toxicity.  Furthermore,  soluble  mutant  huntingtin  inhibits  Sp1  binding  to  DNA  in  postmortem  brain  tissues  of  both  presymptomatic  and  affected  HD  patients.  Understanding  these  early  molecular  events  in  HD  may  provide  an  opportunity  to  interfere  with  the  effects  of  mutant  huntingtin  before  the  development  of  disease  symptoms.  Huntington's  disease  (HD)  is  a  dominantly  inherited  neurodegenerative  disorder  manifested  by  psychiatric,  cognitive,  and  motor  symptoms  typically  starting  in  midlife  and  progressing  toward  death.  HD  is  caused  by  expansion  of  a  polyglutamine  tract  in  the  huntingtin  protein.  The  number  of  diseases  caused  by  polyglutamine  expansions  continues  to  grow,  and  a  common  mechanism  could  underlie  these  disorders.  One  hypothesis  suggests  that  expanded  polyglutamines  result  in  aberrant  interactions  with  nuclear  proteins  and  thereby  lead  to  transcriptional  dysregulation  (1-7).  If  huntingtin  is  involved  in  regulating  gene  transcription,  it  is  important  to  determine  which  genes  may  be  affected  by  normal  and/or  mutant  huntingtin.  Some  obvious  candidates  are  genes  whose  expression  is  altered  in  HD  patients  or  in  animal  models  of  HD.  Neurotransmitter  receptor  alterations  have  been  described  in  early-stage  human  HD  autopsy  material,  and  many  of  these  changes  have  been  confirmed  in  transgenic  mouse  models  of  HD  (8,  9).  Gene  expression  assays  on  DNA  microarrays  have  shown  that
0	the  scope  of  mRNA  changes  in  transgenic  HD  mice  involves  several  groups  of  genes,  including  neurotransmitter  receptors  and  intracellular  signaling  systems  (10).  The  known  regulatory  sequences  of  these  genes  contain  binding  sites  for  the  transcription  factor  Sp1,  suggesting  that  huntingtin  may  interfere  with  Sp1-mediated  transcription.  Sp1  is  a  ubiquitous  transcriptional  activator  whose  major  function  is  recruitment  of  the  general  transcription  factor  TFIID  to  DNA  (11).  TFIID  is  a  multisubunit  complex  made  up  of  the  TATA  box-  binding  protein  (TBP)  and  multiple  TBP-associated  factors  (TAFs)  (12).  Involvement  of  one  of  the  human  TAFs,  TAFII130,  in  activator-TAF  interactions  has  been  examined  in  detail  (13,  14).  TAFII130  interacts  with  various  cellular  activators,  including  Sp1  and  CREB,  suggesting  that  TAFII130  may  be  critical  for  the  transcriptional  activation  function  of  these  factors  by  bridging  them  to  the  basal  machinery.  Using  the  yeast  two-hybrid  system  (15),  we  found  that  both  Sp1  and  TAFII130  interact  with  full-length  huntingtin  (Fig.  1).5  The  interactions  between  Sp1  and  huntingtin  are  stronger  in  the  presence  of  an  expanded  polyglutamine  repeat  (HttQ75)  as  compared  to  the  nonexpanded  repeat  length  (HttQ17)  (Fig.  1A),  whereas  the  interactions  between  TAFII130  and  huntingtin  are  not  significantly  influenced  by  the  polyglutamine  tract  length  (Fig.  1B).  Although  the  glutamine-rich  regions  of  Sp1  (Sp1AB)  and  TAFII130  (TAFII130-M)  are  sufficient  for  their  interaction  with  huntingtin,  the  presence  of  the  COOH-terminal  DNA  binding  domain  of  Sp1
0	or  the  conserved  COOH-terminal  domain  of  TAFII130  results  in  stronger  interaction.  Because  NH2-terminal  fragments  of  mutant  huntingtin  can  effectively  induce  cell  death  in  both  in  vivo  and  in  vitro  models  (16-19),  we  examined  the  interactions  of  Sp1  and  TAFII130  with  the  480  -amino  acid  NH2terminal  fragment  of  huntingtin.  Compared  with  the  full-length  protein,  NH2-terminal  fragments  showed  similar,  polyglutamine  length-  dependent  interactions  with  Sp1,  whereas  their  interactions  with  TAFII130  were  independent  of  polyglutamine  length  (Fig.  1,  A  and  B).  To  further  examine  the  strength  of  huntingtin/Sp1  and  huntingtin/TAFII130  interactions  in  relation  to  polyglutamine  length,  we  cotransfected  HEK  293T  cells  with  expression  plasmids  for  normal  (HttQ17)  or  mutant  (HttQ75)  full-length  huntingtin  and  flagtagged  Sp1  or  hemagglutinin  (HA)-tagged  TAFII130  (15).  Coimmunoprecipitations  of  the  transfected  proteins  with  antibodies  to  huntingtin  showed  that  Sp1  preferentially  interacted  with  mutant  huntingtin  (Fig.  1C),  whereas  TAFII130  bound  similarly  to  both  normal  and  mutant  huntingtin  (20).  These  results,  together  with  the  yeast  two-hybrid  data,  indicate  that  polyglutamine  expansion  enhances  the  interaction  of  Sp1,  but  not  TAFII130,  with  huntingtin.  To  establish  whether  huntingtin  interacts  with  Sp1  and  TAFII130  in  the  human  brain,  coimmunoprecipitation  studies  were  performed  using  extracts  from  the  caudate  nucleus  of  grade  1  HD  brain  with  antibodies  to  Sp1  (anti-Sp1)  (Fig.  1D),  to  TAFII130  (antiTAFII130)  (Fig.  1E),  or  to  huntingtin  (antiHtt)  (15).  Both  anti-Sp1  and  anti-TAFII130  precipitated  huntingtin  protein.  In  addition,  anti-Htt  coimmunoprecipitated  substantial  amounts  of  Sp1  and  TAFII130  proteins.  We  found  that  the  immunoprecipitated  complex,  in  addition  to  TAFII130,  contained  other  TAFs  (21),  suggesting  that  TAFII130  interacts  with  huntingtin  in  the  context  of  TFIID.  However,  because  we  found  TAFII130  to  be  expressed  at  higher  levels  in  HD  brain  tissue,  it  is  possible  that  huntingtin  interacts  with  free  TAFII130  as  well  (see  below).  Next,  we  tested  whether  mutant  huntingtin  affects  the  interactions  between  Sp1  and  TAFII130  in  HD  brain  tissue.  In  coimmunoprecipitation  experiments  using  anti-Sp1  and  anti-TAFII130,  we  found  a  decrease  in  the  interactions  between  Sp1  and  TAFII130  in  the  postmortem  human  HD  brain  as  compared  to  the  control  brain  (Fig.  1F). 
0	Control  of  Stochasticity  in  Eukaryotic  Gene  Expression
1	Jonathan  M.  Raser  and  Erin  K.  O'Shea*
0	rescent  proteins  (CFP  and  YFP)  from  identical  promoters,  integrated  at  the  same  locus  on  homologous  chromosomes  (Fig.  1A).  Two  types  of  noise  are  distinguished  in  our  analysis:  intrinsic  noise  attributable  to  stochastic  events  during  gene  expression,  and  extrinsic  noise  due  to  any  existing  cellular  heterogeneity  that  affects  gene  expression  or  to  stochastic  events  in  upstream  signal  transduction  (5).  For  each  population  of  cells,  we  calculated  the  variability  in  terms  of  two  metrics:  the  noise,  defined  as  the  standard  deviation  divided  by  the  mean,  which  we  present  to  convey  the  magnitude  of  variability  as  a  percentage  of  the  level  of  gene  expression;  and  the  noise  strength,  or  variance  divided  by  the  mean,  which  we  use  for  our  analysis  because  it  is  independent  of  population  mean  for  a  single  stochastic  process  (supporting  online  text).  We  induced  the  expression  of  CFP  and  YFP  from  the  budding  yeast  PHO5  promoter  and  measured  the  fluorescence  of  single  cells  in  random  subpopulations  at  multiple  times  after  induction  (Fig.  1B).  The  total  noise  of
0	time  points  (in  minutes)  are  indicated  with  different  colors.  Extrinsic  noise  is  manifested  as  scatter  along  the  diagonal  and  intrinsic  noise  as  scatter  perpendicular  to  the  diagonal.  AU,  arbitrary  units  of  fluorescence.  (C)  Total,  extrinsic,  and  intrinsic  noise  strength  as  functions  of  population  mean  for  (B).  The  solid  line  represents  expectations  for  a  single  stochastic  process,  and  error  bars  represent  bootstrap  values  (6).
0	RESEARCH  ARTICLES
0	The  Transcriptional  Program  of  Sporulation  in  Budding  Yeast
1	S.  Chu,*  J.  DeRisi,*  M.  Eisen,  J.  Mulholland,  D.  Botstein,  P.  O.  Brown,  I.  Herskowitz
0	Diploid  cells  of  budding  yeast  produce  haploid  cells  through  the  developmental  program  of  sporulation,  which  consists  of  meiosis  and  spore  morphogenesis.  DNA  microarrays  containing  nearly  every  yeast  gene  were  used  to  assay  changes  in  gene  expression  during  sporulation.  At  least  seven  distinct  temporal  patterns  of  induction  were  observed.  The  transcription  factor  Ndt80  appeared  to  be  important  for  induction  of  a  large  group  of  genes  at  the  end  of  meiotic  prophase.  Consensus  sequences  known  or  proposed  to  be  responsible  for  temporal  regulation  could  be  identified  solely  from  analysis  of  sequences  of  coordinately  expressed  genes.  The  temporal  expression  pattern  provided  clues  to  potential  functions  of  hundreds  of  previously  uncharacterized  genes,  some  of  which  have  vertebrate  homologs  that  may  function  during  gametogenesis.  All  sexually  reproducing  organisms  have  a  specialized  developmental  pathway  for  gametogenesis,  in  which  diploid  cells  undergo  meiosis  to  produce  haploid  germ  cells.  Gametogenesis  in  yeast  (sporulation)  involves  two  overlapping  processes,  meiosis  and  spore  morphogenesis  (Fig.  1),  and  results  in  four  haploid  spores.  Each  spore  is  capable  of  germinating  and  fusing  with  a  cell  of  the  opposite  mating  type,  analogous  to  the  fusion  of  egg  and  sperm.  Sporulation  in  yeast  is  characterized  by  sequential  transcription  of  at  least  four  sets  of  genes--early,  middle,  mid-late,  and  late  (1).  Most  of  the  known  early  genes  are  involved  in  meiotic  prophase  (pairing  of  homologous  chromosomes  and  recombination).  The  Ume6/Ime1  complex,  which  recognizes  a  conserved  site  (URS1)  found  in  the  upstream  region  of  many  of  the  known  early  genes,  appears  to  be  the  major  transcriptional  regulator  of  this  class  (2,  3).  Products  of  the  known  middle  genes  are  required  for  the  concomitant  events  of  meiotic  nuclear  division  and  spore  formation  (4-6).  Ndt80,  a  meiosis-specific  transcription  factor,  has  been  shown  to  be  important  in  inducing  transcription  of  middle  genes  at  the  end  of  meiotic  prophase,  binding  to  the  middle  gene
0	patterns  for  hundreds  of  genes  in  a  compact  graphical  format  suitable  for  this  report,  measured  changes  in  mRNA  levels  were  shown  in  a  tabular  form  (Fig.  3B),  with  rows  corresponding  to  individual  genes  and  columns  corresponding  to  the  successive  intervals  during  the  sporulation  program  at  which  mRNA  levels  were  measured.  The  changes  in  expression  of  each  gene  are  represented  in  the  table  not  as  numbers,  but  by  mapping  the  numerical  values  onto  a  color  scale.  Increases  in  expression  relative  to  vegetative  cells  are  represented  as  graded  shades  of  red,  and  decreases  as  graded  shades  of  green.  The  pattern  of  expression  as  assayed  by  Northern  analysis  (Fig.  3A)  was  very  similar  to  that  determined  by  microarray  analysis  (Fig.  3B).
0	Sequential  Induction  of  Genes  During  Sporulation
0	Of  the  about  6200  protein-encoding  genes  in  the  yeast  genome,  more  than  1000  showed  significant  changes  in  mRNA  levels  during  sporulation  (20).  About  half  of  these  genes  were  induced  during  sporulation,  and  half  were  repressed.  To  facilitate  the  visualization  and  interpretation  of  the  gene  expression  program  represented  in  this  very  large  body  of  data,  we  have  used  the  method  of  Eisen  et  al.  (21,  22)  to  order  genes  on  the  basis  of  similarities  in  their  expression  patterns  and  display  the  results  in  a  compact  graphical  format  (Fig.  4A).  The  relatively  small  number  of  genes  (about  50)  whose  transcription  has  been  studied  previously  had  defined  four  temporal  classes  of  sporulation-specific  genes  (1).  These  classes  were  evident  in  this  analysis  but  were  not  sufficient  to  represent  the  diversity  of  observed  expression  patterns.  We  found  it  useful  to  distinguish  seven  temporal  patterns  of  induced  transcription  that  reflect  sequential  progression  through  this  program,  even  though  well-defined  boundaries  between  temporal  classes  could  not  be  determined.  (Increased  synchrony  and  more  frequent  time  points  might  sharpen  these  boundaries  and  reveal  more  classes.)  For  each  of  these  seven  temporal  patterns,  a  small,  representative  set  of  genes  was  hand-picked  and  used  to  define  a  model  expression  profile  (Fig.  4B).  A  variety  of  temporal  expression  patterns  were  also  observed  for  the  genes  whose  mRNA  transcripts  decreased  during  sporulation.  To  display  the  results  as  shown  in  Fig.  5A,  correlation  coefficients  were  computed,  relating  the  expression  profiles  of  each  induced  gene  to  each  of  the  seven  model  profiles  in  Fig.  4B.  Genes  were  then  grouped  according  to  the  model  profile  that  gave  the  highest  correlation  coefficient.  The  seven  groups  were  placed  in  a  sequence  that  reflected  the  time  of  initial  induction.  Genes  assigned  to  each  group  were  then  further  ordered  on  the  basis  of  the
0	RESEARCH  ARTICLES
0	Generating  Protein  Interaction  Maps  from  Incomplete  Data:  Application  to  Fold  Assignment
1	Michael  Lappe  1,,  Jong  Park  1,  3,  Oliver  Niggemann  2  and  Liisa  Holm  1
0	Structural
0	ABSTRACT  Motivation:  We  present  a  framework  to  generate  comprehensive  overviews  of  protein-protein  interactions.  In  the  post-genomic  view  of  cellular  function,  each  biological  entity  is  seen  in  the  context  of  a  complex  network  of  interactions.  Accordingly,  we  model  functional  space  by  representing  protein-protein-interaction  data  as  undirected  graphs.  We  suggest  a  general  approach  to  generate  interaction  maps  of  cellular  networks  in  the  presence  of  huge  amounts  of  fragmented  and  incomplete  data,  and  to  derive  representations  of  large  networks  which  hide  clutter  while  keeping  the  essential  architecture  of  the  interaction  space.  This  is  achieved  by  contracting  the  graphs  according  to  domain-specific  hierarchical  classifications.  The  key  concept  here  is  the  notion  of  induced  interaction,  which  allows  the  integration,  comparison  and  analysis  of  interaction  data  from  different  sources  and  different  organisms  at  a  given  level  of  abstraction.  Results:  We  apply  this  approach  to  compute  the  overlap  between  the  DIP  compendium  of  interaction  data  and  a  dataset  of  yeast  two-hybrid  experiments.  The  architecture  of  this  network  is  scale-free,  as  frequently  seen  in  biological  networks,  and  this  property  persists  through  many  levels  of  abstraction.  Connections  in  the  network  can  be  projected  downwards  from  higher  levels  of  abstraction  down  to  the  level  of  individual  proteins.  As  an  example,  we  describe  an  algorithm  for  fold  assignment  by  network  context.  This  method  currently  predicts  protein  folds  at  30%  accuracy  without  any  requirement  of  detectable  sequence  similarity  of  the  query  protein  to  a  protein  of  known  structure.  We  used  this  algorithm  to  compile  a  list  of  structural  assignments  for  previously  unassigned  genes  from  yeast.  Finally  we  discuss  ways  forward  to  use  interaction  networks  for  the  prediction  of  novel  protein-protein  interactions.
0	OPERATIONALIZING  THE  NOTION  OF  FUNCTION  As  more  experimental  data  on  protein  interaction  becomes  available,  it  will  be  of  critical  importance  to  integrate  and  compare  the  data  derived  from  different  sources.  The  analysis  of  interaction  data  aims  to  reveal  the  organizational  principles  of  cellular  networks  and  to  describe  the  architecture  of  biochemical  and  genetic  networks.  A  key  difficulty  on  the  way  is  the  incomplete  experimental  characterization  of  most  biological  systems.  To  fill  the  information  gaps,  we  need  to  find  ways  of  generalization  from  individual  experimental  evidence  (e.g.  protein-protein  interactions)  to  higher  level  biological  entities  (e.g.  protein  families)  in  order  to  generate  structural  and  functional  annotations.  The  notion  that  cellular  functions  are  the  outcome  of  molecular  interactions  among  biological  entities  is  reflected  in  the  post-genomic  approach  to  cellular  function,  where  molecular  biological  entities  are  seen  as  nodes  in  a  complex  network  of  interactions  (Eisenberg,  2000).  This  view  helps  to  operationalize  the  notion  of  function,  since  it  allows  to  build  models  of  the  cellular  circuitry  as  undirected  graphs  G  =  (V  ,  E).  Here  the  set  of  vertices  V  represents  proteins  connected  by  a  set  of  edges  E,  which  represent  the  interactions.  A  big  problem  with  available  interaction  data  is  that  it  is  fractionated  and  the  data  files  come  in  the  form  of  binary  sets.  Each  data  record  represents  a  single  edge  e  =  (v,  w)  E  in  our  graph  and  denotes  that  protein  v  interacts  with  protein  w.  Experimental  techniques  such  as  yeast-two-hybrid  experiments  are  suitable  for  genomewide  screening  but  yield  large  numbers  of  both  false  positives  and  false  negatives.  Therefore,  our  knowledge  of  functional  space  in  terms  of  protein  interactions  is  still
0	M.  Lappe  et  al.
0	far  from  being  complete.  Clearly,  in  order  to  generate  a  comprehensive  interaction  map  from  fractionated  and  incomplete  data,  the  input  sets  of  binary  interactions  have  to  be  merged  in  a  meaningful  way  by  joining  their  nodes.  Within  a  given  organism,  it  is  easily  possible  to  link  the  given  set  of  edges  (interactions)  via  identical  node  labels,  as  was  done  for  yeast  (Schwikowski  et  al.,  2000).  However,  for  many  genomes  we  know  only  the  gene  complement  (set  of  nodes)  and  need  to  infer  the  interaction  network  by  mapping  information  from  other  sources  and  organisms.  We  were  motivated  to  the  present  work  by  the  observation  that  even  for  such  a  simple  model  organism  as  yeast,  with  6000  genes,  the  current  interaction  network  is  far  too  complex  (densely  connected)  to  be  perceived  as  a  whole  (Mayer  &  Hieter,  2000).  Therefore,  especially  for  the  upcoming  amounts  of  experimental  data  from  yeasttwo-hybrid,  gene  expression,  co-immunoprecipitation,  TAP  and  protein  array  experiments  etc.,  it  is  important  to  have  means  to  condense  the  interaction  data  into  functional  modules  in  a  comprehensive  way.  It  is  intuitively  clear  that,  like  in  aerial  archaeology,  where  the  structure  of  an  ancient  settlement  is  invisible  from  the  ground  and  only  becomes  apparent  from  aerial  photographs,  we  have  to  take  a  step  back  to  get  an  idea  of  the  bigger  picture.  In  this  paper,  we  present  a  general  framework  that  is  able  to  integrate  data  at  different  levels  of  abstraction  coming  from  different  sources  and  different  species,  and  is  able  to  condense  the  amount  of  data  in  a  meaningful  way.  As  a  result,  we  get  a  glimpse  at  the  overall  architecture  and  topology  of  cellular  networks.
0	CLUSTERING  OF  INTERACTION  INFORMATION  (CONTRACTION)  Here  we  apply  a  graph-theoretic  framework  to  generate  protein  interaction  maps  from  a  given  set  of  binary  interactions  obtained  from  experimental  data.  Such  a  set  can  be  seen  as  a  graph  G  =  (V  ,  E).  Initially,  each  node  v  V  is  connected  to  just  one  other  node  w  V  ,  representing  experimental  evidence  that  the  protein  represented  by  v  interacts  with  the  protein  represented  by  w.  So  how  do  we  go  about  joining  these  binary  interactions  in  order  to  get  an  overview  ?  It  is  fairly  obvious  that  for  interaction  data  derived  from  the  same  species  it  is  possible  to  assign  a  finite  set  of  protein  names  L  to  the  interacting  partners  represented  as  the  set  of  nodes  V.  This  defines  a  labeling  function  l  :  V  L  ,  where  l  represents  our  knowledge  about  the  identity  of  proteins  within  the  proteome.  Then  it  is  straightforward  to  link  the  given  interactions  via  nodes  with  identical  labels.  The  method  described  above  does  not  work  across  species,  unless  we  have  a  way  for  identifying  the  same  (homologous)  proteins  in  different  species.
0	INDUCED  INTERACTIONS  AND  THE  LEVEL  OF  ABSTRACTION  Let  G  =  (V  ,  E)  be  the  graph  abstraction  of  the  biological  system  under  consideration  (the  interaction  network).  Then  C(G)  =  (C  1  ..Cn  )  is  a  decomposition  of  G  into  n  subgraphs  induced  on  the  C  i  ,  if  CiC  =  V  and  Ci  Cj,j  =i  =  .  The  induced  subgraphs  G(C  i  )  are  called  clusters.  The  set  of  edges  E  c  E  consists  of  the
0	Generating  Protein  Interaction  Maps  from  Incomplete  Data
0	set  of  edges  between 
0	Statistical  Applications  in  Genetics  and  Molecular  Biology
0	Parameter  estimation  for  the  calibration  and  variance  stabilization  of  microarray  data
1	Wolfgang  Huber  Anja  von  Heydebreck  Holger  Sueltmann  Annemarie  Poustka  Martin  Vingron
0	Parameter  estimation  for  the  calibration  and  variance  stabilization  of  microarray  data
0	Huber  et  al.:  Variance  stabilization  of  microarray  data
0	The  model
0	A  microarray  consists  of  a  set  of  probes  immobilised  on  a  solid  support.  The  probes  are  chosen  such  that  they  bind  to  specific  sample  molecules;  for  DNA  arrays,  this  is  ensured  by  the  sequence-specificity  of  the  hybridization  reaction  between  complementary  DNA  strands.  The  interesting  fraction  from  the  biological  sample  is  prepared  in  solution,  labeled  with  fluorescent  dye  and  allowed  to  bind  to  the  array.  The  abundance  of  sample  molecules  can  then  be  compared  through  comparing  the  fluorescence  intensities  at  the  matching  probe  sites.  The  measured  intensity  yki  of  probe  k  =  1,  .  .  .  ,  n  for  sample  i  =  1,  .  .  .  ,  d  may  be  decomposed  into  a  specific  and  an  unspecific  part,  yki  =  ki  +  ki  xki  .  (1)
0	Here,  xki  is  the  abundance  of  the  transcript  represented  by  probe  k  in  the  sample  i,  ki  is  a  proportionality  factor,  and  ki  subsumes  unspecific  signal  contri-
0	Produced  by  The  Berkeley  Electronic  Press,  2003
0	Statistical  Applications  in  Genetics  and  Molecular  Biology
0	butions  which  may  be  caused  by  effects  such  as  non-specific  hybridization,  crosshybridization  or  background  fluorescence.  The  offsets  ki  and  gain  factors  ki  are  usually  not  known,  but  microarray  technologies  are  designed  in  such  a  way  that  their  values  for  different  k  and  i  are  related.  This  makes  it  possible  to  infer  statements  about  the  concentrations  xki  from  the  measured  data  yki  .  Relations  between  the  offsets  and  gain  factors  for  different  k  and  i  can  be  expressed  in  terms  of  a  further  decomposition,  ki  =  i  k  eki  ,  ki  =  ai  +  ki  .  fl  (2)  (3)
0	Thus,  the  gain  factor  is  the  product  of  a  probe  affinity  k  ,  which  is  the  same  for  all  measurements  involving  probes  of  type  k,  times  a  normalization  factor  i  ,  which  applies  to  all  measurements  from  sample  i.  The  remainder  ki  /(i  k  )  is  accounted  for  by  eki  .  One  can  choose  the  units  of  i  and  k  such  that  k  ki  =  i  ki  =  0.  The  unspecific  signal  contribution  ki  can  be  decomposed  into  a  per-sample  offset  ai  and  a  remainder  ki  with  k  ki  =  0.  fl  fl  The  probe  affinity  k  may  depend,  for  example,  on  the  probe  sequence,  secondary  structure  and  the  abundance  of  probe  molecules  on  the  array.  The  normalization  factor  i  may  depend,  for  example,  on  the  amount  of  mRNA  in  the  sample,  on  the  labeling  efficiency,  and  on  dye  quantum  yield.  The  idea  behind  the  decompositions  (1)-(3)  is  that  while  the  individual  values  of  ki  and  ki  may  fluctuate  around  fl  zero,  they  do  so  in  an  unsystematic,  random  manner.  Thus,  for  example,  we  assume  that  there  are  no  systematic  non-linear  effects,  which  would  imply  trends  in  the  ki  or  ki  dependent  on  the  value  of  xki  .  fl  Now  one  can  reduce  the  parameter  complexity  of  Eqn.  (1)  through  the  following  three  modeling  steps:  1.  Do  not  try  to  explicitly  determine  the  probe  affinities  k  .  They  can  be  absorbed  into  mki  =  k  xki  ,  which  may  be  considered  a  measure  of  the  abundance  of  transcript  k  in  sample  i  in  probe-specific  units.  2.  Treat  ki  and  ki  as  "noise  terms"  coming  from  appropriate  probability  disfl  tributions.  3.  Estimate  the  values  of  the  normalization  factors  i  and  offsets  ai  ,  as  well  as  parameters  of  the  probability  distributions  from  the  data.  Thus,  Eqn.  (1)  leads  to  the  following  stochastic  model:  Yki  -  ai  =  mki  eki  +  ki  ,  i  ki  L  ,  ki  L  .
0	Huber  et  al.:  Variance  stabilization  of  microarray  data
0	Here,  ki  =  ki  /i  is  the  additive  noise  scaled  by  the  normalization  factor  i  .  The  fl  right  hand  side  of  Eqn.  (4)  is  a  combination  of  an  additive  and  a  multiplicative  error  term.  It  was  proposed  by  Rocke  and  Durbin  [5],  using  normal  distributions  L  =  2  2  N  (0,  )  and  L  =  N  (0,  ).  In  the  following,  we  will  consider  distributions  L  2  and  L  that  are  unimodal,  roughly  symmetric,  and  have  mean  zero  and  variances  2  and  ,  respectively,  but  we  do  not  rely  on  the  assumption  of  a  normal  distribution.  The  left  hand  side  describes  the  calibration  of  the  microarray  intensities  Yki  through  subtraction  of  offsets  ai  and  scaling  by  normalization  factors  i  [7,  3].  According  to  Eqn.  (4),  the  variance  of  the  random  variable  Yki  is  related  to  its  mean  through  22  Var(Yki  )  =  c2  (E(Yki  )  -  ai  )2  +  i  ,  (5)  where  c2  =  Var(e  )/E2  (e  )  is  a  parameter  of  the  distribution  of  L  .  In  the  2  log-normal  case,  c2  =  exp(  )  -  1.  Thus,  the  relationship  of  the  variance  to  the  mean  is  a  strictly  positive,  quadratic  function.  For  a  highly  expressed  gene,  the  variance  Var(Yki  )  is  dominated  by  the  quadratic  term  and  the  coefficient  of  variation  of  Yki  is  approximately  c,  independent  of  k  and  i.  For  a  weakly  expressed  or  unexpressed  gene,  the  variance  Var(Yki  )  is  dominated  by  the  constant  term  and  the  standard  deviation  of  Yki  is  approximately  i  ,  which  may  be  interpreted  as  the  background  noise  level  for  the  i-th  sample,  and  is  independent  of  k.
0	Variance  stabilizing  transformations
0	Consider  a  random  variable  X  with  expectation  value  0  and  a  differentiable  function  h  defined  on  the  range  of  X.  Then  h(X)  =  h(0)  +  h  (0)  X  +  r(X)  X,  where  r  is  a  continuous  function  with  r(0)  =  0  and  Var(h(X))  =  h  (0)2  Var(X)  +  Var(r(X)  X)  +  2h  (0)  E(r(X)  X  2  ).  (7)  (6)
0	If  h  does  not  deviate  from  linearity  too  strongly  within  the  range  of  typical  values  of  X,  then  r(X)  is  small  and  the  terms  involving  r(X)  on  the  right  hand  side  of  Eqn.  (7)  are  negligible.  Thus,  for  a  family  of  random  variables  Yu  with  expectation  values  E(Yu  )  =  u  and  variances  Var(Yu  )  =  v(u)  Var(h(Yu  ))  h  (u)2  v(u).  (8)
0	An  approximately  variance-stabilizing  transformation  can  be  obtained  by  finding  a  function  h  for  which  the  right  hand  side  is  constant,  that  is,  by  integrating  h  (u)  =
0	Produced  by  The  Berkeley  Electronic  Press,  2003
0	Statistical  Applications  in  Genetics  and  Molecular  Biology
0	Note  that  if  h  is  approximately  variance-stab
1	Wlad  Kusnezow  Anette  Jacob  Alexandra  Walijew  Frank  Diehl  Joerg  D.  Hoheisel  Functional  Genome  Analysis,  Deutsches  Krebsforschungszentrum,  Heidelberg,  Germany
0	Antibody  microarrays:  An  evaluation  of  production  parameters
0	Antibody  microarrays  could  have  an  enormous  impact  on  the  functional  analysis  of  cellular  activity  and  regulation,  especially  at  the  level  of  protein  expression  and  protein-protein  interaction,  and  might  become  an  invaluable  tool  in  disease  diagnostics.  The  array  surface  is  bound  to  have  a  tremendous  influence  on  the  findings  from  such  studies.  Apart  from  the  basic  issue  of  how  to  attach  antibodies  optimally  without  affecting  their  function,  it  is  also  important  that  the  cognate  antigens,  applied  within  a  complex  protein  mixture,  all  bind  to  the  arrayed  antibodies  irrespective  of  their  enormous  variety  in  structure.  In  this  study,  various  factors  in  the  production  of  antibody  microarrays  on  glass  support  were  analysed:  the  modification  of  the  glass  surface;  kind  and  length  of  cross-linkers;  composition  and  pH  of  the  spotting  buffer;  blocking  reagents;  antibody  concentration  and  storage  procedures,  in  order  to  evaluate  their  effect  on  array  performance.  Altogether,  data  from  more  than  700  individual  array  experiments  were  taken  into  account.  In  addition  to  home-made  slides,  commercially  available  systems  were  also  included  in  the  analysis.
0	Keywords:  Antibody  /  Cross-linker  /  Glass  slide  /  Microarrays  /  Surface  modification  PRO  0357
0	Introduction
0	DNA  microarrays  have  become  an  essential  tool  in  the  functional  interpretation  of  sequence  information  yielded  from  the  various  genome  projects.  Many  aspects  of  modulation  and  regulation  of  cellular  activity  at  the  level  of  nucleic  acids  can  be  investigated  with  this  technology.  A  major  area  of  analysis  are  studies  of  the  variations  in  gene  expression  by  comparing  transcript  levels  present  in  cells  from  different  tissues  or  growth  conditions.  However,  the  data  provide  only  a  limited  insight  into  the  process  of  actual  protein  expression  and  even  less  information  on  protein-protein  interaction  or  the  proteins'  biochemical  activity.  Consequently,  there  is  a  strong  demand  for  analysis  procedures  at  the  protein  level  that  correspond  in  performance  to  the  kind  of  studies  possible  on  DNA  microarrays  [1-3].  As  a  matter  of  fact,  even  higher  capabilities  will  be  required  from  such  techniques.  The  human  proteome  is  much  more  complex  in  composition  than  the  coding  portion  of  the  genome.  Estimates  range
0	WILEY-VCH  Verlag  GmbH  &  Co.  KGaA,  Weinheim
0	Producing  antibody  microarrays
0	ester,  6-maleimidohexanoic  acid  N-hydroxysuccinimide  ester,  11-maleimidoundecanoic  acid  N-hydroxysuccinimide  ester  were  obtained  from  Sigma.  Immunoglobulins  and  corresponding  antigens  were  obtained  from  the  following  companies:  monoclonal  anti  green  fluorescent  protein  (GFP)  antibody  (IgG1k  isotype)  from  Hoffmann-La  Roche  (Mannheim,  Germany);  monoclonal  antihuman  interferon-g  antibody  (I5521;  IgG2a  isotype)  and  recombinant  human  interferon-g  (I3265)  from  Sigma-Aldrich.  Keyhole  limpet  hemocyanin  (KLH),  polyclonal  anti-KLH  antibody,  thyroglobulin  and  polyclonal  antithyroglobulin  antibody  were  a  kind  donation  of  Eurogentec  (Seraing,  Belgium);  monoclonal  anti-p16  antibodies  (IgG1  isotype)  and  recombinant  p16  were  a  gift  from  MTM  Laboratories  (Heidelberg,  Germany).
0	Surface  derivatisation  of  glass  slides
0	Untreated  slides  were  washed  with  ethanol  and  then  etched  by  immersion  in  10%  NaOH  at  room  temperature  for  1  h.  Subsequently,  the  slides  were  placed  again  in  10%  NaOH  and  cleaned  by  sonification  for  15  min.  They  were  rinsed  four  times  in  water,  washed  twice  in  ethanol  and  derivatised  in  the  appropriate  solution  at  room  temperature  for  1  h,  again  followed  by  a  sonification  step.  The  following  derivatisation  solutions  were  used:  GPTS  slides:  2.5%  GPTS,  10  mM  acetic  acid  in  ethanol;  APTES  slides;  5%  APTES  in  95%  ethanol/water;  MPTS  slides:  1%  MPTS,  10  mM  acetic  acid  in  ethanol;  poly-L-lysine  slides:  0.01%  poly-L-lysine  solution,  0.16PBS  buffer  (16PBS:  137  mM  NaCl,  2.7  mM  KCl,  10  mMM  Na2HPO4,  2  mM  KH2PO4,  pH  7.4).  After  silanisation,  GPTS-treated  slides  were  washed  thoroughly  with  ethanol,  while  MPTS  slides  were  additionally  rinsed  with  16  mM  acetic  acid  in  ethanol.  APTES  and  poly-L-lysine  slides  were  washed  first  with  water  and  then  twice  with  ethanol.  All  slides  were  dried  with  nitrogen.  The  APTES  slides  were  finally  baked  at  1107C  for  15  min,  poly-L-lysine  at  457C  for  30  min.
0	Materials  and  methods
0	Materials
0	All  chemicals  and  solvents  were  purchased  from  Fluka  (Taufkirchen,  Germany),  Sigma-Aldrich  (Munich,  Germany)  or  SDS  (Peypin,  France),  unless  stated  otherwise,  and  used  without  additional  purification.  Untreated  slides  were  purchased  from  Menzel-Glaeser  (Braunschweig,  Germany);  amino-silanised  slides  from  Sigma  and  Corning  (Schiphol-Rijk,  The  Netherlands);  FAST  slides  from  Schleicher  &  Schuell  (Einbeck,  Germany);  QMT  epoxy  slides  from  Quantifoil  Micro  Tools  (Jena,  Germany);  aldehyde  slides  and  ArrayIt  spotting  solution  from  TeleChem  (TeleChem  International  Sunnyvale,  CA,  USA).  (3-glycidoxypropyl)trimethoxy  silane  (GPTS),  (3-aminopropyl)trimethoxy  silane  (APTES),  (3-mercaptopropyl)trimethoxy  silane  (MPTS),  BSA,  milk  powder  and  TopBlock  solution  were  obtained  from  Sigma-Aldrich.  4-[N-maleimidomethyl]cyclohexane-1-carboxylhydrazide  dioxane  and  succinimidyl4-[N-maleimidomethyl]-cyclohexane-1-carboxy-[6-amidocaproate]  were  purchased  from  Pierce  (Rockford,  IL,  USA);  3-maleimidopropionic  acid  N-hydroxysuccinimide
0	Addition  of  cross-linkers
0	Aminosilane  (APTES),  mercaptosilane  (MPTS)  and  polyL-lysine  slides  were  additionally  derivatised  with  different  cross-linkers.  All  cross-linkers  were  diluted  in  DMF  and  stored  at  a  concentration  of  200  mM  at  47C.  Prior  to  use,  the  cross-linkers  were  diluted  in  DMF  to  a  final  concentration  of  20  mM.  Fifty  mL  of  the  respective  cross-linker  solution  were  pipetted  onto  the  slide  surface  and  covered  with  a  glass  coverslip  that  had  been  cleaned  with  ethanol.  The  slides  were  incubated  at  room  temperature  for  3  h.  Subsequently,  excess  of  cross-linker  was  removed  by  washing  twice  with  DMF  and  twice  wit
0	Integrated  Graphical  Analysis  of  Protein  Sequence  Features  Predicted  From  Sequence  Composition
1	Erik  L.L.  Sonnhammer1,2*  and  John  C.  Wootton2  Center  for  Genomics  and  Bioinformatics,  Karolinska  Institutet,  Stockholm,  Sweden  2  Computational  Biology  Branch,  National  Center  for  Biotechnology  Information,  National  Library  of  Medicine,  National  Institutes  of  Health,  Bethesda,  Maryland
0	Key  words:  sequence  analysis;  graphical  visualization;  dot-plot;  database  search  viewing;  sequence  complexity;  transmembrane;  coiled-coil;  protein  structure;  nonglobular  proteins;  algorithms;  data  definition  format  INTRODUCTION  Any  protein  sequence,  as  typically  inferred  from  a  genomic  or  mRNA  sequence,  potentially  represents  a  rich  mosaic  of  molecular  properties  reflecting  structure,  dynamics,  interactions,  and  roles  in  cellular  machinery.  Interpretation  and  annotation  of  such  a  sequence  is  a  complex  conceptual  task,  which  is  usually  achieved  by  a  synthesis  of  algorithmic  analysis  and  expert  judgment.  Individual  algorithms  vary  in  their  ability  to  diagnose  or  classify
0	various  sequence  features,  and  knowledgeable  human  interpretation  is  generally  considered  to  be  essential.  Even  seemingly  straightforward  outputs,  such  as  database  sequence  similarity  search  results  using  conservative  cutoffs,  are  frequently  greatly  enriched  by  human  abilities  to  perceive  context,  associations,  and  unexpected  pitfalls.  In  all  cases,  graphical  display  can  dramatically  improve  envisioning  and  comprehension  of  the  interrelated  sets  of  data,  and  most  sequence  analysis  software  packages  include  graphical  tools.  In  addition  to  comparative  analysis  of  conserved  domains  and  sequence  motifs  by  means  of  database  searches,  several  algorithms  have  been  designed  to  predict  certain  protein  features  primarily  from  attributes  of  composition  and  repetitiveness.  Such  features  include  secondary  structure  elements,  transmembrane  segments,  signal  peptides,  low-complexity  regions,  coiled-coils,  other  nonglobular  domains,  and  intrinsically  unstructured  regions.  These  results  are  typically  interpreted,  together  with  regions  of  sequence  conservation,  to  infer  a  provisional  map  of  the  possible  structural  and  functional  regions  of  a  protein.  This  task  presents  several  difficulties  and  requires  critical  evaluation  of  results  from  various  compositional,  alignment,  and  modeling  algorithms.  To  assist  these  tasks,  adaptable  software  is  needed  to  take  the  results  of  different  amino  acid  sequence  feature  analysis  programs  and  use  them  as  inputs  into  graphics  programs  designed  for  integrated  visualization.  Also  needed  is  the  ability  to  run  each  program  with  different  parameter  sets  and  compare  the  results  graphically.  Weighing  the  significance  of  different  types  and  levels  of  evidence  together  usually  leads  to  a  more  accurate  analysis  than  running  each  prediction  program  separately  with  default  parameters.  In  addition,  integrated  analyses  of  this  type  are  valuable  in  calibrating  parameters  during  development  of  computational  methods,  for  example,  to  use  them  in  large-scale  genomic  analysis.  Many  analysis  programs  are  provided  with  very  permissive  default  parameters  to  minimize  false  negatives,  whereas  in  genomewide  analysis,  it  is  often  important  to  use  nondefault  conservative  parameters  to  limit  the  number  of  false  positives.
0	INTEGRATED  GRAPHICAL  SEQUENCE  ANALYSIS
0	It  is  desirable,  therefore,  to  view  the  combined  output  from  several  approaches,  algorithms,  and  parameter  sets,  in  many  cases  juxtaposed  with  database  matches.  Here,  we  describe  a  flexible  software  system  that  meets  these  various  needs  and  illustrate  some  of  its  applications.  Because  it  is  impossible  to  define  exact  rules  on  how  to  interpret  such  multifacetted  data,  we  provide  a  set  of  typical  examples  that  illustrate  how  logical  reasoning  based  on  the  combined  output  of  many  different  analyses  can  lead  to  a  correct  interpretation,  or  at  least  avoidance  of  an  incorrect  one.  DATA  TYPES  AND  FORMATS  There  are  in  principle  two  primary  types  of  data  for  describing  sequence  features:  segments  and  curves.  Segments  are  defined  by  one  start  and  end  sequence  coordinate.  Typically,  the  sequence  between  these  coordinates  is  assigned  a  certain  property  algorithmically,  such  as  a  low-complexity  region.  Curves  (or  "profiles"),  in  contrast,  consist  of  an  array  of  scores,  each  score  being  assigned  by  an  algorithm  to  a  single  residue.  We  here  use  the  term  "curve"  because  the  term  "profile"  is  mainly  used  in  sequence  analysis  to  denote  a  matrix  of  numbers  along  the  sequence.  Segments  frequently  have  a  score  too  and  may  have  associations  with  other  pieces  of  data,  particularly  if  they  are  "matching  segments"  that  can  be  aligned  by  similarity  to  other  sequences  or  sequence  models.  It  is  often  advantageous  to  browse  matching  segments  from  database  searches  at  the  level  of  aligned  residues;  a  special  viewer  for  this  purpose  is  Blixem.1  Data  sets  of  both  segment  and  curve  types  can  be  obtained  either  by  parsing  the  output  of  available  sequence  analysis  programs  or  by  independent  calculation  from  the  sequence  being  analyzed.  Many  prediction  programs  not  only  produce  a  set  of  segments  as  output  but  also  calculate  a  profile  internally,  according  to  some  mathematical  function  or  empirical  scale,  as  part  of  the  algorithm.  This  is  the  case  in,  for  instance,  the  SEG  complexity  analysis,2,3  most  transmembrane  segment  prediction  programs,  and  secondary-structure  prediction  methods.  Generally,  in  these  cases,  the  underlying  profile  may  be  readily  calculated  by  using  the  appropriate  function,  independently  of  the  program.  Some  programs  report  both  the  segments  and  the  underlying  profile,  for  instance  COILS2,4  which  predicts  -helical  coiled-coils.  A  number  of  established  database  and  visualization  systems  exist  that  include  built-in  functions  for  sequence  segment  display.  These  include  ChromoScope,5  bioWidgets,6  APIC,7  the  BDGP  java  sequence  viewer,8  GAIA,9  and  ACEDB.10  These  are  relatively  large  software  suites  that  require  a  significant  investment  in  knowledge  to  become  operational,  usually  due  to  the  intricacies  of  specifying  a  practical  data  model.  For  instance,  the  data  definition  languages  (e.g.,  ACEDB  and  ASN.1)  were  designed  to  store  biological  objects  in  a  rigorous  way.  Generating  and  parsing  data  in  such  formats  involves  supporting  a  substantial  framework  of  semantic  rules.  For  data  consisting  only  of  segments  or  curves,  the  complications  of  conforming  to  such  a  format  are  unwarranted,  and  a
0	simple  tabular  format  is  adequate.  Furthermore,  many  of  the  available  visualization  systems  have  various  limitations,  depending  on  their  history  of  development,  which  in  many  cases  was  oriented  toward  displaying  genetic  or  physical  maps,  and  thus  have  no  facility  for  curve  data.  To  our  knowledge,  only  the  commercial  APIC  system  was  designed  to  handle  curve  data  in  a  generic  way.  In  contrast  to  these  large,  comprehensive  systems,  our  goal  is  to  provide  simple,  yet  powerful,  generic  tools  that  allow  any  sequence  crunching  program  to  communicate  its  results  to  any  graphical  viewer.  At  the  core  is  a  simple  data  format  for  sequence  feature  series,  which  we  call  SFS.  Sequence  analysis  programs  typically  produce  data  that  are  compatible  with  the  present  SFS  data  model,  but  it  is  also  extensible  to  incorporate  features  that  may  need  special  treatment  in  the  future.  SFS  achieves  a  logical  separation  of  prediction-calculation  programs  and  viewers  and  thus  removes  the  need  for  special  visualization  tools  for  each  individual  program.  Viewers  can  then  become  more  powerful  and  evolved  tools,  whereas  the  algorithmic  implementations  can  be  developed  without  the  extra  burden  of  building  visualization  tools.  The  overhead  for  both  viewers  and  calculation  programs  to  support  the  lightweight  SFS  format  is  minimal.  The  two  core  data  types  in  the  SFS  format  are  segments  and  XY  curves.  An  XY  curve  is  a  two-dimensional  plot  of  a  series  of  X  and  Y  value  pairs,  where  X  is  the  sequence  residue  coordinate.  The  information  stored  is  very  reduced  but  is  sufficient  for  generating  a  rich  and  easily  interpretable  graphical  representation.  In  addition  to  the  coordinates  and  sc
0	MINIREVIEW  Tumor  Necrosis  Factor  (TNF)-  and  TNF  Receptors  in  Viral  Pathogenesis  (44487)
1	GEORGES  HERBEIN*,1
1	WILLIAM  A.  O'BRIEN
0	embers  of  the  TNF  ligand  and  receptor  family  act  via  a  common  set  of  signaling  molecules  to  regulate  cell  differentiation,  activation,  and  viability.  Among  TNF  family  members,  the  first  discovered  TNF-  ,  formerly  cachectin,  is  a  proinflammatory  cytokine  that  plays  a  key  role  in  both  inflammatory  and  infectious  diseases,  especially  in  viral  infections  (1-3).  TNF  binds  to  two  TNF  receptors,  TNF-R1  and  TNF-R2,  that  transduce  intracellular  signals  when  expressed  on  the  cell  surface,  while  blocking  TNF  signaling  when  released  as  soluble  decoys  in  body  fluids.  TNF  interferes  with  viral  replication  in  several  ways.  TNF  enhances  or  inhibits  viral  replication  depending
0	on  the  virus  involved  and  the  cell  type  infected.  The  binding  of  TNF  to  the  TNF  receptors  can  activate,  differentiate,  or  kill  target  cells  thereby  interfering  with  the  viral  life  cycle.  In  contrast,  viruses  have  evolved  to  appropriate  the  TNF/  TNFR  pathway  to  evade  immune  responses  and  favor  viral  dissemination.  TNF  is  also  involved  in  a  network  of  cytokines  and  chemokines  that  stimulate  the  recruitment  of  immune  cells  in  the  infectious  foci,  thereby  enhancing  the  spread  of  the  viral  infection.  TNF  also  can  block  the  viral  replication  by  interfering  with  the  viral  life  cycle  especially  the  viral  entry.  Thus  an  intricate  balance  between  the  viral  life  cycle  and  the  cytokine  network,  especially  the  TNF/  TNFR  pathway,  is  a  key  component  that  will  influence  the  pathogenesis  of  many  viral  diseases.  We  will  define  the  role  of  both  TNF  and  TNFR  in  immune  modulation,  describe  their  signaling  pathway,  and  delineate  their  role  in  viral  pathogenesis.
0	TNF  and  TNF  Receptors  in  the  Immune  Response
0	TNF-  binds  to  two  distinct  TNFR  called  TNFR1  and  TNFR2  that  belong  to  a  broader  group  of  related  proteins,  the  TNFR  family.  The  members  of  the  TNFR  family  have  a
0	TNF,  TNF  RECEPTORS,  AND  VIRUSES  241
0	TNF,  TNF  RECEPTORS,  AND  VIRUSES
0	Significance  and  statistical  errors  in  the  analysis  of  DNA  microarray  data
1	James  P.  Brody*,  Brian  A.  Williams,  Barbara  J.  Wold,  and  Stephen  R.  Quake*
0	DNA  microarrays  are  important  devices  for  high  throughput  measurements  of  gene  expression,  but  no  rational  foundation  has  been  established  for  understanding  the  sources  of  within-chip  statistical  error.  We  designed  a  specialized  chip  and  protocol  to  investigate  the  distribution  and  magnitude  of  within-chip  errors  and  discovered  that,  as  expected  from  theoretical  expectations,  measurement  errors  follow  a  Lorentzian-like  distribution,  which  explains  the  widely  observed  but  unexplained  ill-reproducibility  in  microarray  data.  Using  this  specially  designed  chip,  we  examined  a  data  set  of  repeated  measurements  to  extract  estimates  of  the  distribution  and  magnitude  of  statistical  errors  in  DNA  microarray  measurements.  Using  the  common  ``ratio  of  medians''  method,  we  find  that  the  measurements  follow  a  Lorentzian-like  distribution,  which  is  problematic  for  subsequent  analysis.  We  show  that  a  method  of  analysis  dubbed  ''median  of  ratios``  yields  a  more  Gaussian-like  distribution  of  errors.  Finally,  we  show  that  the  bootstrap  algorithm  can  be  used  to  extract  the  best  estimates  of  the  error  in  the  measurement.  Quantifying  the  statistical  error  in  such  measurements  has  important  applications  for  estimating  significance  levels,  clustering  algorithms,  and  process  optimization.
0	a  modified  algorithm  (median  of  ratios),  the  distribution  became  more  Gaussian-like  and  we  obtained  more  consistent  results.  We  describe  a  method  for  estimating  the  error  in  the  measured  ratio  by  using  the  bootstrap  method  (3).  The  bootstrap  is  an  algorithm  used  to  estimate  confidence  intervals  of  an  arbitrary  parameter  estimated  from  a  population  of  measurements.  It  does  this  by  repeatedly  randomly  sampling  from  the  population  and  calculating  the  parameter  of  interest.  We  evaluated  this  method  of  error  estimation  by  comparing  the  actual  differences  in  multiple  measurements  of  the  ratio  (the  median  of  the  ratios)  to  the  estimated  error  for  a  single  measurement.  There  is  good  agreement  between  the  two,  leading  us  to  conclude  that  the  bootstrap  can  give  reliable  error  estimates.  Methods  A  test  slide  was  constructed  containing  100  spots  representing  cDNA  cloned  from  mouse  glycerol-3-phosphate  dehydrogenase  (G3PDH).  The  series  of  spots  were  from  a  single  preparation  of  cDNA.  Arrays  were  hybridized  to  mRNA  from  C2C12  and  10T1  2  cell  lines.  Results  are  shown  in  Fig.  1;  all  100  points  are  represented.  A  4,608  spot  DNA  microarray  representing  1,152  mouse  genes  each  repeated  four  times  was  constructed.  mRNA  was  extracted  from  a  whole  adult  mouse  liver  (Cy5)  and  a  C2C12  mouse  myoblast  cell  line  (Cy3)  and  hybridized  to  the  microarray.  The  slide  was  scanned  and  spots  were  grouped  by  the  cDNA  clone  they  represent.  The  commonly  used  measure  of  signal  is  the  log2  transform  of  the  ratio  of  medians.  The  ratio  of  medians  is  defined  as  ``the  ratio  of  the  median  intensities  of  each  feature  for  each  wavelength,  with  the  median  background  subtracted.''  We  found  that  the  median  of  ratios,  defined  as  ``the  median  of  pixel-by-pixel  ratios  of  pixel  intensities,  with  the  median  background  subtracted,''  provided  a  more  consistent  measurement.  A  scatter  plot,  presented  in  Fig.  2,  was  constructed  by  taking  all  possible  pairs  of  measurements  and  plotting  them  against  each  other.  Points  which  had  background  values  greater  than  foreground  values  in  either  the  Cy3  or  Cy5  channel  were  excluded  from  the  analysis.  The  ratios  were  transformed  by  taking  the  log2  and  normalized.  Values  are  reported  in  Fig.  2.  Numbers  were  extracted  from  the  image  by  using  GENEPIX  software  (Axon  Instruments,  Foster  City,  CA).  We  used  a  computer  algorithm  to  calculate  the  bootstrap  median  and  confidence  levels  in  the  median.  The  bootstrap  algorithm  works  as  follows.  A  list  of  measured  ratios,  one  from  each  pixel  in  a  spot,  was  compiled.  A  new  list  was  created  by  sampling  (with  replacement)  from  this  list.  The  median  value  of  the  new  list  was  computed  and  recorded  on  a  list  of  medians.  This  procedure  was  repeated  as  many  times  as  there  were  pixels  in  the  spot.  The  mean  and  90%  confidence  interval  in  the  mean  was  computed  from  the  list  of  medians.  In  the  bootstrap  algorithm,  these  represent  the  best  estimate  of  the  median  and  90%
0	ny  measurement  is  only  an  estimate  of  a  physical  value,  but  to  be  useful  the  measurement  should  be  accompanied  by  an  estimate  of  the  error.  The  error  in  a  single  measurement  can  be  estimated  by  examining  a  histogram  of  many  independently  repeated  measurements.  Typically,  a  histogram  of  many  measurements  will  form  a  normal  (i.e.,  Gaussian)  distribution  whose  mean  value  is  taken  as  the  best  estimate  of  the  true  value.  The  standard  deviation  of  this  distribution  is  an  estimate  of  the  error  in  a  single  measurement.  The  measurement  of  ratios  poses  special  statistical  problems.  The  distribution  of  the  ratio  x  y  of  two  Gaussian  random  variables  x  and  y  is  not  necessarily  Gaussian.  In  the  case  of  noisy  measurements,  where  the  standard  deviation  is  a  significant  fraction  of  the  measured  value,  the  distribution  of  the  ratio  approaches  a  Lorentzian  or  Cauchy  distribution  (1).  In  the  case  of  non-noisy  measurement,  where  the  standard  deviation  is  a  small  fraction  of  the  mean,  the  distribution  of  the  ratio  will  follow  a  Gaussian  distribution.  Loosely  speaking,  Lorentzian  distributions  have  longer  tails  than  Gaussian  distributions.  This  means  that  points  sampled  from  a  Lorentzian  distribution  will  have  more  frequent  ``outliers''  than  points  sampled  from  a  similar  Gaussian  distribution.  The  mean,  standard  deviation,  and  higher  moments  of  the  Lorentzian  distribution  are  undefined.  The  measurement  of  ratios  can  give  wide  tails  and  nonsensical  error  estimates  unless  the  data  are  handled  properly.  Thus,  one  needs  to  turn  to  other  statistical  tools  for  measurement  and  error  estimates  rather  than  the  mean  and  standard  error  in  the  mean.  To  examine  the  statistical  reliability  of  measurements  from  DNA  microarrays,  we  examined  microarrays  with  multiply  repeated  spots  and  looked  at  differences  in  the  measured  values.  We  analyzed  data  from  experiments  that  measure  a  large  number  (1,152)  of  mRNAs  four  different  times  on  a  single  slide.  When  the  ratio  measurements  are  extracted  using  one  common  method  [the  ratio  of  medians  (2)],  the  distribution  of  deviations  follow  a  Lorentzian-like  distribution  rather  than  a  normal  (Gaussian)  distribution.  When  we  re-analyzed  the  data  by  using
0	October  1,  2002
0	confidence  level  of  the  estimate.  This  is  reported  in  Table  1  and  shown  graphically  in  Fig.  3.  Results
0	The  Efficiency  of  Hybridization  on  DNA  Spots  Varies  Over  a  Wide  Range.  This  has  been  known  since  the  first  paper  on  spotted  DNA
0	microarrays  (4,  5);  we  reproduce  it  here  to  show  the  magnitude  of  the  variation.  The  wide  variation  requires  the  use  of  an  internal  control  on  each  DNA  spot.  The  control  and  sample  are  labeled  with  different  fluorophores  and  the  ratio  of  intensities  between  the  sample  and  control  is  reported.  As  is  shown  in  Fig.  1,  the  ratio  between  the  two  measurements  is  considerably  more  consistent  than  the  absolute  intensity  of  either  one.
0	repeated  four  times  show  that  the  measured  values  follow  a  Lorentzian-like  distribution.  Measurements  extracted  using  the  ratio  of  means  algorithm  give  similar  results.  This  indicates  that  approximately  one  in  five  of  the  genes  that  appear  to  have  significant  changes  in  expression  level  do  not;  they  are  statistical  outliers  that  are  an  artifact  of  the  data  analysis  method.
0	Measurements  Extracted  from  Images  by  Using  the  Median  of  Pixelby-Pixel  Ratios  Follow  a  Gaussian-Like  Distribution.  By  examining  a
0	population  of  pixel-by-pixel  ratio  measurements  at  each  spot  and  selecting  the  median  of  the  population,  the  distribution  of  deviations  follows  a  Gaussian  distribution,  with  a  significantly  smaller  width  (see  Fig.  4).
0	The  Error  on  an  Individual  Spot  Can  Be  Estimated  by  Using  the  Bootstrap  Algorithm  on  the  Ratios  of  Individual  Pixels  Within  a  Spot.
0	Confidence  levels  (90%)  in  the  median  for  each  spot  were  estimated  using  the  bootstrap  algorithm.  These  errors  agreed  well  with  the  observed  spread  in  measurements  across  different  s
0	PLoS  BIOLOGY
0	Similarities  and  Differences  in  Genome-Wide  Expression  Data  of  Six  Organisms
1	Sven  Bergmann,  Jan  Ihmels,  Naama  Barkai*
0	Departments  of  Molecular  Genetics  and  Physics  of  Complex  Systems,  Weizmann  Institute  of  Science,  Rehovot,  Israel
0	Comparing  genomic  properties  of  different  organisms  is  of  fundamental  importance  in  the  study  of  biological  and  evolutionary  principles.  Although  differences  among  organisms  are  often  attributed  to  differential  gene  expression,  genome-wide  comparative  analysis  thus  far  has  been  based  primarily  on  genomic  sequence  information.  We  present  a  comparative  study  of  large  datasets  of  expression  profiles  from  six  evolutionarily  distant  organisms:  S.  cerevisiae,  C.  elegans,  E.  coli,  A.  thaliana,  D.  melanogaster,  and  H.  sapiens.  We  use  genomic  sequence  information  to  connect  these  data  and  compare  global  and  modular  properties  of  the  transcription  programs.  Linking  genes  whose  expression  profiles  are  similar,  we  find  that  for  all  organisms  the  connectivity  distribution  follows  a  power-law,  highly  connected  genes  tend  to  be  essential  and  conserved,  and  the  expression  program  is  highly  modular.  We  reveal  the  modular  structure  by  decomposing  each  set  of  expression  data  into  coexpressed  modules.  Functionally  related  sets  of  genes  are  frequently  coexpressed  in  multiple  organisms.  Yet  their  relative  importance  to  the  transcription  program  and  their  regulatory  relationships  vary  among  organisms.  Our  results  demonstrate  the  potential  of  combining  sequence  and  expression  data  for  improving  functional  gene  annotation  and  expanding  our  understanding  of  how  gene  expression  and  diversity  evolved.
0	indirect  and  noisy  information  about  the  regulatory  relationships  between  genes.  Second,  while  the  genomic  sequence  is  essentially  complete,  expression  profiles  only  cover  a  subset  of  all  possible  cellular  conditions  and  thus  provide  only  partial  information  about  the  underlying  regulatory  program.  Moreover,  this  subset  is  typically  very  different  for  each  organism,  reflecting  distinct  physiologies  as  well  as  different  research  foci.  One  way  to  circumvent  this  problem  is  to  restrict  the  data  to  a  small  subset  of  similar  conditions,  such  as  timepoints  along  the  cell  cycle  (Alter  et  al.  2003).  Such  an  approach,  however,  drastically  reduces  the  size  of  the  dataset  and  limits  the  scope  of  comparison.  Here,  we  present  a  comparative  analysis  of  large  sets  of  expression  data  from  six  evolutionarily  distant  organisms  (Table  1).  We  integrate  the  expression  data  with  genomic  sequence  information  to  address  three  biological  issues.  First,  we  verify  that  coexpression  is  often  conserved  among  organisms  and  propose  a  method  for  improving  functional  gene  annotations  using  this  conservation.  We  provide  a  Webbased  application  suitable  for  this  purpose.  Second,  we  compare  the  regulatory  relationships  between  particular  functional  groups  in  the  different  organisms,  giving  initial  insights  into  the  extent  of  conservation  of  the  gene  regulatory
0	Comparative  Analysis  of  Expression  Data
0	architecture.  Interestingly,  we  find  that  while  functionally  related  genes  are  frequently  coexpressed  in  several  organisms,  their  organization  and  relative  contribution  to  the  overall  expression  program  differ.  Finally,  we  compare  global  topological  properties  of  the  transcription  networks  derived  from  the  expression  data,  using  a  graph  theoretical  approach.  This  analysis  reveals  that  despite  the  differences  in  the  regulation  of  individual  gene  groups,  the  expression  data  of  all  organisms  share  large-scale  properties.
0	Results  and  Discussion  Combining  Sequence  and  Expression  Data  for  Improving  Functional  Gene  Annotations
0	With  the  rapid  increase  in  the  number  of  sequenced  genomes,  assigning  function  to  novel  ORFs  has  become  a  major  computational  challenge.  Functional  links  are  often  imputed  based  on  sequence  similarity  with  genes  of  known  functions.  Despite  the  large  success  of  this  approach,  it  has  several  well-recognized  limitations.  Foremost,  an  ORF  can  have  several  close  homologues,  some  of  which  may  be  related  to  different  functions.  Furthermore,  the  sequence  of  an  ORF  may  have  diverged  beyond  recognition  although  the  gene  maintained  its  function.  Gene  expression  analysis  can  provide  functional  links  for  new  ORFs  based  on  their  coexpression  with  known  genes.  However,  in  this  case,  only  links  between  genes  of  the  same  organism  can  be  established.  Moreover,  owing  to  biological  interference  and  the  noise  in  the  expression  data,  the  inferred  coexpression  could  be  accidental  and  may  not  necessarily  reflect  similar  function.  Combining  expression  and  sequence  data  may  help  to  overcome  the  abovementioned  limitations.  Specifically,  homologous  genes  whose  function  has  been  preserved  are  expected  to  be  coregulated  with  genes  related  to  that  function.  Conserved  coexpression  could  thus  distinguish  them  from  homologues  whose  function  diverged.  This  can  be  done,  for  example,  by  focusing  on  a  group  of  functionally  related  genes  in  a  characterized  genome,  identifying  simultaneously  all  the  respective  homologues  in  a  second  genome,
0	and  then  examining  which  of  the  homologues  are  indeed  coexpressed  (Figure  1A).  Importantly,  restricting  the  search  for  coexpressed  genes  to  a  limited  set  of  candidates  provides  an  effective  mean  to  overcome  the  noise  in  the  expression  data  (Ihmels  et  al.  2002).  Conserved  coregulation  of  functionally  related  genes.  To  explore  systematically  the  utility  of  this  approach,  we  first  examined  to  what  extent  coexpression  is  conserved  among  different  organisms.  We  performed  a  statistical  analysis  comparing  the  pairwise  correlations  between  genes  in  one  organism  to  the  correlations  between  their  respective  homologues.  Indeed,  a  significant  fraction  of  such  correlations  were  similar  (see  Figure  S8).  The  strongest  conservation  of  coexpression  was  found  between  pairs  of  genes  associated  with  particular  cellular  processes,  such  as  core  metabolic  functions  or  central  complexes  (e.g.,  ribosome  and  proteasome)  (lists  of  gene  pairs  with  conserved  coexpression  are  available  at  http://barkai-serv.weizmann.ac.il/  ComparativeAnalysis).  Next,  we  examined  whether  coexpression  is  conserved  among  groups  of  genes  that  are  associated  with  the  same  cellular  function.  To  this  end,  we  used  as  a  benchmark  coexpressed  groups  of  genes  (termed  transcription  modules;  see  Materials  and  Methods  for  a  precise  definition)  that  we  extracted  from  the  Saccharomy
0	Microarray  data  quality  analysis:  lessons  from  the  AFGC  project
1	David  Finkelstein1,  ,  Rob  Ewing1  ,  Jeremy  Gollub1  ,  Fredrik  Sterky1  ,  J.  Michael  Cherry2  and  Shauna  Somerville1
0	Carnegie
0	Key  words:  Arabidopsis,  annotation,  microarray  functional  genomics,  normalization
0	Abstract  Genome-wide  expression  profiling  with  DNA  microarrays  has  and  will  provide  a  great  deal  of  data  to  the  plant  scientific  community.  However,  reliability  concerns  have  required  the  development  data  quality  tests  for  common  systematic  biases.  Fortunately,  most  large-scale  systematic  biases  are  detectable  and  some  are  correctable  by  normalization.  Technical  replication  experiments  and  statistical  surveys  indicate  that  these  biases  vary  widely  in  severity  and  appearance.  As  a  result,  no  single  normalization  or  correction  method  currently  available  is  able  to  address  all  the  issues.  However,  careful  sequence  selection,  array  design,  experimental  design  and  experimental  annotation  can  substantially  improve  the  quality  and  biological  of  microarray  data.  In  this  review,  we  discuss  these  issues  with  reference  to  examples  from  the  Arabidopsis  Functional  Genomics  Consortium  (AFGC)  microarray  project.
0	Introduction  Genome-wide  gene  expression  profiling  can  be  performed  with  many  methods.  To  date  these  methods  include  sequence  tags  (e.g.  serial  analysis  of  gene  expression  (SAGE);  Velculescu  et  al.,  1995),  expressed  sequence  tags  (ESTs)  (Adams  et  al.,  1993),  and  various  hybridization-based  methods  such  as  photolithographic  oligonucleotide  arrays  (Lockhart  et  al.,  1996),  inkjet  microarrays  (Medlin,  2001),  nylon  membrane  macroarrays  (Desprez  et  al.,  1998),  and  DNA  microarrays  (Schena  et  al.,  1995).  These  methods  differ  in  scale,  economy  and  sensitivity.  In  terms  of  the  hybridization  methods,  the  two-color  DNA  spotted  array  is  the  most  accessible,  economical  and  flexible  method  currently  available  to  plant  biologists.  The  basic  principles  of  the  technique  are  reviewed  in  Figure  1.  Microarray  techniques  are  steadily  improving.  As  our  understanding  improves,  array  design,  noise  reduction,  and  data  interpretation  also  is  improved.  Our  choice  of  clones,  printing,  labeling  and  hybridization  methods  influence  the  quality  and  utility  of  the  data.
0	Fortunately,  we  can  benefit  from  the  efforts  of  our  predecessors.  For  example,  the  once  tedious  process  of  extracting  data  from  scanner-generated  images  of  dye-labeled  DNA  spots  is  now  efficiently  handled  by  commercial  software.  Also  of  use  are  data  quality  statistical  tests  that  measure  systematic  biases.  Also,  many  useful  normalization  strategies  have  been  developed.  However,  no  single  normalization  method  has  become  the  standard.  This  review  provides  an  overview  of  spotted  DNA  microarray  technology  and  data  generation,  with  specific  reference  to  plants  and  the  Arabidopsis  Functional  Genomics  Consortium  (AFGC)  microarray  project  (Wisman  and  Ohlrogge,  2000).  Second,  we  examine  bias  in  microarray  expression  data  in  detail  and  describe  methods  for  detection,  quantification  and  removal  of  biases.  Whenever  possible,  we  use  real  examples,  drawn  from  our  experience  at  AFGC.
0	Data  generation  Probe  selection  and  design  Microarray  designs  depend  on  the  aims  of  the  researcher.  One  important  decision  is  whether  to  design  a  genome-wide  array  or  a  smaller  specialty  array.  A  genome-wide  microarray  should  maximize  the  number  of  genes  while  minimizing  redundancy.  The  choice  of  the  physical  DNA  spotted  on  the  array  (hereafter  referred  to  as  the  probe)  influences  cost,  handling,  data  interpretation  and  normalization.  Experiments  on  genome-wide  arrays  may  provide  the  preliminary  attribution  of  a  function  to  unknown  ESTs.  However,  large  arrays  require  tracking,  handling,  and  maintaining  quality  control  for  a  large  numbers  of  clones.  Alternatively,  if  genes  of  interest  are  already  known,  a  custom  microarray  can  be  generated.  Custom  or
0	specialty  arrays  cost  less,  but  they  also  have  the  disadvantage  of  limited  scope.  In  both  types  of  array,  probe  selection  and  design  is  important.  Genome-wide  arrays  are  designed  for  general  purposes  from  model  organisms  like  Arabidopsis  or  rice.  For  model  organisms,  there  is  a  wide  selection  of  genes  available,  either  as  clones  or  as  annotated  genomic  sequences.  For  other  plant  species,  specialty  arrays  must  be  designed  from  candidate  clones  selected  from  any  cDNA  library.  Subtraction  libraries  are  used  to  enrich  for  genes  regulated  by  the  process  of  interest.  Subtraction  library  clones  are  also  useful  as  probes  for  northern  blots  to  verify  microarray  results  (Xu  et  al.,  2000).  Whenever  possible,  genes  should  be  included  beyond  the  set  of  genes  of  interest.  If  a  specialty  array  is  too  focused,  new  phenomena  may  go  undetected.  For  example,  low  phosphate  levels  have  a  role  in  triggering  cold  acclimation  (Hurry  et  al.,  2000).  Without  prior  knowledge,  an  array  that  specialized  in  nutrient  stress  may  exclude  cold  acclimation  genes.  Also,  if  insufficient  precautions  were  taken  in  the  preparation  of  the  biological  sample,  the  array  results  may  be  in-
0	fluenced  by  unintended  stresses.  For  example,  subtle  abiotic  stresses,  such  as  touch,  induce  the  expression  of  calmodulin  (Braam  and  Davis,  1990)  and  other  related  genes  (Ichimura  et  al.,  2000)  in  Arabidopsis.  This  accidental  effect  may  be  go  undetected  on  a  specialized  array,  and  may  lead  to  misinterpretation.  Specialty  array  designers  may  address  this  concern  by  including  genes  known  to  monitor  pathways  outside  of  the  experimental  focus  of  the  array.  These  monitor  genes  are  selected  for  their  sensitive  response  to  a  given  stress.  The  ability  to  select  specific  and  sensitive  monitor  genes  is  one  of  the  expected  outcomes  of  the  complete  analysis  of  the  AFGC  arrays.  Currently,  stress-specific  monitor  genes  are  not  yet  well  identified.  Ideally,  each  specialty  array  designer  should  include  genes  that  monitor  metabolic  processes  that  influence  effect  their  process  of  interest.  Gene  selection  alone  does  not  solve  some  problems  such  as  alternative  splicing  and  crosshybridization.  Cross-hybridization  to  family  members  or  alternative  splicing  products  may  mask  changes  in  transcript  levels  (Girke  et  al.,  2000).  The  selection  of  non-redundant  cDNA  clones  does  not  eliminate  crosshybridization.  Each  spotted  cDNA  may  still  detect  a  group  of  closely  related  sequences  rather  than  a  single  transcript.  Only  sequence-specific  probes  can  distinguish  between  the  expression  patterns  of  similar  sequences.  To  accurately  track  alternative  splicing,  exon-specific  sequences  are  used.  Either  oligonucleotide  probes  or  fragments  amplified  directly  from  genomic  DNA  (Penn  et  al.,  2000)  can  achieve  exon  specificity.  Specific  exons,  3  -UTR  sequences  or  even  polymorphisms  can  be  printed  (Okamoto  et  al.,  2000).  Since  3  -UTR  sequences  are  more  likely  to  be  genespecific  than  coding  sequences,  probes  that  are  primarily  composed  of  3  -UTR  sequences  are  used.  Probes  can  be  designed  to  represent  each  exon,  intron  and  exon-intron  junctions  of  a  given  transcript  so  that  alternative  splicing  events  can  be  identified.  While  current  commercial  Arabidopsis  photolithographic  oligoarrays  do  not  address  alternative  splicing,  they  do  effectively  address  crosshybridization.  Alternatives  to  commercial  arrays  do  exist.  Sharing  resources  lightens  the  burden  of  oligo  design.  For  example,  the  AFGC  has  designed  genomic  PCR  primers  that  are  then  synthesized  and  tested  by  other  groups  within  the  Arabidopsis  community.  For  plant  species  where  sequence  information  is  less  complete,  gene-specific  oligonucleotide  a
0	letters  to  nature
0	Genomic  binding  sites  of  the  yeast  cell-cycle  transcription  factors  SBF  and  MBF
1	Vishwanath  R.  Iyer*²³,  Christine  E.  Horak§³,  Charles  S.  Scafek¶,  David  Botsteink,  Michael  Snyder§  &  Patrick  O.  Brown*#
0	Summing  up  the  noise  in  gene  networks
1	Johan  Paulsson
0	Random  fluctuations  in  genetic  networks  are  inevitable  as  chemical  reactions  are  probabilistic  and  many  genes,  RNAs  and  proteins  are  present  in  low  numbers  per  cell.  Such  `noise'  affects  all  life  processes  and  has  recently  been  measured  using  green  fluorescent  protein  (GFP).  Two  studies  show  that  negative  feedback  suppresses  noise,  and  three  others  identify  the  sources  of  noise  in  gene  expression.  Here  I  critically  analyse  these  studies  and  present  a  simple  equation  that  unifies  and  extends  both  the  mathematical  and  biological  perspectives.
0	ntracellular  randomness  has  long  been  predicted  from  basic  physical  principles1  and  observations  of  phenotypic  heterogeneity2,3.  In  the  last  few  years  it  has  also  been  visualized  directly  using  fluorescent  probes.  The  first  quantitative  studies4-8  collectively  examined  the  noise  associated  with  the  principal  steps  of  the  central  dogma  of  molecular  biology;  that  is,  replication,  gene  activation,  transcription,  translation  and  the  enslaving  intracellular  environment.  They  also  suggested  how  autorepression  of  replication  and  transcription  suppresses  noise,  and  how  eukaryotes  differ  from  prokaryotes.  This  analysis  connects  the  different  studies  to  a  simple  variant  of  the  fluctuation-dissipation  theorem  and  uses  the  experimental  controls  to  extend  or  reinterpret  many  of  the  conclusions.
0	The  fluctuation-dissipation  theorem
0	where  j2  =kn1  l2  <  dkn1  lH  11  Þ21  :  Intrinsic  noise  depends  on  the  1  average  number  of  molecules  and  how  systematic  adjustments  (rate  H  22/t  2)  quench  spontaneous  fluctuations  (rate  1/t  2).  The  normalized  adjustment  rate  H  22  can  also  be  interpreted  as  the  statistical  bias  to  return  to  the  average  rather  than  deviate  further:  a  1%  increase  in  n  2  gives  a  H  22  per  cent  increase  in  R2  =Rþ  :  Extrinsic  2  2  noise  instead  depends  on  the  magnitude  of  n  1  fluctuations  and  how  strongly  n  1  affects  n  2.  The  normalized  susceptibility  factor  H  21/H  22  reflects  that  a  1%  increase  in  n  1  gives  a  H  21  per  cent  increase  in  R2  =Rþ  ;  which  makes  n  2  adjust  towards  a  H  21/H  22  per  cent  lower  2  2  average  quasi-steady-state.  When  n  1  changes  rapidly  (high  H  11/t  1)  or  n  2  adjusts  slowly  (low  H  22/t  2),  n  2  does  not  have  time  to  reach  its  quasi-steady-state  before  n  1  changes  anew.  Consecutive  ups  and  downs  in  n  1  then  cancel  out  and  n  2  time-averages  over  the  recent  history  of  n  1  fluctuations.  The  effect  of  cell  growth  and  division  is  qualitatively  accounted  for  by  adding  first-order  elimination  terms  to  R2  or  R2.  The  method  behind  equation  (1)  can  be  extended  to  1  2  any  chemical  system,  providing  a  basis  for  a  stochastic  Biochemical  Systems  Theory  (J.P.,  manuscript  in  preparation).
0	Noise  in  the  central  dogma
0	Nature  Publishing  Group
0	suppress  the  noise  in  any  of  these  systems,  cells  commonly  use  autorepression  that  increases  and  decreases  synthesis  at  low  and  high  concentrations  respectively.  This  has  been  studied  extensively  using  macroscopic  models17,32,  and  the  stochastic  principles  are  closely  related.  Autorepression  can  raise  the  effective  H  22,  and  thus:  (1)  increase  the  adjustment  rate  H  22/t  2  relative  to  the  rate  1/t  2  of  spontaneous  randomization  and  thereby  suppress  intrinsic  noise  around  a  given  average  number  of  molecules;  (2)  increase  the  adjustment  rate  H  22/t  2  relative  to  the  rate  H  11/t  1  of  environmental  changes  and  thereby  amplify  extrinsic  noise  by  preventing  timeaveraging;  (3)  decrease  the  susceptibility  H  21/H  22  of  the  quasisteady-state,  which  typically  overcompensates  for  the  impaired  time-averaging  and  produces  a  net  decrease  in  extrinsic  noise.  This  may  explain  the  popularity  of  autorepression  in  transcription  networks32-34  and  its  ubiquity  in  replication  control  of  chromosomes  and  plasmids  where  it  has  been  similarly  described5,19.  Equation  (1)  thus  unifies  models  of  autoreplication,  constitutive  transcription  and  translation,  and  the  stabilizing  effect  of  autorepression,  all  in  disordered  environments.  This  makes  it  ideally  suited  also  to  unify  the  GFP  studies  that  examine  these  aspects  experimentally.
0	Terminology  and  measures
0	rapid  adjustments  reduce  intrinsic  noise  around  a  given  average,  the  effect  is  the  opposite  for  extrinsic  noise  (compare  points  (1)  and  (2)  above).  This  reveals  an  interesting  discrepancy  between  the  studies:  if  the  noise  came  from  gene  expression,  GFP  would  not  measure  plasmid  copy  numbers,  and  if  it  came  from  fluctuations  in  plasmid  copy  numbers,  a  protein  that  adjusted  more  rapidly  to  its  meandering  steady  state  would  inherit  more  noise,  not  less.  Thanks  to  the  many  experimental  controls,  this  issue  can  be  at  least  partially  settled  using  equation  (1).
0	Comparing  different  systems  requires  consistency  in  definitions  and  measures.  The  terms  `intrinsic'  and  `extrinsic'  generically  distinguish  between  the  origin  and  propagation  of  noise,  and  their  biological  meaning  is  always  defined  in  conjunction  with  a  specified  component  or  process.  For  example,  if  a  gene  for  a  transcriptional  repressor  spontaneously  switches  on  and  off,  thereby  enslaving  the  encoded  protein  and  transmitting  the  fluctuations  to  repressed  genes,  the  noise  is  intrinsic  to  the  number  of  active  repressor  genes  and  extrinsic  to  all  affected  components.  If  the  corresponding  noise  in  a  repressed  protein  was  instead  assigned  to  its  transcription,  just  because  transcription  transmits  the  noise  from  the  repressor  gene,  then  by  the  same  logic  it  must  also  be  assigned  to  translation  or  to  any  other  step  in  the  cascade.  This  relates  directly  to  the  mathematical  measures.  For  intrinsic  noise  it  is  convenient  to  use  j2  =kn2  l  for  a  size-independent  comparison,  but  this  artificially  2  forces  extrinsic  noise  to  increase  with  kn  2l,  as  in  j2  =kn2  l  <  H  21  þ  2  22  Ekn2  l  where  E  is  the  second  term  in  equation  (1).  Unless  the  measure  matches  the  noise,  scale  artefacts  may  thus  completely  distort  interpretations  of  dynamics.  For  instance,  if  all  protein  (X2)  noise  came  from  fluctuations  in  protease  levels,  j2  =kn2  l2  would  typically  2  be  independent  of  transcription  and  translation  rates.  But  because  both  of  these  processes  affect  kn  2l,  measuring  noise  strength  by  j2 
0	Evidence  for  lateral  gene  transfer  between  Archaea  and  Bacteria  from  genome  sequence  of  Thermotoga  maritima
1	Karen  E.  Nelson,  Rebecca  A.  Clayton,  Steven  R.  Gill,  Michelle  L.  Gwinn,  Robert  J.  Dodson,  Daniel  H.  Haft,  Erin  K.  Hickey,  Jeremy  D.  Peterson,  William  C.  Nelson,  Karen  A.  Ketchum,  Lisa  McDonald,  Teresa  R.  Utterback,  Joel  A.  Malek,  Katja  D.  Linher,  Mina  M.  Garrett,  Ashley  M.  Stewart,  Matthew  D.  Cotton,  Matthew  S.  Pratt,  Cheryl  A.  Phillips,  Delwood  Richardson,  John  Heidelberg,  Granger  G.  Sutton,  Robert  D.  Fleischmann,  Jonathan  A.  Eisen,  Owen  White,  Steven  L.  Salzberg,  Hamilton  O.  Smith,  J.  Craig  Venter  &  Claire  M.  Fraser
0	The  Institute  for  Genomic  Research,  9712  Medical  Center  Drive,  Rockville,  Maryland  20850,  USA
0	Thermotoga  maritima,  a  non-spore-forming,  rod-shaped  bacterium  belonging  to  the  order  Thermotogales,  was  originally  isolated  from  geothermal  heated  marine  sediment  at  Vulcano,  Italy1,  and  has  an  optimum  growth  temperature  of  80  C.  T.  maritima  metabolizes  many  simple  and  complex  carbohydrates  including  glucose,  sucrose,  starch,  cellulose  and  xylan1,2.  Both  cellulose  and  xylan,  through  conversion  to  fuels  (such  as  H2),  have  great  potential  as  renewable  carbon  and  energy  sources.  T.  maritima  is  also  of  evolutionary  significance,  because  smallsubunit  ribosomal  RNA  (SSU  rRNA)  phylogeny  has  placed  this  bacterium  as  one  of  the  deepest  and  most  slowly  evolving  lineages  in  the  Eubacteria3.  To  elucidate  further  its  unique  metabolic  properties  and  evolutionary  relationship  to  other  microbial  species,  we  sequenced  the  genome  of  the  type  strain  T.  maritima  MSB8  using  the  whole-genome  random-sequencing  method  previously  described4,5.
0	General  features  of  the  genome
0	Fourth  circle,  Archaea-like  islands  on  the  genome.  Fifth  circle,  small  repeats.  Sixth  circle,  large  repeats  (black),  large  repeats  associated  with  small  repeats  (red).  Seventh  and  eighth  circles,  rRNAs  and  tRNAs,  respectively.
0	Macmillan  Magazines  Ltd
0	Table  1  General  features  of  the  T.  maritima  MSB8  genome
0	General  features
0	Length  of  sequence  G  +  C  ratio  Total  no.  of  sequences  Average  read  length  (bp)  Open  reading  frames  Protein  coding  regions  Ribosomals  tRNAs
0	Chromosomal  coding  sequences
0	No.  similar  to  known  proteins  No.  of  conserved  hypotheticals  No.  similar  to  proteins  of  unknown  function  No.  without  a  database  match  Total
0	Repeats  Class  SR-01  LR-01  LR-02  LR-03  LR-04  LR-05  LR-06  LR-07  LR-08
0	Database  match  tttccatacctctaaggaattattgaaaca  hypothetical  protein  -glucosidase  putative  transposase  methyl-accepting  chemotaxis  protein  putative  transposase  helicase  excinuclease  putative  transposase
0	Solute  uptake  and  metabolism
0	DESIGN  ISSUES  FOR  cDNA  MICROARRAY  EXPERIMENTS
1	Yee  Hwa  Yang*  and  Terry  Speed*§
0	Microarray  experiments  are  used  to  quantify  and  compare  gene  expression  on  a  large  scale.  As  with  all  large-scale  experiments,  they  can  be  costly  in  terms  of  equipment,  consumables  and  time.  Therefore,  careful  design  is  particularly  important  if  the  resulting  experiment  is  to  be  maximally  informative,  given  the  effort  and  the  resources.  What  then  are  the  issues  that  need  to  be  addressed  when  planning  microarray  experiments?  Which  features  of  an  experiment  have  the  most  impact  on  the  accuracy  and  precision  of  the  resulting  measurements?  How  do  we  balance  the  different  components  of  experimental  design  to  reach  a  decision?  For  example,  should  we  replicate,  and  if  so,  how?
0	NATURE  REVIEWS  |  GENETICS
0	a  cDNA  microarray  experiment  is  a  competitive  hybridization  between  a  sample  that  is  labelled  with  the  red-fluorescent  dye  Cyanine  5  (Cy5)  and  a  sample  that  is  labelled  with  the  green-fluorescent  dye  Cyanine  3  (Cy3).  Unlike  gene-expression  data  from  nylon  membranes  (filter)  or  GeneChip  (Affymetrix),  cDNA  microarray  data  are  inherently  comparative.  This  is  because  the  filter  or  Affymetrix  data  measure  geneexpression  levels  for  each  sample  separately,  whereas,  in  the  case  of  cDNA  experiments,  the  pairing  of  target  samples  for  hybridization  leads  to  relative  expression  values  and  constrains  the  types  of  design  that  can  be  considered.  So,  each  cDNA  microarray  experiment  gives  us  the  relative  abundance  of  two  sets  of  mRNA.  The  principles  of  design  for  comparative  experiments  of  this  kind  are  not  new.  They  first  arose  in  agricultural  research  many  years  ago  with  Ronald  A.  Fisher8,  who  studied  yields  from  different  plant  varieties  (see  also  REFS  9,10).  The  varieties  to  be  compared  were  grown  on  the  same  land,  as  the  variation  between  plots  was  substantial.  This  planting  arrangement  is  conceptually  equivalent  to  using  COMPETITIVE  HYBRIDIZATION  to  compensate  for  the  variation  between  glass-slide  microarrays.  This  similarity  means  that  the  design  and  analysis  of  comparative  experiments  can  be  accommodated  in  a  classical  statistical  framework,  an  important  point  to  which  we  return  in  later  sections.  In  cDNA  microarray  experiments,  we  see  more  variation  between  slides  than  within  slides  (for  further  discussion,  see  the  section  on  Variability  and  replication),  and  so  the  most  important  design  issue  is  to  determine  which  mRNAs  are  to  be  labelled  with  which  fluor,  and  which  are  to  be  hybridized  together  on  the  same  slide.  In  addition,  there  can  be  constraints  on  the  number  of  slides,  the  amount  of  RNA  available,  or  other  cost  considerations,  all  of  which  will  affect  the  experimental  design.
0	Graphical  representation  of  designs
0	COMPETITIVE  HYBRIDIZATION
0	A  mixture  of  differently  labelled  target  cDNA  fragments  that  are  hybridized  together  in  the  presence  of  a  common  probe  or  collection  of  probes.
0	LOG  RATIO
0	The  logarithm,  usually  to  the  base  2,  of  the  ratio  of  the  measured  signal  intensities  in  the  two  channels  of  a  two-colour  microarray  experiment.  If  we  denote  these  two  signals  by  R  (red  channel)  and  G  (green  channel),  then  their  log  ratio  is  log2(R/G).
0	This  review  describes  the  experimental  design  and  related  issues  that  are  important  for  carrying  out  cDNA  microarray  experiments.  In  addition,  we  hope  to  facilitate  discussion  and  understanding  between  biologists  who  do  the  experiments  and  statisticians  or  others  who  do  the  analyses.  We  first  describe  the  objectives  of  experimental  design  in  the  context  of  microarray  experiments.  The  next  section  introduces  the  reader  to  a  display  that  summarizes  the  hybridizations  that  are  carried  out  in  an  experiment.  Furthermore,  we  discuss  how  scientific  aims  affect  the  choice  of  design,  and  how  practical  issues  constrain  our  design  options.  Finally,  we  use  three  case  studies  to  illustrate  the  ways  in  which  scientific  and  physical  constraints  can  be  used  to  choose  a  design.
0	Why  experimental  design?
0	The  objective  of  experimental  design  is  to  make  the  analysis  of  the  data  and  the  interpretation  of  the  results  as  simple  and  as  powerful  as  possible,  given  the  purpose  of  the  experiment  and  the  constraints  of  the  experimental  material.  As  described  in  BOX  1,  the  underlying  idea  of
0	technology  review
0	Navigating  gene  expression  using  microarrays  --  a  technology  review
1	Almut  Schulze  and  Julian  Downward
0	Parallel  quantification  of  large  numbers  of  messenger  RNA  transcripts  using  microarray  technology  promises  to  provide  detailed  insight  into  cellular  processes  involved  in  the  regulation  of  gene  expression.  This  should  allow  new  understanding  of  signalling  networks  that  operate  in  the  cell  and  of  the  molecular  basis  and  classification  of  disease.  But  can  the  technology  deliver  such  far-reaching  promises?
0	Upstream  considerations:  microarray  technology
0	Macmillan  Magazines  Ltd
0	technology  review
0	cDNA  microarray  High-density  oligonucleotide  microarrays
0	mRNA  refernce  sequence
0	cDNA  collection  Array  preparation  Perfect  match  Probe  set  Mismatch  Insert  amplification  by  PCR  Vector-specific  primers  Gene-specific  primers  Printing  Coupling  Denaturing  Array  2  Array  1
0	In  situ  synthesis  by  photolithography
0	Ratio  Cy5/Cy3
0	Ratio  array  1/array  2
0	Target  preparation
0	Hybridization  mixing  Cy3  Cy3  or  Cy5  labelled  cDNA
0	TTTTTTTT  TTTTTTTT  TTTTTTTT  TTTTTTTT
0	Staining  hybridization  Biotin-labelled  cRNA
0	TTTTTTTT  TTTTTTTT  TTTTTTTT  TTTTTTTT
0	In  vitro  transcription
0	AAAAAAAA  TTTTTTTT  T7  AAAAAAAA  TTTTTTTT  T7
0	Double-stranded  cDNA
0	First-strand  cDNA  synthesis
0	cDNA  synthesis
0	AAAAAAAA  AAAAAAAA  AAAAAAAA  AAAAAAAA  AAAAAAAA  AAAAAAAA  AAAAAAAA  AAAAAAAA  AAAAAAAA  AAAAAAAA  AAAAAAAA  AAAAAAAA
0	Total  RNA  Cells/tissue
0	AAAAAAAA  AAAAAAAA  AAAAAAAA
0	PolyA+  RNA  Cells/tissue
0	intensities  and  ratios  of  mRNA  abundance  for  the  genes  represented  on  the  array.  b,  High-density  oligonucleotide  microarrays.  Array  preparation:  sequences  of  16-20  short  oligonucleotides  (typically  25mers)  are  chosen  from  the  mRNA  reference  sequence  of  each  gene,  often  representing  the  most  unique  part  of  the  transcript  in  the  5-untranslated  region.  Light-directed,  in  situ  oligonucleotide  synthesis  is  used  to  generate  high-density  probe  arrays  containing  over  300,000  individual  elements.  Target  preparation:  polyA+  RNA  from  different  tissues  or  cell  populations  is  used  to  generate  double-stranded  cDNA  carrying  a  transcriptional  start  site  for  T7  DNA  polymerase.  During  in  vitro  transcription,  biotin-labelled  nucleotides  are  incorporated  into  the  synthesized  cRNA  molecules.  Each  target  sample  is  hybridized  to  a  separate  probe  array  and  target  binding  is  detected  by  staining  with  a  fluorescent  dye  coupled  to  streptavidin.  Signal  intensities  of  probe  array  element  sets  on  different  arrays  are  used  to  calculate  relative  mRNA  abundance  for  the  genes  represented  on  the  arr
0	Nature  Publishing  Group  http://biotech.nature.com
0	RESEARCH  ARTICLE
0	Expression  profiling  using  microarrays  fabricated  by  an  ink-jet  oligonucleotide  synthesizer
0	Nature  Publishing  Group  http://biotech.nature.com
1	Timothy  R.  Hughes1,  Mao  Mao1,  Allan  R.  Jones1,  Julja  Burchard1,  Matthew  J.  Marton1,  Karen  W.  Shannon2,  Steven  M.  Lefkowitz2,  Michael  Ziman1,  Janell  M.  Schelter1,  Michael  R.  Meyer1,  Sumire  Kobayashi1,  Colleen  Davis1,  Hongyue  Dai1,  Yudong  D.  He1,  Sergey  B.  Stephaniants1,  Guy  Cavet1,  Wynn  L.  Walker1,  Anne  West1,  Ernest  Coffey1,  Daniel  D.  Shoemaker1,  Roland  Stoughton1,  Alan  P.  Blanchard1,  Stephen  H.  Friend1,  and  Peter  S.  Linsley1*
0	We  describe  a  flexible  system  for  gene  expression  profiling  using  arrays  of  tens  of  thousands  of  oligonucleotides  synthesized  in  situ  by  an  ink-jet  printing  method  employing  standard  phosphoramidite  chemistry.  We  have  characterized  the  dependence  of  hybridization  specificity  and  sensitivity  on  parameters  including  oligonucleotide  length,  hybridization  stringency,  sequence  identity,  sample  abundance,  and  sample  preparation  method.  We  find  that  60-mer  oligonucleotides  reliably  detect  transcript  ratios  at  one  copy  per  cell  in  complex  biological  samples,  and  that  ink-jet  arrays  are  compatible  with  several  different  sample  amplification  and  labeling  techniques.  Furthermore,  results  using  only  a  single  carefully  selected  oligonucleotide  per  gene  correlate  closely  with  those  obtained  using  complementary  DNA  (cDNA)  arrays.  Most  of  the  genes  for  which  measurements  differ  are  members  of  gene  families  that  can  only  be  distinguished  by  oligonucleotides.  Because  different  oligonucleotide  sequences  can  be  specified  for  each  array,  we  anticipate  that  ink-jet  oligonucleotide  array  technology  will  be  useful  in  a  wide  variety  of  DNA  microarray  applications.
0	DNA  microarrays  provide  a  means  to  quantify  tens  of  thousands  of  discrete  sequences  in  a  single  assay.  Among  the  most  widespread  uses  of  microarrays  is  expression  profiling1,2,  which  has  found  many  applications  including  discovery  of  gene  functions3,4,  drug  evaluation4-6,  pathway  dissection7,  and  classification  of  clinical  samples8-10.  Two  major  platforms  for  high-density  microarray  manufacture  are  in  common  use.  The  first  involves  25-mer  oligonucleotides  made  by  a  photolithographic  process  similar  to  manufacture  of  computer  chips11.  The  second  utilizes  robotic  deposition  or  "spotting"  of  DNA  molecules1.  Spotted  arrays  are  commonly  referred  to  as  "cDNA  microarrays",  although  clones,  PCR  products,  or  oligonucleotides  can  all  be  spotted.  Oligonucleotides  offer  greater  specificity  than  cDNAs  or  PCR  products,  having  the  capacity  to  distinguish  singlenucleotide  polymorphisms12  and  discern  splice  variants.  In  both  systems,  creation  of  a  new  array  design  is  relatively  inconvenient  and/or  expensive,  requiring  either  a  new  set  of  masks  (for  photolithography)  or  new  samples  to  deposit  (for  spotted  cDNA  arrays).  Flexibility  to  create  new  arrays  is  becoming  increasingly  important  as  more  genomes  are  sequenced  and  more  applications  for  microarrays  are  described.  One  possible  solution  to  this  need  is  a  recently  described  system  for  maskless  light-directed  oligonucleotide  synthesis  using  a  micromirror  array13.  Here,  we  describe  a  flexible  platform  for  microarray  expression  profiling,  centered  around  an  in  situ  oligonucleotide  synthesis  method  in  which  the  ink-jet  printing  process  is  modified  to  accommodate  delivery  of  phosphoramidites  to  directed  locations  on  a  glass  surface14.  Using  the  flexibility  offered  by  this  system,  we  have  characterized  the  importance  of  various  experimental  parameters  to  hybridization  specificity  and  sensitivity.  We  show  that  results  obtained  are  in  good  overall  agreement  with  spotted  cDNA  microarrays,  that  oligonucleotides  have  a  superior  ability  to  distinguish  closely  related  sequences,  and  that  a  single  oligonucleotide  is  suitable  for  detection  of  single-copy  genes  in  human  cells.
0	Results  and  discussion
0	Nature  Publishing  Group  http://biotech.nature.com
0	RESEARCH  ARTICLE
0	Length  (nt)
0	Nature  Publishing  Group  http://biotech.nature.com
0	Intensity  -  background  >  (arbitrary  units)
0	Oligonucleotide  length  (nt)
0	GCN4  -  Average  sensitivity
0	GCN4  -  Average  specificity
0	Oligonucleotide  length  (nt)
0	Specific  /  non  specific  >
0	Formamide  (%)
0	Non-specific,  specific  intensity
0	Formamide  (%)
0	Tiling  start  position  (nt)
0	Data  extraction  from  composite  oligonucleotide  microarrays
1	Ilya  Shmulevich*,  Jaakko  Astola1,  David  Cogdell,  Stanley  R.  Hamilton  and  Wei  Zhang
0	ABSTRACT  Microarray  or  DNA  chip  technology  is  revolutionizing  biology  by  empowering  researchers  in  the  collection  of  broad-scope  gene  information.  It  is  well  known  that  microarray-based  measurements  exhibit  a  substantial  amount  of  variability  due  to  a  number  of  possible  sources,  ranging  from  hybridization  conditions  to  image  capture  and  analysis.  In  order  to  make  reliable  inferences  and  carry  out  quantitative  analysis  with  microarray  data,  it  is  generally  advisable  to  have  more  than  one  measurement  of  each  gene.  The  availability  of  both  betweenarray  and  within-array  replicate  measurements  is  essential  for  this  purpose.  Although  statistical  considerations  call  for  increasing  the  number  of  replicates  of  both  types,  the  latter  is  particularly  challenging  in  practice  due  to  a  number  of  limiting  factors,  especially  for  in-house  spotting  facilities.  We  propose  a  novel  approach  to  design  so-called  composite  microarrays,  which  allow  more  replicates  to  be  obtained  without  increasing  the  number  of  printed  spots.  INTRODUCTION  Oligonucleotide  arrays  (1,2),  both  synthesized  and  spotted,  enjoy  several  advantages  over  cDNA-based  arrays  (3,4),  such  as  simpler  methodology  to  obtain  DNA  and  better  quality  control,  options  to  select  high-specificity  sequences  to  avoid  cross-hybridization,  and  the  potential  to  detect  alternative  spliced  variants  of  genes  (5).  It  is  known  that  microarray  gene  expression  measurements  exhibit  both  between-slide  and  within-slide  variability  (6)  and  that  apart  from  making  efforts  to  improve  the  technology,  having  replicate  measurements  is  essential  for  improving  the  reliability  of  subsequent  quantitative  analysis.  Dealing  with  between-slide  variability  involves  repeating  entire  microarray  experiments.  There  exist  some  limitations,  however,  such  as  availability  of  RNA  as  well  as  cost  factors.  To  address  within-slide  variability,  the  typical  approach  entails  printing  replicate  spots  on  the  same  slide.  However,  spotting  robots  typically  have  a  limitation  on  the  number  of  spots  that  can  be  reliably  printed.  Thus,  increasing
0	PAGE  2  OF  5
0	each  well  were  resuspended  in  1  ml  of  50%  DMSO  array  buffer  (50  mM  for  each  oligo).  Spotting  Oligos  were  spotted  onto  poly-L-lysine  glass  slides  by  a  G3  solid  pin  spotter  (Genomic  Solutions,  Ann  Arbor,  MI,  USA),  baked  at  65°C  for  90  min,  and  crosslinked  with  65  mJ  of  ultraviolet  radiation.  Probe  labeling,  hybridization  and  quantification
0	or  more  oligos  into  the  same  spot.  The  challenge  then  is  to  recover  the  individual  gene  intensities  by  observing  the  intensities  of  the  mixtures.  This  is,  in  fact,  conceptually  simpler  than  the  blind  source  separation  problem  because  we  know  exactly  which  genes  are  present  in  which  spots  and  because  intensities  are  simply  scalars  and  not  time-varying  signals.  In  addition,  the  contributions  from  the  mixed  oligos  are  expected  to  be  mutually  independent,  as  they  are  designed  to  be  non-homologous  to  each  other,  which  is  a  fundamental  assumption  of  all  oligonucleotide  microarrays.  The  obvious  benefit  of  this  approach  is  that  each  gene  is  given  an  opportunity  to  make  several  contributions  in  different  spots,  each  time  with  a  different  partner,  and  therefore,  is  also  a  type  of  replication.  The  question  is  whether  the  original  gene  expressions  can  be  reliably  recovered  from  such  mixtures.
0	The  microarray  experiments  were  performed  as  described  previously  (13).  Briefly,  triplicate  reverse  transcription  reactions  using  100  mg  of  total  RNA  from  RKO  cells  incorporated  Cy3  d-CTP  into  cDNA.  After  G50  column  purification,  replicates  were  combined  for  uniformity  and  distributed  to  three  identical  microarray  slides.  Each  slide  was  hybridized  overnight  at  60°C  in  a  humid  incubator,  then  washed  at  37°C  with  increasing  stringency  until  0.1Q  SSC  was  used.  Slides  were  scanned  on  a  LSIV  laser  scanner  (Genomic  Solutions,  Ann  Arbor,  MI,  USA)  and  quantified  using  ArrayVision  software  (Imaging  Research,  Inc,  St  Catherine's,  Ontario,  Canada).  RESULTS  Our  experiment  consisted  of  designing  a  spotted  microarray  containing  30  genes  represented  in  50  bp  oligos  that  are  expressed  at  different  levels  in  RKO  colon  cancer  cells  based  on  our  prior  experiments.  Those  genes  were  spotted  individually  five  times  each,  as  well  as  mixtures  of  all  possible  pairs  of  genes,  for  a  total  of  (30  Q  29)  /  2  =  435  pairs.  Thus,  each  of  the  30  genes  appeared  29  times  with  different  partner  genes.  Finally,  each  mixture  was  replicated  five  times  to  facilitate  statistical  analysis.  Total  RNA  was  isolated  from  RKO  colon  cancer  cells  and  used  for  microarray  experiments.  As  a  first  step,  we  proceeded  to  discover  how  the  intensities  of  signals  of  the  mixtures  are  related  to  signal  intensities  of  the  individual  genes.  Prior  to  any  experimentation,  it  was  expected  that  the  intensity  of  the  mixture  should  be  at  least  an  increasing  function  of  the  individual  intensities.  In  other  words,  the  higher  the  expression  of  the  two  genes,  the  higher  is  the  signal  from  their  mixture.  It  was  further  anticipated  that  the  mixture  would  be  a  linear  combination  of  the  individual  gene  intensities.  That  is,  if  xi  is  the  individual  intensity  of  gene  i,  xj  is  the  intensity  of  gene  j  ¹  i,  and  yk(i,j)  is  the  intensity  of  the  mixture  of  genes  i  and  j,  then  yk(i,j)  =  a(xi  +  xj)  +  n,  i,  j,  =  1,  ...,  30,  for  some  scalar  a  and  additive  error  component  n.  Here,  k(i,  j)  is  simply  an  index  that  counts  from  1  to  435,  so  k(1,  2)  =  1,  k(1,3)  =  2,  ...,  k(29,30)  =  435.  Note  that  since  genes  are  simply  mixed  in  equal  proportions,  there  is  no  notion  of  `first'  or  `second'  gene  and  thus,  we  would  not  expect  different  weights  ai  and  aj  for  genes  xi  and  xj.  Also,  for  the  least-squares  approach  that  we  use  below,  no  statistical  description  of  the  error  component  n  is  required.  Rewriting  the  above  relationship  in  vector-matrix  notation,  we  have:  y  =  aAx  +  n  where  y  is  a  435  Q  1  vector  of  mixtures,  x  is  a  30  Q  1  vector  of  individual  gene  intensities,  A  is  a  binary  matrix  of  size  435  Q  30  in  which  row  k(i,  j)  contains  ones  in  the  ith  and  jth  positions
0	MATERIALS  AND  METHODS  Oligonucleotide  design  For  the  proof-of-principle  experiments,  we 
0	A  novel  sensitive  microarray  approach  for  differential  screening  using  probes  labelled  with  two  different  radioelements
1	H.  Salin,  T.  Vujasinovic,  A.  Mazurie,  S.  Maitrejean1,  C.  Menini,  J.  Mallet  and  S.  Dumas*
0	LGN,  UMR  7091,  CNRS,  Batiment  CERVI,  5eme  Etage,  Hopital  Pitie  Salpetriere,  83  boulevard  de  l'Hopital,  F-75013  Paris,  France  and  1Biospace  Mesures,  10  rue  Mercoeur,  F-75011  Paris,  France
0	ABSTRACT  We  have  developed  a  novel  microarray  approach  for  differential  screening  using  probes  labelled  with  two  different  radioelements.  The  complementary  DNAs  from  the  reverse  transcription  of  mRNAs  from  two  different  biological  samples  were  labelled  with  radioelements  of  significantly  different  energies  (3H  and  35S  or  33P).  Radioactive  images  corresponding  to  the  expressed  genes  were  acquired  with  a  MicroImager,  a  real  time,  high  resolution  digital  autoradiography  system.  An  algorithm  was  used  to  process  the  data  such  that  the  initially  acquired  radioactive  image  was  filtered  into  two  subimages,  each  representative  of  the  hybridisation  result  specific  for  one  probe.  The  simultaneous  screening  of  gene  expression  in  two  different  biological  samples  requires  <100  ng  mRNA  without  any  amplification.  In  such  conditions,  the  technique  is  sensitive  enough  to  directly  quantify  the  amount  of  mRNA  even  when  present  in  small  amounts:  107  molecules  in  the  probe  as  assessed  with  an  added  control  sequence  and  2  x  105  molecules  with  an  endogenous  tyrosine  hydroxylase  mRNA.  This  novel  technique  of  double  radioactive  labelling  on  a  microarray  is  thus  suitable  for  the  comparison  of  gene  expression  in  two  different  biological  samples  available  in  only  small  quantities.  Consequently,  it  has  great  potential  for  various  biological  fields,  such  as  neuroscience.  INTRODUCTION  DNA  array  technology  is  increasingly  used  for  large-scale  screening  of  gene  expression.  The  availability  of  laser  devices  that  can  differentiate  between  several  fluorescent  dyes  has  led  to  most  development  efforts  being  concentrated  on  fluorescent  labelling  of  probes  to  be  hybridised  onto  DNA  arrays  (the  immobilised  nucleic  acid  is  called  the  `target'  and  the  free  nucleic  acid  is  called  the  `probe').  The  use  of  two  different  fluorescent  dyes,  one  to  label  probes  from  a  control  tissue  and  one  to  label  probes  from  a  tissue  of  interest,  allows  normalised  quantification  of  gene  expression.  For  example,  standard  high
0	PAGE  2  OF  7
0	of  starting  material  required  for  radioactive  labelling  is  only  2-400  ng  mRNA  to  detect  2  x  107  molecules  (12).  Previously,  such  analyses  were  possible  only  for  one  mRNA  sample  at  a  time.  A  technique  comparing  several  mRNA  samples  on  the  same  high  density  array  but  attaining  the  sensitivity  discussed  above  would  be  of  great  value.  For  example,  the  results  could  be  normalized,  each  RNA  sample  being  used  as  a  control  for  the  other,  on  each  target  of  the  microarray,  as  is  possible  with  double  fluorescent  labelling  (2).  These  considerations  led  us  to  develop  a  technique  for  simultaneous  hybridisation  of  two  differently  labelled  radioactive  probes  on  the  same  glass  support  microarray  and  detection  of  the  hybridisation  result  for  each  probe  separately.  The  development  of  this  procedure  required  a  device  for  detection  of  radioactive  emission  that  could  discriminate  between  different  radioactive  emission  spectra  and  also  with  a  spatial  discrimination  appropriate  for  the  microarray  density.  The  MicroImager  has  these  properties.  We  have  previously  shown  the  potential  of  this  device  in  the  discrimination  of  the  radioactive  emissions  of  two  different  radioelements  for  in  situ  hybridisation  of  two  probes  on  a  single  tissue  section  (13,14).  Here  we  describe  methods  of  labelling  and  hybridisation  allowing  work  with  two  radioactive  probes  simultaneously  on  a  single  glass  support  microarray.  The  sensitivity  of  this  method  was  analysed  and  we  demonstrate  the  potential  of  this  novel  approach  in  cases  where  only  small  samples  are  available.  MATERIALS  AND  METHODS  Gene  array  PCR  products  300-1500  bp  long  were  purified  using  the  concert  nucleic  acid  purification  system  and  then  spotted  with  an  arrayer  (Genetix)  onto  polylysine-coated  slides  (15).  The  cDNA  clones  used  were  obtained  from  adult  rat  brains  by  RT-PCR,  from  a  positive  and  exogenous  control  luciferase  cDNA  sequence  (572  bp  insert)  in  the  pGEM-T  easy  vector  (Promega,  France)  and  from  a  negative  and  exogenous  control  neomycin  phosphotransferase  cDNA  sequence  (738  bp  insert)  in  the  pGEM-T  easy  vector  (Promega).  A  total  of  384  clones  were  spotted  onto  the  microarray.  The  microarray  plan  was  made  up  of  four  blocks  of  four  rows  and  24  columns  (as  shown  in  Fig.  2).  This  plan  was  in  duplicate  on  every  microarray.  Preparation  of  the  luciferase  RNA  The  luciferase  RNA  was  prepared  from  the  luciferase  cDNA  described  above  using  the  riboprobe  combination  system  T7  (Promega).  RNA  extraction  mRNA  was  directly  isolated  from  crude  extracts  of  rat  brain  tissues  on  magnetic  beads  [oligo(dT)25  Dynabeads;  Dynal].  All  experimental  procedures  were  carried  out  in  accordance  with  the  European  Communities  Council  Directive  (24.xi.1986)  and  with  the  guidelines  of  the  CNRS  and  the  French  Agricultural  and  Forestry  Ministry  (decree  87848,  licence  number  A91429).  All  efforts  were  made  to  minimise  animal  suffering  and  to  use  only  the  number  of  animals  necessary  to  produce  reliable  scientific  data.
0	Sample  preparation  for  hybridisation  Aliquots  of  100  ng  mRNA  were  mixed  with  0.1  µg  random  hexamers  from  a  Superscript  First-Strand  Synthesis  System  for  RT-PCR  (Life  Technologies,  France),  heated  to  70°C  for  10  min  and  cooled  on  ice.  Probe  synthesis  and  labelling  were  then  performed  in  the  presence  of  5  mM  MgCl2,  1x  reverse  transcription  buffer  (Life  Technologies),  10  mM  dithiothreitol,  100  U  RNaseOUT  RNase  inhibitor  (Life  Technologies),  0.05  mM  ddTTP,  0.5  mM  dGTP  and  dTTP,  100  U  Superscript  II  reverse  transcriptase  (Life  Technologies)  and  10  µCi  [35S]dATP  (Amersham)  and  0.5  mM  dCTP  or  20  µCi  [3H]dCTP  (Amersham)  and  0.5  mM  dATP  for  the  phosphorylated  and  tritiated  probes,  respectively,  by  incubation  of  the  mixtures  at  42°C  for  50  min.  RNA  was  eliminated  by  heating  at  70°C  for  15  min  and  treatment  with  2  U  RNase  H  (Life  Technologies)  at  37°C  for  20  min.  Unincorporated  nucleotides  were  removed  by  passage  through  a  P10  column  (Bio-Rad).  Hybridisation  The  probes  were  added  to  the  hybridisation  buffer  (3.5x  SSC,  0.3%  SDS),  heated  to  95°C  for  2  min,  cooled  to  room  temperature  and  then  put  on  the  microarray  under  parafilm  (Fuji).  Hybridisation  was  performed  in  a  cassette  chamber  (Telechem)  submerged  in  a  water  bath  at  60°C  for  16-17  h.  Following  hybridisation,  arrays  were  rinsed  at  room  temperature  in  2x  SSC,  0.1%  SDS,  then  2x  SSC,  then  0.2x  SSC,  each  washing  step  lasting  2  min.  Acquisition  of  radioactive  images  with  a  MicroImager  (Biospace  Mesures,  Paris,  France)  A  thin  foil  of  scintillating  paper  was  placed  in  contact  with  the  microarrays.  -Particles  emitted  by  the  hybridised  probes  were  identified  by  acquisition  of  the  light  spot  emissions  in  the  scinti
0	Sensitivity  and  Specificity  of  Photoaptamer  Probes*
1	Drew  Smith§,  Brian  D.  Collins,  James  Heil,  and  Tad  H.  Koch¶
0	Proteomics,  the  study  of  protein  expression  at  the  scale  of  cell,  tissue,  or  organism  (1,  2),  has  been  defined  by  a  single  technology:  two-dimensional  gel  separation  followed  by  mass  spectrometric  analysis  (3,  4).  Although  this  technology  is  mature,  powerful,  and  wonderfully  sophisticated,  it  suffers  from  evident  limitations  in  speed  and  sensitivity.  Several  days  are  required  to  process  a  single  sample,  and  only  1000  of  the  most  abundant  proteins  can  be  detected  (5).  The  ideal  proteomic  technology  would  process  samples  in  minutes  or  hours  and  be  able  to  quantify  even  the  most  weakly  expressed  proteins.  Two-dimensional  gels  and  chromatographic  methods  separate  and  identify  proteins  on  the  basis  of  their  physical  characteristics.  An  alternative  approach  is  to  identify  proteins  by  specific  recognition.  The  potential  advantage  of  this  approach  is  that  proteins  that  have  similar  size  and  charge  but  which
0	The  abbreviations  used  are:  SELEX,  systematic  evolution  of  ligands  by  exponential  enrichment;  A,  aptamer;  aFGF,  acidic  fibroblast  growth  factor;  bFGF,  basic  fibroblast  growth  factor;  NHS,  N-hydroxysuccinimide;  PDGF,  platelet-derived  growth  factor;  T,  target  protein;  HIV,  human  immunodeficiency  virus.
0	Molecular  &  Cellular  Proteomics  2.1
0	Photoaptamer  Probes
0	under  the  harshest  and  most  stringent  conditions  necessary  to  reduce  background  and  improve  signal.  What  is  not  established  is  the  effect  of  photocross-linking  on  the  specificity  of  the  capture  step.  We  set  out  to  characterize,  systematically  and  quantitatively,  a  set  of  photocross-linking  aptamers,  photoaptamers,  with  regard  to  their  sensitivity  and  specificity.  The  photoreactive  unit  incorporated  into  our  photoaptamers  is  5-bromodeoxyuridine  (BrdUrd),  used  for  decades  in  protein-nucleic  acid  cross-linking  studies.  Rather  than  use  short  wave  (254  or  266  nm)  UV  light  for  cross-linking,  however,  we  irradiate  at  308  nm  using  a  XeCl  excimer  laser.  This  technique  was  developed  by  Koch  and  colleagues  (12-16)  and  has  been  shown  to  result  in  specific  and  high  yield  cross-linking  reactions.  Light  at  308  nm  induces  photoelectron  transfer  from  a  nearby  electron  donor  to  the  bromouracil  base  via  either  excitation  of  the  BrdUrd,  excitation  of  the  electron  donor,  or  excitation  of  a  BrdUrdelectron  donor  charge  transfer  state  (17,  18).  Amino  acid  residues  that  can  serve  as  electron  donors  in  BrdUrd  photocross-linking  include  Tyr,  Trp,  His,  Phe,  Cys,  Cys-Cys,  and  Met  of  which  only  Tyr  and  Trp  are  excited  at  308  nm  (16  -20).  Cross-linking  results  from  subsequent  reaction  of  the  resulting  radical  ion  pair.  In  the  absence  of  an  electron  donor  the  BrdUrd  efficiently  relaxes  back  to  ground  state  (17).  We  hypothesized  that  photocross-linking  via  photoelectron  transfer  would  actually  enhance  the  specificity  of  the  aptamer-protein  capture  reaction:  although  a  protein  might  bind  an  aptamer  nonspecifically,  the  probability  that  an  appropriate  amino  acid  would  be  positioned  to  cross-link  with  a  BrdUrd  residue  would  be  low.  Some  evidence  for  this  view  has  been  presented  by  Golden  and  co-workers  (9),  who  showed  that  basic  fibroblast  growth  factor  (bFGF)  photoaptamers  could  cross-link  picomolar  concentrations  of  target  in  the  presence  of  serum  with  very  little  nonspecific  cross-linking.  Using  these  bFGF  photoaptamers  and  a  new  photoaptamer  raised  against  the  HIV  coat  protein  gp120MN  we  evaluated  both  the  equilibrium  binding  constant  and  the  relative  rate  of  cross-linking  to  target  proteins.  We  then  compared  these  values  to  the  values  for  a  set  of  non-target  proteins.  These  non-target  proteins  were  chosen  to  provide  an  exacting  test  of  specificity:  1)  aFGF  and  gp120SF2  are  the  commercially  available  proteins  most  closely  related  to  the  target  proteins;  2)  platelet-derived  growth  factor  (PDGF)  is  a  highly  basic  heparinbinding  growth  factor  that  is  notorious  for  its  nonspecific  DNA  binding;  and  3)  thrombin  is  another  heparin-binding  protein.  These  experiments  confirm  the  specificity  of  the  photocross-linking  reaction  in  the  solution  phase.  We  extend  these  results  to  microarray  format  by  measuring  cross-linking  of  immobilized  photoaptamers  to  target  protein.  We  find  that  the  sensitivity  and  specificity  of  photocross-linking  are  maintained  in  this  format:  target  proteins  can  be  detected  at  subnanomolar  concentrations  in  buffer  and  at  nanomolar  concentrations  when  spiked  into  serum.
0	EXPERIMENTAL  PROCEDURES
0	Revealing  Global  Regulatory  Features  of  Mammalian  Alternative  Splicing  Using  a  Quantitative  Microarray  Platform
0	Molecular  Cell  930
0	sive  use  of  the  latter  approach  was  the  application  of  "exon-junction"  microarrays  for  the  discovery  of  exon  skipping  events  in  human  tissues  and  cell  lines  (Johnson  et  al.,  2003).  These  authors  used  custom  microarrays  containing  oligonucleotide  probes  complementary  to  mapped  exon-exon  junction  sequences  in  RefSeq  genes  for  the  main  purpose  of  discovering  new  AS  events  in  human  transcripts.  Despite  the  progress  described  above,  a  system  has  not  yet  been  described  that  permits  the  large-scale  quantitative  profiling  of  alternative  splicing  in  mammalian  cell  and  tissue  sources.  This  is  primarily  due  to  limitations  stemming  from  the  design  of  existing  microarrays  and  the  lack  of  suitable  algorithms  for  data  analysis.  In  this  paper,  we  describe  a  microarray  platform  that  permits  the  simultaneous  quantification  of  the  levels  of  thousands  of  alternative  exons  in  mammalian  cell  and  tissues  sources.  We  have  applied  this  system  to  the  analysis  of  the  regulation  of  3126  sequence-verified  AS  events  in  diverse  mouse  tissues.  The  resulting  data  have  generated  hundreds  of  new  inferences  for  functional  roles  of  tissue-specific  AS,  insights  into  how  the  evolutionary  origins  of  alternative  exons  relate  to  their  inclusion  levels  in  normal  tissues,  and  information  on  global  features  of  AS  that  underlie  tissue-type  specificity.  This  study  therefore  demonstrates  the  utility  of  a  quantitative  microarray  platform  for  generating  fundamental  new  insights  into  the  global  regulation  of  alternative  splicing  in  mammals.  Results  A  Custom  Microarray  for  Quantitative  Profiling  of  AS  in  Mammalian  Cells  In  order  to  perform  large-scale  quantitative  analyses  of  functionally  diverse  AS  events  in  mammalian  tissues,  we  developed  a  custom  microarray  to  represent  sequencevalidated  AS  events  mined  from  mouse  cDNA  and  EST  sequence  databases  (refer  to  Experimental  Procedures).  To  minimize  representation  of  possible  splicing  errors  or  relatively  low-abundance  transcripts,  we  selected  "cassette-type"  AS  events  with  the  highest  numbers  of  supporting  cDNA  and  EST  sequences  from  different  cell  and  tissue  sources.  To  enhance  the  sensitivity  of  detection  and  quantification  of  inclusion/exclusion  levels  of  alternative  exons,  each  AS  event  was  measured  by  using  six  different  oligonucleotide  probes:  one  body  probe  for  each  exon  sequence,  designated  as  "C1,  A  and  C2"  probes  (C,  constitutive;  A,  alternative),  and  one  junction  probe  for  each  of  the  three  splice-junction  sequences  generated  by  AS,  designated  as  "C1-A,  A-C2  and  C1-C2"  probes  (Figure  1A).  In  addition,  a  control  probe  specific  to  each  intron  sequence  (located  between  C1  and  A)  was  included  to  permit  detection  of  unspliced  pre-mRNA  and/or  contaminating  genomic  DNA  in  the  hybridizations.  From  an  initial  starting  set  of  4892  AS  events  in  our  database,  3126  AS  events  were  selected  for  monitoring  on  a  single  ink-jet  printed  microarray,  manufactured  by  Agilent  Technologies  (Figure  1B).  The  vast  majority  of  the  AS  events  correspond  to  cassette-type  alternative  exons,  and  additional  events  may  correspond  to  mutually  exclusive  alternative  exons.  The  3126  AS  events  are
0	represented  by  2647  distinct  genes,  with  413  of  the  genes  containing  two  or  more  AS  events.  In  addition,  54  of  the  AS  events  represented  on  the  microarray  are  duplicates  and  were  monitored  by  sets  of  probes  that  in  some  cases  are  complementary  to  different  sequences  within  the  same  exons.  These  served  as  reproducibility  controls  (see  below).  The  2647  AS  genes  represented  on  the  microarray  are  associated  with  1118  distinct  Gene  Ontology  Biological  Process  (GO-BP)  categories  among  a  total  set  of  2362  GO-BP  categories  assigned  to  10,361  Mouse  Gene  Informatics  (MGI)  markers  (refer  to  Experimental  Procedures;  see  below).  This  indicates  that  the  AS  genes  represented  on  the  microarray  are  associated  with  a  diverse  range  of  biological  functions  in  mammalian  cells.  Quantitative  Microarray  Profiling  of  Alternative  Splicing  in  Mouse  Tissues  In  order  to  assess  the  performance  of  our  microarray  system  and  to  reveal  global  properties  of  alternative  splicing  in  mammalian  tissues,  we  hybridized
0	Molecular  Cancer  Therapeutics
0	Transcriptome  analysis  of  endometrial  cancer  identifies  peroxisome  proliferator-activated  receptors  as  potential  therapeutic  targets
1	Cathrine  M.  Holland,1,2  Samir  A.  Saidi,2  Amanda  L.  Evans,1  Andrew  M.  Sharkey,1  John  A.  Latimer,2  Robin  A.F.  Crawford,2  D.  Stephen  Charnock-Jones,2  Cristin  G.  Print,1  and  Stephen  K.  Smith1,2
0	Endometrial  carcinoma  is  the  most  common  gynecologic  malignancy  and  comprises  97%  of  all  uterine  cancers  (1).
0	There  is  a  peak  incidence  between  ages  55  and  65  years,  with  <5%  of  endometrial  cancers  occurring  below  age  40  years  (2).  The  majority  are  of  an  endometrioid  histologic  subtype  and  display  an  association  with  obesity  and  diabetes  mellitus  (2).  There  is  a  pressing  need  to  better  understand  the  molecular  basis  for  this  disease,  as  25%  of  women  present  with  extrauterine  disease  with  5-year  survival  rates  of  f31%  and  10%  for  Federation  Internationale  des  Gynaecologistes  et  Obstetristes  stages  3  and  4  disease,  respectively  (2).  An  improved  understanding  of  events  at  a  molecular  level  is  essential  in  the  development  of  targeted  therapy,  with  a  view  to  improving  survival  and  cure  rates.  There  are  increasing  efforts  to  gain  a  more  global  view  of  the  multiple,  interrelated  molecular  changes  that  occur  during  tumorigenesis  (3  -  6).  The  gene  microarray  is  a  highthroughput  technology  able  to  interrogate  multiple  genetic  changes  within  tissues  and  cells  (7  -  9).  Consequently,  there  has  been  a  marked  increase  in  the  use  of  microarrays  to  interrogate  cancers  at  the  genomic  level.  In  addition  to  screening  for  candidate  genes,  microarrays  may  provide  molecular  diagnoses,  thus  avoiding  some  of  the  weaknesses  of  conventional  diagnostic  techniques  (4,  10).  Despite  the  increasing  use  of  microarray  technology  in  cancer  research,  there  have  been  difficulties  obtaining  meaningful  biological  information.  The  cost  of  genomewide,  commercially  available  arrays  may  prohibit  large  experimental  samples,  and  there  are  multiple  sources  of  variation  in  experimental  results  complicating  data  analysis  and  interpretation  (11).  Large-scale  gene  expression  analyses  of  endometrial  cancer  have  mostly  been  confined  to  small  sample  sets  and  cell  lines  (12,  13)  and  have  employed  genome-wide,  commercially  available  microarray  systems  (12).  Previous  microarray  studies  in  endometrial  cancer  have  highlighted  differences  in  the  abundance  of  individual  genes  between  benign  and  malignant  tissues  (12,  13),  although  there  has  been  little  advance  in  the  understanding  of  pathway-specific  alterations  that  may  contribute  to  endometrial  tumorigenesis.  Independent  component  analysis  (ICA)  is  a  sophisticated  statistical  method  that  aims  to  identify  patterns  of  coregulated  genes  rather  than  individual  transcript  changes  (14).  We  previously  have  applied  high-density  cDNA  microarrays  to  determine  gene  transcript  abundance  in  epithelial  ovarian  cancer  (14).
0	Materials  and  Methods
0	Tumor  Samples  and  RNA  Preparation  Twenty  frozen  endometrial  carcinoma  tissues,  three  atypical  complex  hyperplasias,  and  eight  postmenopausal  benign  endometrial  control  tissues  (four  atrophic  and  four
0	PPARa  Is  a  Molecular  Target  in  Endometrial  Cancer
0	quantitative,  real-time  PCR  experiments  were  done  in  the  ABI  PRISM  7700  Sequence  Detector  (Applied  Biosystems)  according  to  the  manufacturer's  instructions  and  were  done  in  triplicate.  The  resultant  data  were  averaged  for  each  sample.  No-template  controls  were  included  in  each  experiment.  Specific  oligonucleotide  primers  and  probes  were  used.  These  were  designed  for  each  of  five  genes  [cyclooxygenase-2  (COX-2),  vascular  endothelial  growth  factor-B  (VEGF-B),  PPARa,  PPARg,  and  retinoid  X  receptor  h  (RXRh)]  using  Primer  Express  1.5  software  (Applied  Biosystems).  Sequences  are  given  below:  (a)  COX-2  5V-TGATCCCCAGGGCTCAAA-3V  (forward  primer),  5V-ATCTGTCTTGAAAAACTGATGCGT-3V  (reverse  primer),  5V-6FAM-TGATGTTTGCATTCTTTGCCCAGCACTTAMRA-3V  (probe);  (b)  VEGF-B  5V-AGCACCAAGTCCGGATG-3V  (forward  primer),  5V-GTCTGGCTTCACAGCACTG-3V  (reverse  primer),  5V-6FAM-AGATCCTCATGATCCGGTACCCGTTAMRA-3V  (probe);  (c)  PPARa  5V-GACGTGCTTCCTGCTTCATAGA-3V  (forward  primer),  5V-CACCATCGCGACCAGATG-3V  (reverse  primer),  5V-6FAM-TGGAGCTCGGCGCACAACCA-TAMRA3V  (probe);  (d)  PPARg  5V-CAGAGCAAAGAGGTGGCCAT-3V  (forward  primer),  5V-GCTTTTGGCATACTCTGTGATCTC-3V  (reverse  primer),  5V-6FAM-CATCTTTCAGGGCTGCCAGTTTCGCTAMRA-3V  (probe);  (e)  RXRh  5V-CCATCCGCAAAGACCTTACATAC-3V  (forward  primer),  5V-GTTCCGCTGGCGCTTG-3V  (reverse  primer),  5-6FAM-TGCCGGGACAACAAAGACTGCACATAMRA-3V  (probe).  Results  for  gene  abundance  in  each  sample  were  normalized  to  abundance  of  an  endogenous  control  gene.  18S  rRNA  was  used  as  an  endogenous  control  for  all  genes,  with  the  exception  of  VEGF-B  for  which  h-actin  was  used.  Preliminary  experiments  to  determine  tha
0	Patterns  of  Temperature  Adaptation  in  Proteins  from  Methanococcus  and  Bacillus
1	John  H.  McDonald,*  Alicia  M.  Grasso,*  and  Lidia  K.  Rejto
0	McDonald  et  al.
0	Comprehensive  Identification  of  Cell  Cycle-regulated  Genes  of  the  Yeast  Saccharomyces  cerevisiae  by  D  Microarray  Hybridization
1	Paul  T.  Spellman,*  Gavin  Sherlock,*  Michael  Q.  Zhang,  Vishwanath  R.  Iyer,§  Kirk  Anders,*  Michael  B.  Eisen,*  Patrick  O.  Brown,§  David  Botstein,*¶  and  Bruce  Futcher
0	INTRODUCTION  In  1981  Hereford  and  coworkers  discovered  that  yeast  histone  mRNAs  oscillate  in  abundance  during  the  cell  division  cycle  (Hereford  et  al.,  1981).  To  date  104  messages  that  are  cell  cycle  regulated  have  been  identified  using  traditional  methods,  and  it  was  estimated  that  some  250  cell  cycle-regulated  genes  might  exist  (Price  et  al.,  1991).  There  are  several  reasons  why  genes  might  be  regulated  in  a  periodic  manner  coincident  with  the  cell  cycle.  Such  regulation  might  be  required  for  the  proper  functioning  of  mechanisms  that  maintain  order  during  cell  division.  Alternatively,  regulation  of  these  genes  could  simply  allow  conservation  of  resources.  Much  of  the  literature  has  focused  on  the
0	posttranscriptional  mechanisms  that  control  the  basic  timing  of  the  cell  cycle.  However,  there  is  also  clear  evidence  that  trans-acting  factors  play  a  critical  role  in  the  regulation  of  the  abundance  of  many  cell  cycle-  regulated  transcripts.  Most  identified  cell  cycle  controls  that  exert  influence  over  mRNA  levels  do  so  at  the  level  of  transcription.  Three  major  types  of  cell  cycle  transcription  factors  are  known  in  yeast,  the  MBF  and  SBF  factors,  Mcm1p-containing  factors,  and  Swi5p/Ace2p  (Table  1).  Many  genes  expressed  at  about  the  G1/S  transition  contain  MCB  or  SCB  elements  in  their  promoters  to  which  MBF  and  SBF  bind  respectively  (for  review,  see  Koch  and  Nasmyth,  1994).  It  is  now  apparent  that  SBF  is  not  as  specific  for  SCBs  as  was  originally  thought  but,  rather,  can  bind,  at  least  in  some  cases,  to  motifs  more  closely  matching  the  MCB  consensus  (Partridge  et  al.,  1997).  MBF  and  SBF  are  activated  posttranslationally  by  Cln3p-Cdc28p,  and  SBF,  at  least,  is  inacti3273
0	by  The  American  Society  for  Cell  Biology
0	P.T.  Spellman  et  al.
0	Table  1.  Transcription  factors  that  regulate  the  cell  cycle  Complex  SBF  MBF  Mcm1p  SFF  Ace2p  Swi5p  Composition  Swi6p  Swi6p  Mcm1p  SFF  Ace2p  Swi5p  Swi4p  Mbp1p  Site  name  SCB  MCB  MCM1  SFF  SWI5  SWI5  Site  CACGAAA  ACGCGT  TTACCNAATTNGGTAA  GTMAACAA  ACCAGC  ACCAGC  Reference  Nasmyth,  1985;  Andrews  and  Herskowitz,  1989  Lowndes  et  al.,  1991;  McIntosh  et  al.,  1991;  Koch  et  al.,  1993  Acton  et  al.,  1997  Althoefer  et  al.,  1995  Dohrmann  et  al.,  1996  Knapp  et  al.,  1996
0	vated  by  Clb2p-Cdc28p  (Amon  et  al.,  1993).  It  is  this  cyclin-dependent  activation  and  inactivation  that  causes  MBF-  and  SBF-mediated  transcription  to  be  cell  cycle  regulated.  Mcm1p  can  bind  with  other  DNA  binding  proteins  to  mediate  a  specific  biological  effect.  In  cooperation  with  Ste12p,  Mcm1p  directs  the  cell  cycle  expression  of  some  genes  in  early  G1  phase  (Oehlen  et  al.,  1996).  In  cooperation  with  an  uncloned  factor  called  "Swi  five  factor"  (SFF),  it  induces  the  expression  of  CLB1,  CLB2,  BUD4,  and  SWI5  in  M  (Lydall  et  al.,  1991;  Sanders  and  Herskowitz,  1996).  Finally,  possibly  acting  without  a  partner,  it  induces  transcription  of  CLN3,  SWI4,  and  CDC6  at  the  M/G1  boundary  (McInerny  et  al.,  1997).  The  Mcm1p  SFF  combination  is  interesting,  because  it  is  somehow  activated  by  Clb2p-Cdc28p,  and  Mcm1p  SFF  then  induces  further  transcription  of  CLB2.  Thus,  Mcm1p  is  part  of  a  positive  feedback  loop  for  CLB2  transcription.  Finally,  Swi5p  and  Ace2p,  which  are  transcriptionally  controlled  by  Mcm1p  and  SFF,  are  responsible  for  the  expression  of  many  genes  in  M  and  M/G1  (Kovacech  et  al.,  1996).  Some  of  these  genes  are  responsible  for  inactivating  Clb2p  and  promoting  cytokinesis,  thus  allowing  exit  from  mitosis,  and  allowing  the  cycle  to  begin  anew.  Many  cell  cycle-regulated  genes  are  involved  in  processes  that  occur  only  once  per  cell  cycle.  Such  processes  include  DNA  synthesis,  budding,  and  cytokinesis.  Additionally  many  of  these  genes  are  involved  in  controlling  the  cell  cycle  itself,  although  in  most  cases  it  is  unclear  whether  their  regulated  transcription  is  absolutely  required.  The  cell  division  cycle  is  thus  a  complex  self-regulating  program,  such  that
0	Strains  used  in  this  study  are  shown  in  Table  2.
0	Media  and  Growth  Conditions
0	YEP  medium  (Sherman,  1991)  was  used  in  all  experiments,  supplemented  with  an  appropriate  carbon  source.  Carbon  sources  are  indicated  in  the  descriptions  of  each  experiment  and  were  used  at  a
0	Molecular  Biology  of  the  Cell
0	Microarray  Manufacture
0	Yeast  ORFs  were  amplified  using  gene  PAIRS  primers  (Research  Genetics,  Huntsville,  AL).  One  hundred-microliter  PCR  reactions  were  performed  in  96-well  PCR  plates  using  each  primer  pair  with  the  following  reagents:  1  M  each  primer,  200  M  each  dATP,  dCTP,  dTTP,  and  dGTP,  1  PCR  buffer  (Perkin  Elmer-Cetus,  Norwalk,  CT),  2  mM  MgCl2,  and  2  U  of  Taq  DNA  polymerase  (Perkin  Elmer-Cetus).  Thermalcycling  was  performed  in  Perkin  Elmer-Cetus  9600  thermalcyclers  with  a  5-min  denaturation  step  at  94°C,  followed  by  30  cycles  with  melting,  annealing,  and  extension  temperatures  and  times  of  94°C,  30  s;  54°C,  45  s;  and  72°C,  3  min  30  s,  respectively.  Production  of  the  correct  PCR  product  was  verified  by  gel  electrophoresis.  Products  deemed  to  have  failed  were  reamplified  either  by  repeating  the  PCR  reaction  with  the  gene  PAIRS  primers,  ordering  custom  primers,  or  using  the  yeast  ORF  DNA  (Research  Genetics)  as  a  template.  Reamplification  of  failed  PCRs  used  the  same  protocol  as  initial  amplification.  DNAs  were  prepared  and  printed  onto  microarrays  as  described  previously  (Shalon  et  al.,  1996;  DeRisi  et  al.,  1997  [http:/  /cmgm.  stanford.edu/pbrown/];  Eisen  and  Brown,  1999)  with  190-  m  spacing  between  the  centers  of  each  element.  Each  microarray  was  visually  inspected,  and  all  microarrays  used  in  this  study  were  estimated  to  be  missing  1%  of  all  elements  except  for  arrays  used  in  the  cdc15  experiments,  which  were  missing  3%  of  all  elements.
0	Size-based  Synchronization
0	Nine  l
0	DNA  Microarrays  of  the  Complex  Human  Cytomegalovirus  Genome:  Profiling  Kinetic  Class  with  Drug  Sensitivity  of  Viral  Gene  Expression
1	JAMES  CHAMBERS,1  ANA  ANGULO,2  DHAMMIKA  AMARATUNGA,1  HONGQING  GUO,1  YING  JIANG,1  JACKSON  S.  WAN,1  ANTON  BITTNER,1  KLAUS  FRUEH,1  MICHAEL  R.  JACKSON,1  PER  A.  PETERSON,1  MARK  G.  ERLANDER,1  AND  PETER  GHAZAL2*  Departments  of  Immunology  and  Molecular  Biology,  Division  of  Virology,  The  Scripps  Research  Institute,  La  Jolla,  California  92037,2  and  The  R.  W.  Johnson  Pharmaceutical  Research  Institute,  San  Diego,  California  921211
0	MATERIALS  AND  METHODS  Selection  and  synthesis  of  oligonucleotides  for  DNA  microarrays.  The  complete  set  of  ORFs  from  the  HCMV  genome  was  analyzed  with  a  custom  se-
0	CHAMBERS  ET  AL.
0	J.  VIROL.
0	GTACCGTTGTACGCATTACAC3  )  and  18120  (5  GACGAAGATG  CCGATGTGTGAC3  ).  The  resulting  PCR  fragments  were  isolated  from  agarose  gels  and  then  radiolabelled  with  [  -32P]dATP  by  the  random-primed  labelling  method  (Boehringer,  Mannheim,  Germany)  according  to  the  manufacturer's  protocol.  For  TRL8-IRL8,  TRL9-IRL9,  UL15,  UL31,  UL48,  UL66,  and  UL73,  the  corresponding  oligonucleotides  shown  in  Fig.  1  were  used  as  probes,  after  being  [  -32P]ATP  end  labelled  with  polynucleotide  kinase  (Stratagene).  Oligonucleotide  probes  were  hybridized  to  the  filters  for  1  h  at  45°C  by  using  Quick  Hybridization  solutions  (Stratagene)  under  conditions  recommended  by  the  manufacturer.  PCR-generated  probes  were  hybridized  with  the  filters  for  12  h  at  65°C  in  1  Denhardt's  solution,  6  SSC,  and  100  g  of  denatured  salmon  sperm  DNA/ml.  Filters  were  washed  to  a  stringency  of  0.1%  sodium  dodecyl  sulfate  (SDS)  at  60°C  or  1%  SDS  at  42°C  depending  whether  PCR-generated  DNA  fragments  or  oligonucleotides,  respectively,  were  used  during  the  hybridization.  Hybridization  signals  were  quantitated  by  using  a  Molecular  Dynamics  PhosphorImager  system  with  ImageQuant  software.  MEME  analysis  of  the  upstream  noncoding  DNA  sequences.  The  computer  program  Multiple  EM  for  Motif  Elicitation  (MEME)  was  used  to  search  for  sequence  motifs  in  500  bp  of  noncoding  sequences  upstream  of  the  initiation  codon.  MEME  analysis  was  performed  by  using  the  sequence  of  strain  AD169  of  HCMV.  The  5  noncoding  regions  were  categorized  according  to  class  of  expression  as  follows:  E  (TRL4-IRL4,  UL104-5,  UL11,  UL112,  UL124,  UL13,  UL16-7,  UL24,  UL26-7,  UL35,  UL4-5,  UL45,  UL53-7,  UL77-9,  US8-14,  US16-7,  US19,  US23-4,  US26,  US28,  and  US30),  early-late  (E-L)  (TRL-IRL6,  TRLIRL10,  TRL-IRL12,  TRL-IRL13,  UL1,  UL106,  UL130,  UL40,  UL44,  UL46-7,  UL49,  UL72,  UL83-5,  UL95-8,  US6-7,  and  US29),  and  L  (TRL-IRL8,  TRLIRL11,  TRL-IRL14,  UL100,  UL103,  UL111A,  UL117,  UL119,  UL131,  UL14,  UL18,  UL2-3,  UL7,  UL9,  UL25,  UL29,  UL32-3,  UL43,  UL48,  UL52,  UL59,  UL67,  UL73,  UL80,  UL82,  UL91-3,  UL99,  US18,  and  US27).  By  using  MEME,  30  motifs  (10  of  8  bases  in  length,  10  of  10  bases  in  length  or  longer,  and  10  of  12  bases  in  length  or  longer)  were  derived  from  each  gene  set.  The  distribution  of  the  combined  90  patterns  was  identified,  allowing  for  10%  mismatch.  MEME  is  available  on  the  World  Wide  Web  (20a).  The  resulting  motifs  that  developed  a  significant  polarized  distribution  pattern  are  summarized  in  Table  2.  In  addition,  the  transcription  factor  database  (TFD)  was  used  to  search  for  known  regulatory  sequences.  The  TFD  was  downloaded  from  the  National  Center  for  Biotechnology  Information.
0	quence  analysis  program  that  selected  a  75-base  sequence  to  be  used  as  a  microarray  deposition  target.  The  analysis  preferentially  selects  unique  sequences  with  a  3  gene  bias  and  a  G-C  content  of  40  to  60%  and  rejects  sequences  that  contain  homopolymeric  stretches  and  potential  hairpin  structures.  The  3  gene  bias  is  preferred,  as  fluorescently  labelled  cDNA  prepared  for  hybridization  is  generated  by  using  oligo(dT)  to  prime  poly(A)  tails  of  mRNA.  The  selected  target  sequences  were  synthesized  by  using  a  PE  Perseptive  BioSystem  (Framingham,  Mass.)  Expedite  MOSS  DNA  synthesizer  with  membrane  columns.  Synthesized  gene  target  oligonucleotides  were  cleaved,  deprotected,  and  purified  by  standard  procedures.  Target  oligonucleotides  were  transferred  in  triplicate  to  96-well  master  plates  at  a  concentration  of  1  g/  l  (in  3  SSC  [1  SSC  is  0.15  M  NaCl  plus  0.015  M  sodium  citrate])  for  robotic  deposition.  The  sequence  of  oligonucleotides  comprising  the  deposited  HCMV  ORF  microarray  is  shown  in  Fig.  1.  The  small  ORF  UL48/49  (8)  and  the  UL74  ORF  described  by  Huber  and  Compton  (13)  were  not  included  in  the  present  chip  design.  Also  shown  in  Fig.  1  is  a  subset  of  cellular  genes  that  were  included  as  internal  controls  for  normalization  between  chips,  as  follows:  elongation  factor  1-alpha  (accession  no.  M29548),  human  acidic  ribosomal  phosphoprotein  (RiboPO;  accession  no.  M17885),  alpha  tubulin  (accession  no.  K00558),  glyceraldehyde-3-phosphate  deh
0	Accounting  Units  in  DNA
1	S.  J.  BELL  AND  D.  R.  FORSDYKE*
0	Chargaff's  first  parity  rule  (%A  =  %T  and  %G  =  %C)  is  explained  by  the  Watson-Crick  model  for  duplex  DNA  in  which  complementary  base  pairs  form  individual  accounting  units.  Chargaff's  second  parity  rule  is  that  the  first  rule  also  applies  to  single  strands  of  DNA.  The  limits  of  accounting  units  in  single  strands  were  examined  by  moving  windows  of  various  sizes  along  sequences  and  counting  the  relative  proportions  of  A  and  T  (the  W  bases),  and  of  C  and  G  (the  S  bases).  Shuffled  sequences  account,  on  average,  over  shorter  regions  than  the  corresponding  natural  sequence.  For  an  E.  coli  segment,  S  base  accounting  is,  on  average,  contained  within  a  region  of  10  kb,  whereas  W  base  accounting  requires  regions  in  excess  of  100  kb.  Accounting  requires  the  entire  genome  (190  kb)  in  the  case  of  Vaccinia  virus,  which  has  an  overall  ``Chargaff  difference''  of  only  0.086%  (i.e.  only  one  in  1162  bases  does  not  have  a  potential  pairing  partner  in  the  same  strand).  Among  the  chromosomes  of  Saccharomyces  cerevisiae,  the  total  Chargaff  differences  for  the  W  bases  and  for  the  S  bases  are  usually  correlated.  In  general,  Chargaff  differences  for  a  natural  sequence  and  its  shuffled  counterpart  diverge  maximally  when  1  kb  sequence  windows  are  employed.  This  should  be  the  optimum  window  size  for  examining  correlations  between  Chargaff  differences  and  sequence  features  which  have  arisen  through  natural  selection.  We  propose  that  Chargaff's  second  parity  rule  reflects  the  evolution  of  genome-wide  stem-loop  potential  as  part  of  shortand  long-range  accounting  processes  which  work  together  to  sustain  the  integrity  of  various  levels  of  information  in  DNA.
0	Academic  Press
0	Introduction  When  the  base  composition  of  natural  duplex  DNA  is  determined  it  is  found  that  the  quantities  of  A  and  T  are  equal  and  the  quantities  of  C  and  G  are  equal.  This  is  Chargaff's  famous  first  parity  rule  (Chargaff,  1951).  If  a  long  DNA  duplex  is  cut  into  two  and  the  base  composition  of  each  part  determined,  the  rule  is  found  to  hold  precisely  for  the  two  parts,  as  for  the  duplex  of
0	origin.  This  division  of  the  duplex  can  be  continued  down  to  individual  bases  (pairing  with  their  complementary  bases  on  the  opposite  strand  of  the  duplex).  Again  Chargaff's  parity  rule  is  obeyed  precisely  (Watson  &  Crick,  1953).  Disregarding  nearest-neighbour  influences  (Turner,  1996),  single  base  pairs  can  be  regarded  as  fundamental  ``accounting  units''.  The  summation  of  these  individual  accounting  units  results  in  the  precise  A  =  T  and  C  =  G  equivalences  of  duplex  DNA  sequences.  That  the  equivalences  have  arisen,  and  are  maintained,  because  they  are  of  adaptive  value  to  an
0	Academic  Press
0	expected  to  resemble  that  resulting  from  the  tossing  of  a  biased  coin  for  which  heads  (A  or  C)  would  be  slightly  favoured/disfavoured  over  tails  (T  or  G),  respectively,  depending  on  their  relative  proportions  in  the  total  segment.  The  base  composition  o
0	Review:  Proteins  with  Repeated  Sequence--Structural  Prediction  and  Modeling
1	Andrey  V.  Kajava
0	The  relationship  between  the  amino  acid  sequence  and  the  three-dimensional  structure  of  proteins  with  internal  repeats  is  discussed.  In  particular,  correlations  between  the  amino  acid  composition  and  the  ability  to  fold  in  a  unique  structure,  as  well  as  classification  of  the  structures  based  on  their  repeat  length,  are  described.  This  analysis  suggests  rules  that  can  be  used  for  the  structural  prediction  of  repeat-containing  proteins.  The  paper  is  focused  on  prediction  and  modeling  of  solenoid-like  proteins  with  the  repeat  length  ranging  between  5  and  40  residues.  The  models  of  leucine-rich  repeat  proteins  and  bacterial  proteins  with  pentapeptide  repeats  are  examined  in  light  of  the  recently  solved  structures  of  the  related  molecules.  ©  2001  Academic  Press  Key  Words:  classification;  molecular  modeling;  prediction;  tandem  repeats;  structural  bioinformatics.
0	Copyright  ©  2001  by  Academic  Press  All  rights  of  reproduction  in  any  form  reserved.
0	REVIEW:  STRUCTURAL  PREDICTION  OF  REPEAT-CONTAINING  PROTEINS
0	their  number  has  grown  to  about  40  since  then  (Groves  and  Barford,  1999;  Kobe  and  Kajava,  2000).  Despite  this  progress,  these  proteins  are  still  underrepresented  in  the  structural  databases  (about  0.5%  of  all  structures),  compared  with  sequence  databases  (about  5%).  This  lack  of  structural  information  is  explained  by  the  fact  that  the  large  molecular  weight  and  the  elongated  shape  of  these  molecules  hamper  X-ray  and  NMR  studies.  These  difficulties  add  importance  to  the  theoretical  approaches.  In  this  article,  molecular  modeling  of  several  solenoidlike  proteins  will  be  described  and  some  rules  will  be  formulated  for  the  theoretical  prediction  and  modeling  of  these  types  of  repetitive  proteins.
0	IS  A  PROTEIN  WITH  REPEATS  STRUCTURED  OR  UNSTRUCTURED?
0	This  is  the  first  question  to  answer  when  approaching  a  repetitive  protein  to  predict  its  3D  structure.  Most  protein  molecules  fold  into  only  one  particular  conformation  determined  by  their  amino  acid  sequence.  This  is  especially  correct  for  proteins  with  aperiodic  sequences  that  fold  into  globular  structures.  Unstructured  fragments  of  globular  proteins,  if  any,  represent  only  a  minor  part  of  the  molecules  and  are  located  in  loops  or  connections  between  stable  structural  domains.  In  contrast,  proteins  with  repeats  frequently  do  not  have  unique  stable  3D  structures.  For  example,  experimental  studies  have  failed  to  demonstrate  the  presence  of  a  unique  3D  structure  for  elastin  (Urry  et  al.,  1995),  small  proline-rich  proteins  of  cell  envelopes  (Steinert  et  al.,  1999),  the  circumsporozoite  protein  of  Plasmodium  falciparum  (Esposito  et  al.,  1989;  Dyson  et  al.,  1990),  glutenin  from  wheat  (Van  Dijk  et  al.,  1997),  the  serine-rich  domain  of  rtoA  protein  from  Dictyostelium  discoideum  (Brazill  et  al.,  2000),  histidine-proline-rich  glycoprotein  (Borza  et  al.,  1996),  and  H1  histones  (Hartman  et  al.,  1977).  The  elastin  molecules  containing  a  set  of  repeats,  e.g.,  VGVAPG  and  GFGVGAGVP,  are  unstructured  and  covalently  cross-linked  to  generate  an  elastic  meshwork  that  enables  tissues  such  as  arteries  and  lungs  to  deform  and  stretch  without  damage  (Urry  et  al.,  1995).  The  small  proline-rich  3  protein  of  the  human  cell  envelope  having  GxTKVPEP  repeats  (here  and  further  in  the  text,  "x"  indicates  a  position  with  any  residue)  adopts  a  loose  structure  with  some  regions  of  protein  occasionally  folding  in  -turn  conformations  (Steinert  et  al.,  1999).  The  circumsporozoite  protein  from  P.  falciparum,  an  agent  of  malaria,  comprises  a  long  tandem  array  of  NANP  repeats.  This  repetitive  region  can  be  elongated  and  flexible  and  may  function  similarly  to  the  outer  cell  carbohydrates.  The  H1  histone  molecules  are  thought  to  be  responsible  for  pulling  chromatin  nucleosomes
0	The  Comparative  Genomics  of  Polyglutamine  Repeats:  Extreme  Difference  in  the  Codon  Organization  of  Repeat-Encoding  Regions  Between  Mammals  and  Drosophila
1	M.  Mar  Alba,1  Mauro  F.  Santibanez-Koref,2  John  M.  Hancock2,*  `  ´~
0	Abstract.  Polyglutamine  repeats  within  proteins  are  common  in  eukaryotes  and  are  associated  with  neurological  diseases  in  humans.  Many  are  encoded  by  tandem  repeats  of  the  codon  CAG  that  are  likely  to  mutate  primarily  by  replication  slippage.  However,  a  recent  study  in  the  yeast  Saccharomyces  cerevisiae  has  indicated  that  many  others  are  encoded  by  mixtures  of  CAG  and  CAA  which  are  less  likely  to  undergo  slippage.  Here  we  attempt  to  estimate  the  proportions  of  polyglutamine  repeats  encoded  by  slippage-prone  structures  in  species  currently  the  subject  of  genome  sequencing  projects.  We  find  a  general  excess  over  random  expectation  of  polyglutamine  repeats  encoded  by  tandem  repeats  of  codons.  We  nevertheless  find  many  repeats  encoded  by  nontandem  codon  structures.  Mammals  and  Drosophila  display  extreme  opposite  patterns.  Drosophila  contains  many  proteins  with  polyglutamine  tracts  but  these  are  generally  encoded  by  interrupted  structures.  These  structures  may  have  been  selected  to  be  resistant  to  slippage.  In  contrast,  mammals  (humans  and  mice)  have  a  high  proportion  of  proteins  in  which  repeats  are  encoded  by  tandem  codon  structures.  In  humans,  these  include  most  of  the  triplet  expansion  disease  genes.
0	Key  words:  Glutamine  repeats  --  Replication  slippage  --  Comparative  genome  analysis  --  Repeat  evolution  --  Triplet  expansion  diseases  --  Triplet  repeats  --  Genome  evolution
0	quences  encoding  polyglutamine  repeats  in  the  yeast  genome  (Alba  et  al.  1999a)  indicated  that  the  majority  does  not  consist  of  long  runs  of  single  codons,  suggesting  that  in  yeast  point  mutation  is  an  important  process  in  generating  polyglutamine  repeats.  These  observations  raise  the  question  to  what  extent  the  contribution  of  point  mutation  and  slippage  to  the  evolution  of  these  structures  differs  in  different  evolutionary  lineages.  To  study  this  we  have  analyzed  large  protein  data  sets  from  a  further  four  model  organisms  that  are  currently  the  subjects  of  genome  sequencing  projects  (Escherichia  coli,  Caenorhabditis  elegans,  Arabidopsis  thaliana,  Drosophila  melanogaster)  and  compared  them  with  S.  cerevisiae,  Mus  musculus,  and  Homo  sapiens  repeats.  The  results  show  similarities  and  differences  between  species.  For  most  of  the  eukaryotic  species  there  is  an  overrepresentation  of  tracts  encoded  by  long  CAG  tandem  repeats,  supporting  the  idea  that  recent  slippage  has  been  involved  in  the  generation  of  a  significant  proportion  of  the  tracts.  However,  on  average  about  70%  of  the  tracts  do  not  show  evidence  of  recent  slippage,  and  in  D.  melanogaster  there  is  no  clear  evidence  of  a  strong  contribution  from  slippage.  Furthermore,  in  the  two  mammalian  species  about  one-third  of  the  tracts  are  exclusively  encoded  by  CAG  and  the  length  of  the  tracts  is  on  average  much  longer  than  in  other  species.  This  suggests  that  slippage  has  played  a  more  important  role  in  the  evolution  of  polyglutamine  regions  in  mammals  than  in  other  taxa.  Methods  Database  Searches
0	BLASTP  (Altschul  et  al.  1990)  at  the  NCBI  was  used  to  find  all  GenBank  entries  which  contained  genes  encoding  long  polyglutamine  tracts  (  6  glutamines)  from  E.  coli,  S.  cerevisiae,  C.  elegans,  A.  thaliana,  D.  melanogaster,  M.  musculus,  and  H.  sapiens.  Redundancy  in  the  primary  data  sets  was  eliminated  by  running  FASTA  within  the  GCG  package  (Pearson  and  Lipman  1988;  GCG  1997).  Sequences  with  95%  identity  were  considered  redundant,  and  only  one  representative  sequence  was  used  in  the  subsequent  analysis.  Where  there  was  a  discrepancy  in  the  length  of  the  polyglutamine  tract  in  nearly  identical  sequences,  we  took  the  sequence  with  the  longest  tract.
0	Analysis  of  Codon  Repeats
0	We  used  statistical  analysis  to  analyze  two  properties  of  polyglutamine  repeat-encoding  regions.  The  first  was  the  extent  of  deviation  of  the  codon  organization  within  these  regions  from  random.  This  was  measured  by  considering  the  deviation  of  the  length  of  the  longest  run  of  each  codon  type  from  chance  expectation  (Alba  et  al.  1999a,b).  The  second  property  was  the  over-  or  underrepresentation  of  tandem  codon  repeats  of  a  particular  length  in  the  whole  set  of  polyglutamine-coding  regions  in  a  given  species.  Length  of  the  Longest  Homogeneous  Run.  As  described  previously  (Alba  et  al.  1999a,b)  the  organizational  homogeneity  or  otherwise  of  a  region  encoding  a  polyglutamine  repeat  has  to  be  considered  in  the
0	Table  1.  Polyglutamine  tracts  in  different  species  Length  of  polyglutamine  tract  Species  S.  cerevisiae  C.  elegans  A.  thaliana  D.  melanogaster  M.  musculus  H.  sapiens
0	CAG  relative  frequencya  Genome  0.307  0.331  0.442  0.716  0.743  0.674  Tracts  0.450*  0.430*  0.465  (NS)  0.728  (NS)  0.824*  0.830*
0	Pure  codon  tracts  CAG  4.7%  2.2%  2.2%  7.3%  37.2%  26.2%  CAA  5.4%  5.8%  11.3%  0%  0%  0%
0	Chi-square  test  of  the  r
0	Tendency  for  Local  Repetitiveness  in  Amino  Acid  Usages  in  Modern  Proteins
1	Kazuhisa  Nishizawa1*,  Manami  Nishizawa1  and  Ki  Seok  Kim2
0	Systematic  analyses  of  human  proteins  show  that  neural  and  immune  system-specific,  and  therefore,  relatively  ``modern''  proteins  have  a  tendency  for  repetitive  use  of  amino  acids  at  a  local  scale  ($1-20  residues),  while  ancient  proteins  (human  homologues  of  Escherichia  coli  proteins)  do  not.  Those  protein  subsegments  which  are  unique  based  on  homology  search  account  for  the  repetitiveness.  Simulation  shows  that  such  repetitiveness  can  be  maintained  by  frequent  duplication  on  a  very  short  scale  (one  to  two  codons)  in  the  presence  of  substitutive  point  mutation,  while  the  latter  tends  to  mitigate  the  repetitiveness.  DNA  analyses  also  show  the  presence  of  cryptic  (i.e.  ``out  of  the  codon  frame'')  repetitiveness,  which  cannot  fully  be  explained  by  features  in  protein  sequences.  Simulative  modification  of  the  amino  acid  sequences  of  immune  systemspecific  proteins  estimate  that  2.4  duplication  events  occur  during  the  period  equivalent  to  ten  events  of  substitution  mutation.  It  is  also  suggested  that  the  repetitiveness  leads  to  longitudinal  unevenness  within  a  given  peptide  domain.  Those  peptide  motifs  which  contain  similarly  charged  residues  are  likely  to  be  generated  more  frequently  in  the  presence  of  the  tendency  for  repetitiveness  than  in  its  absence.  Therefore,  the  neutral  propensity  of  DNA  for  duplication,  which  can  also  tend  to  generate  repetitiveness  in  amino  acid  sequences,  seems  to  be  manifested  primarily  when  the  constraints  on  amino  acid  sequences  are  relatively  weak,  and  yet  may  be  positively  contributing  to  generation  of  unevenness  in  modern  proteins.
0	Academic  Press
0	Keywords:  microsatellite;  coding  regions;  peptide  motif;  triplet  repeat
0	Academic  Press
0	Repetitive  Use  of  Amino  Acids
0	Results  and  Discussion
0	Repetitive  Use  of  Amino  Acids
0	Identifying  Differentially  Expressed  Genes  in  cDNA  Microarray  Experiments
0	ABSTRACT  A  major  goal  of  microarray  experiments  is  to  determine  which  genes  are  differentially  expressed  between  samples.  Differential  expression  has  been  assessed  by  taking  ratios  of  expression  levels  of  different  samples  at  a  spot  on  the  array  and  agging  spots  (genes)  where  the  magnitude  of  the  fold  difference  exceeds  some  threshold.  More  recent  work  has  attempted  to  incorporate  the  fact  that  the  variability  of  these  ratios  is  not  constant.  Most  methods  are  variants  of  Student's  t  -test.  These  variants  standardize  the  ratios  by  dividing  by  an  estimate  of  the  standard  deviation  of  that  ratio;  spots  with  large  standardized  values  are  agged.  Estimating  these  standard  deviations  requires  replication  of  the  measurements,  either  within  a  slide  or  between  slides,  or  the  use  of  a  model  describing  what  the  standard  deviation  should  be.  Starting  from  considerations  of  the  kinetics  driving  microarray  hybridization,  we  derive  models  for  the  intensity  of  a  replicated  spot,  when  replication  is  performed  within  and  between  arrays.  Replication  within  slides  leads  to  a  beta-binomial  model,  and  replication  between  slides  leads  to  a  gamma-Poisson  model.  These  models  predict  how  the  variance  of  a  log  ratio  changes  with  the  total  intensity  of  the  signal  at  the  spot,  independent  of  the  identity  of  the  gene.  Ratios  for  genes  with  a  small  amount  of  total  signal  are  highly  variable,  whereas  ratios  for  genes  with  a  large  amount  of  total  signal  are  fairly  stable.  Log  ratios  are  scaled  by  the  standard  deviations  given  by  these  functions,  giving  model-based  versions  of  Studentization.  An  example  is  given.  Key  words:  beta-binomial  model,  microarray  replication.
0	BAGGERLY  ET  AL.
0	INTRODUCTION
0	he  human  biological  system  is  under  the  control  of  perhaps  40,000  genes.  Genes  are  the  encoded  blueprints  for  the  proteins  that  perform  cellular  functions.  In  going  from  genes  to  proteins,  there  is  an  intermediate  step  in  which  DNA  is  transcribed  to  single-stranded  messenger  RNA  (mRNA).  It  is  through  mRNA  that  genes  produce  protein.  Most  of  the  time,  the  levels  of  mRNA  re  ect  the  abundance  of  the  corresponding  proteins  in  the  cell.  Perturbations  of  the  cellular  environment  by  such  factors  as  radiation,  heat,  food  intake,  or  genetic  mutation  lead  to  altered  expression  in  a  speci  c  group  of  genes.  A  goal  of  functional  genomics  is  to  apply  high-throughpu  t  technologies  to  identify,  from  the  vast  number  of  genes,  the  few  genetic  and  molecular  changes  associated  with  a  de  ned  phenotype.  Identi  cation  of  these  genes  can  help  us  diagnose  disease,  identify  targets  for  speci  c  therapeutic  intervention,  or  simply  understand  the  basis  of  the  underlying  biological  processes.  A  primary  tool  for  functional  genomics  is  the  Complementary  DNA  (cDNA)  microarray,  which  is  commonly  used  to  measure  the  relative  expression  levels  of  thousands  of  genes  in  a  given  cell  population.  Using  this  approach,  researchers  have  successfully  found  disease  related  genes  (Bittner  et  al.,  2000;  Clark  et  al.,  2000;  Fuller  et  al.,  1999),  and  have  developed  new  molecular  classi  cation  schemes  for  cancers  (Bittner  et  al.,  2000;  Golub  et  al.,  1999).  Microarrays  are  produced  in  a  laboratory  by  placing  thousands  of  different  cDNA  clones  onto  a  solid  surface:  a  nylon  membrane  or  a  chemically  coated  glass  microscopy  slide.  For  example,  in  a  typical  experiment  we  print  4,800  spots  in  a  4  £  12  format  of  patches,  where  each  patch  contains  100  different  spots  arranged  in  a  10  £  10  grid.  At  each  spot,  approximately  2  nanograms  of  a  speci  c  gene  are  deposited  by  a  robotic  arrayer.  Once  on  the  slide,  the  originally  double-stranded  DNA  is  denatured  so  that  it  splits  into  single  strands  which  are  bound  to  the  surface.  These  single  strands  are  then  available  to  serve  as  speci  c  attractants  to  the  complementary  single-stranded  DNA  molecules,  a  process  called  hybridization.  To  assess  the  expression  levels  of  the  genes  in  a  given  cell  population,  the  cells  are  broken  apart  chemically  (lysed)  and  total  RNA  is  isolated  according  to  a  standard  procedure.  Then  reverse  transcriptase  is  used  to  convert  the  mRNA  back  into  single-stranded  complementary  DNA,  which  is  more  stable  than  RNA.  During  the  process  of  reverse  transcription,  uorescent  dyes  or  radioactively  labeled  nucleotides  can  be  incorporated,  providing  a  signal  that  can  be  monitored  by  detectors.  Further,  two  or  more  different  uorescent  dyes  can  be  used  to  label  different  samples,  thus  allowing  simultaneous  monitoring  of  two  samples  on  the  same  microarray.  After  the  labeled  cDNA  in  a  solution  is  obtained,  it  is  placed  onto  the  microarray  surface  and  incubated  to  allow  speci  c  binding  to  the  different  DNA  molecules  bound  to  the  array.  We  customarily  call  the  immobilized  DNA  on  the  microarray  "probe"  and  the  labeled  DNA  in  solution  "target."  (This  target/probe  dichotomy  is,  unfortunately,  not  set;  the  literature  contains  both  this  usage  and  the  converse.  We  have  chosen  to  follow  the  de  nition  adopted  in  the  January  1999  supplement  to  Nature  Genetics,  "The  Chipping  Forecast.")  The  amount  of  probe  on  the  array  is  assumed  to  be  vastly  in  excess  of  the  amount  of  target,  so  that  the  amount  binding  to  the  probe  is  a  function  of  the  target  copy  number  in  the  mixture.  After  washing  to  remove  the  nonspeci  c  binding,  the  hybridized  microarray  is  scanned  using  a  laser  scanner  (for  uorescence)  or  a  phosphorimage  r  (for  radioactive  labels).  We  will  focus  on  uorescent  labeling  on  glass  slides  in  this  paper,  but  the  model  proposed  also  holds  for  radioactive  labeling,  since  the  hybridization  kinetics  are  similar.  Both  scanners  produce  computer  images  of  the  entire  array  whose  pixel  values  are  processed  to  estimate  the  rough  amounts  associated  with  individual  spots.  Unfortunately,  these  measurements  do  not  correspond  perfectly  to  the  true  expression  levels.  Reverse  transcription  and  label  incorporation  work  with  different  ef  ciencies  for  different  mRNA  sequences,  so  the  relative  expression  levels  of  different  genes  within  a  sample  cannot  be  measured  reliably.  However,  the  relative  expression  levels  of  the  same  gene  in  two  different  samples  can  be  measured,  as  the  reverse  transcription  ef  ciencies  should  be  about  the  same.  Comparing  the  images  introduces  two  types  of  offset  that  must  be  corrected  for.  First,  there  is  a  multiplicative  offset,  a  normalization  factor,  associated  with  scans  being  made  using  different  gain  settings  or  using  different  amounts  of  raw  material  in  the  two  samples.  Second,  there  is  a  background  level  associated  with  the  nonspot  portions  of  the  image,  which  must  be  subtracted  before  comparisons  are  made.  Estimating  and  correcting  for  these  offsets  introduces  variation,  which  we  shall  address  below.  For  more  detailed  descriptions  of  the  experimental  protocols  used  in  microarray  preparation,  the  reader  is  referred  to  some  of  the  papers  addressing  protocols  (Eisen  and  Brown,  1999;  Hedge  et  al.,  2000).
0	IDENTIFYING  DIFFERENTIALLY  EXPRESSED  GENES
0	Thus,  cDNA  microarrays  allow  us  to  compare  genetic  pro  les  of  different  samples  (Schena  et  al.,  1995,  1996).  We  may  be  able  to  use  these  pro  les  to  identify  genetic  markers  associated  with  various  diseases  by  contrasting  diseased  and  healthy  tissue.  Further,  we  may  arrive  at  a  more  objective  method  of  pathology  that  allows  us  to  identify  molecularly  distinct  subcategories  of  diseases,  paving  the  way  for  more  focused  treatments.  Some  of  this  potential  is  beginning  to  be  realized  (Alizadeh  et  al.,  1999;  Alon  et  al.,  1999;  DeRisi  et  al.,  1997;  Eisen  and  Brown,  1999;  Golub  et  al.,  1999;  Hughes  et  al.,  2000b;  Lee  et  al.,  2000;  Pollack  et  al.,  1999;  Ross  et  al.,  2000;  Scherf  et  al.,  2000).  Books  on  the  methodology,  (Schena,  1999,  2000)  are  beginning  to  appear.  From  a  statistical  point  of  view,  the  initial  question  to  be  addressed  in  comparing  relative  expression  levels  is  whether  an  observed  difference  corresponds  to  a  real  difference  or  simply  a  statistical  uctuation:  How  do  we  assess  signi  cance?  Early  papers  (Schena  et  al.,  1995,  1996;  DeRisi  et  al.,  1996)  focused  on  sets  of  genes  exhibiting  more  than  a  k-fold  difference  in  expression  level  between  samples,  where  the  value  of  k  was  chosen  more  or  less  arbitrarily.  Focusing  on  fold  differences  reduces  to  focusing  on  ratios,  or  equivalently  log  ratios,  of  expression  levels.  We  prefer  log  ratios  because  they  visually  emphasize  the  equal  importance  of  ratios  of  k  and  1=k;  on  the  log  scale  these  have  the  same  magnitude  and  differ  only  in  sign.
0	Assessing  signi  cance:  Historical  background
0	In  the  rst  statistical  attack  on  the  problem  of  assessing  when  a  log  ratio  is  "signi  cant"  (Chen  et  al.,  1997),  the  use  of  a  xed  fold-difference  is  restated  by  assuming  that  the  coef  cient  of  variation  associated  with  each  signal  is  constant,  but  the  fold  multiple  for  signi  cance  thresholding  is  chosen  in  a  less  ad  hoc  fashion.  The  authors  assess  the  overall  level  of  variability  associated  with  the  log  ratio  measurements  for  a  few  "housekeeping"  genes  whose  level  of  expression  is  assumed  to  be  constant  across  samples  an
0	General  nonlinear  framework  for  the  analysis  of  gene  interaction  via  multivariate  expression  arrays
1	Seungchan  Kim  Edward  R.  Dougherty
1	Michael  L.  Bittner  Yidong  Chen
0	National  Institutes  for  Health  National  Human  Genome  Research  Institute  Laboratory  for  Cancer  Genetics
1	Krishnamoorthy  Sivakumar
1	Paul  Meltzer  Jeffrey  M.  Trent
0	National  Institutes  for  Health  National  Human  Genome  Research  Institute  Laboratory  for  Cancer  Genetics
0	Abstract.  A  cDNA  microarray  is  a  complex  biochemical-optical  system  whose  purpose  is  the  simultaneous  measurement  of  gene  expression  for  thousands  of  genes.  In  this  paper  we  propose  a  general  statistical  approach  to  finding  associations  between  the  expression  patterns  of  genes  via  the  coefficient  of  determination.  This  coefficient  measures  the  degree  to  which  the  transcriptional  levels  of  an  observed  gene  set  can  be  used  to  improve  the  prediction  of  the  transcriptional  state  of  a  target  gene  relative  to  the  best  possible  prediction  in  the  absence  of  observations.  The  method  allows  incorporation  of  knowledge  of  other  conditions  relevant  to  the  prediction,  such  as  the  application  of  particular  stimuli  or  the  presence  of  inactivating  gene  mutations,  as  predictive  elements  affecting  the  expression  level  of  a  given  gene.  Various  aspects  of  the  method  are  discussed:  prediction  quantification,  unconstrained  prediction,  constrained  prediction  using  ternary  perceptrons,  and  design  of  predictors  given  small  numbers  of  replicated  microarrays.  The  method  is  applied  to  a  set  of  genes  undergoing  genotoxic  stress  for  validation  according  to  the  manner  in  which  it  points  toward  previously  known  and  unknown  relationships.  The  entire  procedure  is  supported  by  software  that  can  be  applied  to  large  gene  sets,  has  a  number  of  facilities  to  simplify  data  analysis,  and  provides  graphics  for  visualizing  experimental  data,  multiple  gene  interaction,  and  prediction  logic.  ©  2000  Society  of  Photo-Optical  Instrumentation
0	Sequences  and  clones  for  over  a  million  expressed  sequenced  tagged  sites  ESTs  are  currently  widely  available.  Characterization  of  these  genes  lies  behind  the  ability  to  collect  them.  Only  14%  of  identified  clusters  contain  genes  even  tenuously  associated  with  a  known  functionality.  One  way  of  gaining  insight  into  a  gene's  role  in  cellular  activity  is  to  study  its  expression  pattern  in  a  variety  of  circumstances  and  contexts,  as  it  responds  to  its  environment  and  to  the  action  of  other  genes.  Recent  methods  facilitate  large  scale  surveys  of  gene  expression  in  which  transcript  levels  can  be  determined  for  thousands  of  genes  simultaneously.  In  particular,  cDNA  microarrays  result  from  a  complex  biochemical-optical  system  incorporating  robotic  spotting  and  computer  image  formation  and  analysis.1-5  Since  transcription  control  is  accomplished  by  a  method  which  interprets  a  variety  of  inputs,6-8  we  require  analytical  tools  for  expression  profile  data  that  can  detect  the  types  of  multivariate  influences  on  decision  making  produced  by  complex  genetic  networks.  In  this  paper  we  discuss  a  statistical-operational  framework  for  finding  associations  between  expression  patterns  of  genes  by  determining  whether  knowledge  of  the  transcriptional  levels  of  a  small
0	gene  set  can  be  used  to  predict  the  transcriptional  state  of  another  gene.  A  feature  of  the  method  is  that  it  allows  one  to  incorporate  knowledge  of  other  conditions,  such  as  the  application  of  particular  stimuli  or  the  presence  of  inactivating  gene  mutations,  as  predictive  elements,  thereby  broadening  the  classes  of  information  that  can  be  simultaneously  evaluated  in  modeling  biological  decision  making.  Our  focus  is  on  a  general  framework:  the  determination-prediction  paradigm  for  analysis  of  gene  interaction,  comparison  of  constrained  and  unconstrained  prediction  in  the  face  of  limited  microarray  replications,  estimation  of  the  degree  of  determination  given  limited  replications,  interpretation  of  the  results,  and  software  to  assist  interpretation.  Experimental  results  will  be  given  for  the  purposes  of  explanation  and  verification.  A  particular  instance  of  the  general  methodology  has  been  applied  in  a  separate  biological  paper  see  Sec.  4  .9  A  methodological  perspective  is  important  for  appreciating  the  range  of  applicability  of  the  proposed  framework,  which  is  not  limited  to  cDNA  microarrays,  but  can  be  used  for  studying  interaction  in  the  context  of  other  kinds  of  arrays.  The  mechanism  of  intergene  association  is  not  a  factor  in  statistical  prediction.  The  only  factor  is  the  ability  to  predict  the  target  level  from  the  predictor  levels.  The  predictor  genes  may  be  upstream  or  downstream  from  the  target  gene  in  the
0	SPIE
0	October  2000
0	actual  genetic  network,  some  may  be  upstream  and  some  downstream,  or  they  may  be  distributed  about  the  network  in  such  a  way  that  their  relation  to  the  target  gene  is  based  on  chains  of  interaction  among  various  intermediate  genes.  Whatever  the  relationship  of  the  predicting  genes  to  the  predicted,  if  knowledge  of  their  states  allows  us  to  better  predict  the  expression  level  of  the  target  gene,  then  we  infer  there  is  some  relationship--the  better  the  prediction,  the  stronger  the  relation.  As  the  first  step  in  carrying  out  nonlinear  genomic  prediction  on  gene  expression  profiles,  data  complexity  is  reduced  by  thresholding  the  changes  in  transcript  level  into  ternary  expression  data:  1  down  regulated  ,  1  up  regulated  ,  or  0  invariant  .  This  simplification  is  motivated  by  the  way  in  which  analysis  is  carried  out  on  cDNA  microarrays  and  by  the  need  to  collect  many  samples  where  gene  expression  levels  vary  due  to  altered  cellular  states.  To  find  connections  between  genes,  enough  conditions  must  be  sampled  to  detect  the  independent  functioning  of  different  genetic  networks.  This  amount  of  sampling  requires  data  from  numerous  arrays.  When  viewed  across  many  arrays,  the  absolute  intensity  of  signal  detected  by  each  element  of  the  detector  in  this  hybridization  based  assay  can  be  seen  to  vary  based  both  on  the  process  of  preparing  and  printing  the  EST  elements,  and  the  processes  of  preparing  and  labeling  the  cDNA  representations  of  the  RNA  pools.  This  problem  is  solved  via  internal  standardization.  An  algorithm  that  first  calibrates  the  data  internally  to  each  microarray  and  statistically  determines  whether  the  data  justify  the  conclusion  that  expression  is  up  regulated  or  down  regulated  with  99%  confidence  is  used  to  detect  significant  changes  in  the  transcript  level.10  Requiring  a  high  confidence  level  insures  that  the  logical  values  1  and  1  represent  significant  down  and  up  regulation,  and  do  not  result  from  experimental  variability.
0	Nonlinear  Multivariate  Prediction
0	The  purpose  of  nonlinear  multivariate  prediction  filtering  is  to  predict  estimate  the  output  of  a  nonlinear  system.  Consider  a  system  S  having  inputs  X  1  ,X  2  ,  .  .  .  ,X  m  to  be  observed  and  measured,  along  with  other  inputs,  which  we  may  have  no  way  of  measuring,  and  may  not  even  be  able  to  identify  Figure  1  .  We  do  not  assume  a  known  mechanism  by  which  the  output  is  determined,  nor  is  there  an  assumption  of  causality.  The  prediction  problem  is  to  estimate  the  output  of  S  given  only  the  inputs  X  1  ,X  2  ,  .  .  .  ,X  m  .  As  indicated  in  Figure  1,  we  view  X  1  ,X  2  ,  .  .  .  ,X  m  as  input  variables  to  a  logical  system  L  that  yields  a  logical  value  Y  pred  that  best  predicts  the  value  Y  that  S  would  provide,  given  the  knowledge  of  the  inputs  X  1  ,X  2  ,  .  .  .  ,X  m  .  Statistical  training  uses  only  the  fact  that  X  1  ,X  2  ,  .  .  .  ,X  m  are  among  the  inputs  to  S,  the  output  Y  of  S  can  be  measured,  and  a  logical  system  L  can  be  constructed  whose  output  Y  pred  statistically  approximates  Y.  The  underlying  scientific  assumption  is  that  the  full  system  S  is  beyond  the  reach  of  current  technology  and  our  knowledge  of  S  is  derived  from  its  effect  on  observable  input  variables.  The  logic  of  L  represents  an  operational  model  of  our  understanding.  It  is  crucial  to  recognize  that  this  operational  model  is  contingent  on  existing  technology,  which  determines  the  inputs  that  can  be  observed,  the  manner  in  which  the  inputs  are
0	A  Comprehensive  View  of  Regulation  of  Gene  Expression  by  Double-stranded  RNA-mediated  Cell  Signaling*
1	Gary  Geiss§,  Ge  Jin§¶,  Jinjiao  Guo¶,  Roger  Bumgarner,  Michael  G.  Katze,  and  Ganes  C.  Sen¶
0	Double-stranded  (ds)  RNA,  a  common  component  of  virus-infected  cells,  is  a  potent  inducer  of  the  type  I  interferon  and  other  cellular  genes.  For  identifying  the  full  repertoire  of  human  dsRNA-regulated  genes,  a  cDNA  microarray  hybridization  screening  was  conducted  using  mRNA  from  dsRNA-treated  GRE  cells.  Because  these  cells  lack  all  type  I  interferon  genes,  the  possibility  of  gene  induction  by  autocrine  actions  of  interferon  was  eliminated.  Our  screen  identified  175  dsRNA-stimulated  genes  (DSG)  and  95  dsRNA-repressed  genes.  A  subset  of  the  DSGs  was  also  induced  by  different  inflammatory  cytokines  and  viruses  demonstrating  interconnections  among  disparate  signaling  pathways.  Functionally,  the  DSGs  encode  proteins  involved  in  signaling,  apoptosis,  RNA  synthesis,  protein  synthesis  and  processing,  cell  metabolism,  transport,  and  structure.  Induction  of  such  a  diverse  family  of  genes  by  dsRNA  has  major  implications  in  host-virus  interactions  and  in  the  use  of  RNAi  technology  for  functional  ablation  of  specific  genes.
0	Double-stranded  (ds)1  RNA  is  not  a  major  constituent  of  mammalian  cells,  but  many  viruses  produce  it  during  their  replication  cycle  as  either  an  essential  intermediate  for  RNA  synthesis  or  a  byproduct  generated  by  annealing  of  complementary  mRNAs  encoded  by  the  opposite  strands  of  a  DNA  virus  genome  (1).  In  addition,  some  viruses  encode  RNA  species,  such  as  VA  RNA  or  EBER  RNA,  which  have  considerable  ds  structures.  Virtually  nothing  is  known  about  how  dsRNA  affects  viral  and  cellular  gene  expression  and  functions  in  a  virally  infected  cell,  although  the  role  of  PKR,  the  dsRNA-activated  protein  kinase,  in  inhibiting  protein  synthesis  has  been  studied  in  cells  infected  with  a  variety  of  viruses  (2).  In  the  host-virus  interaction  context,  dsRNA  is  closely  associated  with  the  interferon  (IFN)  system.  dsRNA  is  a  potent  inducer  of  type  I  IFN  synthesis  and  is  believed  to  be  the  primary  viral  gene  product  that  causes  IFN  production  by
0	infected  cells  (3).  dsRNA  has  important  roles  in  IFN  actions  as  well.  It  is  the  obligatory  activator  of  two  classes  of  IFN-induced  enzymes:  PKR,  the  IFN-induced  protein  kinase,  and  2-5(A)  synthetases,  whose  products  activate  the  latent  ribonuclease,  RNaseL.  Moreover,  transcription  of  some  IFN-stimulated  genes  (ISGs)  is  also  induced  by  dsRNA  (4).  That  this  induction  is  direct  and  not  mediated  by  induced  IFN  was  convincingly  demonstrated  in  IFN  unresponsive  cells  and  in  cells  that  are  devoid  of  the  IFN  gene  locus  (5,  6).  Direct  induction  of  some  ISGs  by  dsRNA  suggests  that  the  encoded  proteins  will  be  induced  in  virally  infected  cells  without  any  involvement  of  IFNs.  Thus  regulation  of  viral  gene  expression  by  these  proteins  is  relevant  for  all  infected  cells,  even  in  the  absence  of  IFN  treatment.  Several  transcription  factors  such  as  NF  B,  IRF-3,  and  ATF-1,  are  known  to  be  activated  by  dsRNA  (7).  Their  activation  is  mediated  by  protein  kinases  including  PKR,  p38,  JNK2,  and  IKK  (7,  8)  although  the  pathways  of  activation  are  not  completely  understood.  For  genes  that  are  induced  by  either  IFN  or  dsRNA,  the  same  cis-element  regulates  their  induction  by  both  reagents.  But  entirely  different  signaling  pathways  and  transcription  factors  are  used  by  the  two  inducers  (5).  There  has  not  been  any  attempt  to  systematically  define  the  full  repertoire  of  dsRNA-regulated  genes.  Identification  of  these  genes  is  required  not  only  for  revealing  the  nature  of  all  signaling  pathways  used  by  dsRNA  but  also  for  defining  the  set  of  proteins  that  are  induced  by  dsRNA  or  virus  infection.  In  the  current  study,  we  started  this  investigation  using  a  cDNA  microarray  hybridization  analysis  of  RNA  isolated  from  dsRNA-treated  and  -untreated  GRE  cells  that  are  devoid  of  the  type  I  IFN  locus  and  cannot  synthesize  IFNs.  Using  this  approach,  in  the  current  study  we  have  identified  more  than  a  hundred  DSGs,  only  a  few  of  which  were  previously  known  to  be  dsRNA-inducible.  Furthermore  we  also  identified  multiple  down-regulated  genes.  These  genes  were  induced  or  repressed  by  dsRNA  strongly,  rapidly,  and  transiently.  The  encoded  proteins  are  involved  in  a  broad  range  of  cellular  functions  and  metabolic  pathways.
0	EXPERIMENTAL  PROCEDURES
0	dsRNA-regulated  Gene  Expression
0	Identification  of  dsRNA-regulated  Genes  (DRGs)--For  undertaking  a  systematic  analysis  of  human  DRGs,  we  chose  to  use  the  glioma  cell  line,  GRE  (5).  These  cells  lack  the  type  I  IFN  locus  and  hence  cannot  synthesize  IFN-  or  any  of  the  multiple  IFN-  species  in  response  to  dsRNA  or  other  stimuli.  Because  dsRNA  treatment  of  GRE  cells  cannot  induce  IFNs,  the  possi-
0	bility  of  secondary  induction  of  the  IFN-stimulated  genes  by  autocrine  actions  of  IFNs  was  eliminated.  This  consideration  was  highly  pertinent  because  dsRNA  is  known  to  be  a  potent  inducer  of  IFNs,  and  several  DSGs  are  known  to  be  induced  by  IFN  as  well.  GRE  cells  were  treated  with  the  dsRNA,  poly(I)  poly(C),  for  6  h  and  poly(A)  RNA  was  isolated  from  treated  and  untreated  cells.  We  chose  the  length  of  treatment  to  be  6  h,  because  our  previous  studies  have  shown  that  this  is  the  optimum  time  for  induction  of  561  mRNA  that  encodes  the  56  kDa  protein,  P56  (5).  The  two  sets  of 
0	Copyright  1997  by  the  American  Chemical  Society
0	The  Efficiency  of  Light-Directed  Synthesis  of  DNA  Arrays  on  Glass  Substrates
1	Glenn  H.  McGall,*  Anthony  D.  Barone,  Martin  Diggelmann,  Stephen  P.  A.  Fodor,  Erik  Gentalen,  and  Nam  Ngo
0	building  blocks  in  combination  with  polymeric  semiconductor  photoresist  films  as  the  photoimageable  component.3  The  development  of  chemistry  and  processes  for  DNA  array
0	American  Chemical  Society
0	McGall  et  al.  Scheme  1
0	(acetic  anhydride/1-methylimidazole/2,6-lutidine/THF)  and  oxidation  (I2/pyridine-H2O).7  After  removing  the  acyl  protecting  groups  from  the  bound  fluorescein,  relative  densities  of  hydroxyl  groups  in  different  regions  of  the  support  could  then  be  determined  from  surface  fluorescence  intensities.
0	For  the  purpose  of  this  study,  it  was  not  necessary  to  achieve  an  absolute  measure  of  the  amount  of  bound  fluorescein  in  any  given  region  of  the  substrate,  although  the  photon-counting  capability  of  the  fluorescence  microscope  would,  in  principle,  enable  one  to  do  so.  Instead,  differences  in  surface  fluorescence  were  used  to  obtain  relatiVe  values  for  surface  density,  providing  a  simple,  internally  consistent  method  for  measuring  chemical  and  photochemical  efficiencies.
0	Beaucage,  S.  L.  In  Protocols  for  Oligonucleotides  and  Analogs;  Agrawal,  S.,  Ed.;  Humana  Press:  Totowa,  New  Jersey,  1993;  pp  33-61.
0	Light-Directed  Synthesis  of  DNA  Arrays  on  Glass  Scheme  2
0	One  potential  source  of  interference  with  this  kind  of  analysis  is  fluorescence  quenching  due  to  energy  transfer  interactions  between  adjacent  fluorophores  on  the  surface.  The  initial  density  of  surface  functional  groups  on  the  silanated  glass  substrates  that  were  used  in  this  work  have  been  estimated  to  be  in  the  range  of  10-30  pmol/cm2.6  Assuming  that  the  initial  silanation  of  the  support  g
0	AAAI  Press
0	The  value  of  prior  knowledge  in  discovering  motifs  with  MEME
1	Timothy  L.  Bailey  and  Charles  Elkan
0	MEME  is  a  tool  for  discovering  motifs  in  sets  of  protein  or  DNA  sequences.  This  paper  describes  several  extensions  to  MEME  which  increase  its  ability  to  find  motifs  in  a  totally  unsupervised  fashion,  but  which  also  allow  it  to  benefit  when  prior  knowledge  is  available.  When  no  background  knowledge  is  asserted,  MEME  obtains  increased  robustness  from  a  method  for  determining  motif  widths  automatically,  and  from  probabilistic  models  that  allow  motifs  to  be  absent  in  some  input  sequences.  On  the  other  hand,  MEME  can  exploit  prior  knowledge  about  a  motif  being  present  in  all  input  sequences,  about  the  length  of  a  motif  and  whether  it  is  a  palindrome,  and  (using  Dirichlet  mixtures)  about  expected  patterns  in  individual  motif  positions.  Extensive  experiments  are  reported  which  support  the  claim  that  MEME  benefits  from,  but  does  not  require,  background  knowledge.  The  experiments  use  seven  previously  studied  DNA  and  protein  sequence  families  and  75  of  the  protein  families  documented  in  the  Prosite  database  of  sites  and  patterns,  Release  11.1.
0	The  new  sequence  model  type  allows  each  each  sequence  in  the  training  set  to  have  exactly  zero  or  one  occurrences  of  each  motif.  This  type  of  model  is  ideally  suited  to  discovering  multiple  motifs  in  the  majority  of  cases  encountered  in  practice.  The  motif-width  heuristic  allows  MEME  to  automatically  discover  several  motifs  of  differing,  unknown  widths  in  a  single  DNA  or  protein  dataset.  We  also  describe  an  improved  method  of  finding  multiple,  different  motifs  in  a  single  dataset.
0	Overview  of  MEME
0	The  principal  input  to  MEME  is  a  set  of  DNA  or  protein  sequences.  Its  principal  output  is  a  series  of  probabilistic  sequence  models,  each  corresponding  to  one  motif,  whose  parameters  have  been  estimated  by  expectation  maximization  (Dempster,  Laird,  &  Rubin  1977).  In  a  nutshell,  MEME's  algorithm  is  a  combination  of  expectation  maximization  (EM),
0	OOPS,  ZOOPS,  and  TCM  models
0	The  different  types  of  sequence  model  supported  by  MEME  make  differing  assumptions  about  how  and  where  motif  occurrences  appear  in  the  dataset.  We  call  the  simplest  model  type  OOPS  since  it  assumes  that  there  is  exactly  one  occurrence  per  sequence  of  the  motif  in  the  dataset.  This  type  of  model  was  introduced  by  Lawrence  &  Reilly  (1990).  This  paper  describes  for  the  first  time  a  generalization  of  OOPS,  called  ZOOPS,  which  assumes  zero  or  one  motif  occurrences  per  dataset  sequence.  Finally,  TCM  (two-component  mixture)  models  assume  that  there
0	Supported  by  NIH  Genome  Analysis  Pre-Doctoral  Training  Grant  No.  HG00005.
0	MEME  is  an  unsupervised  learning  algorithm  for  discovering  motifs  in  sets  of  protein  or  DNA  sequences.  This  paper  describes  the  third  version  of  MEME.  Earlier  versions  were  described  previously  (Bailey  &  Elkan  1994),  (Bailey  &  Elkan  1995a).  The  MEME  extensions  on  which  this  paper  focuses  are  methods  of  incorporating  background  knowledge,  or  coping  with  its  lack.  For  incorporating  background  knowledge,  these  innovations  include  automatic  detection  of  inverse-complement  palindromes  in  DNA  sequence  datasets,  and  using  Dirichlet  mixture  priors  with  protein  sequence  datasets.  Dirichlet  mixture  priors  bring  information  about  which  amino  acids  share  common  properties  and  thus  are  likely  to  be  interchangeable  in  a  given  position  in  a  protein  motif.  This  paper  also  describes  a  new  type  of  sequence  model  and  a  new  heuristic  for  automatically  determining  the  width  of  a  motif  which  remove  the  need  for  the  user  to  provide  two  types  of  information.
0	an  EM-based  heuristic  for  choosing  the  starting  point  for  EM,  a  maximum  likelihood  ratio-based  (LRT-based)  heuristic  for  determining  the  best  number  of  model  free  parameters,  multistart  for  searching  over  possible  motif  widths,  and  greedy  search  for  finding  multiple  motifs.
0	for  .  The  last  column  is  an  inverted  version  of  the  first  column,  the  second  to  last  column  is  an  inverted  version  of  the  second  column,  and  so  on.  As  will  be  described  below,  MEME  automatically  chooses  whether  or  not  to  enforce  the  palindrome  constraint,  doing  so  only  if  it  improves  the  value  of  the  LRT-based  objective  function.
0	Expectation  maximization
0	Consider  searching  for  a  single  motif  in  a  set  of  sequences  by  fitting  one  of  the  three  sequence  model  types  to  it.  The  dataset  of  sequences,  each  of  length  ,  will  be  referred  to  as  .  There  are  possible  starting  positions  for  a  motif  occurrence  in  each  sequence.  The  starting  point(s)  of  the  occurrence(s)  of  the  motif,  if  any,  in  each  of  the  sequences  are  unknown  and  are  represented  by  the  the  variables  (called  the  "missing  information")  where  if  a  motif  occurrence  starts  in  position  in  sequence  ,  and  otherwise.  The  user  selects  one  of  the  three  types  of  model  and  MEME  attempts  to  maximize  the  likelihood  function  of  a  model  of  that  type  ,  where  is  a  vector  containing  given  the  data,  all  the  parameters  of  the  model.  MEME  does  this  by  using  EM  to  maximize  the  expectation  of  the  joint  likelihood  of  the  model  given  the  data  and  the  missing  information,  .  This  is  done  iteratively  by  repeating  the  following  two  steps,  in  order,  until  a  convergence  criterion  is  met.  E-step:  compute
0	jhEg4  ki  ¢  X
0	M-step:  solve
0	x  2  n  te   ki  g  qjhE4  g  pl  n  So  mEl  ¢  fX
0	DNA  palindromes
0	where  is  a  vector  containing  all  the  parameters  of  the  model.  This  process  is  known  to  converge  (Dempster,  Laird,  &  Rubin  1977)  to  a  local  maximum  of  the  likelihood  function  .  Joint  likelihood  functions.  MEME  assumes  each  sequence  in  the  training  set  is  an  independent  sample  from  a  member  of  either  the  OOPS,  ZOOPS  or  TCM  model  families  and  uses  EM  to  maximize  one  of  the  following  likelihood  functions.  The  logarithm  of  the  joint  likelihood  for  models
0	It  is  not  necessary  that  all  of  the  sequences  be  of  the  same  length,  but  this  assumption  will  be  made  in  what  follows  in  order  to  simplify  the  exposition  of  the  algorithm.  In  particular,  under  this  assumption,  .
0	That  is,
0	A  DNA  palindrome  is  a  sequence  whose  inverse  complement  is  the  same  as  the  original  sequence.  DNA  binding  sites  for  proteins  are  often  palindromes.  MEME  models  a  DNA  palindrome  by  constraining  the  parameters  of  corresponding  columns  of  a  motif  to  be  the  same:
0	Here,  is  the  probability  of  letter  occurring  at  either  a  background  position  (I  )  or  at  position  of  a  motif  occurrence  (Q  ),  is  the  parameters  of  the  background  component  of  the  sequence  model,  and  is  the  parameters  of  the  motif  component.  Formally,  the  parameters  of  an  OOPS  model  are  the  letter  frequencies  for  the  background  and  each  column  of  the  motif,  and  the  width  of  the  motif.  The  ZOOPS  model  type  adds  a  new  parameter,  ,  which  is  the  prior  probability  of  a  sequence  containing  a  motif  occurrence.  A  TCM  model,  which  allows  any  number  of  (non-overlapping)  motif  occurrences  to  exist  within  a  sequence,  replaces  with  ,  where  is  the  prior  probability  that  any  position  in  a  sequence  is  the  start  of  a  motif  occurrence.
0	rGFd
0	are  zero  or  more  non-overlapping  occurrences  of  the  motif  in  each  sequence  in  the  dataset,  as  described  by  Bailey  &  Elkan  (1994).  Each  of  these  types  of  sequence  model  consists  of  two  components  which  model,  respectively,  the  motif  and  nonmotif  ("background")  positions  in  sequences.  A  motif  is  modeled  by  a  sequence  of  discrete  random  variables  whose  parameters  give  the  probabilities  of  each  of  the  different  letters  (4  in  the  case  of  DNA,  20  in  the  case  of  proteins)  occurring  in  each  of  the  different  positions  in  an  occurrence  of  the  motif.  The  background  positions  in  the  sequences  are  modeled  by  a  single  discrete  random  variable.  If  the  width  of  the  motif  is  ,  and  the  alphabet  for  sequences  is  ,  we  can  describe  the  parameters  of  the  two  components  of  each  of  the  three  model  types  in  the  same  way  as
0	For  a  ZOOPS  model,  the  joint  log  likelihood  is
0	For  a  ZOOPS  model,
0	For  a  TCM  model,
0	The  M-step.  The  M-step  of  EM  in  MEME  reestimates  using  the  following  formula  for  models  of  all  three  types:
0	if  otherwise.
0	Finding  multiple  motifs
0	All  three  sequence  model  types  supported  by  MEME  model  sequences  containing  a  single  motif  (albeit  a  TCM  model  can  describe  sequences  with  multiple  occurrences  of  the  same  motif).  To  find  multiple,  non-overlapping,  different  motifs  in  a  single  dataset,  MEME  uses  greedy  search.  It  incorporates  information  about  the  motifs  already  discovered  into  the  current  model  to  avoid  rediscovering  the  same  motif.  The  process  of  discovering  one  motif  is  called  a  pass  of
0	The  conditional  probability  of  a  lengthsubsequence  generated  according  to  the  background  or  motif  component  of  a  TCM  model  is  defined  to  be
0	is  a  vector-valued  indicator  variable  of  lengt
0	New  topical  antiandrogenic  formulations  can  stimulate  hair  growth  in  human  bald  scalp  grafted  onto  mice
1	Amnon  Sintov  a,*,  Sima  Serafimovich  b,  Amos  Gilhar  b
0	Keywords:  Androgenetic  alopecia;  Flutamide;  Finasteride;  Topical  drug  delivery;  Skin  permeation;  Mice
0	Introduction  Testosterone  metabolites  exert  a  significant  hormonal  influence  on  hair  growth  by  interacting  with  receptors  at  the  follicular  papilla.  It  has  long  been  known  that  an  increased  susceptibility  of
0	scalp  follicles  to  these  androgens  is  the  main  cause  of  androgenetic  alopecia  (or  male-pattern  baldness)  in  genetically  predisposed  individuals  (Imperato-McGinley  et  al.,  1974;  Ebling  et  al.,  1991).  In  this  type  of  alopecia,  scalp  follicles  exhibit  increased  levels  and  activity  of  scalp  5a-reductase  isoenzyme,  which  converts  testosterone  (T)  to  dihydrotestosterone  (DHT)  (Bingham  and  Shaw,  1973;  Schweikert  and  Wilson,  1974).  Taken  together,  increased  conversion  of  T  to  DHT  and
0	increased  DHT  binding  capacity  in  bald  scalp  as  compared  to  hairy  scalp  (Sawaya  et  al.,  1989)  provide  a  mechanistic  explanation  for  androgenetic  alopecia.  DHT  shortens  the  hair  cycle  and  progressively  miniaturizes  scalp  follicles.  The  miniaturized  follicles  all  remain  present  and  thus  the  possibility  of  reversal  by  re-enlargement  exists.  It  is  reasonable,  therefore,  to  suppose  that  by  administration  of  5a-reductase  inhibitors  and/or  non-steroidal  antiandrogens,  this  reversal  should  occur.  Finasteride,  a  4-azasteroid  inhibitor  of  5a-reductase,  was  introduced  by  Merck  in  1989.  Finasteride  is  known  to  inhibit  the  prostate  5a-reductase  isoenzyme  type  2  more  effectively  than  type  1  isoenzyme  predominantly  found  in  the  skin  of  the  scalp.  However,  while  type  1  isoenzyme  is  located  in  the  sebaceous  glands,  there  is  still  significant  activity  of  type  2  isoenzyme  in  the  hair  follicles  (Sawaya  and  Price,  1997).  This  is,  therefore,  the  reason  why  finasteride  decreased  the  level  of  DHT  in  bald  scalps  after  a  long-term  oral  administration  (Diani  et  al.,  1992;  Dallob  et  al.,  1994);  it  also  provides  the  justification  for  the  topical  mode  of  delivery.  It  should  be  emphasized  that  oral  finasteride  has  already  been  introduced  as  an  effective  hair  growth  treatment,  with  only  minor  systemic  adverse  effects.  Nevertheless,  systemic  therapy  for  a  disorder  such  as  male-pattern  baldness  is  obviously  not  the  treatment  of  choice  if  the  option  of  topical  delivery  is  available  option.  Another  agent  with  a  hair  growth  potential  is  the  nonsteroidal  anti-androgen  flutamide.  This  drug,  produced  by  Schering-Plough,  was  introduced  as  a  new  potent  compound  for  treatment  of  prostatic  carcinoma  (Martindale,  1993).  The  systemic  administration  of  flutamide  causes  several  unwanted  side  effects,  such  as  reducing  libido  and  impairing  spermatogenesis  in  men  and  feminizing  male  fetuses  in  pregnant  women.  Topical  administration,  therefore,  is  an  important  goal  for  such  a  drug,  especially  if  indicated  for  skin  disorders.  In  a  comparative  study,  Chen  et  al.  (1995)  showed  that  topical  administration  of  finasteride  (in  ethanol/propylene  glycol  vehicle)  caused  local  inhibition  of  androgen-controlled  sebaceous  gland  growth  in  hamster  flank  organ  and  that  had  a
0	similar  action  to  that  of  the  same  doses  of  flutamide.  To  date,  clinical  studies  have  not  been  performed  for  testing  the  efficacy  of  topical  flutamide  in  male-pattern  baldness.  It  is  likely  that  the  success  (i.e.  effective  with  minimal  systemic  exposure)  of  this  drug  would  be  dependent  on  a  well-designed  vehicle  that  would  increase  skin  accumulation  and  decrease  percutaneous  absorption.  In  this  paper,  we  present  a  new  topical  base  formulation  for  finasteride  and  flutamide  (representing  two  anti-DHT  categories).  We  studied  the  effect  of  the  topical  preparations  of  these  two  compounds  on  the  growth  of  human  hair  in  a  murine  transplantation  model.  The  effect  was  monitored  in  scalp  skin  biopsies  taken  from  bald  subjects  before  plastic  surgery  procedures.  This  model  which  has  been  described  previously  by  Gilhar  et  al.  (1988),  Van  Neste  (1996)  and  De  Brouwer  et  al.  (1997),  is  specific  to  male-pattern  baldness,  in  which  hairs  of  the  bald  skin  graft  do  not  re-enlarge  after  transplantation,  while  the  hair  of  grafts  taken  from  patients  with  alopecia  areata  (an  auto-immune  problem)  begin  to  grow  shortly  after  transplantation  (Gilhar  and  Krueger,  1987).  To  correlate  the  pharmacological  efficacy  of  the  new  drug-vehicle  system  with  its  cutaneous  penetration  properties,  topical  preparations  containing  flutamide  were  tested  in  vitro  using  excised  hairless  mouse  skin.
0	Materials  and  methods
0	Formulation
0	Gel  preparations  containing  1%  of  flutamide  (Eulexin,  Schering-Plough  Lab.,  Belgium)  or  finasteride  (Proscarfi,  Merck  Sharp  &  Dohme,  UK)  were  produced  as  follows.  The  drug  was  dissolved  in  ethyl  alcohol  (30%  w/w  in  the  final  gel  for  flutamide,  and  58%  w/w  in  the  final  gel  for  finasteride);  then  1%  glyceryl  oleate  (as  an  enhancer)  and  distilled  water  were  added  gradually  with  mixing.  The  solutions  were  finally  gelled  by  adding  4%  hydroxypropyl  methylcellulose  (for  flutamide)  or  ethylcellulose  (for  finasteride).  A  vehicle  corresponding  to  the  flutamide  formula-
0	tion  but  containing  no  drugs  was  prepared  for  the  purpose  of  in  vivo  comparison.  In  addition,  a  1%  flutamide  formulation  without  enhancer  was  prepared  and  tested  in  vitro  together  with  the  formulation  containing  the  enhancer  (as  described  above),  and  a  hydroalcoholic  formulation  (1:1  ethanol-water).
0	the  subcutaneous  tissue  over  the  lateral  thoracic  cage  of  each  mouse,  and  covered  with  a  standard  band  aid  dressing.  The  dressing  was  removed  on  day  7,  and  the  grafts,  which  were  located  at  the  surface,  were  treated  from  day  8  for  60  days  as  described  below.  The  procedure  protocol  related  to  animals  was  reviewed  and  approved  by  the  Institutional  Animal  Care  and  Use  Committee.
0	Animals  2.4.  Treatment
0	Severe  combined  immune  deficient  mice  (male  Prkdc  SCID-Charles  River,  UK),  2  -  3  months  of  age,  were  used  in  this  study.  The  mice  were  grown  in  a  pathogen-free  animal  facility.  Specimens  of  each  topical  preparation,  20-30  mg,  were  spread  gently  over  each  transplanted
0	Skin  grafting
0	Punch  grafts,  0.5mm2,  obtained  from  scalp  skin  of  five  bald  men  were  used  for  transplantation  to  the  SCID  mice  (three  grafts  per  mouse).  The  transplantation  procedure  was  performed  as  previously  described  (Gilhar  et  al.,  1988).  Each  graft  was  inserted,  through  an  incision  in  the  skin,  into
0	Table  1  Distribution  of  the  histological  hair  structures  in  the  treated  grafts  Anagen  (%)  Before  treatment  Finasteride  Flutamide  Vehicle  (control)  0  30.4  47.0  10.5
0	Finasteride  Flutamide  Vehicle  (control)
0	a  No  difference  between  groups  was  found  for  T  or  DHT  (P\0.05).
0	Catagen  (%)  35.7  22.8  26.5  24.6
0	Telogen  (%)  64.2  46.8  26.5  64.9
0	scopically  in  the  horizontal  sections  with  the  aid  of  a  calibrated  ocular  micrometer.  Hair  structures  in  the  histological  specimens  were  counted.
0	In  6itro  permeation  testing
0	The  in  vitro  diffusion  of  a  topical  drug  through  skin  (in  which  the  flux  of  the  drug  molecules  through  human  cadaver  or  animal  skin  is  determined)  was  performed  basically  according  to  the  FDA  guidelines  (Skelly  et  al.,  1987).  Bas
0	Ecdysone-regulated  puff  genes  2000
1	C.S.  Thummel
0	Keywords:  Ecdysone;  Drosophila  metamorphosis;  Gene  regulation
0	these  hormones  could  act  directly  on  the  nucleus,  triggering  a  complex  regulatory  cascade  of  gene  expression  (Yamamoto  and  Alberts,  1976).  Through  a  series  of  detailed  and  elegant  studies,  Ashburner  and  co-workers  proposed  a  model  for  the  regulation  of  gene  expression  by  20-hydroxyecdysone  (referred  to  hereafter  as  ecdysone)  (Fig.  1).  Briefly,  this  model  proposed  that  ecdysone,  bound  to  its  specific  receptor,  directly  induces  the  expression  of  a  small  set  of  early  regulatory  genes.  The  protein  products  of  these  genes,  in  turn,  repress  their  own  expression  and  induce  a  much  larger  set  of  late  target  genes.  It  was  assumed  that  these  late  genes  would  function  as  effectors  that  directly  or  indirectly  control  the  appropriate  biological  responses  to  the  pulse  of  ecdysone.  Ashburner  and  colleagues  also  determined  that  the  late  puffs  could  be  divided  into  two  classes,  based  on  their  regulation  by  ecdysone  (Ashburner  and  Richards,  1976).  The  early-late  puffs  are  induced  relatively  rapidly  after  the  addition  of  hormone  and  require  the  continuous  presence  of  ecdysone  for  their  activity,  much  like  the  early  puffs.  The  late-late  puffs,  in  contrast,  are  induced  at  later  times  and  are  prematurely  induced  upon  ecdysone  withdrawal.  This  latter  result  was  interpreted  to  mean  that  the  ecdy-
0	E63-1:  an  ecdysone-inducible  calcium  binding  protein  that  can  regulate  salivary  gland  glue  secretion  Molecular  analysis  of  the  63F  early  puff  provided  the  first  evidence  that  not  all  early  puffs  encode  transcriptional  regulators.  This  work  identified  a  pair  of  divergently  transcribed  ecdysone-inducible  genes:  E63-1  and  E63-2  (Andres  and  Thummel,  1995).  E63-2  produces  a  single  1.2  kb  mRNA  with  no  extended  open  reading  frames.  Genetic  studies  indicate  that  this  gene  has  no  essential  functions  during  development,  suggesting  that  it  may  only  be  expressed  due  to  its  proximity  to  E63-1  (Vaskova  et  al.,  2000).  In  contrast,  E63-1  encodes  a  calcium-binding  protein  with  four  EF  hands,  most  closely  related  to  calmodulin.  The  regulation  of  E63-1  provides  a  further  departure  from  prior  studies  of  early  puff  genes,  in  that  it  is  induced  by  ecdysone  in  a  tissue-specific  manner.  Low  to  moderate  levels  of  E63-1  are  widely  expressed  in  the  third  instar  larvae,  prior  to  the  late  larval  ecdysone  pulse.  Only  in  the  salivary  gland  is  E63-1  transcription  rapidly  and  directly  induced  by  the  hormone  at  puparium  formation  (Andres  and  Thummel,  1995).  This  restricted  pattern  of  induction,  combined  with  the  known  role  of  calcium-binding  proteins  in  regulating  secretion,  led  to  the  proposal  that  E63-1  might  contribute  to  the  physiology  of  the  salivary  gland  by  regulating  ecdysoneinduced  secretion.  Although  loss-of-function  mutants  provide  an  ideal  means  of  testing  this  model,  inactivation  of  the  E63-1  gene  has  no  detectable  effect  on  viability  or  reproduction  (Vaskova  et  al.,  2000).  In  retrospect,  this  is  not  surprising,  given  that  other  calcium-binding  proteins  are  encoded  by  the  Drosophila  genome.  Consistent  with  possible  functional  redundancy  in  this  pathway,  recent  studies  have  shown  that  salivary  glands  compromised  for  both  calmodulin  and  E63-1  are  defective  in  glue  secretion  (T.V.  Do  and  A.J.  Andres,  personal  communication).  In  addition,  ectopic  expression  of  E63-1  in  transgenic  animals  is  sufficient  to  trigger  glue  secretion  if  the  intracellular  calcium  levels  are  elevated  (A.  Biyasheva  et  al.,  2001).  Moreover,  ecdysone  alone  can  lead  to  increased  levels  of  intracellular  calcium  in  larval  salivary  glands,  with  a  detectable  increase  after  2  h  of  exposure.  Ecdysone  thus  leads  to  two  responses  that  can  synergistically  trigger  salivary  gland  glue  secretion  --  increased  levels  of  E63-1  expression  as  well  as  increased  cytoplasmic  calcium  levels  (Fig.  2).  Although  the  time  frame  for  calcium  elevation  suggests  that  this  is  a  secondary-response  to  the  hormone,  the  mechanism  by  which  calcium  levels  are  effected  remains  to  be  determined.  E63-1  protein  shows  dynamic  changes  in  its  subcellular  distribution  as  the  salivary  glands  secrete  glue,  providing  further  evidence  of  a  possible  role  in  glue  secretion  (Vaskova  et  al.,  2000).  Initially,  before  the  glue  is  secreted,  E63-1  is  localized  to  cell  membranes,  in  the
0	The  E23  early  puff  gene  may  regulate  ecdysone  responses  by  controlling  intracellular  hormone  concentrations  The  23E  ecdysone-inducible  puff  is  among  the  last  early  puffs  described  by  Ashburner  to  be 
0	Special  Feature
0	Signalling  by  CD95  and  TNF  receptors:  Not  only  life  and  death
0	Walter  and  Eliza  Hall  Institute  of  Medical  Research,  Royal  Melbourne  Hospital,  Parkville,  Victoria,  Australia
0	Summary  Members  of  the  TNF  family  of  receptors  play  important  roles  in  normal  physiology  and  in  defence.  The  recent  rapid  progress  in  the  understanding  of  the  mechanisms  of  apoptosis  has  been  accompanied  by  assumptions  that  TNF  family  receptors  such  as  CD95(Fas/APO-1)  only  have  a  role  in  regulating  cell  survival.  While  regulation  of  cell  death  is  one  important  function  of  TNF  family  receptors,  they  are  capable  of  activating  signal  transduction  pathways  that  have  many  other  effects.  The  present  review  will  focus  on  signalling  of  some  TNF  family  receptors  in  the  immune  system,  not  only  for  apoptosis,  but  also  for  survival  or  activation.  Key  words:  apoptosis,  CD95,  NF-B,  signal  transduction,  TNF  receptors.
0	TNF  receptor  family
0	The  tumour  necrosis  factor  receptor  (TNFR)/nerve  growth  factor  receptor  (NGFR)  family  of  molecules  regulate  a  number  of  biological  functions,  such  as  growth,  differentiation  and  apoptosis  in  multiple  cell  types.  In  the  immune  system,  members  of  this  receptor  family  are  involved  in  the  development  of  peripheral  lymphoid  organs,  regulation  of  induced  inflammatory  responses  and  removal  of  cells  at  the  end  of  an  immune  response.  The  TNFR  family  consists  of  more  than  15  different  molecules.  Most  are  type  I  membrane  proteins  which  resemble  each  other  largely  in  their  extracellular  regions,  which  all  contain  2-6  characteristic  cysteine-rich  domains.1  The  TNF  family  receptors  are  activated  upon  binding  of  their  cognate  ligands,  most  of  which  are  trimers  with  a  structure  similar  to  TNF.  Sometimes  the  ligands  are  cell  bound  type  II  membrane  proteins,  but  several  are  cleaved  off  and  appear  as  soluble  trimers.  Induction  of  trimers  or  higher  order  complexes  of  the  TNF  family  of  receptors  allows  their  cytoplasmic  domains  to  aggregate  intracytoplasmic  signalling  molecules.
0	so-called  because  it  is  required  for  these  receptors  to  transmit  apoptotic  signals.  The  DD  is  a  protein-protein  interaction  motif  consisting  of  six  alpha  helices  that  allow  two  proteins  with  DD  to  bind  to  each  other.  Structurally  the  DD  is  related  to  two  other  homotypic  interaction  domains,  the  death  effector  domain  (DED),  and  the  caspase  recruitment  domain  (CARD).2
0	Death  domain  adaptors:  TRADD,  FADD,  RIP  and  RAIDD
0	Binding  of  TNF  to  TNFR1  induces  recruitment  of  the  DDcontaining  protein  TRADD  to  the  DD  of  TNFR1.3  Overexpression  of  TRADD  alone  also  induces  the  TNF-regulated  responses  apoptosis  and  activation  of  the  transcription  factors  NF-B  and  Jun  kinase  (JNK),  presumably  because  TRADD  provides  docking  sites  for  downstream  signalling  proteins  to  the  receptor  complex.4  Two  of  the  proteins  that  TRADD  recruits  to  the  signalling  complex  also  bear  death  domains.  One  of  these,  RIP,  has  an  N-terminal  DD  and  a  C-terminal  kinase  domain.  Knockout  studies  have  shown  that  RIP  is  required  for  induction  of  NFB  by  TNF.5  The  other,  Fas-associated  protein  with  death  domain  (FADD),  has  a  C-terminal  DD,  and  an  N-terminal  DED.  The  FADD  is  required  for  cell  death  signalling  by  TNFR1  and  also  by  CD95,  to  which  it  binds  directly  via  its  death  domain.6-8  The  DED  of  FADD  allows  it  to  bind  to  DED  in  the  pro-domain  of  caspase  8.  Through  these  interactions,  ligation  of  TNFR1  or  CD95  can  result  in  the  formation  of  a  death-inducing  signalling  complex,  which  leads  to  activation  of  caspase  8,  a  cell  death  effector  protease.  Once  activated,  caspase  8  cleaves  and  activates  downstream  caspases,  such  as  caspase  3,  ultimately  leading  to  cell  death.  Because  cells  from  mice  lacking  caspase  8  are  resistant  to  death  induced  by  TNF  receptors,  CD95  and  DR3,  apoptosis  triggered  by  all  of  these  receptors  must  converge  on  this  caspase.9  However,  FADD  must  have  other  functions  because  FADD  knockout  mice  die  during  embryogenesis,  and  lymphocytes  from  FADD-dominant  negative  transgenic  mice  do  not  proliferate  normally  in  response  to  T  cell  mitogens  in  vitro.10-12
0	Signalling  pathways  controlled  by  TNF  receptors
0	The  cytoplasmic  domains  of  the  TNFR  family,  which  are  more  diverse  than  the  extracellular  portions,  do  not  have  any  intrinsic  enzymatic  activity,  hence  they  signal  by  inducing  aggregation  of  intracellular  adaptor  molecules  (Fig.  1).
0	Death  domains
0	The  cytoplasmic  domains  of  TNFR1  (p55),  CD95  (Fas/  APO-1),  NGFR  (p75),  death  receptor  (DR)  3,  TRAIL-R1  and  TRAIL-R2  all  bear  a  motif  termed  a  `death  domain'  (DD),
1	C  Magnusson  and  DL  Vaux
0	The  group  of  TNF  receptor-associated  factors  (TRAF)  interact  with  members  of  the  TNFR  family.  There  are  to  date  six  TRAF  proteins  identified,  TRAF1,  TRAF2,  TRAF3  (CRAF,  LAP-1,  CD40-bp),  TRAF4  (CART1),  TRAF5  and  TRAF6  (review18).  With  the  exception  of  TRAF4,  TRAF  proteins  interact  with  receptor  molecules  either  directly,  or  indirectly  through  binding  to  other  TRAF,  or  through  binding  to  TRADD.  The  TNFR2  (p75),  CD40,  CD30  and  lymphotoxin-  receptor  (LTR)  contain  conserved,  cytoplasmic  TRAF  binding  motifs  and  are  able  to  bind  directly  to  TRAF  proteins.  Because  TRAF2  can  bind  to  TRADD,  which  in  turn  can  associate  with  TNFR1,  TRAF2  can  indirectly  participate  in  signalling  from  this  receptor  as  well.  The  TRAF  molecules  share  similar  C-terminal  domains,  designated  the  TRAF  domain,  which  is  involved  in  protein-protein  interactions.  TRAF2,  TRAF3,  TRAF5  and  TRAF6  also  bear  an  N-terminal  RING  finger,  a  zinc  binding  motif  found  in  several  types  of  intracellular  proteins.19-23  TNF  receptor-associated  factor  proteins  interact  as  homodimers  or  in  heterodimeric  complexes.  For  example,  TRAF2  binds  to  TRADD,  the  TNFR2,  LTR,  CD40  or  CD30  via  its  C-terminal  TRAF  domain,  probably  as  a  heterodimeric  complex  with  TRAF1  or  TRAF5,  or  as  a  homodimer.18,19  It  has  also  been  shown  that  TRAF  proteins  may  signal  from  other  receptors  in  addition  to  TNFR  family  molecules.  TRAF6,  which  binds  to  CD40,  is  also  involved  in  IL-1  receptor  signalling  through  interaction  with  IRAK,  a  serine/  threonine  kinase  that  also  has  a  DD.24  Studies  of  TRAF2  and  TRAF3  knockout  mice  have  shown  that  TRAF  proteins  are  required  for  activation  of  Jun/AP-1  signalling  by  TNF  receptors,  and  have  important  roles  for  normal  development,  since  these  mice  die  during  early  life.25,26
0	RIP  is  an  adaptor  protein  with  a  C-terminal  death  domain  that  can  associate  with  the  DD  in  the  cytoplasmic  domain  of  CD95.  Via  TRADD,  RIP  can  also  associate  with  the  TNFR1.4  Cells  from  RIP  knockout  mice  show  increased  susceptibility  to  TNF-mediated  killing  and  fail  to  activate  NF-B  in  response  to  TNF.5  This  indicates  that  RIP  is  required  for  NF-B  activation  by  TNF.  Because  RIP  is  a  serine  threonine  kinase,  it  is  likely  to  phosphorylate,  and  thereby  activate,  kinases  that  phosphorylate  the  inhibitor  of  NF-B,  IB.13  Interestingly,  RIP  knockout  mice  also  have  abnormal  development  of  lymph  nodes,  similar  to  those  in  lymphotoxin  (LT)  receptor-deficient  mice.14,15  Therefore  it  is  possible  that  RIP  also  takes  part  in  signalling  from  these  receptors.  However,  because  the  LTR  lacks  a  DD,  if  it  does  signal  via  RIP  then  it  must  do  so  indirectly  (see  following).  Another  DD-bearing  adaptor  molecule  implicated  in  TNF  signalling  of  apoptosis  is  `RIP-associated  ICH-1/CED-3homologous  protein  with  a  death  domain'  (RAIDD).  In  addition  to  the  DD,  RAIDD  has  a  CARD  which  allows  it  to  bind  to  the  CARD  of  procaspase  2.16  Overexpression  of  RAIDD  in  vitro  induces  apoptosis,  suggesting  that  this  interaction  is  functional.  However,  the  significance  of  this  pathway  for  induction  of  cell  death  is  uncertain  because  neither  CD95  ligand  (CD95L)  nor  TNF  are  able  to  induce  apoptosis  in  mice  lacking  FADD  or  caspase  8.  In  these  mice,  RAIDD  and  caspase  2  would  presumably  be  able  to  function  normally.  Furthermore,  TNF-  was  still  able  to  induce  cell  death  in  the  absence  of  caspase  2.17
0	Inhibitor-of-apoptosis  proteins
0	In  some  cell  types  in  vitro,  ligation  of  CD95  is  able  to  activate  the  JNK/SAPK  pathway.  A  candidate  for  mediating  this
0	CD95  and  TNF  receptor  signalling
0	activity  is  the  CD95  `death  domain-associated  protein'  Daxx,  which  was  identified  in  yeast  two-hybrid
0	Springer-Verlag  1997
1	Russell  L.  Margolis  ·  Meena  R.  Abraham  ·  Shawn  B.  Gatchell  ·  Shi-Hua  Li  ·  Arif  S.  Kidwai  ·  Theresa  S.  Breschel  ·  O.  Colin  Stine  ·  Colleen  Callahan  ·  Melvin  G.  McInnis  ·  Christopher  A.  Ross
0	cDNAs  with  long  CAG  trinucleotide  repeats  from  human  brain
0	Trinucleotide  repeat  expansion  mutation  is  now  know  to  cause  12  diseases,  most  with  neuropsychiatric  features  (Linblad  and  Schalling  1996;  Paulson  and  Fischbeck  1996;  Ross  1995;  Zoghbi  1996).  Seven  of  these  are  known  as  the  type  1  disorders  -  spinocerebellar  ataxia  type  1  (SCA1,  Orr  et  al.  1993),  SCA2  (Imbert  et  al.  1996;  Pulst  et  al.  1996;  Sanpei  et  al.  1996),  Machado-Joseph  disease  (MJD  or  SCA3,  Kawaguchi  et  al.  1994),  SCA6  (Zhuchenko  et  al.  1997),  dentatorubral  pallidoluysian  atrophy  (DRPLA,  Koide  et  al.  1994;  Nagafuchi  et  al.  1994),  Huntington's  disease  (HD,  Huntington's  Disease  Collaborative  Research  Group  1993),  and  spinal  and  bulbar  muscular  atrophy  (SBMA,  La  Spada  et  al.  1991).  Each  is  caused  by  a  (CAG)n  expansion  in  an  open  reading  frame,  resulting  in  an  expanded  glutamine  repeat.  The  properties  of  the  repeats  in  the  other  (type  2)  expansion  mutation  diseases  vary  widely.  Myotonic  dystrophy  is  caused  by  a  3  untranslated  (CTG)n  expansion  (Brook  et  al.  1992;  Fu  et  al.  1992;  Mahadevan  et  al.  1992),  the  A  and  E  forms  of  fragile  X  syndrome  (Fu  et  al.  1991;  Knight  et  al.  1993;  Kremer  et  al.  1991;  Verkerk  et  al.  1991)  and  some  cases  of  Jacobsen's  syndrome  (Jones  et  al.  1995)  result  from  5  untranslated  region  (CCG)n  expansions,  and  Friedreich's  ataxia  is  caused  by  an  intronic  (GAA)n  expansion  (Campuzano  et  al.  1996).  Expandable  trinucleotide  repeats  therefore  are  found  in  translated,  transcribed  but  untranslated,  and  intronic  regions;  they  may  be  G-C  or  A-T  rich  and  range  from  minimal  to  highly  variable  in  length  in  the  normal  population.  At  least  four  lines  of  evidence  indicate  that  additional  disorders  may  arise  from  trinucleotide  repeat  expansion  mutations.  First,  an  antibody  (IC2)  that  specifically  recognizes  expanded  glutamine  repeats  detects  an  expansion  segregating  with  SCA7  (Trottier  et  al.  1995).  Second,  indirect  evidence  of  CAG  expansion  has  been  detected  using  rapid  expansion  detection  (RED,  Schalling  et  al.  1993)  in  a  pedigree  with  SCA7,  and  less  clearly  in  heterogeneous  populations  of  patients  with  bipolar  affective
0	disorder  and  schizophrenia  (Linblad  et  al.  1996;  Linblad  and  Schalling  1996;  O'Donovan  et  al.  1995).  Third,  several  neurodegenerative  disorders,  including  SCA4,  SCA5,  SCA7,  and  familial  Parkinson  disease,  are  phenotypically  similar  to  the  type  I  expansion  mutation  disorders.  Fourth,  anticipation,  the  phenomenon  of  increasing  phenotypic  severity  or  decreasing  age  of  onset  in  successive  generations  affected  by  a  disease  (McInnis  1996;  Ross  et  al.  1993),  is  found  in  most  of  the  expansion  mutation  diseases.  Anticipation  has  been  detected  in  a  disparate  group  of  other  diseases,  including  affective  disorder  (Engstrom  et  al.  1995;  McInnis  et  al.  1993;  Nylander  et  al.  1994),  schizophrenia  (Chotai  et  al.  1995;  Gorwood  et  al.  1996;  Stober  et  al.  1995;  Thibaut  et  al.  1995),  autism  (Stine  1993),  familial  Parkinsonism  (Bonifati  et  al.  1995;  Markopoulou  et  al.  1995;  Payami  et  al.  1995;  Plante-Bordeneuve  et  al.  1995),  familial  leukemias  (Horwitz  et  al.  1996),  Crohn's  disease  (Polito  et  al.  1996),  Meniere's  disease  (Morrison  1995),  torsion  dystonia  (LaBuda  et  al.  1993),  rheumatoid  arthritis  (McDermott  et  al.  1996),  facioscapulohumeral  muscular  dystrophy  (Tawil  et  al.  1996),  Holt-Oram  syndrome  (NewburyEcob  et  al.  1996),  and  familial  spastic  paraplegia  (Raskind  et  al.  1997).  We  have  sought  to  identify  candidate  genes  for  these  disorders  by  screening  cDNA  libraries  for  the  presence  of  DNA  fragments  containing  CAG,  CCG,  CCA,  and  AAT  trinucleotide  repeats  (Li  et  al.  1993;  Margolis  et  al.  1995  a,  b).  Our  description  of  CTG-B37,  a  cDNA  fragment  with  a  highly  polymorphic  CAG  repeat  located  within  an  open  reading  frame  on  chromosome  12,  directly  led  to  the  finding  that  an  expansion  mutation  within  the  CTGB37  repeat  causes  DRPLA  (Koide  et  al.  1994;  Nagafuchi  et  al.  1994).  This  same  strategy  of  screening  cDNA  libraries  for  trinucleotide  repeats  was  later  employed  to  identify  the  MJD  gene  (Kawaguchi  et  al.  1994)  and  the  SCA6  gene  (Zhuchenko  et  al.  1997).  Screening  genomic  contigs  for  trinucleotide  repeats  was  used  to  clone  the  gene  for  SCA2  (Pulst  et  al.  1996).  Based  on  the  repeats  that  expand  to  cause  disease,  repeats  with  the  highest  likelihood  of  undergoing  expansion  mutation  consist  of  at  least  six  consecutive  CAG  or  CTG  triplets  in  the  transcribed  portions  of  genes  expressed  in  brain.  To  identify  genes  with  these  features,  we  have  screened  human  adult  frontal  cortex  and  fetal  brain  cDNA  libraries  at  high  stringency  for  the  presence  of  CAG  or  CTG  repeats.  We  now  report  the  identification  and  mapping  of  19  of  these  cDNA  fragments.
0	Materials  and  methods
0	cDNA  cloning  Adult  human
0	EVects  of  a  motilin  receptor  agonist  (ABT-229)  on  upper  gastrointestinal  symptoms  in  type  1  diabetes  mellitus:  a  randomised,  double  blind,  placebo  controlled  trial
1	N  J  Talley,  M  Verlinden,  D  J  Geenen,  R  B  Hogan,  D  RiV,  R  W  McCallum,  R  J  Mack
0	Motilin  is  a  22  amino  acid  peptide  hormone  that  is  expressed  throughout  the  gut.1  Motilin  stimulates  interdigestive  antral  contractions  promoting  gastric  emptying;  the  receptor  has  recently  been  identified.2  Erythromycin  is  a  potent  motilin  agonist,  inducing  phase  3  of  the  migrating  motor  complex1;  it  accelerates  gastric  emptying  in  healthy  volunteers  as  well  as  in  patients  with  diabetic  gastroparesis  or  those  post-vagotomy.3  4  Dyspepsia  is  a  common  problem  in  patients  with  diabetes  mellitus.5  6  Between  27%  and  58%  of  type  1  diabetics  are  reported  to  have  gastroparesis,  usually  aVecting  solids  but  less  often  liquids.7  8  Symptoms  of  diabetic  gastroparesis  include  postprandial  distress,  early  satiety,  bloating,  fullness,  and  nausea  and  vomiting,  but  while  gastroparesis  is  common,  only  a  minority  have  overt  symptomatology.7  8  Moreover,  these  symptoms  also  occur  frequently  in  diabetics  who  do  not  have  objective  evidence  of  gastroparesis.6  The  underlying  mechanisms  remain  in  dispute  but  disturbed  vagal  parasympathetic  function  and  poor  glycaemic  control  may  both  be  important.8  9  In  addition,  increased  levels  of  motilin  have  been  observed  in  diabetic  gastroparesis  which  is  likely  to  be  a  compensatory  mechanism  as  motilin  levels  decreased  with  the  introduction  of  a  prokinetic.10  A  prokinetic  agent  in  diabetic  gastroparesis  has  the  potential  to  increase  gastric  emptying,  improve  dyspepsia,  and  better  control  plasma  glucose  levels.  There  has  therefore  been  considerable  interest  in  developing  new  prokinetics  for  gastroparesis,  including  motilin  agonists  that  lack  antibiotic  activity.  ABT-229  has  potent  motilin  agonist  activity  with  essentially  no  antibiotic  action.11  12  It  dose  dependently  accelerates  gastric  emptying,  and  has  a  half  life  of  20  hours.11  12  Multidose  studies  have  shown  that  the  maximally  eVective  dose  was  5  mg  twice  daily  for  accelerating  gastric  emptying  and  2.5  mg  twice  daily  retained  a  modest  but  significant  prokinetic  eVect.12  We  aimed  to  test  the  hypothesis  that  ABT-229  would  relieve  postprandial  symptoms  in  patients  with  diabetes  mellitus.  We  further  hypothesised  that  the  maximum  therapeutic  gain  over  placebo  would  be  observed  in  patients  with  diabetic  gastroparesis  on  higher  doses  of  ABT-229.  To  test  these  hypotheses,  we  conducted  a  randomised,  placebo  controlled,
0	Abbreviations  used  in  this  paper:  HbA1c,  glycated  haemoglobin.
0	Talley,  Verlinden,  Geenen,  et  al
0	dose  ranging  trial  in  North  American  patients  with  type  1  diabetes  mellitus.  Methods  The  trial  was  approved  by  the  local  institutional  review  boards,  and  all  patients  gave  informed  consent.
0	PATIENT  SELECTION
0	Ambulatory  patients  at  least  18  years  of  age  with  documented  type  1  diabetes  were  eligible  to  be  enrolled.  All  patients  were  by  definition  insulin  dependent.  A  minimum  three  month  history  of  chronic  upper  abdominal  discomfort  (that  is,  one  or  more  of  postprandial  fullness,  bloating,  epigastric  discomfort,  early  satiety,  belching  after  meals,  postprandial  nausea,  vomiting,  or  epigastric  pain)  was  required.  A  total  of  383  patients  were  screened  (by  33  investigators  in  the  USA  and  three  in  Canada  between  June  1997  and  August  1998)  (fig  1).  Patients  were  required  to  have  a  normal  upper  endoscopy  (that  is,  no  ulcers  or  erosions  in  the  oesophagus  and  gastroduodenum)  in  the  three  months  before  randomisation.  Furthermore,  during  the  baseline  evaluation  over  14  days,  patients  had  to  have  experienced  one  or  more  symptoms  of  postprandial  upper  abdominal  discomfort  on  three  or  more  days  per  week  and  on  average  have  suYciently  severe  symptoms  (defined  as  an  upper  abdominal  discomfort  severity  score  of  >149  mm  and  a  postprandial  fullness  severity  score  of  >29  mm  on  visual  analogue  scales,  as  described  below).  Patients  were  only  enrolled  if  there  were  no  serious  comorbid  illnesses  and  screening  laboratory  values  were  normal.  Excluded  were  patients  with  gastrooesophageal  reflux  disease,  based  on  a  normal  endoscopy  (only  erythema  was  permitted),  and
0	n  =  383  Patients  screened  n  =  113  Screening  failures  n  =  270  Patients  randomised  n=1  Patient  did  not  receive  study  drug  n  =  269  Intent  to  treat  patients  n  =  15  Prematurely  discontinued  n  =  254  Completed  trial
0	Each  site  was  supplied  with  separate  sets  of  study  drug  for  the  gastric  emptying  strata  (normal  and  delayed);  to  ensure  random  assignment,  patients  in  each  strata  were  given  a  number  in  sequential  order  from  a  separate  computer  generated  randomisation  list.  A  total  of  270  patients  were  randomised  but  one  was  lost  to  follow  up  after  the  drug  was  dispensed  and  this  patient  was  excluded.  Patients  treated  (n=269)  were  randomly  assigned  to  receive  ABT-229  1.25  mg  (n=55),  2.5  mg  (n=58),  5  mg  (n=53),  10  mg  (n=55),  or  placebo  (n=48)  twice  daily  before  breakfast  and  dinner  for  four  weeks.  These  four  doses  were  chosen  based  on  the  gastrokinetic  eVects  of  ABT-229  administered  in  healthy  subjects.12  The  2.5  mg  twice  daily  dose  was  only  marginally  significantly  superior  to  placebo  as  it  accelerated  gastric  emptying  of  the  evening  meal  only.  The  maximally  eVective  dose  in  healthy  subjects  was  5  mg  twice  daily.  As  the  gastrokinetic  eVects  of  ABT-229  were  largest  in  those  with  slower  gastric  emptying,  a  1.25  mg  dose  was  included  in  the  trial.  To  account  for  the  possibility  that  patients  with  diabetic  gastroparesis  might  be  more  resistant  to  therapy  and  require  a  higher  dose,  10  mg  was  also  included.  Overall,  15  patients  prematurely  discontinued;  the  reasons  were  adverse  events  (n=10),  treatment  failure  (n=2),  lost  to  follow  up  (n=1),  or  other  reasons  (n=2),  and  the  distribution  was  similar  in  each  arm  (fig  1).  In  total,  254  patients  completed  the  trial.
0	Adverse  events  n  =  10  Lost  to  follow  up  n=1  Treatment  failures  n=2
0	The  placebo  was  identical  in  appearance  to  active  therapy.  All  medication  was  supplied  in  double  blinded  multidose  bottles.  An  administrative  blind  break  occurred  for  one  patient.
0	Other  reasons  n=2
0	Compliance,  measured  by  a  tablet  count  at  week  4,  was  excellent.  A  minimum  of  97%  of  patients  in  each  treatment  arm  were  at  least  75%  compli
0	Quality  Indicators  Increase  the  Reliability  of  Microarray  Data
1	Wolfgang  Raffelsberger,1  Doulaye  Dembele,1  Mike  G.  Neubauer,2  Marco  M.  Gottardis,3  and  Hinrich  Gronemeyer1,*
0	Institut  de  Genetique  et  de  Biologie  Moleculaire  et  Cellulaire,  CNRS/INSERM/ULP,  B.P.  10142,  F-67404  Illkirch  Cedex,  C.  U.  de  Strasbourg,  France  Departments  of  2Applied  Genomics  and  3Oncology  Drug  Discovery,  Bristol-Myers  Squibb  Pharmaceutical  Research  Institute,  Princeton,  New  Jersey  08543-4000,  USA
0	Large-scale  gene  expression  profiling  with  DNA  microarrays  opens  new  dimensions  to  molecular  biology  but  still  lacks  the  overall  precision  of  traditional  low-scale  techniques.  We  developed  a  novel  strategy  of  data  processing  linking  search  stringency  to  quality  indicators  for  efficient  detection  of  low-level,  regulated  genes.  Using  retinoid-induced  differentiation  of  NB-4  promyelocytic  cells,  the  variation  of  expression  profiles  between  biological  duplicates  was  studied  and  compared  with  the  changes  induced  by  all-trans  retinoic  acid  (atRA)  treatment.  An  analysis  of  4320  genes  showed  that  retinoic  acid  has  mainly  geneactivating  function  in  NB-4  cells.  Treatment  with  atRA  for  18  hours  induced  metabolic  genes  that  may  be  associated  with  cell  differentiation  and  signaling  factors  triggering  later  events  leading  to  apoptosis;  cytokine  genes  were  among  the  highest  stimulated  by  atRA.  Notably,  we  identified  a  regulatory  loop  inhibiting  MYC  action:  as  MYC  was  downregulated,  a  cognate  repressor  of  MYC  was  upregulated.  Key  Words:  retinoic  acid,  cell  differentiation,  gene  expression  profiling,  biostatistics
0	Until  recently  only  a  limited  number  of  genes  were  accessible  to  gene  expression  profiling,  as  northern  blot,  RT-PCR,  and  ribonuclease  protection  assays  are  designed  for  single  genes  or  small  groups  of  genes  at  a  time.  During  the  course  of  the  human  genome  project,  comprehensive  cDNA  libraries  became  available  allowing  the  development  of  techniques  for  massive  parallel  expression  profiling.  Two  types  of  microarrays  emerged  either  using  oligonucleotides  directly  synthesized  on  a  chip  surface  (Affymetrix)  [reviewed  in  1,2]  or  depositing  cDNA  PCR  products  on  glass  slides  [reviewed  in  1,3].  In  parallel,  clustering  algorithms  for  data  analysis  have  been  developed  [4-7].  High-density  microarrays  allowed  genome-wide  screening  programs  for  identification  of  target  genes  or  expression  profiles  in  disease  and  cancer  [reviewed  in  8-10].  Large  amounts  of  data  have  been  generated  quickly,  but  several  types  of  problems  encourage  the  development  of  novel  concepts  for  data  evaluation.  Large  data  sets  with  intrinsic  variation  ("noisy  data")  have  to  be  interpreted  by  recognizing  and  excluding  outlier  data  from  subsequent  analysis  in  an  automated  and  highly  reliable  way.
0	Edge  Effect  and  Normalization  The  microarrays  used  had  a  considerable  edge  effect:  spots  located  close  to  the  edge  of  a  slide  displayed  lower  fluorescence  signals  than  duplicate  spots  in  the  center  of  the  slide.  For  each  column  a  correction  factor  was  introduced  minimizing  the  normalized  differences  of  spot-duplicate  (left/right).  As  low  spot  intensity  values  have  3-  to  10-fold  elevated  deviation  (Fig.  1A),  only  the  60%  most  intense  spot  pairs  were  used.  Spots  at  saturation  were  excluded.  All  normalizations  between  replicate  slides  or  subsequently  between  different  samples  were  based  on  the  assumption  that  there  are  no  major  changes  in  expression  levels  for  the  bulk  part  of  the  genes  tested.  This  was  a  valid  assumption--it  is  supported  by  near-identical  shapes  of  cumulative  frequency  histograms  of  fluorescence  intensities  for  different  slides  after  median  normalization  (Fig.  1B).  Comparison  with  Quantitative  RT-PCR  and  Previous  Results  Obtained  with  Affymetrix  GeneChips  From  preliminary  experiments  18  genes  were  selected  and  their  atRA-induced  expression  was  assessed  by  real-time  PCR.  In  general,  most  results  were  in  agreement  with  the
0	arrays  revealed  upregul
0	Assessing  the  Drosophila  melanogaster  and  Anopheles  gambiae  Genome  Annotations  Using  Genome-Wide  Sequence  Comparisons
1	Olivier  Jaillon,1  Carole  Dossat,1  Ralph  Eckenberg,1  Karin  Eiglmeier,2  Beatrice  Segurens,1  Jean-Marc  Aury,1  Charles  W.  Roth,2  Claude  Scarpelli,1  ´  Paul  T.  Brey,2  Jean  Weissenbach,1  and  Patrick  Wincker1,3
0	Genoscope/Centre  National  de  Sequencage  and  CNRS  UMR  8030,  91057  Evry  Cedex,  France;  2Unite  de  Biochimie  ´  ¸  ´  et  Biologie  Moleculaire  des  Insectes,  Institut  Pasteur,  Paris  75724  Cedex  15,  France  ´  We  performed  genome-wide  sequence  comparisons  at  the  protein  coding  level  between  the  genome  sequences  of  Drosophila  melanogaster  and  Anopheles  gambiae.  Such  comparisons  detect  evolutionarily  conserved  regions  (ecores)  that  can  be  used  for  a  qualitative  and  quantitative  evaluation  of  the  available  annotations  of  both  genomes.  They  also  provide  novel  candidate  features  for  annotation.  The  percentage  of  ecores  mapping  outside  annotations  in  the  A.  gambiae  genome  is  about  fourfold  higher  than  in  D.  melanogaster.  The  A.  gambiae  genome  assembly  also  contains  a  high  proportion  of  duplicated  ecores,  possibly  resulting  from  artefactual  sequence  duplications  in  the  genome  assembly.  The  occurrence  of  4063  ecores  in  the  D.  melanogaster  genome  outside  annotations  suggests  that  some  genes  are  not  yet  or  only  partially  annotated.  The  present  work  illustrates  the  power  of  comparative  genomics  approaches  towards  an  exhaustive  and  accurate  establishment  of  gene  models  and  gene  catalogues  in  insect  genomes.
0	nome  annotations.  We  therefore  carried  out  this  type  of  global  comparison  between  these  two  insect  genomes.
0	RESULTS  AND  DISCUSSION
0	The  Drosophila  Annotation
0	Genome  Research
0	Jaillon  et  al.
0	Ecores  47,134  n.d.  46,742  n.d.
0	Genes  13,468  n.d.  13,666  n.d.
0	Exons  54,771  n.d.  61,085  n.d.
0	Ecores/  gene  3.17  n.d.  3.2  n.d.
0	Genes  and  exons  stand  for  annotated  genes  and  exons  in  the  corresponding  versions.
0	Genome  Research
0	Drosophila/Anopheles  Genomes  Comparison
0	eral  explanations  that  are  not  mutually  exclusive  may  account  for  this  observation.  The  high  number  of  ecores  could  be  the  consequence  of  (1)  an  increased  coding  capacity  in  the  genome  of  Anopheles,  or  (2)  a  larger  number  of  pseudogenes  or  unmasked  tranposable  elements  in  Anopheles,  or  (3)  problems  in  the  sequence  assembly.  Explanations  (1)  and  (2)  were  not  supported  by  a  previous  comparative  analysis  (Zdobnov  et  al.  2002).  The  presence  of  at  least  two  different  haplotypes  in  the  A.  gambiae  strain  sequenced  is  known  to  have  int
0	How  many  replicates  of  arrays  are  required  to  detect  gene  expression  changes  in  microarray  experiments?  A  mixture  model  approach
1	Wei  Pan*,  Jizhen  Lin  and  Chap  T  Le*
0	comment  reviews
0	deposited  research  refereed  research  interactions
0	Microarrays  are  used  to  measure  the  (relative)  expression  levels  of  thousands  of  genes  (or  expressed  sequence  tags).  A  comparison  of  gene  expression  in  cells  or  tissues  from  two  conditions  may  provide  useful  information  on  important  biological  processes  or  functions  [1,2].  The  challenge  now  is  how  to  detect  those  genuine  changes  from  noisy  data.  It  is  now  known  that  simply  using  fold  changes,  as  in  the  earlier  days,  is  unreliable  and  inefficient  [3,4].  More  sophisticated  statistical  methods  are  called  for.  Many  proposals  have  appeared  in  the  literature  [3-10].  In  particular,  it  has  been  noticed  that  it  may  be  necessary  to  design  an  experiment  that  uses  multiple  arrays  (or  multiple  spots  on  each  array)  containing  multiple  measurements  for  each  gene  under  each
0	condition.  One  reason  is  that  because  of  a  high  noise-tosignal  ratio,  a  single  array  may  not  provide  enough  information  that  can  be  reliably  extracted  [11].  More  important,  multiple  measurements  from  each  gene  make  it  possible  to  assess  the  potentially  different  variability  of  genes.  The  problem  then  seems  to  fall  within  the  traditional  two-sample  comparison  in  statistics.  Two  of  the  best  known  two-sample  statistical  tests  are  the  two-sample  t-test  and  the  Wilcoxon  test  (or  equivalently,  Mann-Whitney  test).  The  t-test  is  parametric  and  is  based  on  the  assumption  that  the  gene-expression  levels  have  normal  distributions.  In  contrast,  the  Wilcoxon  test  is  nonparametric  and  is  based  on  the  ranks  of  observed  gene-expression  levels.  Although  the  t-test  is  robust  to  departures  from  normality  and  the  Wilcoxon  test
0	Genome  Biology
0	Results  and  discussion
0	A  statistical  model
0	We  consider  a  generic  situation  that,  for  each  gene  i,  I  =  1,2,...,N,  we  have  (relative)  expression  levels  X1i,...,  Xmi  from  m  microarrays  under  condition  1,  and  Y1i,...,  Ymi  from  m  arrays  under  condition  2.  We  need  to  assume  that  m  is  an  even  integer.  A  general  statistical  model  is  assumed  for  gene  expression  data:  Xji  =
0	where  P(1),i  and  P(2),i  are  the  mean  expression  levels  for  gene  i  under  the  two  conditions  respectively,  and  Hji  and  eli  are  independent  random  errors  with  means  and  variances  E(  ji)  =  E(eli)  =  0,  Var(  ji)  =
0	depend  on  the  mean  expression  P(c),i.  Also,  we  do  not  even  need  to  assume  that  V2(1),i  =  V2(2),i  unless  P(1),i  =  P(2),i.  A  goal  is  to  detect  all  genes  with  P(1),i  z  P(2),i.  This  can  be  accomplished  through  statistical  hypothesis  testing.
0	nonparametrically.  T
0	Copyright  2004  by  the  Genetics  Society  of  America  DOI:  10.1534/genetics.104.026658
0	The  DrosDel  Collection:  A  Set  of  P-Element  Insertions  for  Generating  Custom  Chromosomal  Aberrations  in  Drosophila  melanogaster
1	Edward  Ryder,*  Fiona  Blows,*  Michael  Ashburner,*  Rosa  Bautista-Llacer,*  Darin  Coulson,*  Jenny  Drummond,*  Jane  Webster,*  David  Gubb,*  Nicola  Gunton,*  Glynnis  Johnson,*  Cahir  J.  O'Kane,*  David  Huen,*  Punita  Sharma,*  Zoltan  Asztalos,*  Heiko  Baisch,  Janet  Schulze,  Maria  Kube,  Kathrin  Kittlaus,  Gunter  Reuter,  Peter  Maroy,  °  Janos  Szidonya,  Asa  Rasmuson-Lestander,§  Karin  Ekstrom,§  Barry  Dickson,**  ¨  Christoph  Hugentobler,  Hugo  Stocker,  Ernst  Hafen,  Jean  Antoine  Lepesant,  Gert  Pflugfelder,§§  Martin  Heisenberg,***  Bernard  Mechler,  Florenci  Serras,  Montserrat  Corominas,  Stephan  Schneuwly,§§§  Thomas  Preat,****  John  Roote*  and  Steven  Russell*,1
0	ENETICALLY  tractable  model  organisms  are  valuable  research  tools  for  uncovering  basic  biological  principles  that  are  conserved  through  evolution.  Many  molecular  pathways,  such  as  signaling  cascades,  gene  regulatory  pathways,  and  cell  cycle  control  circuits,  were  first  characterized  genetically  in  model  systems.  The  subsequent  molecular  cloning  of  the  genes  involved  in  such  pathways  has  shown  how  evolution  has  utilized  basic  molecular  building  blocks  to  control  a  wide  variety  of  biological  processes.  Key  to  the  success  of  such  approaches  has  been  the  ability  to  carry  out  genetic  screens
0	for  components  that  function  in  particular  pathways  and  characterize  how  individual  genes  participate  in  such  pathways.  The  fruit  fly,  Drosophila  melanogaster,  is  one  such  tractable  model  that  has  been  used  extensively  to  elucidate  many  conserved  genetic  hierarchies.  One  particularly  powerful  approach  with  Drosophila  is  the  ability  to  rapidly  carry  out  focused  genome-wide  screens  for  pathway  components  by  identifying  loci  that  modify  specific  phenotypes  (see  St.  Johnston  2002  for  review).  In  this  approach,  a  sensitized  genetic  background,  most  commonly  exhibiting  an  easily  scored  adult  phenotype  such  as  rough  eyes  or  a  wing  defect,  is  used  to  search  for  mutations  in  genes  that  make  the  phenotype  more  severe  (enhancer)  or  more  like  wild  type  (suppressor).  Mutation-bearing  chromosomes  are  introduced  into  the
0	E.  Ryder  et  al.
0	specific  recombinase  (FRT  site)  placed  within  intron  one.  In  the  case  of  RS3,  a  second  FRT  site  is  placed  upstream  of  the  first  of  the  mini-white  exons;  in  the  case  of  RS5  the  second  FRT  site  is  located  downstream  of  the  mini-white  exons.  Golic  and  Golic  demonstrated  how  a  pair  of  RS3  and  RS5  elements  can  be  used  to  generate  chromosome  rearrangements  by  design.  These  chromosome  rearrangements  include  both  deficiencies  and  duplications  (Figure  6).  Since  the  insertion  site  of  any  P  element  can  be  precisely  mapped  to  the  genomic  sequence,  the  end  points  of  any  chromosome  aberration  derived  from  a  pair  of  these  RS  elements  can  be  determined  with  single-base-pair  resolution.  The  problem  of  genetic  background  heterogeneity  is  less  easily  overcome.  Powerful  genetic  methods  are  available  with  D.  melanogaster  to  construct  "isogenic"  lines  and  we  have  used  these  methods  in  our  current  screen  (Ashburner  1989).  However,  in  the  absence  of  practical  methods  to  preserve  these  lines  cryogenically,  there  is  no  way  to  prevent  the  slow,  but  inevitable,  divergence  of  these  lines  in  subsequent  years.  While  this  may  be  a  drawback  in  the  long  term,  there  can  be  no  doubt  that,  in  the  medium  term,  a  deficiency  kit  in  a  homogeneous  genetic  background  will  be  of  considerable  utility  in  genome-scale  analysis  of  Drosophila.  We  describe  here  the  construction  of  a  set  of  isogenic  lines  that  form  the  basis  for  a  mobilization  screen  with  RS  elements.  We  describe  the  isolation  and  mapping  of  3000  new  P-element-insertion  lines  on  this  background  and  demonstrate  their  utility  for  generating  deletions  precisely  mapped  onto  the  genome  sequence.  This  work  is  a  prelude  to  an  ongoing  effort  to  generate  a  precisely  mapped  deletion  kit  that  will  cover  as  much  of  the  genome  of  D.  melanogaster  as  is  possible.  In  addition,  we  have  constructed  a  genetic  and  computational  toolkit  that  allows  individual  researchers  to  design  and  synthesize  deletions  in  regions  of  particular  interest.  The  materials  we  have  generated  are  all  publicly  available.
0	MATERIALS  AND  METHODS  Genetic  nomenclature  is  according  to  FlyBase  (2003).  The  FM7  balancer  stocks  were  ob
0	Steroid  signaling  in  plants  and  insects--common  themes,  different  pathways
1	Carl  S.  Thummel1  and  Joanne  Chory2,3
0	Outside  of  mammals,  two  model  systems  have  been  the  focus  of  intensive  genetic  studies  aimed  at  defining  the  molecular  mechanisms  of  steroid  hormone  action--the  flowering  plant,  Arabidopsis  thaliana,  and  the  fruit  fly,  Drosophila  melanogaster.  Studies  in  Arabidopsis  have  benefited  from  a  detailed  description  of  the  brassinosteroid  (BR)  biosynthetic  pathway,  allowing  the  effects  of  mutations  to  be  linked  to  specific  enzymatic  steps.  More  recently,  the  signaling  cascade  that  functions  downstream  from  BR  production  has  been  defined,  revealing  for  the  first  time  how  the  hormone  can  exert  its  effects  on  gene  expression  through  a  cell  surface  receptor  and  phosphorylation  cascade.  In  contrast,  studies  of  steroid  hormone  action  in  Drosophila  began  in  the  nucleus,  with  a  detailed  description  of  the  transcription  puffs  activated  by  the  steroid  hormone  20-hydroxyecdysone  (20E)  in  the  giant  polytene  chromosomes.  Subsequent  genetic  studies  have  revealed  that  these  effects  are  exerted  through  nuclear  receptors,  much  like  mammalian  hormone  signaling.  Most  recently,  genetic  studies  have  begun  to  elucidate  the  ecdysteroid  biosynthetic  pathway  which,  until  recently,  remained  largely  undefined.  Our  current  understanding  of  steroid  hormone  signaling  in  Arabidopsis  and  Drosophila  provides  a  number  of  intriguing  parallels  as  well  as  distinct  differences.  At  least  some  of  these  differences,  however,  appear  to  be  due  to  deficiencies  in  our  understanding  of  these  pathways.  Below  we  discuss  recent  breakthroughs  in  defining  the  molecular  mechanisms  of  BR  biosynthesis  and  signaling  in  plants,  and  we  compare  and  contrast  this  pathway  with  what  is  known  about  the  mechanisms  of  ecdysteroid  action  in  Drosophila.  We  raise  some  current  questions  in  these  fields,  the  answers  to  which  may  reveal  other  similarities  in  steroid  signaling  in  plants  and  animals.  Brassinosteroid  biosynthesis  and  homeostasis  Although  plants  and  animals  diverged  more  than  1  billion  years  ago,  it  is  remarkable  that  polyhydroxylated
0	steroidal  molecules  are  used  as  hormones  in  both  of  these  kingdoms,  as  well  as  in  algae  and  fungi.  Brassinosteroids  (BRs),  a  class  of  plant-specific  steroid  hormones,  control  many  of  the  same  developmental  and  physiological  processes  as  their  animal  and  fly  counterparts,  including  regulation  of  gene  expression,  cell  division  and  expansion,  differentiation,  programmed  cell  death,  and  homeostasis.  The  regulation  of  these  processes  by  BRs,  acting  together  with  other  plant  hormones,  leads  to  the  promotion  of  stem  elongation  and  pollen  tube  growth,  leaf  bending  and  epinasty,  root  growth  inhibition,  proton-pump  activation,  and  xylem  differentiation  (Mandava  1988;  Clouse  and  Sasse  1998).  In  addition,  useful  agricultural  applications  have  been  found  such  as  increasing  yield  and  improving  stress  resistance  of  several  major  crop  plants  (Ikebawa  and  Zhao  1981;  Cutler  et  al.  1991).  Although  the  existence  and  biological  activity  of  these  plant  steroids  had  been  described  in  a  large  body  of  literature,  they  only  found  their  way  into  the  mainstream  of  plant  hormone  biology  a  few  years  ago,  when  the  available  biochemical  and  physiological  data  were  complemented  by  the  identification  of  BR-deficient  mutants  of  Arabidopsis  (Clouse  et  al.  1996;  Kauschmann  et  al.  1996;  Li  et  al.  1996;  Szekeres  et  al.  1996),  pea  (Nomura  et  al.  1999),  and  tomato  (Bishop  et  al.  1999;  Koka  et  al.  2000).  Mutations  in  8  loci  of  Arabidopsis  and  several  additional  loci  in  tomato  and  pea  result  in  plants  with  reduced  levels  of  BR  biosynthetic  intermediates  and  lead  to  distinct  phenotypes  (Bishop  et  al.  1996;  Li  et  al.  1996;  Szekeres  et  al.  1996;  Choe  et  al.  1998a,b,  1999a,b,  2000;  Klahre  et  al.  1998;  Nomura  et  al.  1999;  Kang  et  al.  2001).  In  Arabidopsis,  loss-of-function  mutations  in  these  genes  have  pleiotropic  effects  on  development.  In  the  dark,  the  mutants  are  short,  have  thick  hypocotyls  and  open,  expanded  cotyledons,  develop  primary  leaf  buds,  and  inappropriately  express  light-regulated  genes.  In  the  light,  these  mutants  are  dark  green  dwarfs,  have  reduced  apical  dominance  and  male  fertility,  display  altered  photoperiodic  responses,  show  delayed  chloroplast  and  leaf  senescence,  have  reduced  xylem  content,  and  respond  improperly  to  fluctuations  in  their  light  environment
0	Thummel  and  Chory
0	(Chory  et  al.  1991,  1994;  Millar  et  al.  1995;  Szekeres  et  al.  1996;  Fig.  1).  Such  phenotypic  differences  between  BRdeficient  mutants  and  wild-type  Arabidopsis  plants  indicate  that  these  genes  (and  by  inference,  BRs)  play  an  important  role  throughout  Arabidopsis  development.  Exogenous  application  of  brassinolide  (BL,  the  most  active  BR,  and  generally  thought  to  be  the  endpoint  of  the  biosynthetic  pathway)  leads  to  the  normalization  of  their  phenotypes.  A  biosynthetic  pathway  derived  solely  from  biochemical  studies  provided  an  excellent  framework  for  the  characterization  of  these  mutants,  and  was  in  turn  confirmed  and  refined  by  their  analysis  (for  review,  see  Clouse  and  Sasse  1998;  Noguchi  et  al.  2000;  Friedrichsen  and  Chory  2001;  Fig.  1).  Because  of  their  striking  mutant  phenotypes,  which  led  to  the  identification  of  most  BR  biosynthetic  genes,  considerable  progress  has  been  made  in  understanding  the  mechanisms  of  BR  homeostasis.  Multiple  control  mechanisms  for  regulating  the  levels  of  BRs  in  plants  have  been  identified,  including  regulation  of  biosynthesis,  inactivation,  and  feedback  regulation  from  the  signaling  pathway.  BR-deficient  mutants  have  helped  to  determine  that  BL  is  not  synthesized  via  a  simple  linear  biosynthetic  pathway.  Recently,  two  pathways,  the  early  C-6  oxidation  and  late  C-6  oxidation  pathways,  were  proposed  for  the  biosynthesis  of  BL  (Choi  et  al.  1996,  1997).  In  the  early  C-6  oxidation  pathway,  hydroxylation  of  the  side  chain  occurs  after  C6  oxidation,  whereas  in  the  late  C-6  oxidation  pathway  the  hydroxylation  of  the  side  chain  occurs  before  position  6  of  the  B-ring  is  oxidized.  Feeding  experiments  with  intermediates  of  both  path-
0	ways  provided  strong  genetic  evidence  that  both  pathways  operate  in  Arabidopsis  (Fujioka  et  al.  1997;  Choe  et  al.  1998a).  A  study  with  dwf4  mutants  suggests  that  6-deoxo-cathasterone  is  a  starting  point  for  a  new  subpathway  as  this  compound  is  able  to  rescue  dwf4  mutations  (Choe  et  al.  1998a).  Of  note,  DWF4,  a  C-22  hydroxylase,  appears  to  be  the  major  rate-limiting  step  in  the  BR  biosynthetic  pathway  based  on  feeding  studies  and  overexpression  of  DWF4  in  transgenic  plants  (Choe  et  al.  2001).  Similarly,  6-6  -hydroxycampestanol  could  also  be  a  starting  point  for  a  different  subpathway  whose  intermediates  act  as  "bridging  molecules"  between  the  early  and  late  C-6  oxidation  pathways.  One  simple  explanation  for  plants  having  multiple  pathways  of  BL  biosynthesis  is  that  these  subpathways  might  be  differentially  regulated  by  various  environmental  or  developmental  signals.  A  possible  point  for  light-regulation  of  BR  biosynthesis  has  very  recently  been  identified  and  is  indicated  in  red  in  Figure  1  (Kang  et  al.  2001).  In  addition,  feeding  experiments  using  det2  and  dwf4  mutants  have  shown  that  BRs  in  the  late  C-6  oxidation  pathway  are  more  effective  in  rescuing  light  phenotypes,  whereas  the  BRs  in  the  early  C-6  oxidation  pathways  show  stronger  activity  in  promoting  hypocotyl  elongation  of  darkgrown  seedlings  (Fujioka  et  al.  1997;  Choe  et  al.  1998a).  Endogenous  levels  of  BRs  are  increased  in  BR-signaling  mutants,  such  as  Arabidopsis  bri1  and  its  orthologous  mutants  in  tomato,  pea,  and  rice  (discussed  below;  Noguchi  et  al.  1999;  Yamamuro  et  al.  2000;  Bishop  and  Yokota  2001).  These  BR-insensitive  mutants  show  the  largest  increases  in  the  early  C-6  oxidation  BRs.  In  Ara-
0	GENES  &  DEVELOPMENT
0	Steroid  hormone  signaling
1	Fredj  Tekaia  a,*,  Edouard  Yeramian  b,  Bernard  Dujon  a
0	Keywords:  Hyperthermophiles;  Mesophiles;  Thermostability;  Amino  acid  composition;  Evolution;  Multivariate  analyses
0	Introduction  One  major  aim  of  large-scale  genomic  projects  is  to  reach  a  global  understanding  of  the  physiological  functioning  of  living  organisms.  Such  understanding  must  encompass  the
0	puzzling  discovery  that  certain  organisms  live  in  extreme  conditions  of  temperature,  pressure,  and  salinity,  which  were  originally  thought  to  be  incompatible  with  life  (for  a  recent  revue  see  Rothschild  and  Mancinelli,  2001,  and  references  therein).  With  the  genomic  sequences  of  these  organisms  becoming  available,  it  is  rather  surprising  that  no  striking  genomic  counterparts  seem  to  be  associated  with  such  extreme  lifestyles.  For  example,  at  the  DNA  level,  an
0	GENERAL  AND  COMPARATIVE
0	Yolk  steroid  hormones  and  sex  determination  in  reptiles  with  TSD
0	Abstract  In  reptiles  with  temperature-dependent  sex  determination  (TSD),  the  temperature  at  which  the  eggs  are  incubated  determines  the  sex  of  the  offspring.  The  molecular  switch  responsible  for  determining  sex  in  these  species  has  not  yet  been  elucidated.  We  have  examined  the  dynamics  of  yolk  steroid  hormones  during  embryonic  development  in  the  snapping  turtle,  Chelydra  serpentina,  and  the  alligator,  Alligator  mississippiensis,  and  have  found  that  yolk  estradiol  (E2  )  responds  differentially  to  incubation  temperature  in  both  of  these  reptiles.  Based  upon  recently  reported  roles  for  E2  in  modulation  of  steroidogenic  factor  1,  a  transcription  factor  known  to  be  significant  in  the  sex  differentiation  process,  we  hypothesize  that  yolk  E2  is  a  link  between  temperature  and  the  gene  expression  pathway  responsible  for  sex  determination  and  differentiation  in  at  least  some  of  these  species.  Here  we  review  the  evidence  that  supports  our  hypothesis.  O  2003  Elsevier  Science  (USA).  All  rights  reserved.
0	Temperature-dependent  sex  determination  Sex  determination  is  thought  to  occur  in  two  basically  different  modes.  There  is  genetic  sex  determination  (GSD),  in  which  sex  chromosomes  determine  the  sex  of  the  individual  and  environmental  sex  determination  (ESD),  where  environmental  factors  determine  sex.  In  one  form  of  ESD,  temperature-dependent  sex  determination  (TSD),  the  temperature  at  which  the  eggs  are  incubated  determines  the  sex  of  the  hatchlings.  There  are  three  different  patterns  or  temperature  profiles  that  have  been  described  for  TSD  species,  male-female  (MF),  female-male  (FM),  and  female-male-female  (FMF).  In  the  MF  pattern,  low  temperatures  produce  a  majority  of  males,  high  temperatures  produce  mostly  females,  and  intermediate  temperatures  produce  a  ratio  of  males  to  females.  The  intermediate  temperature  that  produces  a  1:1  ratio  of  males  to  females  is  referred  to  as  the  pivotal  temperature  for  the  species.  Several  turtle  species  have  been  reported  to  show  this  profile,  including  the  painted  turtle,  Chrysemys  picta  and  the  red-eared  slider  turtle,  Trachemys  scripta  (Ewert  et  al.,  1994).  In  the  FM  pattern,  the  temperature  regimen  is  reversed,  with  high
0	temperatures  producing  mainly  males,  low  temperatures  producing  primarily  females,  and  again,  intermediate  temperatures  producing  ratios  of  males  to  females.  This  pattern  has  been  reported  for  some  lizards  (Viets  et  al.,  1994),  including  the  skink,  Eulamprus  tympanum,  the  only  viviparous  TSD  lizard  reported  to  date  (Robert  and  Thompson,  2001).  In  the  third  TSD  pattern,  FMF,  females  are  produced  at  low  temperatures,  a  majority  of  males  are  produced  at  an  intermediate  temperature,  and  predominantly  females  are  produced  again  at  high  temperatures.  In  this  system  there  are  two  pivotal  temperatures  at  which  ratios  of  males  to  females  are  produced.  This  pattern  is  displayed  in  all  the  crocodilians  studied  to  date,  including  the  American  alligator,  Alligator  mississippiensis  (Lang  and  Andrews,  1994).  In  the  snapping  turtle,  Chelydra  serpentina,  the  usual  TSD  pattern  is  FMF  (Ewert  et  al.,  1994),  however,  the  TSD  pattern  in  some  populations  of  snapping  turtles  varies  slightly  from  that  described,  being  MF,  with  males  predominating  at  lower  temperatures,  females  at  higher  temperatures,  and  a  single  pivotal  temperature  range.  The  period  of  development  during  which  sex  is  determined,  the  thermosensitive  period  (TSP),  falls  within  the  middle  one-third  to  one  half  of  the  total  incubation  time  (Wibbels  et  al.,  1991a),  and  temperature  influences  the  rate  of  development  as  well  as  the  sex  of  the  hatchling.
0	Temperature  is  apparently  not  the  only  factor  influencing  sex  determination,  at  least  in  some  of  these  species.  There  are  reports  of  large  variations  in  the  ratios  of  males  to  females  produced  among  clutches  of  eggs  laid  by  different  females  at  the  pivotal  temperature  where  one  would  expect  to  see  a  1:1  ratio  (Rhen  and  Lang,  1998,  Fig.  1).  This  would  indicate  that  other  factors,  perhaps  some  maternal  contribution  could  influence  the  outcome  of  the  sex  determining  process.  Clutch  identity  or  ``clutch  effects''  have  also  been  reported  to  influence  other  aspects  of  offspring  fitness,  including  residual  yolk  mass,  fat  body  mass  and  total  mass  of  hatchling  snapping  turtles  (Rhen  and  Lang,  1999).  Moreover,  studies  of  post-hatch  growth  of  snapping  turtles  showed  significant  clutch  effects  in  growth  rates  that  were  independent  of  egg  mass  (Rhen  and  Lang,  1995).  These  differences  could  also  be  due  to  differential  hormone  deposition  in  yolk,  as  has  been  reported  in  some  avian  species  (Frank  et  al.,  1991;  Schwabl,  1996;  Schwabl  et  al.,  1997).
0	Gene  expression  patterns  during  sex  differentiation  of  TSD  reptiles  What  is  known  about  the  sex  differentiation  process  in  reptiles  with  TSD?  The  gene  expression  pattern  that  leads  to  sex  determination  and  subsequent  testis  or  ovary  differentiation,  has  been  defined  best  in  mammalian  species,  which  utilize  GSD.  SRY  (Sex-determining  region  of  the  Y  chromosome)  is  thought  to  be  the  primary  determinant  of  testis  differentiation  in  mouse  and  human  systems  (reviewed  by  Koopman  et  al.,  2001),  but  there  is  no  known  homologue  of  SRY  in  TSD  reptiles.  There  are  a  number  of  candidate  genes  that  are  present
0	but  since  the  embryonic  adrenal  gland  is  extremely  active,  these  results  do  not  accurately  reflect  activity  of  the  gonad  alone  (T.  Wibbels,  personal  communication).  Since  in  mammalian  species  SF-1  works  in  conjunction  with  SOX9  to  up-regulate  AMH  for  male  differentiation,  SF-1  must  participate  in  completely  different  interactions  in  chickens  and  alligators,  where  it  is  upregulated  in  females.  Recent  reports  indicate  that  DAX1,  an  orphan  nuclear  receptor,  inhibits  the  expression  of  genes  in  the  male  differentiation  pathway  possibly  by  modulating  the  activity  of  SF-1  (reviewed  by  Parker  and  Schimmer,  2002).  DAX1  also  has  reported  interactions  with  estrogen  receptors  and  is  thought  to  act  as  a  corepressor,  so  could  play  a  role  in  estrogen  signaling  pathways  (Zhang  et  al.,  2000).  Cytochrome  P450  aromatase  expression,  a 
0	FEBS  23893
0	Gene  expression  data  analysis
1	Alvis  Brazma*,  Jaak  Vilo
0	what  are  the  functional  roles  of  di¡erent  genes  and  in  what  cellular  processes  do  they  participate;  how  are  genes  regulated,  how  do  genes  and  gene  products  interact,  what  are  these  interaction  networks  ;  how  does  gene  expression  level  di¡er  in  various  cell  types  and  states,  how  is  gene  expression  changed  by  various  diseases  or  compound  treatments.
0	Knowing  the  gene  transcript  abundance  in  various  tissues,  developmental  stages  and  under  various  conditions  is  important  for  attacking  these  questions.  Although  mRNA  is  not  the
0	ultimate  product  of  a  gene,  transcription  is  the  ¢rst  step  in  gene  regulation,  and  information  about  the  transcript  levels  is  needed  for  understanding  gene  regulatory  networks.  Moreover,  the  measurement  of  mRNA  levels  currently  is  considerably  cheaper  and  can  be  done  in  a  more  high-throughput  way  than  direct  measurements  of  the  protein  levels.  The  correlation  between  the  mRNA  and  protein  abundance  in  the  cell  may  not  be  straightforward,  nevertheless  the  absence  of  mRNA  in  a  cell  is  likely  to  imply  a  not  very  high  level  of  the  respective  protein  and  thus  at  least  qualitative  estimates  about  the  proteome  can  be  based  on  the  transcriptome  information.  The  mRNA  and  protein  level  correlation  studies  are  under  way  (see  [1]).  The  ability  to  monitor  gene  expression  at  the  transcript  level  has  become  possible  due  to  the  advent  of  DNA  microarray  technologies  (see  [2]).  A  microarray  is  a  glass  slide,  onto  which  single-stranded  DNA  molecules  are  attached  at  ¢xed  locations  (spots).  There  may  be  tens  of  thousands  of  spots  on  an  array,  each  related  to  a  single  gene.  Microarrays  exploit  the  preferential  binding  of  complementary  single-stranded  nucleic  acid  sequences.  There  are  several  variations  of  microarray  technologies  each  used  in  a  speci¢c  way.  One  of  the  most  popular  experimental  platforms  is  used  for  comparing  mRNA  abundance  in  two  di¡erent  samples  (or  a  sample  and  a  control).  RNA  from  the  sample  and  control  cells  are  extracted  and  labeled  with  two  di¡erent  £uorescent  labels,  e.g.  a  red  dye  for  the  RNA  from  the  sample  population  and  a  green  dye  for  that  from  the  control  population.  Both  extracts  are  washed  over  the  microarray.  Gene  sequences  from  the  extracts  hybridize  to  their  complementary  sequences  in  the  spots.  To  measure  the  relative  abundance  of  the  hybridized  RNA  the  array  is  excited  by  a  laser.  If  the  RNA  from  the  sample  population  is  in  abundance,  the  spot  will  be  red,  if  the  RNA  from  the  control  population  is  in  abundance,  it  will  be  green.  If  sample  and  control  bind  equally,  the  spot  will  be  yellow,  while  if  neither  binds,  it  will  not  £uoresce  and  appear  black.  Thus,  from  the  £uorescence  intensities  and  colors  for  each  spot,  the  relative  expression  levels  of  the  genes  in  the  sample  and  control  populations  can  be  estimated.  By  measuring  transcription  levels  of  genes  in  an  organism  under  various  conditions,  at  di¡erent  developmental  stages  and  in  di¡erent  tissues,  we  can  build  up  `gene  expression  pro¢les'  which  characterize  the  dynamic  functioning  of  each  gene  in  the  genome.  We  can  imagine  the  expression  data  represented  in  a  matrix  with  rows  representing  genes,  columns  representing  samples  (e.g.  various  tissues,  developmental  stages  and  treatments),  and  each  cell  containing  a  number  characterizing  the  expression  level  of  the  particular  gene  in  the  particular  sample.  We  will  call  such  a  table  a  gene  expres-
0	sion  matrix.  Building  up  a  database  of  such  matrices  will  help  us  to  understand  gene  regulation,  metabolic  and  signaling  pathways,  the  genetic  mechanisms  of  disease,  and  the  response  to  drug  treatments.  For  instance,  if  overexpression  of  certain  genes  is  correlated  with  a  certain  cancer,  we  can  explore  which  other  conditions  a¡ect  the  expression  of  these  genes  and  which  other  genes  have  similar  expression  pro¢les.  We  can  also  investigate  which  compounds  (potential  drugs)  lower  the  expression  level  of  these  genes.  2.  From  raw  data  to  gene  expression  matrix  Like  many  experimental  technologies,  microarrays  measure  the  target  quantity  (i.e.  relative  or  absolute  mRNA  abundance)  indirectly  by  measuring  another  physical  quantity  ^  the  intensity  of  the  £uorescence  of  the  spots  on  the  array  for  each  £uorescent  dye,  i.e.  for  each  optical  wavelength
0	(so-called  channel).  Therefore  the  raw  data  produced  by  microarrays  are  in  fact  monochrome  images  (Fig.  1).  Transforming  these  images  into  the  gene  expression  matrix  is  a  nontrivial  process:  the  spots  corresponding  to  genes  on  the  microarray  should  be  identi¢ed,  their  boundaries  determined,  the  £uorescence  intensity  from  each  spot  measured  and  compared  to  the  background  intensity  and  to  these  intensities  for  other  channels.  The  software  for  this  initial  image  processing  is  often  provided  with  the  image  scanner,  since  it  will  depend  on  particular  properties  of  the  hardware.  Often  laborious  manual  adjustment  of  the  grid  for  spots  is  used.  We  will  not  discuss  the  raw  data  processing  in  detail  in  this  paper,  some  survey  of  image  analysis  software  can  be  found  on  http://  cmpteam4.unil.ch/biocomputing/array/software/MicroArray_  Software.html.  In  any  physical  experiment  it  is  important  to  know  not  only  the  value  of  the  measurement,  but  also  the  standard  error  or
0	Nutrient  control  of  gene  expression  in  Drosophila:  microarray  analysis  of  starvation  and  sugar-dependent  response
1	Ingo  Zinke,  Christina  S.Schutz,  E  Jorg  D.Katzenberger,  Matthias  Bauer  and  E  Michael  J.Pankratz1
0	E  Institut  fur  Genetik,  Forschungszentrum  Karlsruhe,  Postfach  3640,  D-76021  Karlsruhe,  Germany
0	We  have  identified  genes  regulated  by  starvation  and  sugar  signals  in  Drosophila  larvae  using  whole-genome  microarrays.  Based  on  expression  profiles  in  the  two  nutrient  conditions,  they  were  organized  into  different  categories  that  reflect  distinct  physiological  pathways  mediating  sugar  and  fat  metabolism,  and  cell  growth.  In  the  category  of  genes  regulated  in  sugar-fed,  but  not  in  starved,  animals,  there  is  an  upregulation  of  genes  encoding  key  enzymes  of  the  fat  biosynthesis  pathway  and  a  downregulation  of  genes  encoding  lipases.  The  highest  and  earliest  activated  gene  upon  sugar  ingestion  is  sugarbabe,  a  zinc  finger  protein  that  is  induced  in  the  gut  and  the  fat  body.  Identification  of  potential  targets  using  microarrays  suggests  that  sugarbabe  functions  to  repress  genes  involved  in  dietary  fat  breakdown  and  absorption.  The  current  analysis  provides  a  basis  for  studying  the  genetic  mechanisms  underlying  nutrient  signalling.  Keywords:  fat/feeding/microarrays/starvation/sugar
0	Halaas,  1998).  Malfunctioning  of  physiological  pathways  underlying  nutrient  signalling  and  energy  homeostasis  can  have  major  consequences  for  human  health,  and  the  modern  society  is  facing  ever  increasing  cases  of  physiological  disturbances  such  as  eating  disorders,  diabetes  and  obesity.  As  the  dietary  requirement  for  sugars,  fats  and  amino  acids  is  essentially  universal,  many  aspects  of  the  basic  logic  of  nutrient  signalling  should  be  conserved.  The  finding  that  both  Drosophila  and  Caenorhabditis  elegans  possess  components  of  insulin  signalling  supports  this  view  (Lehner,  1999;  Brogiolo  et  al.,  2001;  Gems  and  Partridge,  2001).  As  part  of  our  analysis  of  Drosophila  larval  feeding  behaviour,  we  previously  identified  lipase  3  (lip3)  and  phosphoenolpyruvate  carboxykinase  (pepck)  as  being  upregulated  upon  starvation  (Zinke  et  al.,  1999).  Upon  addition  of  sugar,  this  upregulation  was  completely  suppressed  for  lip3,  but  not  for  pepck.  These  results  demonstrated  that  different  nutrient  conditions  can  have  very  specific  effects  on  gene  expression  patterns  in  Drosophila  larvae.  We  have  now  used  Affymetrix  microarrays  to  identify  genes  regulated  by  starvation  and  by  sugar  in  order  to  study  the  mechanisms  underlying  nutrient  signalling.  Based  on  the  pattern  of  response  to  different  nutrient  conditions  and  on  existing  knowledge  of  metabolic  pathways,  we  could  categorize  the  identified  genes  into  groups  that  reflect  distinct  physiological  functions.  We  have  further  characterized  a  zinc  finger  transcription  factor  that  is  one  of  the  earliest  and  highest  upregulated  genes  upon  sugar  ingestion.  Identification  of  potential  target  genes  indicates  that  this  transcription  factor  functions  to  repress  genes  involved  in  dietary  fat  breakdown  and  absorption.
0	Drosophila  larvae  are  continuous  feeders  and  show  large  growth  in  a  relatively  short  time  period.  About  5  days  after  egg  laying  (AEL),  they  stop  feeding,  leave  the  food  to  enter  the  wandering  stage  and  pupariate  shortly  thereafter  (Figure  1A).  Within  this  normal  developmental  progression,  there  are  several  notable  variations  that  become  apparent  under  different  environmental  conditions.  One  intriguing  observation  was  made  by  Beadle  et  al.  (1938).  When  larvae  are  starved  before  70  h  AEL,  they  die  within  several  days,  whereas  if  they  are  starved  after  this  time  point,  they  do  not  grow,  but  still  survive  and  differentiate  to  give  rise  to  small  adult  flies.  The  authors  concluded  that  some  `organizational  change  occurs  in  larvae  at  about  70  h'  and  termed  this  the  `70  h  change'  (Beadle  et  al.,  1938).  This  survival  after  the  70  h  change  period  is  independent  of  whether  the  larvae  are  starved  or  placed  on  sugar;  however,  before  the  70  h,  larvae  placed  in  sugar  live  for  much  longer  than  those  under  starvation  conditions  (over  a
0	a  European  Molecular  Biology  Organization
0	Nutrient  control  of  gene  expression
0	week  as  compared  with  ~2  days;  see  also  Britton  and  Edgar,  1998;  Zinke  et  al.,  1999).  Clearly,  there  is  a  difference  in  the  metabolic  programme  that  becomes  activated  across  this  point  upon  change  in  nutrient  status.  As  the  period  before  70  h  is  critical  for  survival,  we  decided  to  perform  the  experiments  prior  to  this  point.  For  each  time  and  nutrient  condition,  two  chips  were  used  with  each  chip  being  hybridized  to  the  samples  collected  independently  (Figure  1B).
0	Categorization  of  nutrient-dependent  genes
0	Mechanisms  for  differences  in  monozygous  twins
1	Paul  Gringrasa,*,  Wai  Chenb,c
0	Keywords:  Twin;  Monozygous;  Genetic  mechanisms
0	Introduction  Over  200  pairs  of  twins  are  assessed  each  year  at  the  Multiple  Births  Foundation,  London.  Despite  often  appearing  indistinguishable  to  strangers,  no  `identical'  twins  assessed  are  so  alike  that  their  mothers  fail  to  distinguish  them  accurately.  Physical  differences  may  be  as  subtle  as  one  small  mole,  or  a  differently  positioned  hair  crown;
0	but  still,  they  exist  and  are  unmistakable  once  identified.  Many  parents  can  also  differentiate  their  `identical'  twins  by  their  personalities,  some  even  claim  from  a  very  early  age.  Physical  similarities  between  MZ  twins  are  well  recognised;  and  these  similarities  have  long  formed  the  basis  of  many  instruments  and  clinical  methods  designed  to  classify  zygosity,  such  as  questionnaires  and  physical  examinations.  Even  the  most  experienced  practitioners  can,  however,  `misclassify'  zygosity  in  about  6%  of  cases  [1],  and  molecular  genetic  methods  are  now  the  preferred  method  for  establishing  zygosity  [2].  The  term  `identical'--although  frequently  used--is  not  synonymous  with  `monozygous'  (MZ).  Most  MZ  twins  are  phenotypically  very  similar,  yet  there  are  significant  numbers  of  MZ  pairs  who  are  neither  phenotypically  nor  genotypically  identical.  Even  if  one  assumes  a  completely  equal  `apportioning'  of  genetic  endowment  when  twinning  occurs,  the  twin  pair  will  only  remain  identical  if  post-zygotic  genetic,  post-zygotic  epi-genetic  and  post-zygotic  environmental  factors  affect  each  twin  equally.  Given  the  extent  of  these  influences  and  many  potential  opportunities  for  disruption  during  the  long  and  complex  intrauterine  development,  it  is  perhaps  surprising  that  so  many  MZ  twins  do  turn  out  to  be  so  alike.  Nevertheless,  it  is  these  anomalous  cases  of  discordant  twins  that  have  taught  us  much  about  human  genetics,  development  and  twinning  in  the  past.  It  is  likely  that  they  will  continue  to  do  so  when  new  technologies  are  applied  to  future  research  in  this  area.  This  review  summarises  some  past  findings  of  well  established  studies,  and  also  some  from  more  recent  exploratory  studies  using  more  experimental  techniques  and  designs.  We  will  first  consider  the  ante-natal  environmental  factors  and  their  effects,  and  then  the  genetic  factors  that  contribute  to  discordance  in  MZ  twins.  Some  examples  of  discordancy  do  not  necessarily  fit  into  the  above  neat  categories.  For  convenience,  they  have  been  grouped  together  and  discussed  in  the  final  section  on  `discordancies  of  unknown  origin'.
0	Timing  of  monozygous  twinning  Monozygous  (MZ)  twinning  occurs  when  one  single  fertilised  egg  gives  rise  to  two  separate  embryos.  The  timing  of  this  division  can  be  an  important  contributory  factor  in  determining  the  post-zygotic  discordance  in  MZ  twins.  This  timing  can  be  characterised  by  the  differences  in  amniotic  sac,  chorionic  and  placental  anatomical  formation  [3].  In  principle,  the  earlier  twinning  occurs,  the  less  the  twins  will  share  common  supportive  structures;  and  the  later,  the  more.  The  extreme  example  of  late  twinning  are  conjoint  twins  who  even  share  some  somatic  organs.  If  twinning  takes  place  prior  to  the  first  4  days  after  conception,  two  separate  placentas  and  sets  of  membranes  are  formed:  that  is,  one  set  for  each  embryo.  Such  twins  are  called  dichorionic  (DC)  MZ  twins,  and  they  account  for  about  one  third  of  all  MZ  twins.  After  the  `fourth'  day,  the  progenitor  cells  of  the  placenta  become  separated  from  the  inner  cell  mass  of  the  embryo.  As  a  result,  for  twinning  occurring  after  this,  only  one  single  placenta  will  develop.  This  single  monochorionic  (MC)  placenta  serves  both
0	Amnionicity  Diamniotic  Diamniotic  Monoamniotic
0	Chorionicity  Dichorionic  Monochorionic  Monochorionic  twins
0	Frequency  One-third  of  monozygous  twins  Approximately  two-thirds  monozygous  twins  Five  percent  of  monozygous  twins  Conjoined  twins
0	Timing  for  conjoint  twins  is  theoretical  and  only  suggested  by  animal  models.
0	embryos,  and  in  the  majority  of  cases,  contains  anastomoses  of  blood  vessels  that  connect  the  embryos.  After  about  the  eighth  day,  the  MC  MZ  pair  will  share  a  common  amniotic  sac,  in  addition  to  the  common  MC  placenta  [4].  About  5%  of  MZ  twins  are  monochorionic  (MC)  and  monoamniotic  (MA).  Twinning  after  the  second  week  results  in  the  very  rare  phenomenon  of  conjoined  twins  (see  Table  1).  All  MC  twins  are  MZ  by  definition,  and  this  is  still  the  `gold  standard'  when  defining  monozygosity.  Although  often  seen  in  animals,  vascular  communications  in  dichorionic  placentae  in  man  are  extremely  rare  [5].  The  combination  of  monochorionicity  and  arterioarterial  anastomoses  is  a  better  proof  of  monozygosity  than  any  genetic  test  currently  available.  If  placentation  has  not  already  been  established  by  ultrasound  in  the  first  trimester,  it  relies  on  placental  examination  by  pathologists;  unfortunately,  this  still  has  not  become  routine  clinical  practice  in  most  hospitals,  despite  numerous  pleas  in  the  literature  [6,7].
0	Ante-natal  environmental  factors  3.1.  Chorionicity,  twin  -twin  transfusion  syndrome  and  discordant  birth  weight  Anastomotic  connections  between  foetal  circulations  are  present  in  around  90%  of  MC  placentas.  These  anastomoses  can  result  in  the  `twin  to  twin  transfusion  syndrome'  (TTTS)  [8].  This  can  result  either  in  a  chronic  ante-partum  transfusion  or  acute  intrapartum  transfusion.  In  the  former  event,  growth  discordance  occurs  and  there  are  risks  for  both  the  donor  and  recipient.  These  include  the  possibility  of  the  donor  becoming  malnourished  and  growth  retarded,  while  the  recipient  is  at  risk  of  cardiac  hypertrophy,  polycythaemia  and  hydramnios.  In  general,  the  mortality  and  morbidity  rate  for  both  twins  in  this  situation  is  high  without  intervention  [9].  The  acute  transfusion  syndrome  occurs  intrapartum  and  causes  increased  mortality  and  morbidity,  through  both  hypovolaemia  and  hypotension  in  one  twin,  and  polycythaemia  in  the  other.  Even  without  TTTS,  discordant  birth  weight  in  MZ  twins  remains  common  as  a  result  of:  (1)  unequal  in-utero  blood  supply,  and  hence  growth;  and  perhaps  (2)  in  theory,  unequal  division  of  inner  cell  mass  at  twinning.  Although  such  differences  may  diminish
0	with  age,  there  is  a  growing  body  of  evidence  that  significant  discrepancy  in  birth  weight  may  lead  to  long-lasting  physiological  changes  in  both  twins.  The  concept  of  `foetal  programming'  proposes  that  intrauterine  growth  affects  long-term  growth  and  metabolism  in  later  life.  Epidemiological  studies  linking  low  birth  weight  with  hypertension  and  coronary  artery  disease  in  adult  life  suggest  that  undernutrition  before  birth  `programmes'  later  cardiovascular  outcome  [10].  Associations  between  `small  for  dates'  babies  with  later  insulin  resistance  and  cardiovascular  disease  are  consistent  with  the  hypothesis  that  late  gestation  may  be  a  window  of  sensitivity  to  nutrition  in  terms  of  its  influence  on  later  cardiovascular  disease.  In  twins  discordant  for  the  development  of  non-insulin  dependant  diabetes  (NIDDM),  birth  weight  has  been  found  to  be  lower  in  the  affected  twin  [11].  Investigators  continue  to  use  twins  with  discordant  birth  weight  as  a  means  to  test  the  `foetal  programming'  hypothesis,  while  assuming  the  twin  pair  would  share  common  confounding  variables  such  as  social  class,  genetic  endowment  and  post-natal  environments.  Two  teams  have  recently  reported  the  importance  of  birth  weight  in  twins,  independent  of  genetic  differences,  in  influencing  their  blood  pressure  as  adults  [12].  Evidence  for  `foetal  programming'  has  even  been  found  in  early  infancy:  in  a  small  cohort  of  MZ  twins,  where  a  twin  -  twin  transfusion  had  occurred,  differences  in  arterial  distensibility  were  found  in  the  donor  twin  when  compared  to  the  recipient  [13].  Appealing  though  the  findings  from  twin  studies  may  be,  the  extent  to  which  they  are  generalisable  to  singleton  population  is  un
0	Genome-wide  identification  of  in  vivo  Drosophila  Engrailed-binding  DNA  fragments  and  related  target  genes
1	Pascal  Jean  Solano1,*,  Bruno  Mugat1,*,  David  Martin2,  Franck  Girard1,  Jean-Marc  Huibant1,  Conchita  Ferraz1,  Bernard  Jacq2,  Jacques  Demaille1  and  Florence  Maschat1,
0	1Institut  de  Genetique  Humaine  (UPR  1142).  141  rue  de  la  Cardonille,  34396  Montpellier,  France  2Laboratoire  de  Genetique  et  Physiologie  du  Developpement  (UMR  6545),  IBDM,  Parc  Scientifique
0	de  Luminy,  13288  Marseille,
0	Cedex  9,  France
0	SUMMARY  Chromatin  immunoprecipitation  after  UV  crosslinking  of  DNA/protein  interactions  was  used  to  construct  a  library  enriched  in  genomic  sequences  that  bind  to  the  Engrailed  transcription  factor  in  Drosophila  embryos.  Sequencing  of  the  clones  led  to  the  identification  of  203  Engrailed-binding  fragments  localized  in  intergenic  or  intronic  regions.  Genes  lying  near  these  fragments,  which  are  considered  as  potential  Engrailed  target  genes,  are  involved  in  different  developmental  pathways,  such  as  anteroposterior  patterning,  muscle  development,  tracheal  pathfinding  or  axon  guidance.  We  validated  this  approach  by  in  vitro  and  in  vivo  tests  performed  on  a  subset  of  Engrailed  potential  targets  involved  in  these  various  pathways.  Finally,  we  present  strong  evidence  showing  that  an  immunoprecipitated  genomic  DNA  fragment  corresponds  to  a  promoter  region  involved  in  the  direct  regulation  of  frizzled2  expression  by  engrailed  in  vivo.
0	Key  words:  Engrailed,  Chromatin  immunoprecipitation,  In  vivo  targets,  Drosophila
0	INTRODUCTION  Identification  of  target  genes  that  are  directly  regulated  by  transcription  factors  is  a  key  issue  in  developmental  biology,  and  has  been  the  purpose  of  several  recent  studies.  Indeed,  the  genome-wide  location  of  DNA-binding  proteins  using  genomic  microarrays  has  been  performed  in  yeast  (Iyer  et  al.,  2001;  Lieb  et  al.,  2001;  Ren  et  al.,  2000).  In  mammalian  cells,  CpG  island  microarrays  have  allowed  the  identification  of  promoter  regions  capable  of  binding  to  the  E2F  transcription  factor  (Weinmann  et  al.,  2002).  Recently,  whole-genome  microarray  assays  associated  with  bioinformatic  methods  have  also  been  successfully  performed  to  identify  direct  target  genes  of  the  Dorsal  transcription  factor  in  Drosophila  (Markstein  et  al.,  2002;  Stathopoulos  et  al.,  2002).  Identifying  the  genes  that  are  directly  regulated  by  transcription  factors,  rather  than  merely  in  the  downstream  pathways,  remains  essential  for  understanding  gene  function  (Liang  and  Biggin,  1998;  Mannervik,  1999;  Furlong  et  al.,  2001;  Egger  et  al.,  2002).  Homeodomain  transcription  factors  play  key  roles  during  development  by  coordinating  the  behavior  of  most  cells  within  their  domains  of  expression  (Garcia-Bellido,  1975;  Lawrence  and  Morata,  1992),  and  identifying  their  target  genes  is  challenging  (Biggin  and  McGinnis,  1997).  Interestingly,  whereas  homeodomain  proteins  recognize  closely  related  binding  sites,  they  are  involved  in  specific  genetic  pathways  and  their  absence  produces  very  specific  phenotypic  effects
0	P.  J.  Solano  and  others  Weinmann  et  al.,  2001;  Weinmann  et  al.,  2002).  However,  UV  light  is  believed  to  be  more  efficient  in  fixing  proteins  that  are  directly  bound  to  DNA  (Toth  and  Biggin,  2000).  In  the  present  report,  we  constructed  a  library  enriched  in  genomic  sequences  that  bind  Engrailed  protein  in  Drosophila  embryos,  by  using  UV  crosslinking  and  chromatin  immunoprecipitation  (UV-X-ChIP).  Systematic  sequencing  of  the  recovered  clones  led  to  the  identification  of  203  potential  direct  targets  of  engrailed  and  evidence  is  presented  to  show  that  some  of  them  represent  bona  fide  engrailed  targets.  MATERIALS  AND  METHODS
0	Tissue-Specific  Gene  Expression  and  Ecdysone-Regulated  Genomic  Networks  in  Drosophila
0	Developmental  Cell  60
0	midgut,  larval  epidermal  cells  and  adult  epidermal  progenitor  cells  (midgut  imaginal  islands),  respond  in  opposite  ways  to  ecdysone.  The  larval  epidermal  cells  initiate  the  process  of  programmed  cell  death,  while  the  imaginal  cells  proliferate  and  form  the  adult  midgut.  These  diverse  responses  to  a  single  hormone  offer  an  opportunity  to  study  tissue-specific  genomic  activity  during  a  developmental  process  that  is  coordinately  regulated  throughout  the  animal.  We  define  the  complements  of  genes  expressed  during  the  process  of  metamorphosis  in  specific  tissues.  We  show  that  computational  analysis  of  genome-wide  gene  expression  patterns  can  facilitate  the  identification  of  cis-regulatory  elements  and  a  cognate  transcription  factor.  We  also  show  that  the  network  that  controls  metamorphosis  can  be  extended  beyond  the  ecdysone-regulatory  cascade  to  include  components  of  other  well-studied  signaling  pathways.
0	Results  Identification  of  Transcripts  Enriched  in  Different  Tissues  and  Organs  Delineating  networks  on  a  genome-wide  scale  requires  a  catalog  of  gene  expression  patterns  in  each  tissue  or  organ.  Of  particular  interest  are  those  genes  that  have  high  levels  of  expression  in  only  certain  tissues  or  times  during  development.  We  isolated  five  different  organs  and  tissues  from  the  Drosophila  melanogaster  Canton-S  strain  (Figure  1A).  Samples  were  collected  in  triplicate  approximately  18  hr  before  puparium  formation  (BPF),  when  larvae  are  at  the  end  of  their  feeding  and  growing  phase  but  have  not  yet  begun  metamorphosis  (Riddiford,  1993).  We  compared  RNA  isolated  from  each  organ  or  tissue  to  a  common  reference  RNA  sample  taken  from  identically  staged  whole  animals.  The  use  of  a  linear  amplification  protocol  enabled  small  amounts  of  sample
0	Tissue-Specific  Genomic  Networks  in  Drosophila  61
0	BMC  Bioinformatics
0	BioMed  Central
0	Open  Access
0	Array-A-Lizer:  A  serial  DNA  microarray  quality  analyzer
1	Andreas  Petri*,  Jan  Fleckner  and  Mads  Wichmann  Matthiessen
0	Petri  et  al;  licensee  BioMed  Central  Ltd.  This  is  an  Open  Access  article:  verbatim  copying  and  redistribution  of  this  article  are  permitted  in  all  media  for  any  purpose,  provided  this  notice  is  preserved  along  with  the  article's  original  URL.
0	Background:  The  proliferate  nature  of  DNA  microarray  results  have  made  it  necessary  to  implement  a  uniform  and  quick  quality  control  of  experimental  results  to  ensure  the  consistency  of  data  across  multiple  experiments  prior  to  actual  data  analysis.  Results:  Array-A-Lizer  is  a  small  and  convenient  stand-alone  tool  providing  the  necessary  initial  analysis  of  hybridization  quality  of  an  unlimited  number  of  microarray  experiments.  The  experiments  are  analyzed  for  even  hybridization  across  the  slide  and  between  fluorescent  dyes  in  two-color  experiments  in  spotted  DNA  microarrays.  Conclusions:  Array-A-Lizer  allows  the  expedient  determination  of  the  quality  of  multiple  DNA  microarray  experiments  allowing  for  a  rapid  initial  screening  of  results  before  progressing  to  further  data  analysis.  Array-A-Lizer  is  directed  towards  speed  and  ease-of-use  allowing  both  the  expert  and  non-expert  microarray  researcher  to  rapidly  assess  the  quality  of  multiple  microarray  hybridizations.  Array-A-Lizer  is  available  from  the  Internet  as  both  source  code  and  as  a  binary  installation  package.
0	The  ongoing  development  of  DNA  microarray  analysis  equipment  have  diminished  both  the  price  and  workload  associated  with  microarray  experiments  leading  to  development  of  data  at  a  tremendous  rate.  It  is  not  unusual  for  a  group  of  researchers  to  be  able  to  produce  and  scan  50-  100  microarray  slides  per  week.  The  processing  of  such  large  amounts  of  experimental  data,  first  requires  verification  of  the  overall  quality  of  the  experiments.  Array-ALizer  employs  two  tests  to  monitor  the  quality  of  the  hybridization  with  respect  to  uniformity  across  the  slide  as  well  as  relative  intensity  of  the  fluorescent  dyes  in  two  color  experiments:  1)  spectrum  analysis  of  the  signal  across  the  microarray  slide  and  2)  comparison  of  the  two  dyes  that  are  used  in  two-color  experiments  (for  instance  Cy3  and  Cy5).
0	The  Array-A-Lizer  graphical  user  interface  (GUI)  is  created  in  Borland  Delphi  and  the  statistical  calculations  are  carried  out  in  the  R-project  statistical  scripting  language  [1].  Array-A-Lizer  includes  a  microdistribution  of  the  Rproject  and  contains  options  for  specifying  the  graphical  output  type  as  either  bitmaps  or  postscript.  Array-A-Lizer  supports  experiment  files  from  GenePixPro  and  Spotfinder  through  an  open  architecture,  which  can  be  extended  to  include  other  file  formats.  Array-A-Lizer  runs  on  the  Microsoft  Windows  platform.
0	Results  and  discussion
0	Array-A-Lizer  is  an  application  for  rapid  quality  control  of  large  DNA  microarray  experiments.  The  program  consists  of  a  collection  of  scripts,  that  are  contained  and  accessed
0	Page  1  of  6
0	(page  number  not  for  citation  purposes)
0	BMC  Bioinformatics  2004,  5
0	through  a  GUI  to  ease  their  use  (figure  1).  The  main  advantage  of  the  program  is  the  rapid  processing  of  an  unlimited  number  of  experiments.  Array-A-Lizer  generates  reports  with  a  graphical  analysis  of  each  experiment,  providing  the  researcher  with  a  rapid  survey  of  the  quality  of  experiments  (figures  2  and  3).  Additionally,  the  program  returns  an  overview  of  the  results  in  the  system  browser  with  hyperlinks  to  each  analysis  report  (figure  4).  Array-A-Lizer  facilitates  the  generation  of  several  plots  that  detail  the  quality  of  the  experiments.  Two  different  analysis  modes  can  be  chosen,  resulting  in  either  a  set  of  diagnostic  plots  or  a  spatial  representation  of  the  data.  In  comparison  to  existing  analysis  packages,  Array-A-Lizer  is  both  quick  and  easy  to  use.  It  is  a  stand-alone  application  that  can  be  installed  on  any  desktop  computer  running  MS  Windows.  It  is  intended  for  easy  visualization  of  microarray  data  allowing  both  the  expert  and  non-expert  microarray  researcher  to  assess  the  quality  of  multiple  microarray  hybridizations.
0	Diagnostic  report  In  this  mode,  the  experimental  data  are  used  to  generate  several  diagnostic  plots  (figure  2)  as  well  as  statistics  on
0	the  identified  spots.  The  Array-A-Lizer  diagnostic  report  includes  both  MvA  plots  (figure  2A  left)[2]  and  red/greenscatter  plots  (figure  2A  right),  both  of  which  show  spot  intensities  after  local  background  subtraction.  MvA  plots  display  the  log  intensity  ratio  M  =  log2(R/G)  versus  the  mean  log  intensity  A  =  log  2  RG  .  This  plot  type  is  widely  use  to  visualize  array  data  because  it  directly  displays  the  red  to  green  ratios,  which  are  often  the  quantities  of  interest  in  most  experiments.  Furthermore,  MvA  plots  make  it  easy  to  identify  intensity  dependent  biases  in  the  data  (i.e.  curvature  or  'banana  shape').  In  scatter  plots,  the  intensities  from  the  green  channel  are  plotted  against  the  red  channel  after  log2  transformation.  Genes  displaying  difference  in  signal  intensities  in  the  two  channels  are  plotted  off  the  diagonal  and  genes  showing  similar  intensities  are  plotted  close  to  the  the  diagonal.  A  common  source  of  variation  in  microarray  data  acquisition  is  attributed  by  incorrectly  balanced  photomultiplier  tube  (PMT)  settings  during  scanning.  This  results  in  overall  differences  in  signal  intensities  obtained  from  either  channel  and  a  shift  of  the  data  from  the  x-axis  (M  =  0)  or
0	Page  2  of  6
0	(page  number  not  for  citation  purposes)
0	BMC  Bioinformatics  2004,  5
0	Page  3  of  6
0	(page  number  not  for  citation  purposes)
0	BMC  Bioinformatics  2004,  5
0	Page  4  of  6
0	(page  number  not  for  citation  purposes)
0	BMC  Bioinformatics  2004,  5
0	the  diagonal  (red  =  green)  of  the  ideal  MvA  and  scatterplot  respectively  (figure  2B).  Finally,  the  diagnostic  analysis  generates  histograms  of  the  log2  transformed  data  for  comparison  of  the  distribution  of  intensities  between  the  two  channels.  The  histograms  display  the  signal  intensities  across  the  slide  (figure  2C).  Overamplified  channels  (PMT  levels  are  set  too  high)  will  result  in  many  saturated  spots,  which  is  revealed  as  an  over  representation  of  high  intensity  values  (figure  2D).  The  diagnostic  report  includes  information  on  which  files  were  used  for  the  analysis,  the  number  of  saturated  spots,  and  the  number  of  negative  values,  i.e.  the  number  of  spots  where  the  background  intensity  was  higher  than  the  foreground  intensity.
0	Spatial  report  The  spatial  analysis  results  in  a  graphical  representation  of  microarray  data  according  to  the  location  on  the  slide  (figure  3).  From  each  channel,  three  different  plots  are  generated  showing  the  log2  transformed  foreground  intensities,  the  background  intensities,  and  a  plot  showing  the  location  of  negative  values  (background  higher  than  fore-
0	ground).  This  analysis  method  can  be  used  to  identify  spatial  effects  on  the  hybridized  arrays  such  as  fading  or  illumination  at  the  edges  due  to  cover-slip  effects  (figure  3A  and  3B)  or  scratches  and  artifacts  resulting  from  inadequate  washing  of  slides  (figure  3C  and  3D).  The  cut-off  values  on  the  background  plot  can  be  set  from  the  GUI  prior  to  starting  the  analysis.  Keeping  these  limits  fixed  will  allow  easy  detection  of  pronounced  fluctuations  in  background  intensities  both  between  and  within  slides.
0	With  the  reduced  cost  and  labor  of  DNA  m
0	TECHNICAL  REPORTS
0	CA).  Touchdown  PCR  amplifications  were  performed  as  recommended18.  Cycle  sequencing  protocols  were  used  with  ABI  sequencers  at  the  Hutchinson  Center  Biotechnology  Facility.  DHPLC.  Mutation  detection  was  performed  using  the  Transgenomic  WAVE  system.  Following  PCR  amplification,  the  Pfu  polymerase  was  inactivated,  and  the  DNA  samples  were  heated  and  cooled  to  form  heteroduplexes18.  For  most  fragments,  the  predicted  WAVE  (v.3.5)  melting  temperatures  and  separation  gradients  were  used19.
0	We  thank  Bruce  Draper  for  helpful  discussions.  This  work  was  supported  by  grant  RO1  GM29009  (to  S.H.)  from  the  National  Institutes  of  Health.  S.H.  is  an  investigator  of  the  Howard  Hughes  Medical  Foundation,  which  also  provided  support  for  Karen  Wolfe  of  the  James  Roberts  lab,  whom  we  thank  for  helping  us  with  the  screen.
0	High-fidelity  mRNA  amplification  for  gene  profiling
1	Ena  Wang1,3,  Lance  D.  Miller2,3,  Galen  A.  Ohnmacht1,  Edison  T.  Liu2,  and  Francesco  M.  Marincola1*
0	TECHNICAL  REPORTS
0	QUANTITATIVE  TRAIT  LOCI  IN  DROSOPHILA
1	Trudy  F.  C.  Mackay
0	Phenotypic  variation  for  quantitative  traits  results  from  the  simultaneous  segregation  of  alleles  at  multiple  quantitative  trait  loci.  Understanding  the  genetic  architecture  of  quantitative  traits  begins  with  mapping  quantitative  trait  loci  to  broad  genomic  regions  and  ends  with  the  molecular  definition  of  quantitative  trait  loci  alleles.  This  has  been  accomplished  for  some  quantitative  trait  loci  in  Drosophila.  Drosophila  quantitative  trait  loci  have  sex-,  environmentand  genotype-specific  effects,  and  are  often  associated  with  molecular  polymorphisms  in  non-coding  regions  of  candidate  genes.  These  observations  offer  valuable  lessons  to  those  seeking  to  understand  quantitative  traits  in  other  organisms,  including  humans.
0	Transfer  of  genetic  material  from  one  strain  to  another  by  repeated  backcrosses.  With  marker-assisted  introgression,  markers  that  distinguish  the  parental  strains  are  used  to  track  the  desired  interval  and  select  against  the  undesired  genotype.
0	The  ease  with  which  Mendelian  and  quantitative  traits  give  up  their  genetic  secrets  is  inversely  proportional  to  the  relative  importance  of  the  two  classes  of  trait  for  human  health,  agriculture,  evolution  and  even  functional  genomics.  Although  devastating  to  the  possessor,  highly  deleterious  alleles  that  cause  inborn  errors  of  metabolism  and  other  single  gene  disorders  are  rare  in  the  general  population.  By  contrast,  susceptibility  to  common  diseases  such  as  atherosclerosis,  arthritis,  diabetes,  hypertension  and  schizophrenia  is  affected  by  multiple  genetic  factors  and  by  the  environment.  These  diseases  are  therefore  quantitative  traits  (FIG.  1),  and  affect  a  large  proportion  of  the  human  population.  Similarly,  individuals  vary  quantitatively  in  their  response  to  drug  therapy.  There  is  great  excitement  in  the  human  genetics  community  and  the  pharmaceutical  industry  that  susceptibility  loci  for  common  diseases  and  individual  variation  in  drug  response  can  be  identified  and  the  molecular  basis  for  this  variation  determined.  This  knowledge  will  herald  a  new  era  of  personalized  medicine  in  which  environment-specific  risk  factors  for  common  diseases  are  assessed  for  individual  genotypes  (and  hopefully  avoided  by  the  patient)  and  pharmaceutical  treatment  is  genotype  dependent.  Similar  arguments  apply  to  the  agriculture  industry,  in  which  most  characters  of  economic  importance  in  domestic  animal  and  crop  species  are  quantitative.  There  is  a  long  history  of  success  in  improving  productivity  traits
0	by  selective  breeding  for  favourable  phenotypes.  Knowledge  of  the  allelic  status  at  each  locus  affecting  these  traits  will  greatly  facilitate  this  process,  and  will  enable  INTROGRESSION  of  favourable  alleles  from  other  strains,  while  simultaneously  eliminating  deleterious  alleles.  Variation  for  quantitative  traits  is  the  raw  material  on  which  the  forces  of  evolution  act  to  produce  phenotypic  diversity  and  adaptation.  Major  research  efforts  in  evolutionary  quantitative  genetics  are  aiming  to  determine  how  genetic  variation  for  adaptive  quantitative  traits  is  maintained  in  natural  populations;  whether  the  loci  at  which  variation  occurs  within  a  population  are  the  same  as  those  that  cause  divergence  between  populations  and  species;  and  how  the  answers  to  these  questions  depend  on  the  relationship  of  the  trait  to  the  ultimate  quantitative  trait  --  reproductive  fitness.  So  a  comprehensive  understanding  of  the  evolutionary  process  is  contingent  on  a  detailed  description  of  the  molecular  genetic  basis  of  variation  for  quantitative  traits  in  natural  populations.  The  complete  genome  sequences  of  the  yeast  Saccharomyces  cerevisiae1,  the  nematode  Caenorhabditis  elegans2  and  the  fruitfly  Drosophila  melanogaster3  reveal  that  a  large  fraction  of  these  genomes  is  uncharted  phenotypic  territory.  In  Drosophila,  for  example,  only  2,500  of  the  13,600  genes  and  predicted  genes  (18%)  have  been  characterized  by  classic  genetic  and  molecular  methods3.  An  important  challenge  for  the  future  is  to  devise  ways  of  determining  the  phenotypic  effects  of
0	NATURE  REVIEWS  |  GENETICS
0	Macmillan  Magazines  Ltd
0	A1A1  Phenotype  Phenotype  A1A2  A2A2  A1A1  A1A2  A2A2  Phenotype  Frequency  A1A1  A1A2  A2A2
0	Phenotypic  value
0	No  GEI  Parallel  reaction  norms
0	GEI  Reaction  norms  cross
0	GEI  Change  of  variance
0	ANTAGONISTIC  PLEIOTROPY
0	Alternative  homozygous  genotypes  (A1A1,  A2A2)  have  opposite  phenotypic  effects  under  different  conditions.
0	CONDITIONAL  NEUTRALITY
0	The  difference  between  quantitative  trait  loci  genotypes  is  only  expressed  under  some  conditions.
0	A  statistic  to  quantify  dispersion  about  the  mean.  In  quantitative  genetics,  the  phenotypic  variance,  VP  ,  is  the  observed  variation  of  the  trait  in  a  population.  VP  is  partitioned  into  components  due  to  variation  in  the  additive  (VA)  dominance  (VD  )  and  epistatic  (VI  )  genetic  variance,  the  variance  attributable  to  the  environment  (VE  ),  and  gene-environment  correlations  and  interactions.
0	uncharacterized  and  predicted  genes.  Conventional  screens  for  mutations  with  large  phenotypic  effects  can  lead  to  the  identification  of  function  for  a  biased  sample  of  genes  --  mutating  one  gene  in  a  pathway  in  which  there  is  functional  redundancy  might  not  cause  a  major  effect  on  the  phenotype.  Furthermore,  homozygous  lethal  mutations  define  loci  that  are  essential  for  viability,  but  less  severe  mutations  at  these  loci  may  have  unknown  and  unexpected  pleiotropic  effects  on  morphology,  physiology  and  behaviour.  So,  genetic  screens  for  mutations  with  subtle,  quantitative  effects  and  genetic  analysis  of  naturally  occurring  variation  for  quantitative  traits  will  be  important  components  of  the  functional  genomics  tool  kit.  Until  very  recently,  the  genetic  basis  of  variation  for  quantitative  traits  was  inferred  solely  from  statistical  estimates  of  correlations  between  relatives,  response  to  artificial  selection  and  changes  of  mean  and  VARIANCE  of  the  trait  on  inbreeding  and  crossing4,5.  To  reap  the  benefits  of  a  thorough  understanding  of  quantitative  traits,  we  must  lift  this  statistical  fog6  and  describe  quantitative  genetic  variation  in  terms  of  complex  genetics  (FIG.  1).  Specifically,  a  full  understanding  of  the  genetic  architecture  of  a  quantitative  trait  will  require  answers  to  the  following  questions.  What  are  the  loci  at  which  mutational  variation  affecting  the  trait  occurs?  What  are  the  spontaneous  mutation  rates  at  these  loci?  What  loci  affect  naturally  occurring  variation  within  and  between  populations  of  a  single  species,  and  between  species?  What  are  the  homozygous  and  heterozygous  effects  of  alleles  at  these  loci?  Are  the  effects  of  the  individual  loci  on  the  final  phenotype  independent  (additive),  or  is  the  effect  of  multiple  loci  on  the  phenotype  nonlinear  (epistasis)?  What  is  the  effect  of  quantitative  trait  locus  (QTL)  alleles  on  multiple  quantitative  traits,  including
0	reproductive  fitness  (pleiotropy)?  How  do  the  homozygous,  heterozygous,  epistatic  and  pleiotropic  QTL  effects  vary  between  the  sexes  and  in  a  range  of  ecologically  relevant  environments?  What  defines  a  QTL  allele  at  the  molecular  level?  What  are  QTL  allele  frequencies  within  and  between  populations?  At  present,  detailed  genetic  dissection  of  quantitative  traits  is  most  feasible  in  genetically  tractable  and  wellcharacterized  model  systems.  Drosophila  melanogaster  is  one  of  the  model  organisms  that  provides  us  with  all  the  tools  necessary  for  identifying  QTL  and  characterizing  them  at  the  molecular  level7  (FIG.  2).  Over  eight  decades  of  research  on  this  organism  have  provided  us  with  a  library  of  stocks  that  bear  mutations  at  single  loci  and  deficiency  chromosomes  that  cover  around  70%  of  the  genome.  The  P  transposable  element  has  been  harnessed  as  a  transformation  vector  and  modified  for  efficient  insertional  mutagenesis,  analysis  of  tissue-specific  expression  patterns,  general  and  targeted  overexpression,  and,  most  recently,  homologous  rec
0	review  review
0	In  control:  systematic  assessment  of  microarray  performance
1	Harm  van  Bakel  &  Frank  C.P.  Holstege+
0	Expression  profiling  using  DNA  microarrays  is  a  powerful  technique  that  is  widely  used  in  the  life  sciences.  How  reliable  are  microarrayderived  measurements?  The  assessment  of  performance  is  challenging  because  of  the  complicated  nature  of  microarray  experiments  and  the  many  different  technology  platforms.  There  is  a  mounting  call  for  standards  to  be  introduced,  and  this  review  addresses  some  of  the  issues  that  are  involved.  Two  important  characteristics  of  performance  are  accuracy  and  precision.  The  assessment  of  these  factors  can  be  either  for  the  purpose  of  technology  optimization  or  for  the  evaluation  of  individual  microarray  hybridizations.  Microarray  performance  has  been  evaluated  by  at  least  four  approaches  in  the  past.  Here,  we  argue  that  external  RNA  controls  offer  the  most  versatile  system  for  determining  performance  and  describe  how  such  standards  could  be  implemented.  Other  uses  of  external  controls  are  discussed,  along  with  the  importance  of  probe  sequence  availability  and  the  quantification  of  labelled  material.  Keywords:  expression  profiling;  external  controls;  microarray;  performance;  quality;  spikes
0	DNA  microarrays  are  universal  tools  that  can  be  applied  throughout  the  life  sciences  (Brown  &  Botstein,  1999;  Lockhart  &  Winzeler,  2000;  Young,  2000).  mRNA-expression  profiling  is  the  most  frequent  application.  Such  microarray  hybridizations  determine  changes  in  mRNA  levels  between  two  samples  or  result  in  an  absolute  quantification  that  is  correlated  to  mRNA  levels.  How  reliable  are  these  measurements?  Given  the  widespread  interest,  it  is  surprising  that  there  have  been  relatively  few  systematic  analyses  of  microarray  performance.  One  reason  for  this  lack  of  assessment  is  the  complicated  nature  of  microarray  technology;  there  is  no  single  `microarray  technology',  but  rather  a  collection  of  different  technology  platforms.  Established  platforms  include  Affymetrix  GeneChips  (Santa  Clara,  CA,  USA),  PCR-product-based  cDNA  arrays  and  long  oligomer  arrays  that  are  manufactured  in-house  or  by  Agilent  (Palo  Alto,  CA,  USA).  New  platforms  are  still  being  introduced,  such  as  the  Illumina  Beadarray
0	(San  Diego,  CA,  USA;  Fan  et  al,  2004)  or  the  Universal  Hexamer  Array  from  Agilix  (New  Haven,  CT,  USA;  Roth  et  al,  2004).  To  complicate  matters  further,  many  technical  alternatives  are  possible  within  each  platform  for  each  of  the  numerous  steps  between  sample  preparation  and  data  analysis.  These  include  diverse  methods  of  generating  labelled  material,  various  hybridization  conditions,  different  microarray  scanners  and  settings,  a  range  of  imagequantification  techniques,  and  several  approaches  for  determining  statistically  and  biologically  significant  differential  gene  expression.  Microarray  technology  is  therefore  an  amalgamation  of  many  different  techniques,  even  within  individual  technology  platforms.  This  complexity  makes  the  need  for  comparing  performance  even  stronger,  whilst  confounding  such  comparisons.  Determining  reliability  is  a  complicated  undertaking  if  all  aspects  are  to  be  assessed  in  a  non-arbitrary  way  across  the  different  platforms  and  their  variants.  In  addition,  reliability  is  a  sensitive  issue  for  those  groups  that  provide  the  technology.  Finally,  not  every  application  requires  reliable  estimates  of  mRNA  level  changes.  This  should  be  interpreted  as  an  indication  of  the  power  of  microarray  technology,  as  even  lower  quality  data  can  yield  important  results.  Improved  performance  would  nevertheless  benefit  all  applications.  A  high  degree  of  reliability  is  a  requirement  if  certain  fields,  such  as  systems  biology  (Ideker  et  al,  2001)  or  diagnostic  mRNAexpression  profiling  (van  de  Vijver  et  al,  2002)  are  to  mature.  A  strong  argument  can  be  made  for  investigating  how  the  technology  can  be  systematically  assessed,  given  its  increased  usage,  the  costs  that  are  involved  and  the  fact  that  the  aim  is  to  determine  the  mRNA  levels  of  all  genes,  including  those  that  are  expressed  at  nearly  zero  levels.  Here,  we  describe  approaches  for  determining  microarray  performance  and  propose  that  the  use  of  external  control  RNAs  is  a  versatile  and  robust  method  for  achieving  this  goal.
0	Accuracy  and  precision
0	Which  performance  parameters  should  be  assessed?  The  two  main  characteristics  of  data  quality  are  accuracy  and  precision.  Whereas  accuracy  refers  to  how  close  a  measurement  is  to  the  real  value,  precision  indicates  how  often  a  measurement  yields  the  same  result  (Fig  1).  When  microarray  data  are  discussed,  the  focus  is  often  on  precision;  that  is,  reproducibility  rather  than  accuracy.  Reproducibility  is  easier  to  assess,  by  taking  repeated  measurements.  Previous  reviews  have  discussed  the  pitfalls  that  are  involved  in  determining  reproducibility,  such  as  the  confusion  between
0	EUROPEAN  MOLECULAR  BIOLOGY  ORGANIZATION
0	Controlling  microarray  performance  H.  van  Bakel  &  F.C.P.  Holstege
0	Measured  mean
0	Measured  mean
0	mized.  Confounding  artefacts  are  still  being  uncovered  (Diehl  et  al,  2001;  Ramdas  et  al,  2001;  Chuaqui  et  al,  2002;  Fare  et  al,  2003;  Martinez  et  al,  2003;  Raghavachari  et  al,  2003;  t  Hoen  et  al,  2003;  Lyng  et  al,  2004).  Therefore,  monitoring  quality  would  benefit  individual  hybridizations  and  projects.  This  could  also  aid  in  analyses  of  the  data  that  are  now  being  collected  in  public  databases  (Edgar  et  al,  2002;  Brazma  et  al,  2003).  In  these  cases,  internal  quality  control  would  allow  the  refinement  of  decisions  about  which  data  to  use,  depending  on  the  requirement  for  different  quality  parameters.
0	Real  value
0	Real  value
0	Measured  mean
0	Measured  mean
0	Approaches  to  determining  performance
0	One  method  that  can  be  used  to  optimize  protocols  is  to  measure  and  increase  the  signal  intensity  (Rickman  et  al,  2003;  Wrobel  et  al,  2003).  The  underlying  assumption  is  that  increased  signal-to-noise  ratios  will  yield  better  quality  hybridizations.  However,  an  increase  in  signal  might  be  aspecific;  for  example,  owing  to  increased  crosshybridization  or  the  nonspecific  binding  of  fluorophores  to  nucleicacid  probes  (Chuaqui  et  al,  2002).  It  is  therefore  risky  to  optimize  signal-to-noise  ratios  without  knowing  whether  specificity  is  being  maintained.  A  second  approach  is  to  determine  the  correlation  between  new  methods  and  an  approach  that  is  already  in  use.  Different  amplification  and  labelling  techniques  are  usually  assessed  by  comparison  to  a  standard  cDNA-synthesis  protocol  (Mahadevappa  &  Warrington,  1999;  Manduchi  et  al,  2002;  Gupta  et  al,  2003;  t  Hoen  et  al,  2003;  Kenzelmann  et  al,  2004).  A  correlation  coefficient  only  shows  how  similarly  two  protocols  behave;  it  does  not  give  information  on  their  individual  accuracy.  A  high  correlation  (Barczak  et  al,  2003)  might  therefore  mean  that  the  technologies  that  are  being  compared  both  suffer  from  the  same  error.  Moreover,  a  low  correlation  (Tan  et  al,  2003)  still  begs  the  question  of  which  technique  is  better.  Another  use  of  correlation  is  to  monitor  reproducibility;  for  example,  between  the  two  dye  channels  of  cDNA  arrays.  The  drawback  is  that  the  technology  is  being  optimized  for  yielding  identical  intensities,  rather  than  for  accurately  reporting  what  most  users  are  interested  in:  differences  in  mRNA  levels.  Perfectly  tight  same-versus-same  scatter  plots,  which  are  often  touted  in  publications  or  advertisements  as  proof  of  superior  performance,  should  be  treated  with  caution.  Optimization  that  is  based  on  achieving  tight  scatter  plots  can  lead  to  a  decreased  ability  to  report  changes  in  mRNA  levels.  Ideally,  optimization  should  focus  on  reporting  relative  or  absolute  mRNA  levels  and  should  take  into  account  the  entire  range  of  expression  levels.  A  third  method  for  performance  evaluation  is  to  use  an  established  cell-culture  experiment  in  which  changes  in  mRNA  levels  are  verified  by  other  means,  such  as  northern  blotting  analysis  or  quantitative  reverse  transcription  (RT)-PCR  (Taniguchi  et  al,  2001;  Yuen  et  al,  2002;  Polacek  et  al,  2003;  Loguinov  et  al,  2004;  Roth  et  al,  2004).  Using  such  established  differentials  is  a  good  method  because  it  optimizes  the  reporting  of  differences  in  expression,  which  is  the  goal  of  most  microarray  hybridizations.  One  disadvantage  is  that  verification  and  optimization  are  driven  by  the  differences  that  are  reported  by  the  microarrays,  rather  than  by  all  of  the  mRNA-level  differences  that  are  present  in  the  experimental  system.  There  is  no  test  for  false-negative  differentials  unless  RT-PCR,  for  example,  is  carried  out  on  many  hundreds  of  genes  that  are  not  reported  as  being  differentially  expressed  in  the  microarray  experiment.  A  further  drawback  is  that  this  method,  similar  to  those  described  above,  does  not  lend  itself  to  the  routine  assessment  of  each  individual  microarray  hybridization  before  optimization.
0	Real  value
0	Real  value
0	Genome-Wide  Location  and  Function  of  DNA  Binding  Proteins
1	Bing  Ren,1*  Francois  Robert,1*  John  J.  Wyrick,1,2*  ¸  Oscar  Aparicio,2,4  Ezra  G.  Jennings,1,2  Itamar  Simon,1  Julia  Zeitlinger,1  Jorg  Schreiber,1  Nancy  Hannett,1  ¨  Elenita  Kanin,1  Thomas  L.  Volkert,1  Christopher  J.  Wilson,5  Stephen  P.  Bell,2,3  Richard  A.  Young1,2
0	Understanding  how  DNA  binding  proteins  control  global  gene  expression  and  chromosomal  maintenance  requires  knowledge  of  the  chromosomal  locations  at  which  these  proteins  function  in  vivo.  We  developed  a  microarray  method  that  reveals  the  genome-wide  location  of  DNA-bound  proteins  and  used  this  method  to  monitor  binding  of  gene-specific  transcription  activators  in  yeast.  A  combination  of  location  and  expression  profiles  was  used  to  identify  genes  whose  expression  is  directly  controlled  by  Gal4  and  Ste12  as  cells  respond  to  changes  in  carbon  source  and  mating  pheromone,  respectively.  The  results  identify  pathways  that  are  coordinately  regulated  by  each  of  the  two  activators  and  reveal  previously  unknown  functions  for  Gal4  and  Ste12.  Genome-wide  location  analysis  will  facilitate  investigation  of  gene  regulatory  networks,  gene  function,  and  genome  maintenance.  Many  proteins  bind  to  specific  sites  in  the  genome  to  regulate  genome  expression  and  maintenance.  Transcriptional  activators,  for  example,  bind  to  specific  promoter  sequences  and  recruit  chromatin  modifying  complexes  and  the  transcription  apparatus  to  initiate  RNA  synthesis  (1-3).  The  reprogramming  of  gene  expression  that  occurs  as  cells  move  through  the  cell  cycle,  or  when  cells  sense  changes  in  their  environment,  is  effected  in  part  by  changes  in  the  DNA  binding  status  of  transcriptional  activators.  Distinct  DNA  binding  proteins  are  also  associated  with  origins  of  DNA  replication,  centromeres,  telomeres,  and  other  sites,  where  they  regulate  chromosome  replication,  condensation,  cohesion,  and  other  aspects  of  genome  maintenance  (4,  5).  Our  understanding  of  these  proteins  and  their  functions  is  limited  by  our  knowledge  of  their  binding  sites  in  the  genome.  The  genome-wide  location  analysis  method  we  have  developed  allows  protein-DNA  interactions  to  be  monitored  across  the  entire  yeast  genome  (6).  The  method  combines  a  modified  chromatin  immunoprecipitation  (ChIP)  procedure,  which  has  been  previously  used  to  study  protein-DNA  interactions  at  a  small  number  of
0	in  galactose  using  our  analysis  criteria  (Fig.  2A).  These  included  seven  genes  previously  reported  to  be  regulated  by  Gal4  (GAL1,  GAL2,  GAL3,  GAL7,  GAL10,  GAL80,  and  GCY1).  The  MTH1,  PCL10,  and  FUR4  genes  were  also  bound  by  Gal4  and  activated  in  galactose.  Each  of  these  results  was  confirmed  by  conventional  ChIP  analysis  (Fig.  2B)  (6),  and  MTH1,  PCL10,  and  FUR4  activation  in  galactose  was  found  to  be  dependent  on  Gal4  (Fig.  2C).  Both  microarray  and  conventional  ChIP  showed  that  Gal4  binds  to  GAL1,  GAL2,  GAL3,  and  GAL10  promoters  under  glucose  and  galactose  conditions,  but  the  binding  was  generally  weaker  in
0	specific  DNA  sites  (7),  with  DNA  microarray  analysis.  Briefly,  cells  were  fixed  with  formaldehyde,  harvested,  and  disrupted  by  sonication.  The  DNA  fragments  cross-linked  to  a  protein  of  interest  were  enriched  by  immunoprecipitation  with  a  specific  antibody.  After  reversal  of  the  cross-links,  the  enriched  DNA  was  amplified  and  labeled  with  a  fluorescent  dye  (Cy5)  with  the  use  of  ligation-mediated-polymerase  chain  reaction  (LM-PCR).  A  sample  of  DNA  that  was  not  enriched  by  immunoprecipitation  was  subjected  to  LM-PCR  in  the  presence  of  a  different  fluorophore  (Cy3),  and  both  immunoprecipitation  (IP)-enriched  and  -unenriched  pools  of  labeled  DNA  were  hybridized  to  a  single  DNA  microarray  containing  all  yeast  intergenic  sequences  (Fig.  1).  A  single-array  error  model  (8)  was  adopted  to  handle  noise  associated  with  low-intensity  spots  and  to  permit  a  confidence  estimate  for  binding  (P  value).  When  independent  samples  of  1  ng  of  genomic  DNA  were  amplified  with  the  LM-PCR  method,  signals  for  greater  than  99.8%  of  genes  were  essentially  identical  within  the  error  range  (P  value  10  3).  The  IP-enriched/unenriched  ratio  of  fluorescence  intensity  obtained  from  three  independent  experiments  was  used  with  a  weighted  average  analysis  method  to  calculate  the  relative  binding  of  the  protein  of  interest  to  each  sequence  represented  on  the  array.  To  investigate  the  accuracy  of  the  genomewide  location  analysis  method,  we  used  it  to  identify  sites  bound  by  the  transcriptional  activator  Gal4  in  the  yeast  genome.  Gal4  activates  genes  necessary  for  galactose  metabolism  and  is  among  the  best  characterized  transcriptional  activators  (1,  9).  We  found  10  genes  to  be  bound  by  Gal4  (P  value  0.001)  and  induced
0	glucose  (6).  The  consensus  Gal4  binding  sequence  that  occurs  in  the  promoters  of  these  genes  (CGGN11CCG)  can  also  be  found  at  many  sites  through  the  yeast  genome  where  Gal4  binding  is  not  detected;  therefore,  sequence  alone  is  not  sufficient  to  account  for  the  specificity  of  Gal4  binding  in  vivo.  Previous  studies  of  Gal4-DNA  binding  have  suggested  that  additional  factors  such  as  chromatin  structure  contribute  to  specificity  in  vivo  (10,  11).  The  identification  of  MTH1,  PCL10,  and  FUR4  as  Gal4-regulated  genes  reveals  previously  unknown  functions  for  Gal4  and  explains  how  regulation  of  several  different  metabolic  pathways  can  be  coordinated  (Fig.  2D).  MTH1  encodes  a  transcriptional  repressor  of  certain  HXT  genes  involved  in  hexose  transport  (12).  Our  results  suggest  that  the  cell  responds  to  galactose  by  increasing  the  concentration  of  its  galactose  transporter  at  the  expense  of  other  transporters.  In  other  words,  while  Gal4  activates  expression  of  the  galactose  transporter  gene  GAL2,  Gal4  induction  of  the  MTH1  repressor  gene  leads  to  reduced  levels  of  glucose  transporter  expression.  The  Pcl10  cyclin  associates  with  Pho85p  and  appears  to  repress  the  formation  of  glycogen  (13).  Thus,  the  observation  that  PCL10  is  Gal4-activated  suggests  that  reduced  glycogenesis  occurs  to  maximize  the  energy  obtained  from  galactose  metabolism.  FUR4  encodes  a  uracil  permease  (14),  and  its  induction  by  Gal4  may  reflect  a  need  to  increase  intracellular  pools  of  pyrimadines  to  permit  efficient  uridine  5  -diphosphate  (UDP)  addition  to  galactose  catalyzed  by  Gal7.  We  next  investigated  the  genome-wide  binding  profile  of  the  transcription  activator  Ste12,  which  functions  in  the  response  of  haploid  yeast  to  mating  pheromones  (15).  Activation  of  the  pheromone-response  pathway  by  mating  pheromones  causes  cell  cycle  arrest  and  transcriptional  activation  of  more  than  200  genes  in  a  Ste12-dependent  fashion  (8,  15).  However,  it  is  not  clear  which  of  these  genes  is  directly  regulated  by  Ste12  and  which  are  regulated  by  other  ancillary  factors.  The  genomewide  binding  profile  of  epitope-tagged  Ste12,  determined  before  and  after  pheromone  treatment  in  three  independent  experiments,  indicates  that  29  pheromone-induced  genes  are  regulated  directly  by  Ste12.  Figure  3A  lists  the  yeast  genes  whose  promoter  regions  are  bound  by  Ste12  at  the  99.5%  confidence  level  (i.e.,  P  value  0.005)  and  whose  expression  is  induced  by  factor.  These  29  genes  are  likely  to  be  directly  regulated  by  Ste12  because  (i)  all  have  promoter  regions  bound  by  Ste12,  (ii)  exposure  to  pheromone  causes  an  increase  in  their  transcription,  and  (iii)  pheromone  induction  of  transcription  is  dependent  on  Ste12.  Of  the  genes  that  are  directly  regulated  by  Ste12,  11  are  already  known  to  participate  in  various  steps  of  the  mating  process  (Fig.  3B).  FUS3  and  STE12  encode  components  of  the  signal  transduction  pathway  involved  in  the  response  to  pheromone  (16);  AFR1  and  GIC2  are  required  for  the  formation  of  mating  projections  (17-19);  FIG2,  AGA1,  FIG1,  and  FUS1  are  involved  in  cell  fusion  (20-23);  and  CIK1
0	The  End  of  the  Microarray  Tower  of  Babel:  Will  Universal  Standards  Lead  the  Way?
1	Ernest  S.  Kawasaki
0	NCI  Advanced  Technolog  y  Center,  Bethesda,  MD
0	A  PRolIfERAtIon  of  MIcRoARRAy  PlAtfoRMs  And  AssocIAtEd  tEchnologIEs
0	Table  1  gives  a  list  of  sources  for  obtaining  whole  genome  arrays,  which  are  defined  as  arrays  that  have  approximately  the  entire  gene  complement  of  the  genome  represented  on  one  slide  or  chip.  You  will  note  that  there  are  large  differences  in  the  size  of  the  probes,  the  number  of  probe  sets,  and  the  total  number  of  probes  per  array.  This  and  many  other  technological  differences  found  in  these  platforms  will  be  enumerated,  with  pointers  as  to  how  or  why
0	The  enD  oF  The  micRoARRAy  ToweR  oF  BABel
0	The  probe  size,  number  of  probes  sets  and  the  total  number  of  probes  per  array  are  indicated.
0	these  differences  can  cause  discordant  results  between  platforms.  The  nomenclature  convention  followed  here  is  that  the  "probe"  is  the  gene  sequence  arrayed  on  the  chip,  and  the  "target"  is  the  RNA  sequence  to  be  labeled  and  hybridized  to  the  probes.  Probe  manufacture.  The  probes  for  the  arrays  may  be  made  in  situ  by  photolithographic  or  ink-jet  methods,  or  by  standard  oligonucleotide  synthesis  protocols  followed  by  attachment  to  various  substrates.3  Because  the  methods  are  so  varied,  it  is  difficult  to  estimate  the  purity  of  the  probes  or  their  true  sizes,  and  large  differences  in  these  parameters  can  have  a  great  influence  on  signal  intensi-
0	A  decade  of  microarray  publications.  The  number  of  publications  per  year  derived  from  Pubmed  using  the  terms  "microarray"  or  `microarrays"  is  shown.
0	e.s.  KAwAsAKi
0	for  detecting  mRNAs  of  low  abundance  than  the  long  probe  arrays.  Thus,  probe  size  can  be  a  confounding  factor  when  comparing  the  same  genes  across  many  platforms  (Table  1).  Probe  element  size  and  concentration.  The  element  or  spot  size  diameters  range  from  11  microns  to  ~200  microns  in  the  different  platforms.  The  size  of  the  array  elements  (spots),  their  size  in  µ2  ,  and  concentration  in  the  number  of  molecules  per  spot  are  given  in  Table  2.  There  is  also  a  large  difference  in  the  number  of  probe  molecules  per  spot,  with  estimates  from  several  million  to  hundreds  of  millions  of  molecules.  This  can  heavily  influence  the  kinetics  of  hybridization,  signal  quantification,  and  signal  intensities  of  the  probes,  and  these  important  factors  will  vary  from  platform  to  platform.  Probe  number  per  array.  The  number  of  probe  sets  may  vary  from  30,000  to  54,000,  but  the  total  number  of  probes  per  array  actually  ranges  from  about  30,000  to  greater  than  500,000  (Table  2).  Microarrays  may  contain  one  probe  per  gene  or  up  to  twenty  probes  per  gene.  This  fact  alone  can  make  it  difficult  to  directly  compare  the  data  from  platforms  with  such  a  wide  range  of  the  number  of  probes  per  mRNA  sequence.  Proper  probe  annotation.  This  is  an  intense  area  of  investigation.6-8  The  sequence  databases  for  expressed  genes  are  still  in  a  state  of  flux,  such  that  probe  sequences  derived  from  older  databases  may  be  dramatically  different  from  the  latest  version.  It  has  been  found  that  some  probe  sequences  no  longer  exist  in  the  database,  or  were  not  annotated  properly  and  now  have  different  IDs  or  names.  Thus,  platforms  may  have  probe  sequences  that  do  not  exist  in  the  genome  or  have  the  incorrect  designation,  and  this  has  been  an  important  source  of  confusion  in  the  analysis  of  array  data.  Target  preparation.  There  is  no  standard  way  of  isolating  RNA  for  target  labeling,  although  almost  all  microarray  experimentalists  follow  the  rule  of  analyzing  the  integrity  of  their  RNA  samples  before  beginning  labeling  steps.  Many  expression  profiling  experiments  in  the  past  were  uninterpretable  simply  because  of  poor  RNA  quality.  A  common  method  to  test  RNA  integrity  is  through  the  use  of  an  Agilent  2100  Bioanalyzer,  which  provides  an  electrophoretic  tracing  and  a  RNA  integrity  number  (RIN)  for  judging  RNA  quality.9  Target  synthesis.  Targets  are  commonly  synthesized  via  cDNA  reactions  on  total  RNA  or  by  in  vitro  synthesis  of  linearly  amplified  RNA  using  T7  RNA  polymerase  technologies.10  The  cDNA  targets  are  thought  to  faithfully  represent  the  original  concentrations  of  the  mRNA  in  the  sample,  but  linearly  amplif
0	BIOINFORMATICS  APPLICATIONS  NOTE
0	arrayMagic:  two-colour  cDNA  microarray  quality  control  and  preprocessing
1	Andreas  Buness,  Wolfgang  Huber,  Klaus  Steiner,  Holger  Sueltmann  and  Annemarie  Poustka
0	that  can  at  any  time  be  re-run  or  extended.  The  compendium  technology  (Gentleman,  2004)  can  be  used  to  produce  distributable  objects  containing  the  data  as  well  as  revivable  documents  reporting  the  processing.  We  aimed  to  integrate  normalization  methods,  quality  scores  and  visualizations  that  had  been  reported  previously.  In  addition,  we  provide  tools  for  dealing  with  different  microarray  layouts  within  one  experiment  and  for  merging  data  from  replicate  probes  or  hybridizations.  The  researcher  obtains  an  instant  overview  on  the  quality  of  the  experiment.
0	Normalization  strategies  for  two-colour  microarrays  can  be  divided  into  two  groups:  adjustment  of  the  colour  channels  or  of  the  log-ratios.  Moreover,  depending  on  the  experimental  design  and  the  objectives  either  a  single  channel  intensity  or  a  log-ratio-based  analysis  might  be  more  appropriate.  The  tool  offers  log-ratio-based  normalization  by  means  of  the  loess  method  (Yang  et  al.,  2002)  and  direct  intensitybased  normalization  by  means  of  vsn  (Huber  et  al.,  2002)  and  quantile  normalization  (Bolstad  et  al.,  2003)  methods.  We  will  also  use  the  terms  `log-ratios'  and  `log-transformed  intensities'  for  the  data  resulting  from  the  vsn  method.  Groups  of  hybridizations,  subsets  of  spots,  e.g.  by  grid,  print-tip  or  PCR  plate,  as  well  as  colour  channels  can  be  normalized  separately.  Plots  characterizing  the  distributions  of  the  log-ratios  and  colour  channels  before  and  after  normalization  were  generated  (Fig.  1b).
0	Two-colour  cDNA  microarray  technology  has  evolved  into  a  routine  laboratory  procedure.  Our  motivation  in  implementing  arrayMagic  was  to  deal  with  the  large  amount  of  data  generated  by  microarray  projects  in  an  efficient,  reliable  and  reproducible  manner.  We  focused  on  preprocessing  and  quality  assurance,  leaving  out  high-level  analysis  which  has  to  be  adressed  specifically.  The  main  design  goal  was  to  allow  for  the  rapid  construction  of  customized  quality  assessment  and  control  (QA/QC)  and  preprocessing  pipelines  for  such  projects  from  a  small  set  of  building  blocks.  arrayMagic  bridges  the  gap  between  the  image  quantification  software  and  subsequent  statistical  and  explorative  analyses  like  testing  for  differential  expression  or  classification.  It  simplifies  the  task  of  building  processing  pipelines  that  are  reproducible,  which  means  that  even  for  idiosyncratic  experimental  designs  and  non-trivial  combinations  and  selections  of  the  data  the  whole  procedure  from  raw  data  to  normalized,  quality-controlled,  annotated  and  summarized  data  is  documented  in  a  not  too  verbose  script
0	QUALITY  CONTROL  AND  ASSESSMENT
0	Quality  assured  data  are  prerequisite  for  any  reliable  highlevel  analysis.  In  addition,  quality  control  allows  to  monitor  and  improve  the  laboratory  procedures.  The  quality  of  hybridizations  is  best  assessed  in  the  context  of  normalization.  In  a  model-based  approach  like  vsn,  the  model  is  a  summary  of  past  experience  and  our  expectations  on  the  data.  Thus,  it  can  be  used  to  identify  hybridizations  or  groups  of  measurements  that  do  not  fit.  Other  methods
0	arrayMagic:  two-colour  microarray  quality  control
0	like  loess  or  quantile  normalization  place  more  emphasis  on  making  the  data  conform  in  any  situation.  In  these  cases,  statistics  of  the  data  distribution  can  be  calculated  (e.g.  location  and  scale  of  the  distribution  of  normalized  log-ratios)  and  compared  against  expectations.  Moreover,  as  long  as  the  majority  of  the  data  are  assumed  to  be  acceptable,  outlier  detection  methods  can  be  used  for  quality  control.  Visual  inspection  of  the  data  is  supported  by  spatial  falsecolour  representations  of  foreground  and  background  intensities  and  the  log-ratios.  This  allows  to  detect  scratches  and  artefacts  (Fig.  1a).  Most  notably,  the  spatial  plots  of  the  normalized  data  are  useful  for  assessing  the  necessity  of  background  correction  and  for  assuring  spatial  homogeneity  of  the  data.  Several  quality  scores  are  calculated,  stored  in  a  report  file  and  are  visualized  in  part.  These  scores  include  spot  replicate  concordance,  the  correlation  of  the  two  colour  channels  and  a  robust  measure  of  noise  W  for  each  hybridization.  W  is  defined  as  the  median  absolute  deviation  of  the  normalized  log-ratios  qi  ,  i.e.  W  =  madi  (qi  )  =  mediani  (|qi  -  medianj  (qj  )|).  A  minority  of  differentially  expressed  genes  should  not  disturb  W  .  We  do  not  find  it  practical  to  define  universally  applicable  thresholds  on  quality  scores.  They  should  be  evaluated  not  on  the  level  of  a  single  hybridization,  but  in  the  context  of  all  data  in  the  experiment.  In  our  experience  this  has  been  very  useful  in  detecting  outliers  in  large-scale  experiments.  In  particular,  a  global  view  on  all  pairwise  similarities  between  all  hybridizations  shown  in  Figure  1c  has  proved  to  be  useful.  For  two  arrays  a  and  b,  we  define  a  similarity  score  Sab  =  madi  (xia  -  xib  ),  where  xia  can  be  the  log-ratio  of  the  i-th  probe  on  the  a-th  array,  or  the  log-transformed  normalized  intensity  of  an  individual  colour  channel.  Especially  in  the
0	case  of  biologically  related  samples,  this  is  an  informative  measure  of  similarity.
0	The  open  source  software  tool  arrayMagic  facilitates  the  analysis  of  two  colour  cDNA  microarray  data.  It  aims  to  provide  quality  assured  and  normalized  data.  The  scriptbased  pipeline  supports  reproducible  batch-like  processing.  The  workflow  starts  with  quantified  image  scan  result  files.  Several  quality  scores  and  diagnostics  are  calculated  and  visualized,  which  offer  a  broad  view.  The  processed  data  can  be  exported  as  HTML-file  or  as  tab-delimited  file  with  spot  and  sample  annotation  and  may  serve  as  input  for  follow-up  analysis  in  commonly  used  tools  of  choice.  Naturally,  high-level  follow-up  analysis  in  the  framework  of  R  and  Bioconductor  is  supported  by  adequate  representation  of  the  data.  Documentation  of  all  functionality  and  a  step-by-step  example  following  a  typical  workflow  is  part  of  the  package.
0	A.Buness  et  al.
0	Gentleman,R.  (2004)  Reproducible  research:  a  bioinformatics  case  study.  Stat.  Appl.  Genet.  Mol.  Biol.,  3.  Gentleman,R.,  Carey,V.J.,  Bates,D.J.,  Bolstad,B.M.,  Dettling,M.,  Dudoit,S.,  Ellis,B.,  Gautier,L.,  Ge,Y.,  Gentry,J.  et  al.  (2004)  Bioconductor:  open  software  development  for  computational  biology  and  bioinformatics.  Bioconductor  Project  Working  Papers.  Working  Paper  1.  Huber,W.,  von  Heydebreck,A.,  Sueltmann,H.,  Poustka,A.  and  Vingron,M.  (2002)  Variance  stabilization  applied  to  microarray
0	Normalization  of  microarray  data  using  a  spatial  mixed  model  analysis  which  includes  splines
1	David  Baird1,,  Peter  Johnstone2  and  Theresa  Wilson3
0	AgResearch,
0	techniques  for  normalization  have  been  suggested,  including  linear  regression  (Hedenfalk  et  al.,  2001),  ratio  statistics  (Chen  et  al.,  1997),  local  smoothing  (Yang  et  al.,  2002)  and  analysis  of  variance  (Kerr  et  al.,  2000;  Chu  et  al.,  2002).  Yang  et  al.  (2002)  compare  these  approaches  and  suggested  a  method  which  allows  for  differences  induced  by  different  print  tips.  We  extend  this  idea  to  model  the  rows  and  columns  over  the  whole  slide  and  within  the  print  tips  and  also  autocorrelation  in  the  printing  order.  This  differs  from  other  methodology  in  that  we  are  able  to  correct  unwanted  variation  arising  from  unevenness  of  the  slide  surface  and  scanning  efficiency.  The  usual  statistical  modelling  approach  is  taken  where  all  possible  sources  of  noise  are  jointly  fitted  in  one  model,  with  the  need  for  each  term  being  assessed  using  statistical  significance  of  the  reduction  in  remaining  unexplained  variation.  Model  terms  can  be  added  or  removed  as  required.  The  fitted  model  then  indicates  where  useful  modification  of  our  protocols  and  equipment  would  help  minimize  variation  in  future  experiments.
0	METHODS  Amplification  of  ESTs
0	Microarray  technology  has  been  used  extensively  to  survey  patterns  of  gene  expression  in  a  range  of  biological  models.  Using  our  own  collection  of  bovine  expressed  sequence  tags  (ESTs)  we  have  constructed  large  cDNA  arrays  (up  to  22  000  ESTs)  for  use  in  several  of  our  research  projects.  For  such  large  arrays  it  is  essential  to  identify  sources  of  variation  and  correct  for  them  to  allow  for  robust  use  of  this  technology.  Through  normalization  procedures,  such  variations  can  be  identified  and  removed  to  obtain  data  for  follow  on  research.  The  analysis  of  the  microarrays,  is  a  two-step  analysis;  a  within  slide  analysis  aimed  at  normalization  and  if  required  standardization,  and  then  a  between  slide  analysis  to  estimate  the  differences  between  targets  and  their  consistency.  Various
0	Mixed  models  using  splines  for  microarray  data
0	C,  washed  for  5  min  each  in  (1)  2  x  SSC,  0.1%  SDS,  (2)  1  x  SSC  and  (3)  0.1  x  SSC,  centrifuged  at  500  g  for  5  min,  dried  and  scanned.
0	Allocation  of  probes  to  slides
0	Randomization  is  a  well-known  device  used  to  ensure  the  valid  application  of  significance  tests  and  confidence  intervals  (Fisher,  1951).  Randomization  also  disarms  critics  who  suggest  an  allocation  of  experimental  units  has  been  chosen  which  is  favourable  to  an  author's  hypothesis  (Cox,  1992).  Because  of  these  properties,  it  is  routine  in  traditional  experiments  to  randomly  allocated  treatments  to  the  experimental  units.  In  microarray  experiments  the  physical  constraints  imposed  by  the  storage  of  probes  in  96-well  plates  and  by  the  microarray  printing  robots,  ensure  that  a  fully  randomized  layout  is  not  possible.  However,  printing  the  96-well  plates  in  random  order  is  possible  and  is  justified  in  that  some  randomization  is  better  than  the  alternative  of  no  randomization.
0	ANALYSIS  Measure  of  differential  expression  in  probes
0	that  the  value  M  will  be  randomly  distributed  around  a  mean  value  of  0.  Other  approaches  to  handling  values  close  or  below  background  can  be  used.  One  option  is  to  make  no  background  correction,  which  will  shrink  all  values  of  M  towards  zero,  with  large  reductions  for  spots  of  low  intensity  and  minimal  reductions  on  spots  with  high  intensity.  This  has  the  advantage  of  reducing  the  variation  of  low-intensity  spots,  but  the  disadvantage  of  reducing  sensitivity  of  identifying  differentially  expressed  ESTs  with  low  expression  levels.  Any  spatial  trends  not  eliminated  in  the  log  ratios  due  to  trends  in  the  background  can  be  estimated  and  removed  as  part  of  the  spatial  model,  as  explained  later  in  this  paper.  Another  alternative  is  suggested  by  Durbin  and  Rocke  (2003),  in  the  context  of  transforming  the  single  channel's  expression,  add  a  constant  to  all  values  in  each  channel  as  part  of  a  more  complex  transformation.  The  constant  to  be  added  in  the  Durbin  and  Rocke  approach  is  estimated  as  that  giving  the  best  stabilized  error  variance.  For  large  expression  values,  these  approaches  have  virtually  no  effect  on  the  log  ratio,  but  for  values  just  below  and  above  the  minimum  cut  off,  the  relative  differences  between  the  approaches  may  be  substantial.  The  advantage  of  using  logs  over  more  complicated  transforms  is  that  the  resulting  values  are  more  naturally  interpreted  by  the  experimenter.  Which  approach  is  best,  in  terms  of  giving  unbiased  results  can  only  be  ascertained  by  a  uniform  study,  that  is  not  available  in  our  current  datasets.
0	Within  slide  dye  bias
0	It  is  typically  found  that  the  mean  of  M  at  a  certain  level  of  log-intensity  depends  on  the  level  of  intensity  of  the  probe.  If  we  define  A,  the  average  log-intensity  of  the  probe  as  A=
0	We  have  used  a  value  of  0.5  for  k,  but  have  tried  values  between  0.1  and  1.0.  The  value  of  k  controls  how  much  the  information  on  the  probe  is  down  weighted,  with  larger  values  reducing  the  value  of  M  towards  0.  If  both  dyes  have  negative  corrected  intensities,  then  there  is  no  information  in  the  probe,  and  M  is  set  to  be  a  missing  value.  It  is  expected  that  the  majority  of  probes  in  the  sample  will  show  no  differential  expression,  and
0	then  a  plot  of  M  versus  A  [an  MA  plot  (Dudoit  et  al.,  2002)],  often  shows  a  departure  from  the  zero  reference  line.  It  is  expected  that  the  level  of  differential  expression  is  independent  of  the  brightness  of  the  probes.  Figure  1  shows  the  MA  plot  for  one  of  our  microarrays.  It  can  be  seen  that  the  mean  location  of  the  M  values  is  below  zero  for  A  between  8  and  11  and  above  zero  for  A  >  11,  falling  back  to  zero  as  A  approaches  16.  The  points  falling  on  the  two  lines  at  the  left  of  the  plot  are  due  to  the  truncation  of  the  intensity  of  the  dyes  to  the  minimum  values.  Figure  3  shows  an  MA  plot  o
0	Developmental  roles  and  molecular  characterization  of  a  Drosophila  homologue  of  Arabidopsis  Argonaute1,  the  founder  of  a  novel  gene  superfamily
1	Youhei  Kataoka1,  Masatoshi  Takeichi2  and  Tadashi  Uemura2,a,*
0	Background:  Arabidopsis  Argonaute1  (AGO1)  is  the  founder  of  a  novel  gene  superfamily  that  is  conserved  from  fission  yeasts  to  humans.  AGO1,  and  several  other  members  of  this  superfamily  are  necessary  for  stem  cell  renewal  or  RNA  interference.  However,  little  has  been  reported  about  their  roles  in  animal  development  or  about  the  molecular  activities  of  any  of  the  members.  Results:  We  have  isolated  a  Drosophila  homologue  of  AGO1,  dAGO1,  in  our  attempt  to  search  genetically  for  regulators  of  Wingless  (Wg)  signal  transduction.  dAGO1  is  broadly  expressed  in  the  embryo  and  the  imaginal  disc.  dAGO1  over-expression  at  wing  margins  suggested  that  it  behaves  as  a  positive  regulator  in  the  genetic  background  employed.  Loss-of-function  mutations  of  dAGO1,  unexpectedly,  did  not  give  typical  segment  polarity  phenotypes  of  the  wg  class;  instead,  dAGO1  maternal  and  zygotic  mutant  embryos  showed  developmental  defects,  with  malformation  of  the  nervous  system  being  the  most  prominent.  The  mutant  decreased  in  the  numbers  of  several  types  of  neurones  and  glia  examined.  The  dAGO1  protein  was  distributed  in  the  cytoplasm  and  co-sedimented  with  poly(U)-  or  poly(A)-conjugated  beads.  Conclusion:  Our  results  suggest  that  the  dAGO1  protein  exerts  its  developmental  functions  by  binding  to  RNA  either  directly  or  indirectly.
0	Cells  are  endowed  with  a  variety  of  mechanisms  to  repress  the  translocation  of  signalling  components  to  nuclei  in  the  absence  of  extracellular  stimuli.  One  straightforward  strategy  to  inhibit  translocation  is  destroying  the  key  components  such  as  transcription  factors  before  they  enter  into  the  nucleus.  Pioneers  of  such  targets  of  proteolysis  are  b-catenin  and  its  Drosophila  homologue,  Armadillo  (Arm;  McCrea  et  al.  1991;  Peifer  et  al.  1992).  Unless  a  cell  receives  secreted  proteins  of  the  Wnt  family,  b-catenin/Arm  are  degraded  in  the  cytoplasm  (Orsulic  &  Peifer  1996;  Pai  et  al.  1997).  Following  the  binding  of  Wnt  to  its  receptor  in  the  Frizzled  family,  the  proteolytic  mechanism  is  inactivated,  and  b-catenin  enters  nuclei,  leading  to  the  transcription  of  target  genes  (Cadigan  &  Nusse  1997;  Wodarz  &  Nusse  1998;  Peifer  &  Polakis
0	q  Blackwell  Science  Limited
0	Most  Wnt  proteins  evoke  the  b-cateninmediated  signalling  cascade,  which  plays  important  roles  in  cell  proliferation  and  fate  determination  in  animal  development.  Besides  a  role  as  a  transcriptional  activator  in  the  Wnt  signal  transduction,  Arm/b-catenin  binds  cadherin,  and  this  complex  is  essential  for  cell  adhesion  at  cell±cell  junctions  (Oda  et  al.  1994;  Cox  et  al.  1996;  Muller  &  Wieschaus  1996;  Iwai  et  al.  1997).  Curiously,  E  these  two  functions  of  Arm/b-catenin  are  separable  (Orsulic  &  Peifer  1996).  Because  of  the  dual  functions  of  b-catenin,  overproduction  of  cadherin  sequesters  bcatenin  and  blocks  the  Wnt  signalling  in  Xenopus  (for  example,  see  Heasman  et  al.  1994).  Similarly,  cadherin  overproduction  in  Drosophila  wings  mimics  one  of  the  loss-of-function  phenotypes  of  wingless  (wg),  one  of  the  most  characterized  Wnt  genes  in  terms  of  developmental  roles  (Sanson  et  al.  1996;  this  study).  During  the  third  instar  larval  stage,  Wg  is  produced  in  a  stripe  of  cells  in  the  developing  wing  blade,  and  these  cells  are  responsible  for  patterning  the  margin  of  the  adult  wing  (Phillips  &  Whittle  1993;  Couso  et  al.  1994).  Without  this  Wg  function  in  late  stages  of  disc  development,  the
0	Y  Kataoka  et  al.
0	wings  lose  their  marginal  structures,  which  can  be  reproduced  with  high  penetrance  by  DE-cadherin  overproduction  (compare  Fig.  3A  with  3B).  Transgenic  flies  that  overproduce  DE-cadherin  along  their  wing  margin  are  healthy  and  fertile.  Thus,  the  strain  provides  an  appropriate  tool  for  conducting  genetic  searches  for  new  regulators  of  Wg  signalling,  as  has  been  previously  attempted  (Greaves  et  al.  1999).
0	Our  search  allowed  us  to  identify  a  Drosophila  homologue  of  Arabidopsis  Argonaute1  (AGO1),  which  is  required  for  the  dorsoventral  identity  of  the  leaf,  development  of  the  axial  meristem  (a  group  of  undifferentiated,  dividing  cells),  and  post-transcriptional  gene  silencing  (Bohmert  et  al.  1998;  Lynn  et  al.  1999;  Fagard  et  al.  2000).  AGO1  is  the  founder  of  a  novel  gene  superfamily  that  is  incredibly  well  conserved  among
0	q  Blackwell  Science  Limited
0	Roles  of  a  Drosophila  homologue  of  AGO1
0	fission  yeast,  plants  and  animals,  which  is  designated  the  AGO1  gene  superfamily  in  this  article.  To  clarify  the  developmental  roles  of  dAGO1,  we  examined  both  lossof-function  and  over-expression  phenotypes  in  the  embryo  and  in  the  imaginal  disc.  Although  the  amino  acid  sequences  of  any  proteins  of  this  superfamily  do  not  predict  their  molecular  activities,  our  result  was  suggestive  of  binding  of  the  dAGO1  protein  to  RNA  in  either  a  direct  or  an  indirect  fashion.
0	level  of  mRNA  (compare  Fig.  4A  with  4B)  and  was  used  in  subsequent  studies.
0	Subdivision  of  the  AGO1  superfamily
0	At  least  three  alternatively  spliced  transcripts  are  made  from  dAGO1,  and  we  focused  on  one  of  them,  CT42236,  which  is  equivalent  to  the  EST  clone  LD09501  (Fig.  1A;  Adams  et  al.  2000).  The  predicted  dAGO1  protein  consisted  of  950  amino  acids,  and  its  molecular  weight  was  estimated  as  106  kDa.  As  in  the  case  of  all  members  of  the  AGO1  superfamily,  amino  acid  sequences  of  dAGO1  provided  no  definite  information  about  its  molecular  activity.  Phylogenetic  trees  and  multiple  alignments  of  amino  acid  sequences  suggest  that  this  superfamily  consists  of  two  distinct  subfamilies  and  several  orphans  (Fig.  1B,C).  We  named  one  of  these  subfamilies  the  AGO1  subfamily,  which  includes  AGO1,  dAGO1  and  an  S.  pombe  protein,  SPCC736.11.  The  other  subfamily  was  designated  as  the  PIWI  subfamily,  because  the  founder  is  a  Drosophila  protein  of  the  piwi  gene,  which  controls  the  division  of  germ-line  stem  cells  (Cox  et  al.  1998).  The  orphans  whose  mutants  were  isolated,  are  C.  elegans  rde-1,  which  is  required  for  RNA  interference  (Tabara  et  al.  1999),  and  Neurospora  QDE-2,  which  is  required  for  quelling,  a  phenomenon  similar  to  co-suppression  (Cogoni  &  Macino  1997;  Fagard  et  al.  2000).  Every  member  of  the  superfamily  shares  a  conserved  box  of  43  residues  near  the  carboxy  terminal  (Cox  et  al.  1998),  and  proteins  of  the  AGO1  subfamily  share  a  longer  stretch  of  86  residues  on  average  (the  AGO1  box;  Fig.  1C,  D).  What  distinguishes  the  two  subfamilies  most  is  the  presence  or  absence  of  a  region  that  is
0	Identification  of  a  Drosophila  AGO1  homologue  essential  for  viability
0	To  identify  new  components  of  the  Wg  signal  transduction  pathway,  we  performed  a  genetic  screen  for  dominant  modifiers  of  the  wing-margin  phenotype  caused  by  the  over-expression  of  DE-cadherin  (see  details  in  Experimental  procedures).  We  focused  on  a  P-element  insertion  line,  l(2)k08121  (Spradling  et  al.  1995,  1999),  in  which  we  found  that  the  transposon  was  inserted  into  gene  CG6671  (Fig.  1A;  Adams  et  al.  2000).  This  gene  is  homologous  to  Arabidopsis  AGO1  (Bohmert  et  al.  1998)  as  described  below;  therefore  we  designated  this  Drosophila  gene  dAGO1.  The  lethality  of  l(2)k08121  was  due  to  a  loss  of  dAGO1  function,  as  shown  by  the  fact  that  remobilization  of  the  P-element  recovered  the  lethality  and  that  expression  of  a  cDNA  clone  (LD09501;  Rubin  et  al.  2000)  under  a  heat-shock  promoter  made  l(2)k08121  homozygotes  and  l(2)k08121/Df  develop  to  adulthood.  l(2)k08121  is  a  strong  allele,  as  was  shown  by  a  great  reduction  in  the
0	Open  Access
0	Computational  identification  of  Drosophila  microRNA  genes
1	Eric  C  Lai¤,  Pavel  Tomancak¤,  Robert  W  Williams  and  Gerald  M  Rubin
0	These  authors  contributed  equally  to  this  work.
0	Background:  MicroRNAs  (miRNAs)  are  a  large  family  of  21-22  nucleotide  non-coding  RNAs  with  presumed  post-transcriptional  regulatory  activity.  Most  miRNAs  were  identified  by  direct  cloning  of  small  RNAs,  an  approach  that  favors  detection  of  abundant  miRNAs.  Three  observations  suggested  that  miRNA  genes  might  be  identified  using  a  computational  approach.  First,  miRNAs  generally  derive  from  precursor  transcripts  of  70-100  nucleotides  with  extended  stem-loop  structure.  Second,  miRNAs  are  usually  highly  conserved  between  the  genomes  of  related  species.  Third,  miRNAs  display  a  characteristic  pattern  of  evolutionary  divergence.  Results:  We  developed  an  informatic  procedure  called  'miRseeker',  which  analyzed  the  completed  euchromatic  sequences  of  Drosophila  melanogaster  and  D.  pseudoobscura  for  conserved  sequences  that  adopt  an  extended  stem-loop  structure  and  display  a  pattern  of  nucleotide  divergence  characteristic  of  known  miRNAs.  The  sensitivity  of  this  computational  procedure  was  demonstrated  by  the  presence  of  75%  (18/24)  of  previously  identified  Drosophila  miRNAs  within  the  top  124  candidates.  In  total,  we  identified  48  novel  miRNA  candidates  that  were  strongly  conserved  in  more  distant  insect,  nematode,  or  vertebrate  genomes.  We  verified  expression  for  a  total  of  24  novel  miRNA  genes,  including  20  of  27  candidates  conserved  in  a  third  species  and  4  of  11  high-scoring,  Drosophila-specific  candidates.  Our  analyses  lead  us  to  estimate  that  drosophilid  genomes  contain  around  110  miRNA  genes.  Conclusions:  Our  computational  strategy  succeeded  in  identifying  bona  fide  miRNA  genes  and  suggests  that  miRNAs  constitute  nearly  1%  of  predicted  protein-coding  genes  in  Drosophila,  a  percentage  similar  to  the  percentage  of  miRNAs  recently  attributed  to  other  metazoan  genomes.
0	deposited  research  refereed  research  interactions
0	Although  the  analysis  of  sequenced  genomes  to  date  has  focused  most  heavily  on  the  protein-coding  set  of  genes,  all  genomes  also  contain  a  constellation  of  non-coding  RNA  genes.  With  the  exception  of  certain  classes  of  RNAs  with  strongly  conserved  sequences  and/or  structures,  such  as  ribosomal  and  transfer  RNAs,  identification  of  most  non-
0	coding  RNAs  has  historically  been  a  relatively  serendipitous  affair.  Only  very  recently  have  there  been  concerted  efforts  to  identify  such  genes  systematically,  using  both  experimental  and  computational  approaches  [1].  Our  collective  ignorance  of  the  totality  of  non-coding  RNA  genes  was  laid  bare  by  recent  work  on  microRNAs  (miRNAs),
0	Genome  Biology  2003,  4:R42
0	R42.2  Genome  Biology  2003,
0	an  abundant  family  of  21-22  nucleotide  non-coding  RNAs  [2,3].  The  founding  members  of  this  family,  lin-4  and  let-7,  were  identified  through  forward  analysis  of  extant  Caenorhabditis  elegans  mutants  [4,5].  Both  of  these  RNAs  are  post-transcriptional  regulators  of  developmental  timing  that  function  by  binding  to  the  3'  untranslated  regions  (3'  UTRs)  of  target  genes  [5-8].  Although  they  were  long  regarded  as  genetic  curiosities  possibly  specific  to  nematodes,  let-7  was  subsequently  found  to  be  broadly  conserved  across  bilaterian  evolution  [9]  and  miRNA  genes  are  now  recognized  as  a  pervasive  and  widespread  feature  of  animal  and  plant  genomes  [10-16].  In  general,  it  is  thought  that  miRNA  biogenesis  proceeds  via  intermediate  precursor  transcripts  of  more  than  70  nucleotides  that  have  the  capacity  to  form  an  extended  stem-loop  structure  (pre-miRNA),  although  at  least  some  pre-miRNAs  are  further  derived  from  even  longer  transcripts  (primary  miRNA  transcripts,  or  pri-miRNAs).  These  can  exist  as  long  individual  pre-miRNA  precursor  transcripts,  as  operon-like  multiple  pre-miRNA  precursors,  or  even  as  part  of  primary  mRNA  transcripts.  Processing  of  pri-miRNA  into  the  premiRNA  stem-loop  occurs  in  the  nucleus,  while  subsequent  processing  of  pre-miRNA  into  21-22  mers  is  a  cytoplasmic  event  mediated  by  the  RNAse  III  enzyme  Dicer  [17-20];  Dicer  is  also  responsible  for  cleavage  of  long  perfectly  doublestranded  RNA  into  21-22  nucleotide  fragments  during  RNA  interference  (RNAi)  [2,21].  These  latter  molecules,  known  as  silencing  RNA  (siRNA),  bind  to  and  trigger  the  degradation  of  perfectly  homologous  mRNA  molecules  via  RISC,  a  doublestrand  RNA-induced  silencing  complex  containing  nuclease  activity  [22,23].  Although  the  in  vivo  function  of  only  a  few  miRNAs  is  known  so  far,  it  is  believed  that  the  vast  majority  are  likely  to  participate  in  post-transcriptional  gene  regulation  of  complementary  mRNA  targets.  Interestingly,  perfect  or  near-perfect  target  complementarity  is  associated  with  mRNA  degradation  [24-26],  similar  to  the  effects  of  siRNA,  whereas  imperfect  base-pairing  is  associated  with  regulation  by  translational  inhibition  [6,27].  Recently,  siRNAs  with  imperfect  match  to  target  mRNA  were  observed  to  function  as  translational  inhibitors  [28],  suggesting  that  the  type  of  21-22  nucleotide  RNA-mediated  regulation  may  be  largely  determined  by  the  quality  of  target  complementarity.  The  vast  majority  of  the  approximately  300  miRNAs  currently  known  were  identified  through  direct  cloning  of  short  RNA  molecules.  Although  this  method  has  been  quite  successful  thus  far,  its  practicality  is  limited  by  the  necessity  for  a  considerable  amount  of  RNA  as  raw  material  for  cloning,  and  cloned  products  are  often  dominated  by  a  few  highly  expressed  miRNAs.  For  example,  41%  of  miRNAs  cloned  from  HeLa  cells  are  variants  of  let-7,  28%  of  human  brain  miRNAs  are  variants  of  miR-124,  and  45%  of  miRNAs  cloned  from  human  heart  and  32%  of  those  cloned  from  early
0	Drosophila  embryos  are  miR-1  [10,29].  In  fact,  it  has  been  opined  that  few  additional  mammalian  miRNAs  will  be  easily  identified  by  the  direct  cloning  method  [30].  As  a  complementary  approach  to  miRNA  identification,  we  developed  an  informatic  strategy  ('miRseeker')  and  applied  it  to  the  completed  genomes  of  Drosophila  melanogaster  and  D.  pseudoobscura,  which  are  some  30  million  years  diverged.  miRseeker  subjects  conserved  intronic  and  intergenic  sequences  to  an  RNA  folding  and  evaluation  procedure  to  identify  evolutionarily  constrained  hairpin  structures  with  features  characteristic  of  known  miRNAs.  The  specificity  of  this  computational  procedure  was  shown  by  the  presence  of  18  out  of  24  reference  miRNAs  within  the  top  124  candidates.  We  identified  a  total  of  48  novel  miRNA  candidates  whose  existence  was  strongly  supported  by  conservation  in  other  insect,  nematode  or  vertebrate  genomes.  Expression  of  24  novel  miRNA  genes  was  verified  by  northern  analysis  (including  20  out  of  27  candidates  that  were  supported  by  third-species  conservation  and  4  out  of  11  high-scoring  predictions  specific  to  Drosophila),  demonstrating  that  the  bioinformatic  screen  was  successful.  As  might  be  expected,  the  newly  verified  miRNA  genes  vary  tremendously  with  respect  to  abundance  and  developmental  expression  profile,  suggesting  diverse  roles  for  these  genes.  Inference  of  our  false-positive  prediction  and  false-negative  verification  rates  (based  on  our  ability  to  identify  known  miRNAs  and  detect  the  expression  of  highly  conserved,  and  thus  presumed  genuine,  novel  miRNAs)  leads  us  to  estimate  that  drosophilid  genomes  contain  around  110  miRNA  genes,  or  nearly  1%  of  the  number  of  predicted  protein-coding  genes.  In  combination  with  other  concurrent  genomic  analyses  [31-34],  it  is  likely  that  most  miRNAs  in  completed  animal  genomes  have  now  been  identified.  Collectively,  this  sets  the  stage  for  both  genome-wide  and  targeted  studies  of  this  functionally  elusive  family  of  regulators.
0	Evolutionarily  conserved  characteristics  of  miRNA  genes
0	Genome  Biology  2003,  4:R42
0	Genome  Biology  2003,
0	comment  reviews
0	Unstructured  sequence
0	Conserved  stem-loop
0	Evaluation  of  cadmium-induced  transcriptome  alterations  by  three  color  cDNA  labeling  microarray  analysis  on  a  T-cell  line
1	George  Th.  Tsangaris  *,  Athanassios  Botsonis,  Ioannis  Politis,  Fotini  Tzortzatou-Stathopoulou
0	Keywords:  Cadmium;  Heavy  metals;  cDNA  microarray;  Gene  regulation;  Toxicogenomics;  Apoptosis
0	Introduction  The  massive  and  rapid  increase  in  human  genome-scale  DNA  sequencing  and  the  concomitant  development  of  methods  and  technologies  for  the  exploitation  of  this  information,  have  recently  indicated  that  reliable  predictions  should  not  be  based  on  any  single  gene,  but  on  multi-gene  ex-
0	has  been  shown  that  Cd  compounds  induced  tumors  in  lungs,  testes,  prostate  as  well  as  hematopoietic  system  malignancies  (Degraeve,  1981;  IARC,  1993;  Waalkes  and  Rehm,  1994),  while  in  cultured  mammalian  cells  they  induced  morphological  transformations,  chromosomal  aberrations  and  gene  mutations  (DiPaolo  and  Castro,  1979;  Ochi  and  Ohsawa,  1983;  Ochi  et  al.,  1984;  Yang  et  al.,  1996;  Hwua  and  Yang,  1998).  A  previous  work  on  a  human  T-cell  line  (CEM-C12)  has  shown  that  Cd  exerts  its  toxic  effect  via  apoptosis  (el  Azzouzi  et  al.,  1994),  while  a  comparative  study  of  Cd  apoptotic  effect  in  immune  system's  cell  lines,  has  shown  a  differential  Cd-induced  apoptosis,  which  may  disturb  the  immune  system's  normal  growth  and  development  (Tsangaris  and  Tzortzatou-Stathopoulou,  1998a).  On  the  cellular  level,  Cd  is  highly  reactive  with  sulfphydryl  groups  of  proteins  and  can  substitutes  zinc  in  certain  enzymes  (Vallee  and  Ulmer,  1972;  Figueiredo-Pereira  et  al.,  1998)  and  so  acts  through  an  orphan  zinc  receptor  can  provoke  the  production  of  inositol  triphosphate  and  subsequent  release  of  calcium  from  internal  stores,  thereafter  stimulating  protein  kinase  C  (Block  et  al.,  1992;  Smith  et  al.,  1994).  Cd  has  been  also  reported  to  activate  p38  and  extracellular  regulated  kinase  (ERK)  in  rat  brain  tumor  cells  (Hung  et  al.,  1998)  and  c-Jun  N-terminal  kinase  (JNK)  in  porcine  renal  epithelial  cells  (Matsuoka  and  Igisu,  1998).  On  the  molecular  level,  Cd  has  been  shown  to  induce  mRNA  levels  of  several  genes  such  as  c-jun,  c-myc  (Jin  and  Ringertz,  1990),  c-fos  (Wang  and  Templeton,  1998),  metallothionein  (MT)  (Karin  et  al.,  1987)  and  heme  oxygenase  1  (HMOX1)  (Alam  et  al.,  1989;  Takeda  et  al.,  1994).  We  and  others  have  shown  that  in  nucleated  blood  cells,  and  particularly  lymphocytes,  Cd  time-  and  dose-dependently  activates  transcription  of  both  metallothionein-IIA  (MT-IIA)  and  heat  shock  protein  70  (HSP  70)  genes  (Pellegrini  et  al.,  1994a,b).  Thus,  after  exposure  to  low  Cd  concentrations,  MT-IIA  is  induced,  in  contrast  to  higher  concentrations  in  which  HSP70  is  induced.  In  the  present  study,  we  investigated  by  cDNA  microarrays  the  cadmium-induced  transcriptome  alterations  on  the  immature  T-cell  line  CCRFCEM,  analyzing  1455  genes,  after  incubation  of
0	the  cells  for  6  and  24  h  with  two  different  Cd  concentrations  (10  and  20  mM),  applying  for  the  first  time  three  fluorescent  dye  cDNA  labeling,  followed  by  three  laser  simultaneous  analysis,  on  the  same  microarray  slide.
0	ml  per  well  of  acid  isopropanol  (0.04  N  HCl)  and  the  plates  were  read  on  an  Elisa  reader  (Stat-Fax  2100,  Awareness  Technology,  Palm  City,  FL).  The  data  were  expressed  as  the  percentage  of  the  number  of  viable  cells  in  cadmium-treated  cells  as  compared  to  untreated  cells  (control).
0	Materials  and  methods
0	Quantification  of  apoptotic  cells
0	The  detection  and  quantification  of  apoptosis  was  performed  as  previously  described  (Tsangaris  and  Tzortzatou-Stathopoulou,  1996).  Briefly,  after  the  exposure  of  the  cells  (2x  106  cells/ml)  for  6  or  24  h  to  various  Cd2  +  concentrations,  8  ml  of  the  cell  suspension  were  mixed  with  2  ml  of  a  fluorescent  EtBr-containing  dye  (0.1  mg/ml  EtBr,  1.5%  NP40,  in  PBS).  This  suspension  was  placed  on  a  microscope  slide  and  covered  with  a  coverslip.  Fluorescent-stained  cells  were  examined  with  an  Epi-Fluorescence  Microscope  (Optiphot-2,  Nikon,  Japan).  The  cells  were  scored  and  categorized  as  normal,  apoptotic  or  necrotic  and  the  results  were  expressed  as  percentage  of  each  cell  kind  to  the  total  counted  cells.  For  each  Cd2  +  concentration  at  each  time  point,  more  than  five  slides  were  prepared  and  more  than  500  cells/slide  were  examined.
0	Media  and  reagents
0	The  medium  for  cell  cultures  was  RPMI  1640,  supplemented  with  10%  heat-inactivated  fetal  bovine  serum  (FBS,  Invitrogen/Life  Technologies  International,  Paisley,  England),  100  U/ml  penicillin,  100  mg/ml  streptomycin,  2  mM  L-glutamine  and  20  mM  HEPES  buffer  (serum  medium)  (all  derived  from  Biochrom,  Berlin,  Germany).  Cadmium  chloride  (Cd2  +  )  (Sigma  Chem.  Co.,  St.  Louis,  MO)  was  dissolved  in  water  at  10  mM,  stored  at  4  °C  (stock  solution)  and  was  diluted  to  appropriate  concentrations  immediately  before  use  in  culture  medium  without  FBS.
0	Cell  cultures
0	The  CCRF-CEM  human  immature  T-cell  line  was  obtained  from  the  European  Collection  of  Cell  Cultures  (ECACC,  Salisbury,  UK).  Cells  (3  x  105  cells/ml)  were  cultured  in  serum  medium  at  37  °C  in  a  humidified  atmosphere  containing  5%  CO2  in  air  and  changed  every  3  days.  For  each  experiment,  cells  (1x106  cells/ml)  were  harvested  at  the  exponential  growth  phase  and  resuspended  in  10%  serum  medium  in  the  presence  of  Cd2  +  for  6  or  24  h  in  Falcon  75  cm2  flasks  (Becton  Dickinson,  Oxnard,  CA).
0	RNA  isolation  and  cDNA  production
0	After  the  incubation  of  the  cells  for  6  or  24  h,  with  or  without  Cd2  +  ,  10x  106  cells  were  centrifuged  (270x  g,  10  min,  4  °C)  and  the  pellets  were  washed  twice  with  ice-cold  normal  saline.  The  cell  pellets 
0	Research  article
0	Identification  of  Pax2-regulated  genes  by  expression  profiling  of  the  mid-hindbrain  organizer  region
1	Maxime  Bouchard1,2,*,,  David  Grote1,2,*,  Sarah  E.  Craven3,  Qiong  Sun1,  Peter  Steinlein1  and  Meinrad  Busslinger1
0	The  paired  domain  transcription  factor  Pax2  is  required  for  the  formation  of  the  isthmic  organizer  (IsO)  at  the  midbrain-hindbrain  boundary,  where  it  initiates  expression  of  the  IsO  signal  Fgf8.  To  gain  further  insight  into  the  role  of  Pax2  in  mid-hindbrain  patterning,  we  searched  for  novel  Pax2-regulated  genes  by  cDNA  microarray  analysis  of  FACS-sorted  GFP+  mid-hindbrain  cells  from  wild-type  and  Pax2-/-  embryos  carrying  a  Pax2GFP  BAC  transgene.  Here,  we  report  the  identification  of  five  genes  that  depend  on  Pax2  function  for  their  expression  in  the  mid-hindbrain  boundary  region.  These  genes  code  for  the  transcription  factors  En2  and  Brn1  (Pou3f3),  the  intracellular  signaling  modifiers  Sef  and  Tapp1,  and  the  non-coding  RNA  Ncrms.  The  Brn1  gene  was  further  identified  as  a  direct  target  of  Pax2,  as  two  functional  Pax2-binding  sites  in  the  promoter  and  in  an  upstream  regulatory  element  of  Brn1  were  essential  for  lacZ  transgene  expression  at  the  mid-hindbrain  boundary.  Moreover,  ectopic  expression  of  a  dominant-negative  Brn1  protein  in  chick  embryos  implicated  Brn1  in  Fgf8  gene  regulation.  Together,  these  data  defined  novel  functions  of  Pax2  in  the  establishment  of  distinct  transcriptional  programs  and  in  the  control  of  intracellular  signaling  during  mid-hindbrain  development.
0	Key  words:  Mid-hindbrain  development,  Pax2-regulated  genes,  Sef,  Tapp1,  Ncrms,  En2,  Brn1,  Fgf8  regulation,  Mouse
0	The  midbrain  and  cerebellum  develop  from  an  organizing  center  that  is  formed  at  the  junction  between  the  embryonic  midbrain  and  hindbrain,  known  as  the  isthmus.  This  isthmic  organizer  (IsO)  was  discovered  because  of  its  property  of  inducing  an  ectopic  midbrain  or  cerebellum,  when  transplanted  into  the  chick  diencephalon  or  hindbrain,  respectively  (reviewed  by  Liu  and  Joyner,  2001a;  Wurst  and  Bally-Cuif,  2001).  The  IsO  activity  recruits  the  surrounding  tissue  into  either  a  midbrain  or  cerebellum  fate  by  controlling  cell  survival,  proliferation  and  differentiation  along  the  anteroposterior  axis  of  the  mid-hindbrain  region.  The  formation  of  the  IsO  is  the  result  of  complex  cross-regulatory  interactions  between  transcription  factors  (Otx,  Gbx,  Pax  and  En)  and  secreted  proteins  (Wnts  and  Fgfs),  culminating  in  the  expression  of  the  signaling  molecule  Fgf8  at  the  mid-hindbrain  boundary  (Liu  and  Joyner,  2001a;  Wurst  and  Bally-Cuif,  2001;  Ye  et  al.,  2001).  Fgf8  is  the  central  mediator  of  IsO  activity,  as  it  is  both  necessary  and  sufficient  for  inducing  midbrain  and  cerebellum  development  (Crossley  et  al.,  1996;  Chi  et  al.,  2003).  Once  formed,  the  IsO  is  maintained  by  a  positive  feedback  loop  involving  multiple  mid-hindbrain-specific  regulators.  Consequently,  the  IsO  is  lost  upon  individual
0	mutation  of  these  regulators,  whereas  ectopic  expression  of  a  single  factor  activates  most  other  components  of  the  regulatory  cascade  (Nakamura,  2001).  Owing  to  this  interdependence,  the  hierarchical  relationship  among  the  different  regulators  remains  largely  elusive  during  the  maintenance  phase  of  IsO  activity  (Liu  and  Joyner,  2001a;  Wurst  and  Bally-Cuif,  2001).  The  initiation  of  IsO  development  crucially  depends  on  the  transcription  factor  Pax2  (Favor  et  al.,  1996;  Brand  et  al.,  1996),  which  shares  similar  DNA-binding  and  transactivation  functions  with  Pax5  and  Pax8  of  the  same  paired  domain  protein  subfamily  (Kozmik  et  al.,  1993;  Doerfler  and  Busslinger,  1996).  Pax2  is  the  earliest  known  gene  to  be  expressed  throughout  the  prospective  mid-hindbrain  region  in  late  gastrula  embryos  (Rowitch  and  McMahon,  1995).  The  initially  broad  expression  pattern  of  Pax2  is  progressively  refined  to  a  narrow  ring  centered  at  the  mid-hindbrain  boundary  by  embryonic  day  9.5,  while  the  related  Pax5  and  Pax8  genes  are  activated  in  the  same  region  at  3-4  and  6-7  somites,  respectively  (Urbanek  et  al.,  1994;  Rowitch  and  McMahon,  1995;  Pfeffer  et  al.,  1998).  Consistent  with  this  sequential  gene  induction,  mutation  of  the  Pax2  gene  leads  to  the  loss  of  the  midbrain  and  cerebellum  in  mouse  and  zebrafish  embryos  (Favor  et  al.,  1996;  Brand  et  al.,  1996;  Bouchard  et  al.,  2000),  whereas  the  inactivation  of  Pax5  or  Pax8  results  in  a  mild
0	Development  132  (11)  cerebellar  midline  defect  or  no  brain  phenotype  at  all  (Urbanek  et  al.,  1994;  Mansouri  et  al.,  1998).  The  severe  mid-hindbrain  deletion  is,  however,  only  observed  in  Pax2-/-  mouse  embryos  on  the  C3H/He  genetic  background  (Bouchard  et  al.,  2000),  where  the  compensating  Pax5  and  Pax8  genes  fail  to  be  activated  at  the  mid-hindbrain  boundary  (Pfeffer  et  al.,  2000;  Ye  et  al.,  2001)  similar  to  the  Pax2.1  (noi)  mutant  embryos  of  the  zebrafish  (Pfeffer  et  al.,  1998).  In  the  absence  of  Pax2,  Otx2,  Gbx2  and  Wnt1  are  normally  transcribed  at  early  somite  stages,  while  the  expression  of  En1  is  reduced  in  the  developing  mid-hindbrain  region  (Ye  et  al.,  2001).  Importantly,  Fgf8  expression  is  never  initiated  at  the  mid-hindbrain  boundary  of  Pax2-/-  C3H/He  embryos  (Ye  et  al.,  2001),  resulting  in  the  complete  absence  of  IsO  activity  and  subsequent  apoptotic  loss  of  the  mid-hindbrain  tissue  starting  at  the  12-somite  stage  (Pfeffer  et  al.,  2000;  Chi  et  al.,  2003).  To  further  investigate  the  role  of  Pax2  at  the  onset  of  midhindbrain  development,  we  searched  for  novel  Pax2-regulated  genes  by  gene  expression  profiling  of  mid-hindbrain  cells  isolated  by  FACS  sorting  from  wild-type  and  Pax2-/-  E8.5  embryos.  This  unbiased  approach  identified  the  En2,  Brn1  (Pou3f3  -  Mouse  Genome  Informatics),  Sef  (Il17rd  -  Mouse  Genome  Informatics),  Tapp1  (Plekha1  -  Mouse  Genome  Informatics)  and  non-coding  Ncrms  genes  as  genetic  Pax2  targets  that  are  totally  dependent  on  Pax2  function  for  their  expression  in  the  mid-hindbrain  region.  The  transcription  factors  En2  and  Brn1,  as  well  as  the  signaling  modifiers  Sef  and  Tapp1,  implicate  Pax2  in  the  establishment  of  distinct  transcriptional  programs  and  the  control  of  intracellular  signaling  during  mid-hindbrain  development.  Biochemical  and  transgenic  analyses  demonstrated  that  Pax2  directly  activates  the  mid-hindbrain-specific  expression  of  Brn1  by  interacting  with  two  functional  Pax2/5/8-binding  sites  in  the  promoter  and  an  upstream  regulatory  element  of  the  Brn1  gene.  Moreover,  ectopic  expression  of  a  dominant-negative  Brn1  protein  in  chick  embryos  implicated  Brn1  as  a  novel  regulator  of  Fgf8  expression.  The  identification  of  new  Pax2-regulated  genes  has  thus  provided  important  insight  into  the  role  of  Pax2  in  midhindbrain  development.
0	Research  article
0	Review  articles
0	Genetic  modules  and  networks  for  behavior:  lessons  from  Drosophila
1	Robert  R.H.  Anholt
0	The  aim  of  this  review  is  not  to  provide  an  exhaustive  review  of  the  literature,  as  this  would  be  a  near  impossible  task,  but  rather  to  highlight  fundamental  principles  using  Drosophila  as  a  model  organism  with  examples  from  recent  studies.  It  should  be  noted  that,  while  the  focus  of  this  article  is  on  the  genetic  architecture  of  behavior,  similar  principles  apply  to  other  complex  traits  as  well.  Behaviors  as  complex  traits  Behaviors  show  all  the  hallmarks  of  quantitative  traits.  They  arise  from  the  coordinated  actions  of  multiple  genes  and  their  phenotypes  are  significantly  affected  by  genome-environment  interactions.(1,2)  Consequently,  neurogenetic  studies  of  behaviors  face  the  typical  challenges  characteristic  of  quantitative  traits,  often  hard  to  control  environmental  variation  and  a  vast  number  of  independently  segregating  genes  with  both  additive  and  epistatic  interactions  that  render  it  difficult  to  predict  phenotypic  values  from  one  generation  to  the  next.  To  dissect  the  genetic  architecture  of  such  traits,  it  is  desirable  to  minimize  environmental  variation  and  essential  to  precisely  control  the  genetic  background.  This  is  difficult  to  achieve  in  human  populations  and,  although  inbred  strains  of  mice  have  been  used  successfully  in  gene  mapping  studies,  such  studies  are  laborious  and  often  are  limited  by  their  ability  to  define  only  large  chromosomal  regions  that  harbor  possible  candidate  genes  (quantitative  trait  loci,  QTL).(3)  Furthermore,  different  QTL  are  often  identified  in  different  environmental,  physiological  or  developmental  conditions,  which  further  complicates  efforts  to  understand  the  genetic  architecture  of  the  behavior  under  study.  Whereas  considerable  advances  have  been  made  in  the  study  of  neurogenetics  of  behavior  using  mouse  model  systems,  obtaining  a  comprehensive  description  of  the  genetic  architecture  of  even  a  single  behavioral  trait  appears  to  be  a  gargantuan  task  for  every  behavioral  trait  examined  to  date.  Most  behavioral  genes  in  mice  have  been  identified  as  a  consequence  of  spontaneous  mutations  or  as  a  result  of  homologous  recombination  studies,  which,  however,  do  not  always  yield  unambiguously  interpretable  phenotypes.(4)  Furthermore,  genetic  background  variation  and/or  restricted  sample  sizes  often  limit  resolution  of  such  studies  to  identifying  only  genes  with  large  effects.  Nonetheless,  knockout  mice  have  confirmed  yet  again,  one  gene  at  a  time,  the  polygenic
0	Introduction  Behaviors  are  the  quintessential  unifying  feature  of  all  animal  live  forms  and  essential  for  survival  and  procreation.  Behaviors  are  the  ultimate  expression  of  the  nervous  system  and  depend  on  the  coordinated  expression  of  ensembles  of  genes.  This  article  seeks  to  describe  how  our  views  of  the  genetic  architecture  of  behavior  have  evolved  from  attempts  to  connect  individual  mutations  as  isolated  pieces  of  a  complex  puzzle  into  the  current  realization  of  dynamic  multidimensional  networks  of  interacting  pleiotropic  genes.  An  appreciation  of  the  genetic  architecture  of  any  complex  trait  demands  attention  to  genetic  background  and  sex  effects,  and  incorporates  interactions  between  the  genome  and  both  the  physical  and-in  the  case  of  behavioral  phenotypes-social  environment.(1,2)
0	BioEssays  26.12
0	Review  articles
0	Drosophila  related  information,  is  publicly  available  (http://  flybase.bio.indiana.edu/).  Genetic  networks  A  diverse  spectrum  of  behaviors  has  been  studied  in  Drosophila,  including  courtship  and  mating  behavior,(18,19)  circadian  behavior(20-22)  and  sleep,(23)  general  locomotor  activity,(24)  geotaxis,(25)  grooming  behavior,(26)  chemosensory  responsiveness,(27-29)  foraging  behavior,(30,31)  aggression,(32)  and  memory  and  learning.(33,34)  Mutations  affecting  critical  genes  have  been  identified  for  many  of  these  traits.  Traditionally,  mutant  screens  identify  genes  that  affect  the  trait  one  at  a  time  and  subsequently  attempt  to  place  these  loci  into  pathways  that  subserve  the  behavior  under  study.  Recent  applications  of  functional  genomic  approaches  to  behavior  have  transformed  the  traditional  view  of  simple  linear  genetic  pathways,  in  which  a  single  mutation  has  a  restricted  effect  on  a  specialized  function,  into  a  more  complex  concept  of  plastic  genetic  networks.(35)  This  was  illustrated  by  transcriptional  profiling  studies  of  circadian  genes,  which  identified  a  large  and  diverse  group  of  oscillating  genes  that  are  co-regulated  under  the  control  of  the  Clock  gene(36,37)  Using  high-density  oligonucleotide  microarrays,  McDonald  and  Rosbash  identified  in  wild-type  flies  134  cycling  genes,  which  included  not  only  known  members  of  the  circadian  clock,  but  also  a  large  number  of  genes  not  previously  known  to  cycle,  encoding  detoxification  enzymes,  ligand  carrier  proteins,  neuropeptide  modulators,  proteins  involved  in  cuticle  formation,  genes  involved  in  immune  defense,  a  diverse  array  of  miscellaneous  enzymes  as  well  as  predicted  proteins  of  unknown  function.  A  larger  group  of  267  genes  with  altered  transcriptional  regulation  was  identified  when  Clk  mutants  were  analyzed.  Such  Clkregulated  genes  included  unexpected  co-regulated  genes  with  17  genes  encoding  antimicrobial  peptides  and  9  encoding  pheromone  or  odorant-binding  proteins,  indicating  that  the  Clk  mutation  has  widespread  direct  and  indirect  effects  throughout  the  transcriptome.(36)  Similar  results  were  obtained  simultaneously  and  independently  by  Clar
0	BMC  Genomics
0	Research  article
0	BioMed  Central
0	Open  Access
0	Performance  evaluation  of  commercial  short-oligonucleotide  microarrays  and  the  impact  of  noise  in  making  cross-platform  correlations
1	Richard  Shippy1,  Timothy  J  Sendera*1,  Randall  Lockner1,  Chockalingam  Palaniappan1,  Tamma  Kaysser-Kranich1,  George  Watts2  and  John  Alsobrook3
0	Page  1  of  15
0	(page  number  not  for  citation  purposes)
0	There  are  several  commercial  microarray  systems  currently  available  on  the  market  for  genome-scale  gene  expression  analysis.  Different  microarray  manufacturers  provide  distinct  underlying  technologies,  protocols  and  reagents  specific  to  each  system  [1].  Despite  the  widespread  use  of  microarrays,  much  ambiguity  regarding  data  analysis,  interpretation  and  correlation  of  the  different  technologies  exists.  There  is  a  need  for  standardization  that  will  facilitate  comparison  of  microarray  data  from  different  platforms  [2].  Comparison  and  cross-validation  between  microarray  platforms  would  greatly  increase  the  understanding  and  value  of  the  wealth  of  data  generated  from  each  microarray  experiment  [3].  A  number  of  cross  platform  comparisons  have  reported  a  failure  to  demonstrate  an  acceptable  level  of  correlation  between  different  microarray  technologies  [4-7].  Some  of  the  difficulties  in  correlating  data  can  be  attributed  to  fundamental  differences  between  cDNA  and  oligonucleotide  based  microarray  technologies.  For  example,  target  preparation  differences  and  single  vs.  dual  labeling  techniques  complicate  the  comparisons.  Furthermore,  cDNA  arrays  have  difficulty  in  distinguishing  between  splice  variants  and  highly  homologous  genes,  while  oligonucleotide  arrays  can  make  these  distinctions  if  designed  appropriately.  However,  when  considering  oligonucleotide  platforms,  which  have  widespread  popularity,  direct  comparisons  between  different  platforms  should  be  less  complex  and  more  direct.  We  assert  that  differences  in  platform  sensitivity,  reproducibility  and  annotation  cross-referencing  accuracy  account  for  a  majority  of  the  irreconcilable  differences  previously  reported  between  different  platforms  [4-7].  When  considering  these  factors  we  demonstrate  a  strong  correlation  between  expression  ratio  data  from  two  different  commercially  available  short  oligonucleotide  based  microarray  technologies.  This  paper  provides  a  comprehensive  guideline  for  microarray  analysis,  interpretation  and  cross-platform  correlation.  There  are  two  commercially  available  high-density  microarray  platforms  that  use  short  oligonucleotides  for  expression  profiling.  CodeLink  (GE  Healthcare  formerly  Amersham  Biosciences,  Chandler,  AZ)  and  GeneChip  (Affymetrix,  Santa  Clara,  CA)  microarray  platforms  utilize  oligonucleotide  gene  target  probes  of  30  and  25  bases,  respectively.  Some  of  the  notable  differences  between  the  GeneChip  and  CodeLink  systems  are,  respectively,  multiple  probes  vs.  one  pre-validated  probe  per  gene  target,  two-dimensional  surface  vs.  three-dimensional  array  matrix,  and  in  situ  synthesized  oligonucleotides  vs.  presynthesized,  non-contact  oligonucleotide  deposition.  We  restricted  our  comparative  analysis  to  these  two  platforms  because  these  systems  are  most  similar  with  respect  to  oligonucleotide  length,  target  preparation,  and  single  color  indirect  labeling  methodology.  Since  these  commercial
0	assays  are  similar,  and  systematic  variables  were  isolated  by  using  the  same  total  RNA  starting  material  for  all  target  preparations,  we  expected  disparity  in  performance  to  reflect  differences  inherent  to  the  microarray  platforms.  To  provide  data  for  comparison  of  the  platforms,  five  technical  replicates  of  brain  and  pancreas  were  processed  on  each  platform  and  the  results  were  compared  for  reproducibility,  sensitivity,  and  similarity  of  results.  Standard  manufacturer-recommended  protocols  and  settings  were  employed  to  obtain  the  raw  data  from  each  platform.  In  the  case  of  Affymetrix  GeneChip,  a  recent  cross-platform  microarray  evaluation  [7]  used  two  additional  algorithms  [8,9]  for  analysis  of  the  GeneChip  data  and  found  the  same  level  of  discordance  across  the  three  analysis  algorithms  as  was  observed  in  the  cross-platform  microarray  comparisons  [7].  We  therefore  restricted  our  analysis  of  the  GeneChip  data  to  the  Affymetrix  recommended  MAS  5.0  software  [10].  This  methodology  was  followed  to  simulate  the  results  users  would  achieve  by  following  current  protocols  supplied  with  each  microarray  system.
0	Two  different  tissue  types  were  compared  in  this  study  to  ensure  a  large  number  of  differentially  expressed  genes,  and  provide  expression  ratios  across  a  wide  dynamic  range  for  derivation  of  the  correlation  coefficient  between  the  two  platforms.  The  array-to-array  precision  of  each  microarray  platform  was  calculated  from  the  five  replicates  within  each  tissue  sample.  The  pair-wise  array-to-array  precision  of  each  microarray  platform  is  illustrated  in  Figure  1  with  respective  noise  levels  for  both  CodeLink  and  GeneChip.  In  these  graphs  all  10,763  uniquely  represented  genes,  common  between  both  microarray  platforms,  are  illustrated.  The  GeneChip  comparisons  display  a  wider  distribution  relative  to  CodeLink  at  the  lower  end  of  the  fluorescence  detection  range.  While  this  wider  distribution  could  be  interpreted  as  indicating  a  lower  level  of  precision  relative  to  CodeLink,  precision  should  only  be  assessed  for  the  population  of  genes  with  expression  values  above  the  noise  calculation  (i.e.  'present'  on  the  arrays  being  considered).  Due  to  the  variation  in  noise  and  specificity  level  between  expression  detection  systems,  each  system  must  individually  define  its  own  threshold  level  cutoff  for  resultant  confidence  in  signals  above  technical  noise.  In  addition,  in  a  multi-oligonucleotide  detection  system,  a  defined  algorithm  must  be  set  to  determine  the  method  for  combining  individual  probe  data  to  yield  a  final  gene  expression  level.  Therefore,  we  used  each  manufacturer's  indications  for  gene  signals  that  should  be  considered  confidently  above  system  noise.  The  wider  distribution  observed  in  the  GeneChip  platform  is  within  the  noise  population  and  therefore  should  not  penalize  the  overall  precision  measurements.  Qualitatively,  CodeLink  and  GeneChip  showed  similar
0	Page  2  of  15
0	(page  number  not  for  citation  purposes)
0	Genotyping  by  apyrase-mediated  allele-specific  extension
1	Afshin  Ahmadian,  Baback  Gharizadeh,  Deirdre  O'Meara,  Jacob  Odeberg  and  Joakim  Lundeberg*
0	Center  for  Physics,  Astronomy  and  Biotechnology,  Department  of  Biotechnology,  The  Royal  Institute  of  Technology  (KTH),  Roslagstullsbacken  21,  SE-106  91  Stockholm,  Sweden
0	ABSTRACT  This  report  describes  a  single-step  extension  approach  suitable  for  high-throughput  singlenucleotide  polymorphism  typing  applications.  The  method  relies  on  extension  of  paired  allele-specific  primers  and  we  demonstrate  that  the  reaction  kinetics  were  slower  for  mismatched  configurations  compared  with  matched  configurations.  In  our  approach  we  employ  apyrase,  a  nucleotide  degrading  enzyme,  to  allow  accurate  discrimination  between  matched  and  mismatched  primer-template  configurations.  This  apyrase-mediated  allelespecific  extension  (AMASE)  protocol  allows  incorporation  of  nucleotides  when  the  reaction  kinetics  are  fast  (matched  3-end  primer)  but  degrades  the  nucleotides  before  extension  when  the  reaction  kinetics  are  slow  (mismatched  3-end  primer).  Thus,  AMASE  circumvents  the  major  limitation  of  previous  allelespecific  extension  assays  in  which  slow  reaction  kinetics  will  still  give  rise  to  extension  products  from  mismatched  3-end  primers,  hindering  proper  discrimination.  It  thus  represents  a  significant  improvement  of  the  allele-extension  method.  AMASE  was  evaluated  by  a  bioluminometric  assay  in  which  successful  incorporation  of  unmodified  nucleotides  is  monitored  in  real-time  using  an  enzymatic  cascade.  INTRODUCTION  Genome  analysis  techniques  have  increasingly  been  adapted  to  identify  and  score  single-nucleotide  polymorphism  (SNP)  to  elucidate  the  genetics  of  individual  differences  in  drug  response  and  disease  susceptibility.  A  number  of  different  techniques  have  been  proposed  to  scan  sequence  variations  in  a  high-throughput  fashion.  Many  of  these  methods  are  based  on  hybridization  techniques,  which  discriminate  between  allelic  variants.  High-throughput  hybridization  of  allelespecific  oligonucleotides  can  be  performed  on  microarray  chips  (1),  microarray  gels  (2)  or  by  using  allele-specific  probes  (molecular  beacons)  in  the  PCR  (3).  Other  technologies  suitable  for  SNP  genotyping  are  mini-sequencing  (4),  mass
0	PAGE  2  OF  5
0	Development  and  Validation  of  a  Diagnostic  DNA  Microarray  To  Detect  Quinolone-Resistant  Escherichia  coli  among  Clinical  Isolates
1	Xiaolei  Yu,1  Milorad  Susa,2  Cornelius  Knabbe,2  Rolf  D.  Schmid,1  and  Till  T.  Bachmann1*
0	J.  CLIN.  MICROBIOL.
0	detection  of  quinolone  resistance.  Although  there  are  several  platforms  available  for  array-based  single-nucleotide  polymorphism,  e.g.,  allele-specific  hybridization  (34),  single-base  primer  extension  (26),  allele-specific  amplification  (1),  or  allele-specific  oligonucleotide  ligation  (13),  we  chose  allele-specific  hybridization  because  its  robust  performance  should  be  suitable  for  routine  clinical  application.  In  contrast  to  the  above-mentioned  genotyping  methods,  the  use  of  allele-specific  hybridization  allowed  not  only  the  identification  of  the  mutated  amino  acid  but  also  the  exact  substitution,  which  could  have  different  contributions  to  resistance  and  can  be  used  as  a  marker  in  epidemiological  studies.
0	MATERIALS  AND  METHODS  Strains.  In  total,  30  E.  coli  clinical  isolates  from  four  different  hospitals  in  Germany  (Backnang,  Stuttgart,  Schorndorf,  and  Winnenden)  (referred  to  here  as  E.  coli  1  to  30)  were  used  for  this  study.  These  strains  were  isolated  from  urine  (n  20),  swabs  (n  7),  secretions  (n  2),  and  blood  (n  1)  of  patients.  The  susceptibility  against  quinolone  was  determined  according  to  NCCLS  guidelines  by  using  either  ciprofloxacin  alone  (n  23)  or  both  ciprofloxacin  and  levofloxacin  (n  7).  The  genomic  DNA  was  isolated  from  a  bacterial  pure  culture  by  using  a  QIAamp  DNA  minikit  (Qiagen,  Hilden  Germany)  according  to  the  manufacturer's  protocol.  DNA  sequencing.  For  the  DNA  sequencing,  a  418-bp  fragment  of  E.  coli,  which  included  the  QRDRs,  was  amplified  by  PCR  with  primers  described  previously  (35).  The  50-  l  PCR  mixture  included  approximately  80  ng  of  template  (genomic  DNA  of  E.  coli),  a  0.4  pM  concentration  of  each  primer,  0.25  mM  deoxynucleoside  triphosphates,  1.5  mM  Mg2  ,  and  2.5  U  of  Taq  polymerase  (Eppendorf,  Hamburg,  Germany).  The  PCRs  were  performed  in  a  thermocycler  (Mastercycler  gradient)  (Eppendorf)  with  the  following  parameters:  94°C  for  5  min;  30  cycles  at  94°C  for  1  min,  52°C  for  1  min,  and  72°C  for  1  min;  and  a  final  elongation  at  72°C  for  10  min.  The  amplified  fragment,  which  was  purified  with  a  QIAquick  PCR  purification  kit  (Qiagen)  according  to  the  manual  provided  by  the  manufacturer,  was  used  for  direct  sequencing.  The  sequencing  was  done  with  the  same  primer  pairs,  a  Big-Dye  terminator  cycle  sequencing  kit  (Applied  Biosystems,  Darmstadt,  Germany),  and  a  Prism  377  DNA  sequencer  (Applied
0	QUARTERLY
0	DNA  microarrays,  a  novel  approach  in  studies  of  chromatin  structure.
1	Piotr  Widlak½
0	Department  of  Experimental  and  Clinical  Radiobiology,  Center  of  Oncology,  Gliwice,  Poland
0	Key  words:  DNA  microarray,  genomics,  epigenomics,  chromatin,  nucleosomes  The  DNA  microarray  technology  delivers  an  experimental  tool  that  allows  surveying  expression  of  genetic  information  on  a  genome-wide  scale  at  the  level  of  single  genes  --  for  the  new  field  termed  functional  genomics.  Gene  expression  profiling  --  the  primary  application  of  DNA  microarrays  technology  --  generates  monumental  amounts  of  information  concerning  the  functioning  of  genes,  cells  and  organisms.  However,  the  expression  of  genetic  information  is  regulated  by  a  number  of  factors  that  cannot  be  directly  targeted  by  standard  gene  expression  profiling.  The  genetic  material  of  eukaryotic  cells  is  packed  into  chromatin  which  provides  the  compaction  and  organization  of  DNA  for  replication,  repair  and  recombination  processes,  and  is  the  major  epigenetic  factor  determining  the  expression  of  genetic  information.  Genomic  DNA  can  be  methylated  and  this  modification  modulates  interactions  with  proteins  which  change  the  functional  status  of  genes.  Both  chromatin  structure  and  transcriptional  activity  are  affected  by  the  processes  of  replication,  recombination  and  repair.  Modified  DNA  microarray  technology  could  be  applied  to  genome-wide  study  of  epigenetic  factors  and  processes  that  modulate  the  expression  of  genetic  information.  Attempts  to  use  DNA  microarrays  in  studies  of  chromatin  packing  state,  chromatin/DNA-binding  protein  distribution  and  DNA  methylation  pattern  on  a  genome-wide  scale  are  briefly  reviewed  in  this  paper.
0	Completion  of  the  Human  Genome  Project  has  opened  a  new  era  in  studies  of  functions  of  cells  and  organisms.  Identification  of  the
0	thousands  of  genes  forming  genomes  brings  us  to  the  next  frontier:  elucidation  of  the  functions  of  these  genes  and  their  interactions  --
0	P.  Widlak
0	DNA  microarrays
0	plate,  is  typical  for  regions  where  active  (or  potentially  active)  genes  are  located.  On  the  other  hand,  non-active  repressed  genes  are  located  primarily  in  regions  of  packed/condensed  chromatin  (heterochromatin)  (reviewed  in:  Groudine  &  Felsenfeld,  2003;  Fry  &  Peterson,  2001).  Because  of  technical  limitations,  the  knowledge  about  the  actual  state  of  chromatin  packing/condensation  and  its  relationship  to  transcriptional  activity  was  until  recently  restricted  to  a  small  number  of  genes  studied  in  a  few  model  organisms.  The  DNA  microarray  technology  delivered  the  unique  opportunity  to  survey  the  chromatin  structure  on  a  genome-wide  scale  at  the  resolution  of  single  genes.  In  fact,  modified  DNA  microarray  technology  has  already  been  applied  to  genome-wide  study  of  epigenetic  factors  and  processes  that  regulate  the  expression  of  genetic  information  (reviewed  in:  Pollack  &  Iyer,  2002).  This  new  field  could  be  termed  "epigenomics"  (Novik  et  al.,  2002).  This  paper  briefly  describes  attempts  to  use  DNA  microarrays  in  studies  of  chromatin  structure  on  a  genome-wide  scale.
0	ized  to  a  DNA  microarray,  either  "standard"  or  "specialized"  (e.g.  microarrays  of  promoter  sequences  or  CpG  islands).  DNA  could  be  fluorescence  labeled  either  during  PCR  amplification  or  without  amplification.  The  most  essential  step  in  such  "structural"  array  protocols  is  initial  isolation/fractionation  of  genomic  DNA  in  a  way  that  would  reflect  the  problem  to  be  analyzed.  Several  principles  that  lie  behind  such  fractionation  procedures  are  listed  below.
0	Differential  physicochemical  characteristics  of  nucleoprotein  complexes
0	The  initial  implementation  of  DNA  microarray  technology  into  genome  structural  research  was  comparative  genomic  hybridization  (CGH)  array,  which  allowed  high  resolution  analysis  of  gene  copy  number  (Solinas-Toldo  et  al.,  1997;  Pinkel  et  al.,  1998).  The  primary  difference  between  gene  expression  microarrays  and  the  CGH  array  is  replacement  of  RNA  samples  with  DNA  ones  as  a  starting  material.  Two  DNA  samples  are  labeled  with  different  fluorophores  and  co-hybridized  to  a  DNA  microarray,  and  their  fluorescence  ratio  represents  the  relative  DNA  copy  number.  Similar  strategies  could  be  applied  to  study  other  aspects  of  genome  structure:  "test"  and  "reference"  DNA  samples  that  are  differentially  labeled  and  co-hybrid-
0	One  of  such  strategies,  originally  described  by  Garrard  and  coworkers  (reviewed  in:  Huang  &  Garrard,  1988),  has  been  used  to  fractionate  chromatin  based  on  differential  solubility  of  histone  H1-containing  and  histone  H1-free  nucleosomes.  Isolated  nuclei  were  briefly  incubated  at  "physiological"  ionic  strength  with  micrococcal  nuclease,  which  specifically  cleaves  internucleosomal  linker  DNA.  That  treatment  solubilized  10-20%  of  the  chromatin,  which  was  collected  as  the  first  supernatant  fraction  termed  S1.  After  removal  of  salt  an  additional  50-60%  of  the  chromatin  was  solubilized,  which  was  collected  as  the  second  supernatant  fraction  termed  S2.  The  S1  fraction  contained  primarily  mononucleosomes  lacking  histone  H1  while  S2  consisted  of  histone  H1-containing  oligonucleosomal  particles.  Another  strategy  to  fractionate  genomic  DNA  based  on  specific  nucleoprotein  complexes  that  seems  to  be  potentially  applicable  to  DNA  microarray  analysis  would  be  isolation  of  nuclear  matrix-attached  DNA  (Sumer  et  al.,  2003).  The  nuclear  matrix  is  a  putative  skeletal  structure  isolated  from  nuclei  after  removal
0	TECHNICAL  REPORTS
0	Rapid  analysis  of  the  DNA-binding  specificities  of  transcription  factors  with  DNA  microarrays
0	We  developed  a  new  DNA  microarray-based  technology,  called  protein  binding  microarrays  (PBMs),  that  allows  rapid,  high-throughput  characterization  of  the  in  vitro  DNA  binding-  site  sequence  specificities  of  transcription  factors  in  a  single  day.  Using  PBMs,  we  identified  the  DNA  binding-site  sequence  specificities  of  the  yeast  transcription  factors  Abf1,  Rap1  and  Mig1.  Comparison  of  these  proteins'  in  vitro  binding  sites  with  their  in  vivo  binding  sites  indicates  that  PBM-derived  sequence  specificities  can  accurately  reflect  in  vivo  DNA  sequence  specificities.  In  addition  to  previously  identified  targets,  Abf1,  Rap1  and  Mig1  bound  to  107,  90  and  75  putative  new  target  intergenic  regions,  respectively,  many  of  which  were  upstream  of  previously  uncharacterized  open  reading  frames.  Comparative  sequence  analysis  indicated  that  many  of  these  newly  identified  sites  are  highly  conserved  across  five  sequenced  sensu  stricto  yeast  species  and,  therefore,  are  probably  functional  in  vivo  binding  sites  that  may  be  used  in  a  condition-specific  manner.  Similar  PBM  experiments  should  be  useful  in  identifying  new  cis  regulatory  elements  and  transcriptional  regulatory  networks  in  various  genomes.  The  interactions  between  transcription  factors  and  their  DNA  binding  sites  are  an  integral  part  of  transcriptional  regulatory  networks.  They  control  the  coordinated  expression  of  thousands  of  genes  during  normal  growth  and  in  response  to  external  stimuli.  Much  progress  has  been  made  recently  in  the  identification  and  analysis  of  mRNA  transcript  profiles1,2,  locations  of  in  vivo  binding  sites  of  transcription  factors3-6  and  protein-protein  interactions7-10.  But  many  transcription  factors  still  have  unknown  DNA  binding  specificities  and  regulatory  roles.  Earlier  technologies  aimed  at  characterizing  DNA-protein  interactions  are  time-consuming  and  not  scalable.  Microarray-based  readout  of  chromatin  immunoprecipitation  (ChIP-chip),  or  genome-wide  location  analysis,  is  currently  the  most  widely  used  high-throughput  method  for  identifying  in  vivo  genomic  binding  sites  for  transcription  factors3-6.  But  some  ChIP-chip  experiments  do  not  result  in  significant  enrichment  of  bound  fragments  in  the  immunoprecipitated  sample.  In  addition,  there  may  be  transcription  factors  of  interest  for  which  a  specific  antibody  is  not  available  or  for  which  the  culture  conditions  or  time  points  that  allow  its  expression  and  activity  are  not  known.  We  previously  developed  a  spotted  microarray  technology  that  used  primer-extended,  double-stranded  synthetic  DNAs  to  quantify  the  differences  in  binding  affinities  for  various  DNA  binding-sequence  variants.  This  technology  allowed  us  to  distinguish  proteins  with  similar  binding-site  preferences  and  to  determine  the  binding  specificities  of  proteins  with  degenerate  sequence  preference11.  Another  group  recently  extended  this  technology  to  use  surface  plasmon  resonance12.  Although  surface  plasmon  resonance  can  provide  kinetic  data,  it  is  not  currently  scalable  to  a  large  number  of  samples.  Here  we  developed  a  new  in  vitro  DNA  microarray  technology  for  genome-scale  characterization  of  the  sequence  specificities  of  DNA-protein  interactions.  This  protein-binding  microarray  (PBM)  technology  allows  the  determination  of  in  vitro  binding  specificities  of  individual  transcription  factors  in  a  single  day,  by  assaying  the  sequence-specific  binding  of  those  individual  transcription  factors  directly  to  double-stranded  DNA  microarrays  spotted  with  a  large  number  of  potential  DNA-binding  sites.  A  DNA-binding  protein  of  interest  is  expressed  with  an  epitope  tag,  purified  and  then  bound  directly  to  a  double-stranded  DNA  microarray.  The  PBM  is  then  washed  to  remove  any  nonspecifically  bound  protein  and  labeled  with  a  fluorophore-conjugated  antibody  specific  for  the  epitope  tag  (Fig.  1a).  We  focused  our  efforts  on  the  genome  of  the  yeast  Saccharomyces  cerevisiae  because  of  its  usefulness  as  a  model  organism  for  both  experimental  and  computational  studies.  Binding-site  data  from  PBMs  on  yeast  transcription  factors  corresponded  well  with  bindingsite  specificities  determined  from  ChIP-chip.  Moreover,  comparative
0	NUMBER  12
0	DECEMBER  2004
0	TECHNICAL  REPORTS
0	dsDNA  microarrays  Bind  epitope-tagged  TF  to  dsDNA  microarrays  GST  SybrGreen  I
0	Label  with  fluorophore-tagged  antibody  to  epitope
0	Scan  triplicate  microarrays
0	Calculate  normalized  PBM  data
0	sequence  analysis  of  the  PBM-derived  binding  sites  indicated  that  many  of  the  sites  bound  in  PBMs,  including  some  not  identified  by  ChIP-chip,  are  highly  conserved  in  other  sensu  stricto  yeast  genomes  and  therefore  are  probably  functional  in  vivo  binding  sites  that  potentially  are  used  in  a  condition-specific  manner.  Our  PBM  technology  should  aid  in  the  annotation  of  many  regulatory  proteins  whose  DNA-binding  specificities  have  not  been  characterized  and  in  the  construction  of  gene  regulatory  networks.  RESULTS  PBM  experiments  As  a  validation  of  this  approach,  we  bound  CBP-FLAG-Rpn4  fusion  protein  to  microarrays  spotted  with  positive  and  negative  control  spots  for  binding  by  Rpn4.  We  labeled  the  protein-bound  array  with  Cy3-conjugated  M2  primary  antibody  to  FLAG  (Sigma)  and  scanned  it  with  a  microarray  scanner  (GSI  Lumonics  ScanArray).  Only  the  spots  that  contain  good  matches  to  the  binding-site  motif  for  Rpn4  have  high  signal  intensity  (Supplementary  Fig.  1  online).  As  we  previously  found  that  higher  signal  intensity  is  generally  indicative  of  higher  DNA-protein  binding  affinity11,  this  CBP-FLAG-Rpn4  PBM  indicates  that  our  PBM  technology  is  successful  in  identifying  sequence-specific  transcription  factor  binding.  Next,  we  applied  the  PBM  technology  on  a  genome-wide  scale  by  using  whole-genome  yeast  intergenic  arrays  in  PBM  experiments  to  identify  the  sequence  specificities  and  target  genes  of  three  yeast  transcription  factors:  Abf1,  Rap1  and  Mig1.  Abf1  has  a  zinc-finger  DNA-binding  domain,  binds  origins  of  replication  and  regulates  ribosome  synthesis.  Rap1  binds  DNA  through  a  Myb-like  helixturn-helix  DNA-binding  domain  and,  in  addition  to  regulating  ribosome  synthesis13,  regulates  telomere  length  and  expression  at  the  silent  mating-type  loci  HML  and  HMR14.  Mig1  has  a  zinc-finger  DNA-binding  domain  and  is  involved  in  the  repression  of  glucoserepressed  genes15.  We  used  Abf1,  Rap1  and  Mig1,  dually  tagged  at  the  N  terminus  with  glutathione  S-transferase  (GST)  and  His6,  in  PBM  experiments
0	using  microarrays  spotted  with  essentially  all  the  intergenic  regions  in  the  yeast  genome3.  The  washed,  protein-bound  microarrays  were  labeled  with  Alexa  488-conjugated  antibody  to  GST  (Molecular  Probes)  and  scanned  with  a  microarray  scanner.  The  microarray  TIF  images  were  quantified  using  GenePix  Pro  version  3.0  software.  A  whole-genome  yeast  intergenic  microarray  that  was  used  in  a  PBM  experiment  with  Rap1  is  shown  in  Figure  1b,c.  Negative  control  PBMs  did  not  show  sequence-specific  DNA  binding  (Supplementary  Fig.  2  online).  For  each  transcription  factor,  experiments  were  done  in  triplicate.  We  found  that  the  PBM  data  were  highly  reproducible,  with  most  spots  having  a  coefficient  of  variation  (i.e.,  s.d.  divided  by  the  mean)  o0.3  (Supplementary  Fig.  3  online).  To  normalize  the  PBM  data  by  relative  DNA  concentration,  we  stained  separate  microarrays  from  the  same  print  run  with  SybrGreen  I  (Molecular  Probes),  which  is  specific  for  double-stranded  DNA.  The  distribution  of  the  log  ratios  of  mean  PBM  to  mean  SybrGreen  I  signal  intensities  for  the  set  of  triplicate  Rap1  PBM  experiments  is  shown  in  Figure  2a.  The  spots  on  the  left,  whose  distribution  is  fit  well  by  a  Gaussian  function,  are  bound  nonspecifically  by  the  transcription  factor.  Conversely,  the  heavy  upper  tail  of  the  distribution  corresponds  to  spots  that  are  bound  specifically  by  the  transcription  factor.  For  each  spot,  we  calculated  a  P  value  for  specific  binding  based  on  the  magnitude  of  its  log  ratio  relative  to  the  standard  deviation  of  the  Gaussian  distribution.  The  numbers  of  unique  spots  that  pass  a  P-value  threshold  of  0.05,  0.01  or  0.001  for  t
0	Issues  in  cDNA  microarray  analysis:  quality  filtering,  channel  normalization,  models  of  variations  and  assessment  of  gene  effects
1	George  C.  Tseng1,  Min-Kyu  Oh2,  Lars  Rohlin2,  James  C.  Liao2  and  Wing  Hung  Wong1,3,*
0	1Department  2Department
0	ABSTRACT  We  consider  the  problem  of  comparing  the  gene  expression  levels  of  cells  grown  under  two  different  conditions  using  cDNA  microarray  data.  We  use  a  quality  index,  computed  from  duplicate  spots  on  the  same  slide,  to  filter  out  outlying  spots,  poor  quality  genes  and  problematical  slides.  We  also  perform  calibration  experiments  to  show  that  normalization  between  fluorescent  labels  is  needed  and  that  the  normalization  is  slide  dependent  and  non-linear.  A  rank  invariant  method  is  suggested  to  select  nondifferentially  expressed  genes  and  to  construct  normalization  curves  in  comparative  experiments.  After  normalization  the  residuals  from  the  calibration  data  are  used  to  provide  prior  information  on  variance  components  in  the  analysis  of  comparative  experiments.  Based  on  a  hierarchical  model  that  incorporates  several  levels  of  variations,  a  method  for  assessing  the  significance  of  gene  effects  in  comparative  experiments  is  presented.  The  analysis  is  demonstrated  via  two  groups  of  experiments  with  125  and  4129  genes,  respectively,  in  Escherichia  coli  grown  in  glucose  and  acetate.  INTRODUCTION  Although  cDNA  microarrays  have  been  used  for  global  monitoring  of  gene  expression  in  many  areas  of  biomedical  research  (1),  methods  for  analysis  of  the  resulting  data  are  only  beginning  to  be  addressed  systematically  (2-7).  We  have  performed  a  series  of  calibration  and  comparative  experiments  to  address  several  important  issues  in  data  analysis  and  study  design  of  microarray  experiments.  In  each  calibration  experiment  we  purified  total  RNA  from  Escherichia  coli  cells  and  divided  the  sample  into  two  aliquots  for  labeling  by  Cy3  and  Cy5.  The  two  separately  labeled  samples  were  then  pooled  and  subdivided  into  hybridization  solutions  for  hybridization  to  multiple
0	slides.  In  the  first  group  of  experiments  each  slide  had  125  E.coli  genes  multiply  spotted  (4  spots/gene)  on  it,  while  in  the  second  each  slide  had  4129  genes  singly  spotted.  The  first  and  second  groups  of  experiments  will  be  called  the  125  and  4129  gene  projects,  respectively,  hereafter.  Several  levels  of  replication  are  embedded  in  the  design  of  these  calibration  experiments  and  the  resulting  data  provide  information  on  the  relative  importance  of  variations  due  to  spots,  labels  and  slides.  Based  on  this  information,  we  formulate  an  approach  to  the  analysis  of  comparative  experiments  where  the  samples  to  be  compared  are  differentially  labeled.  The  main  components  are  as  follows.  (i)  Detect  and  filter  out  poor  quality  genes  on  a  slide  using  measurements  from  multiple  spots.  This  procedure  is  not  applicable  in  singly  spotted  designs.  (ii)  Perform  slidedependent  non-linear  normalization  of  the  log  ratios  of  the  two  channels.  (iii)  Apply  hierarchical  model-based  analysis  to  the  normalized  log  ratio  scale,  where  assessment  of  the  significance  of  gene  effects  are  aided  by  statistical  information  obtained  from  calibration  experiments,  if  they  are  available.  Details  of  the  experiments  are  given  below  and  the  analysis  methodology  is  developed,  justified  and  illustrated.  A  discussion  of  other  important  issues,  such  as  why  a  two  label  design  is  useful  and  whether  gene-label  interaction  is  an  important  consideration,  is  also  provided.  MATERIALS  AND  METHODS  Preparation  of  the  DNA  array  In  the  125  gene  project,  to  ensure  uniform  quality  and  quantity  of  the  DNA  probes,  we  constructed  a  gene  library  consisting  of  125  genes  each  cloned  into  pBluescript  II  KS+  (Stratagene,  La  Jolla,  CA)  as  previously  reported  (8,9).  These  genes  are  involved  in  various  aspects  of  E.coli  physiology,  including  glycolysis,  the  TCA  cycle,  the  pentose  phosphate  pathway,  fermentation  pathways,  the  heat  shock  response,  major  biosynthetic  pathways  and  the  respiratory  system.  The  gene  probes  used  in  microarray  construction  were  obtained  by  PCR  amplifying  the  inserted  genes  using  pBluescript  II  KS+specific  primers  (Genosys,  The  Woodlands,  TX),
0	5-GGCCGCTCTAGAACTAGTGGAT-3  and  5-CTCGAGGTCGACGGTATCGATA-3.  PCR  products  were  precipitated  with  ethanol  and  redissolved  in  15  µl  of  350  mM  sodium  bicarbonate/carbonate  buffer,  pH  9.0.  Each  gene  was  spotted  four  times  on  a  slide  to  analyze  the  reliability  and  variability.  In  the  4129  gene  project  we  performed  the  PCR  reactions  using  Genosys  E.coli  ORFmers  (the  entire  genome  of  E.coli)  and  an  Eppendorf  MasterTaq  kit  (Westbury,  NY).  Among  4290  primers,  161  failed  to  make  products  or  proper  sized  products.  The  4129  PCR  products,  representing  96%  of  the  predicted  open  reading  frames  (10),  were  precipitated  with  propanol  twice  and  then  dissolved  in  10  µl  of  350  mM  sodium  bicarbonate/  carbonate  buffer,  pH  9.0.  They  were  arrayed  with  single  spotting  on  each  slide.  All  resulting  slides  with  DNA  probes  underwent  post-processing  according  to  the  protocol  suggested  by  Eisen  and  Brown  (11).  RNA  purification  and  labeling  Escherichia  coli  strain  MC4100  [F-  araD139  (argF-lac)  U169  rpsL150  relA1  flb5301  deoC1  ptsF25  rbsR]  was  cultured  in  shake  flasks  using  M9  minimal  medium  (12)  containing  either  0.5%  glucose  or  acetate  as  carbon  source  supplemented  with  125  mg/l  (w/v)  arginine.  When  the  optical  density  of  the  cell  reached  0.4-0.6  at  550  nm  total  RNA  was  purified  from  1  x  109  cells  using  the  RNeasy  Midi  kit  from  Qiagen  (Valencia,  CA).  The  resulting  RNA  solution  was  incubated  at  37°C  with  100  U  DNase  (Gibco  BRL,  Rockville,  MD)  and  40  U  RNasin  RNase  inhibitor  (Promega,  Madison,  WI)  for  30  min,  extracted  with  phenol/chloroform  and  then  precipitated  with  ethanol.  After  dissolution  in  10-20  µl  of  RNase-free  water,  30  µg  total  RNA  was  labeled  with  either  Cy3  or  Cy5  during  reverse  transcription.  The  reverse  transcription  cocktail  included  200  U  Superscript  RNase  H-  reverse  transcriptase  (Gibco  BRL),  E.coli  gene-specific  C-terminal  primers  (Genosys),  0.5  mM  dATP,  dTTP  and  dGTP,  0.2  mM  dCTP  and  0.1  mM  Cy3-  or  Cy5labeled  dCTP  (Amersharm  Pharmacia,  Piscataway,  NJ).  After  reverse  transcription  the  RNA  was  degraded  by  adding  5  µl  of  1  N  NaOH  and  incubating  at  65°C  for  40  min.  The  resulting  cDNA,  labeled  with  either  Cy3  or  Cy5,  was  diluted  with  60  µl  of  TE  buffer,  pH  8.0,  and  then  mixed  together.  The  labeled  cDNA  mixture  was  then  concentrated  to  1-2  µl  using  Micron50  (Millipore,  Bedford,  MA).  Hybridization  and  scanning  The  concentrated  Cy3-  and  Cy5-labeled  cDNA  was  resuspended  in  10  µl  of  hybridization  solution,  consist  of  50%  formamide,  3x  SSC,  1%  SDS,  5x  Denhardt's  solution,  0.1  mg/ml  salmon  sperm  DNA  and  0.05  mg/ml  yeast  total  RNA.  Hybridization  solution  without  5x  Denhardt's  solution  was  also  used  for  comparison.  The  labeled  cDNA  was  denaturated  at  95°C  for  3  min  then  quickly  chilled  on  ice.  The  cDNA  was  then  placed  on  the  slide  and  covered  by  a  coverslip.  The  slide  was  assembled  with  a  hybridization  chamber  (Corning,  Charlotte,  NC)  and  hybridized  for  14-20  h  at  42°C.  The  hybridized  slide  was  washed  in  2x  SSC,  0.1%  SDS  for  5  min  at  room  temperature  and  then  0.2x  SSC  for  5  min  prior  to  scanning.  After  drying  the  hybridized  slides  were  scanned  with  an  Affymetrix  418  scanner  (Santa  Clara,  CA)  and  the  scanned  images  analyzed  with  the  software  program  Imagene  (Biodiscovery,  Santa  Monica,  CA).  The  median  intensities  of
0	spot  areas  were  calculated  and  imported  into  the  program  S-Plus  (MathSoft,  Cambridge,  MA).  Description  of  experiments  We  performed  four  calibration  experiments  and  two  comparative  experiments  in  the  125  gene  project,  two  calibration  and  two  comparative  ones  in  the  4129  gene  project.  Calibration  experiments  used  the  same  mRNA  pool  divided  into  two  aliquots  and  labeled  separately  with  two  different  dyes  in  order  to  investigate  variations  in  this  technology.  Some  calibration  experiments  used  genes  from  E.coli  grown  in  acetate,  while  the  others  used  E.coli  grown  in  glucose.  The  comparative  experiments  labeled  mRNA  from  E.coli  grown  in  acetate  with  Cy3  and  mRNA  from  E.coli  grown  in  glucose  with  Cy5.  Different  slides  in  the  same  experiment  were  hybridized  with  the  same  pool  of  labeled  cDNA  and  different  experiments  in  the  same  project  redid  the  whole  experiment  with  the  same  pool  of  mRNA.  We  will  use  C,  R  and  S  to  denote  the  calibration  experiment,  comparative  (real)  experiment  and  slide,  respectively,  and  suffix  numbers  to  indicate  the  sequence  in  the  two  projects.  For  example,  C3S2  indicates  slide  2  in  the  third  calibration  experiment  and  R1S2  slide  2  in  the  first  comparative  experiment.  Some  slides  did  not  use  Denhardt's  solution  during  hybridization  while  others  did.  Detailed  information  concerning  experimental  design  is  listed  in  Table  1.  RESULTS  AND  DISCUSSION  Outline  of  analysis  procedure  The  steps  of  the  proposed  analysis  are  herein  briefly  described.  The  motivation  and  justification  of  each  step  will  be  given  in  subsequent  sub-sections.  To  analyze  a  calibration  experiment:  (i)  compute  a  quality  measure  for  ea
0	TRENDS  in  Biochemical  Sciences
0	Standardization  of  protocols  in  cDNA  microarray  analysis
1	Vladimir  Benes  and  Martina  Muckenthaler  ´
0	European  Molecular  Biology  Laboratory,  Meyerhofstrasse  1  D-69117  Heidelberg,  Germany
0	TRENDS  in  Biochemical  Sciences
0	Here,  we  list  the  points  to  consider  during  a  cDNA  microarray  experiment  starting  from  gene,  to  spot,  to  insight:  Genome-wide  expression  profiling  vs  specialized  microarrays  Selection  and  sequence  verification  of  cDNA  samples
0	Background  cut-off  1
0	Establishment  of  the  technological  microarray  platform  Synthesis  and  purification  of  gene  fragments  Surface  chemistry  Spotting  conditions  Array  design  Preparation  of  the  experimental  and  the  reference  sample  High  quality  RNA  extraction  from  cultured  cells,  tissues,  patient  biopsies,  laser  capture  microdissection  taking  into  consideration  that  the  experimental  and  the  reference  samples  must  be  treated  identically  Choice  of  methodology  for  the  synthesis  of  fluorescent-labelled  cDNA  Yield  of  purified  total  RNA  Accuracy,  sensitivity,  background  noise  Labour  intensity  and  working  time  Financial  aspects  Implementation  of  controls  (non-specific  background,  normalization  and  ratio)
0	Background  cut-off
0	Number  and  type  of  replicates  (technical,  biological)  Data  acquisition  and  evaluation  Data  normalization  (global,  intensity-dependent)
0	Interpretation  of  the  microarray  data  (comparison,  clustering,  selforganizing  maps)  Independent  validation  of  the  data  (quantitative  reverse  transcriptase  real-time  PCR,  Northern  blot,  in  situ  hybridization)
0	numerous  variations  that  can  occur  at  each  step  (Box  1).  Generally,  experimental  and  systematic  variations  can  be  distinguished:  experimental  variability  can  be  controlled  by  careful  experimental  design  [8]  and  through  a  sufficient  number  of  experimental  repeats;  systematic  variations  have  to  be  addressed  by  controls  on  the  array.  A  possible  source  for  systematic  variations  can  be  the  irregular  deposition  of  PCR  amplified  cDNAs  on  the  glass  surface  by  different  printing  pins  (including  `carry-over'  of  the  samples  between  adjacent  sample  wells  caused  by  inferior  washing  of  the  pins)  or  biases  associated  with  different  fluorescent  dyes.  It  has  been  recognized  that  fluorescent
0	dyes  such  as  Cy3  and  Cy5  exhibit  different  quantum  yields  and  are  differentially  sensitive  to  photobleaching  [9,10].  Depending  upon  the  type  of  the  activated  surface,  these  dyes  also  show  varying  background  levels  (E.  Furlong,  pers.  commun.).  Although  this  phenomenon  has  not  been  thoroughly  studied,  it  has  been  indicated  that  the  direct  incorporation  of  Cy3  and  Cy5  modified-nucleotide  analogues  into  the  cDNA  might  introduce  sequence-specific  artefacts  [11,12].  This  is  likely  to  be  caused  by  the  variable  and  differing  rates  by  which  these  bulky  nucleotide  analogues  are  in
0	Significance  analysis  of  microarrays  applied  to  the  ionizing  radiation  response
1	Virginia  Goss  Tusher*,  Robert  Tibshirani,  and  Gilbert  Chu*
0	Microarrays  can  measure  the  expression  of  thousands  of  genes  to  identify  changes  in  expression  between  different  biological  states.  Methods  are  needed  to  determine  the  significance  of  these  changes  while  accounting  for  the  enormous  number  of  genes.  We  describe  a  method,  Significance  Analysis  of  Microarrays  (SAM),  that  assigns  a  score  to  each  gene  on  the  basis  of  change  in  gene  expression  relative  to  the  standard  deviation  of  repeated  measurements.  For  genes  with  scores  greater  than  an  adjustable  threshold,  SAM  uses  permutations  of  the  repeated  measurements  to  estimate  the  percentage  of  genes  identified  by  chance,  the  false  discovery  rate  (FDR).  When  the  transcriptional  response  of  human  cells  to  ionizing  radiation  was  measured  by  microarrays,  SAM  identified  34  genes  that  changed  at  least  1.5-fold  with  an  estimated  FDR  of  12%,  compared  with  FDRs  of  60  and  84%  by  using  conventional  methods  of  analysis.  Of  the  34  genes,  19  were  involved  in  cell  cycle  regulation  and  3  in  apoptosis.  Surprisingly,  four  nucleotide  excision  repair  genes  were  induced,  suggesting  that  this  repair  pathway  for  UV-damaged  DNA  might  play  a  previously  unrecognized  role  in  repairing  DNA  damaged  by  ionizing  radiation.
0	sented  by  20  oligonucleotide  pairs,  each  pair  consisting  of  an  oligonucleotide  perfectly  matched  to  the  cDNA  sequence,  and  a  second  oligonucleotide  containing  a  single  base  mismatch.  Because  gene  expression  was  computed  from  differences  in  hybridization  to  the  matched  and  mismatched  probes,  expression  levels  were  sometimes  reported  by  the  GENECHIP  ANALYSIS  SUITE  software  as  negative  numbers.
0	Northern  Blot  Hybridization.  Total  RNA  (15  g)  was  resolved  by  agarose  gel  electrophoresis,  transferred  to  a  nylon  membrane,  and  hybridized  to  specific  radiolabeled  DNA  probes,  which  were  prepared  by  PCR  amplification.
0	Microarray  Hybridization.  Each  gene  in  the  microarray  was  repre-
0	NA  microarrays  contain  oligonucleotide  or  cDNA  probes  for  measuring  the  expression  of  thousands  of  genes  in  a  single  hybridization  experiment.  Although  massive  amounts  of  data  are  generated,  methods  are  needed  to  determine  whether  changes  in  gene  expression  are  experimentally  significant.  Cluster  analysis  of  microarray  data  can  find  coherent  patterns  of  gene  expression  (1)  but  provides  little  information  about  statistical  significance.  Methods  based  on  conventional  t  tests  provide  the  probability  (P)  that  a  difference  in  gene  expression  occurred  by  chance  (2,  3).  Although  P  0.01  is  significant  in  the  context  of  experiments  designed  to  evaluate  small  numbers  of  genes,  a  microarray  experiment  for  10,000  genes  would  identify  100  genes  by  chance.  This  problem  led  us  to  develop  a  statistical  method  adapted  specifically  for  microarrays,  Significance  Analysis  of  Microarrays  (SAM).  SAM  identifies  genes  with  statistically  significant  changes  in  expression  by  assimilating  a  set  of  gene-specific  t  tests.  Each  gene  is  assigned  a  score  on  the  basis  of  its  change  in  gene  expression  relative  to  the  standard  deviation  of  repeated  measurements  for  that  gene.  Genes  with  scores  greater  than  a  threshold  are  deemed  potentially  significant.  The  percentage  of  such  genes  identified  by  chance  is  the  false  discovery  rate  (FDR).  To  estimate  the  FDR,  nonsense  genes  are  identified  by  analyzing  permutations  of  the  measurements.  The  threshold  can  be  adjusted  to  identify  smaller  or  larger  sets  of  genes,  and  FDRs  are  calculated  for  each  set.  To  demonstrate  its  utility,  SAM  was  used  to  analyze  a  biologically  important  problem:  the  transcriptional  response  of  lymphoblastoid  cells  to  ionizing  radiation  (IR).  Materials  and  Methods
0	Results  RNA  was  harvested  from  wild-type  human  lymphoblastoid  cell  lines,  designated  1  and  2,  growing  in  an  unirradiated  state  (U)  or  in  an  irradiated  state  (I)  4  h  after  exposure  to  a  modest  dose  of  5  Gy  of  IR.  RNA  samples  were  labeled  and  divided  into  two  identical  aliquots  for  independent  hybridizations,  A  and  B.  Thus,  data  for  6,800  genes  on  the  microarray  were  generated  from  eight  hybridizations  (U1A,  U1B,  U2A,  U2B,  I1A,  I1B,  I2A,  and  I2B).  We  scaled  the  data  from  different  hybridizations  as  follows.  A  reference  data  set  was  generated  by  averaging  the  expression  of  each  gene  over  all  eight  hybridizations.  The  data  for  each  hybridization  were  compared  with  the  reference  data  set  in  a  cube  root  scatter  plot.  We  chose  the  cube  root  scatter  plot  because  it  resolved  the  vast  majority  of  genes  that  are  expressed  at  low  levels  and  permitted  the  inclusion  of  negative  levels  of  expression  that  are  sometimes  generated  by  the  GENECHIP  software.  A  linear  leastsquares  fit  to  the  cube  root  scatter  plot  was  then  used  to  calibrate  each  hybridization.  After  scaling,  a  linear  scatter  plot  was  generated  for  average  gene  expression  in  the  four  A  aliquots  (U1A,  I1A,  U2A,  and  U2A)  vs.  the  average  in  the  four  B  aliquots  (U1B,  I1B,  U2B,  and  U2B),  a  partitioning  of  the  data  that  eliminates  biological  changes  in  gene  expression  (Fig.  1A).  The  linear  scatter  plot  confirmed  that  the  data  were  generally  reproducible  but  failed  to  resolve  genes  expressed  at  low  levels.  Better  resolution  of  these  genes  was  achieved  by  the  cube  root  scatter  plot  (Fig.  1B),  which  revealed  three  salient  features:  the  large  percentage  of  genes  (24%)  assigned  negative  levels  of  expression,  the  large  percentage  of  genes  with  low  levels  of  expression,  and  the  low  signal-to-noise  ratio  at  low  levels  of  expression.  To  assess  the  biological  effect  of  IR,  a  scatter  plot  was  generated  for  average  gene  expression  in  the  four  irradiated  states  vs.  the  four  unirradiated  states  (compare  Fig.  1  B  and  C).  A  few  of  the  potentially  significant  changes  in  gene  expression  are  indicated  by  arrows  in  Fig.  1C,  but  the  effect  was  not  easily  quantified,  and  a  method  was  needed  to  identify  changes  with  statistical  confidence.
0	Abbreviations:  SAM,  significance  analysis  of  microarrays;  FDR,  false  discovery  rate;  IR,  ionizing  radiation;  FWER,  family-wise  error  rate.
0	GM08925  (Coriell  Cell  Repositories,  Camden,  NJ)  were  seeded  at  2.5  105  cells  ml  and  exposed  to  IR  24  h  later.  RNA  was  isolated,  labeled,  and  hybridized  to  the  HUGENEFL  GENECHIP  microarray  according  to  manufacturer's  protocols  (Affymetrix,  Santa  Clara,  CA).
0	Preparation  of  RNA.  Human  lymphoblastoid  cell  lines  GM14660  and
0	The  publication  costs  of  this  article  were  defrayed  in  part  by  page  charge  payment.  This  article  must  therefore  be  hereby  marked  "advertisement"  in  accordance  with  18  U.S.C.  §1734  solely  to  indicate  this  fact.
0	where  m  and  n  are  summations  of  the  expression  measurements  in  states  I  and  U,  respectively,  a  (1  n  1  1  n  2)  (n  1  n  2  2),  and  n1  and  n2  are  the  numbers  of  measurements  in  states  I  and  U  (four  in  this  experiment).  To  compare  values  of  d(i)  across  all  genes,  the  distribution  of  d(i)  should  be  independent  of  the  level  of  gene  expression.  At  low  expression  levels,  variance  in  d(i)  can  be  high  because  of  small  values  of  s(i).  To  ensure  that  the  variance  of  d(i)  is  independent  of  gene  expression,  we  added  a  small  positive  constant  s0  to  the  denominator  of  Eq.  1.  The  coefficient  of  variation  of  d(i)  was  computed  as  a  function  of  s(i)  in  moving  windows  across  the  data.  The  value  for  s0  was  chosen  to  minimize  the  coefficient  of  variation.  For  the  data  in  this  paper,  this  computation  yielded  s0  3.3.  Scatter  plots  of  d(i)  vs.  s(i)  are  shown  in  Fig.  2.  The  scatter  plot  for  relative  difference  between  states  I  and  U  is  shown  in  Fig.  2  A.  By  contrast,  the  scatter  plot  for  relative  difference  between  cell  lines  1  and  2  shows  more  marked  changes  in  Fig.  2B.  These  relative  differences  exceeded  random  fluctuations  in  the  data,  as  measured  by  the  relative  difference  between  hybridizations  A  and  B  in  Fig.  2C.  Although  the  relative  difference  computed  from  hybridizations  A  and  B  provided  a  control  for  random  fluctuations,  additional  controls  were  needed  to  assign  statistical  significance  to  the  biological  effect  of  IR.  Instead  of  performing  more  experiments,  which
0	Tusher  et  al.
0	April  24,  2001
0	where  xI(i)  and  xU(i)  are  defined  as  the  average  levels  of  expression  for  gene  (i)  in  states  I  and  U,  respectively.  The  ``gene-specific  scatter''  s(i)  is  the  standard  deviation  of  repeated  expression  measurements:
0	Our  approach  was  based  on  analysis  of  random  fluctuations  in  the  data.  In  general,  the  signal-to-noise  ratio  decreased  with  decreasing  gene  expression  (Fig.  1).  However,  even  for  a  given  level  of  expression,  we  found  that  fluctua
0	Nonparametric  methods  for  identifying  differentially  expressed  genes  in  microarray  data
1	Olga  G.  Troyanskaya  1,  Mitchell  E.  Garber  1,  Patrick  O.  Brown  2,  3,  David  Botstein  1,  and  Russ  B.  Altman  1,
0	Department
0	BACKGROUND  DNA  microarray  technology  allows  for  the  monitoring  of  expression  levels  of  thousands  of  genes  under  a  variety
0	of  conditions.  A  major  question  in  microarray  studies  is  how  to  select  genes  associated  with  specific  physiological  states  or  clinical  parameters-genes  whose  expression  in  a  tumor  sample  is  related  to  a  specific  tumor  subtype  or  patient  survival.  In  a  clinical  context,  such  differentially  expressed  genes  are  often  referred  to  as  clinical  markers.  Clinical  markers  can  form  the  basis  for  diagnostic  tests,  particularly  if  they  can  be  assayed  in  reliable  and  inexpensive  ways.  Identification  of  clinical  markers  may  lead  to  improved  diagnosis  and  treatment  guidance,  early  disease  detection,  and  clinical  outcomes  prediction.  While  routine  clinical  use  of  microarrays  is  still  not  feasible,  they  may  provide  methods  for  fast,  accurate,  and  systematic  identification  of  biomedical  markers  from  the  data  generated  by  gene  expression  experiments.  Clinicians  can  then  assay  the  expression  of  one  or  a  few  such  markers  by  immunohistochemistry  or  quantitative  PCR  (Kim,  2001).  Moreover,  relating  specific  groups  of  genes  with  specific  biological  correlates  is  a  critical  step  toward  understanding  the  underlying  molecular  mechanisms  and  identifying  novel  therapeutic  targets.  The  most  commonly  used  tools  for  identification  of  differentially  expressed  genes  include  qualitative  observation  (usually  following  some  form  of  clustering  of  expression  patterns),  heuristic  rules,  and  model-based  probabilistic  analysis.  The  simplest  heuristic  is  setting  cutoffs  for  gene  expression  changes  over  a  background  expression  level.  In  an  early  gene  expression  study,  Iyer  et  al.  (1999)  sought  genes  whose  expression  changed  by  a  factor  of  2.20  or  more  in  at  least  two  of  the  experiments.  DeRisi  et  al.  (1997)  looked  for  2-fold  induction  of  gene  expression  compared  to  baseline.  Xiong  et  al.  (2001)  identified  indicator  genes  based  on  classification  errors  by  feature  wrappers  (including  linear  discriminant  analysis,  logistic  regression,  and  support  vector  machines).  Although  this  approach  is  not  based  on  specific  data  modeling  assumptions,  the  results  are  affected  by  assumptions  behind  the  specific  classification  methods  used  for  scoring.
0	Nonparametric  identification  methods  for  differentially  expressed  genes
0	sum  test)  with  heuristic-based  inference.  We  evaluate  the  performance  of  these  methods  on  generated  expression  data  as  well  as  on  real  biological  data  sets.
0	METHODS  Experimental  methods  We  implemented  and  evaluated  three  methods  for  modelfree  identification  of  differentially  expressed  genes  in  microarray  analysis:  a  nonparametric  t-test,  a  Wilcoxon  rank  sum  test,  and  a  heuristic  idealized  discriminator  method.  The  evaluation  included  applications  to  both  simulated  data  and  real  biological  data.  By  using  simulated  data,  we  could  first  evaluate  the  methods  on  data  sets  with  known  differentiator  genes  in  the  context  of  different  noise  levels.  The  simulated  data  were  generated  to  create  plausible  distributions  of  microarray  expression  values  while  not  perfectly  matching  any  particular  data  set.  From  qualitative  comparisons  of  distribution  histograms  and  Quantile-Quantile  plots  of  several  biological  data  sets  (Alizadeh  et  al.,  2000;  Garber  et  al.,  2001;  Gasch  et  al.,  2000),  we  found  that  normally  generated  data  with  uniform  noise  generated  from  uniform  distribution  in  the  range  of  U(-0.01,  0.01)  to  U(-0.1,  0.1)  approximated  the  true  distributions  reasonably  well.  Such  an  approximate  fit  to  biological  data  is  similar  to  the  differences  in  data  distributions  between  real  microarray  experiments.  To  test  the  methods,  we  generated  ten  simulated  data  sets  (5000  genes  by  40  experiments  each)  at  each  of  the  six  noise  levels  (U(-0.01,  0.01),  U(-0.05,  0.05),  U(-0.1,0.1),  U(-0.5,0.5),  U(-0.75,0.75),  U(-1.0,1.0)).  Increasing  noise  levels  in  the  data  sets  allowed  us  to  test  robustness  of  the  methods  on  very  noisy  data.  Each  data  set  included  twenty  predictor  genes  (markers),  whose  values  were  generated  from  two  different  normal  distributions:  group  1  (20  experiments)  and  group  2  (20  experiments).  The  rest  of  the  genes,  for  which  all  values  were  generated  from  one  normal  distribution  per  gene,  were  considered  nonpredictors.  The  means  of  each  normal  distribution  were  generated  from  a  random  normal  distribution  with  a  mean  of  0  and  standard  deviation  of  0.25  for  nonpredictors  and  standard  deviation  of  0.5  for  predictors.  Each  of  the  methods  was  then  applied  to  each  simulated  data  set,  and  true  positive  rate  (TPR)  and  false  positive  rate  (FPR)  were  calculated  according  to  the  following  formulae.
0	TPR  =  number  of  predictors  identified
0	A  spline  function  approach  for  detecting  differentially  expressed  genes  in  microarray  data  analysis
1	Wenqing  He
0	Prossermen  Center  for  Health  Research,  Samuel  Lunenfeld  Research  Institute  of  Mount  Sinai  Hospital,  Toronto,  Ontario,  Canada  M5G  1X5
0	Microarray  technology  has  been  increasingly  used  in  medical  studies  such  as  cancer  research.  This  technology  makes  it  possible  to  measure  the  expressions  of  thousands  of  genes  simultaneously  under  a  variety  of  conditions.  The  objectives  of  microarray  studies  often  include  finding  genes  which  have  different  expressions  between  conditions  and  making  predictions  on  outcomes  such  as  tumor  types  in  cancer  research.  In  most  cases,  the  predictions  are  based  on  those  genes  that  are  differentially  expressed,  and  therefore,  detection  of  differentially  expressed  genes  plays  an  important  role.  Commonly  used  methods  for  identification  of  differentially  expressed  genes  include  qualitative  observations,  heuristic  rules  such  as  cutoff  settings,  and  model-based  probability
0	analyses.  Iyer  et  al.  (1999)  discussed  an  approach  based  on  choosing  genes  with  expression  changes  from  at  least  two  arrays  being  more  than  2.20  times  of  their  baseline  expressions.  DeRisi  et  al.  (1997)  considered  to  select  genes  that  have  at  least  2-fold  changes  over  their  baseline  expressions.  These  heuristic  rules  just  focused  on  the  absolute  expression  changes  of  genes.  Variation  of  gene  expressions  was  not  accounted  for.  Moreover,  the  decisive  values  for  identifying  differentially  expressed  genes  are  arbitrary.  Thus,  these  methods  have  not  been  used  widely.  Several  probability  approaches  have  been  proposed  to  detect  differentially  expressed  genes.  One  intuitive  method  is  the  two  sample  t-test.  Two  sample  t-tests  select  genes  that  have  significantly  different  means  between  conditions.  One  problem  for  using  two  sample  t-tests  is  that  some  genes  with  small  differences  between  conditions  may  be  selected  because  of  their  very  small  within  group  variation.  To  correct  the  effect  of  the  small  variance,  Tusher  et  al.  (2001)  proposed  a  modified  t-statistic  for  which  a  constant  is  added  to  the  denominator  of  the  traditional  t-statistic.  As  microarray  data  commonly  contain  various  types  of  variation,  the  normality  assumption  of  expression  measurements  is  often  not  adequate  (Hunter  et  al.,  2001),  and  therefore  the  normal-distribution-based  inference  may  not  be  valid.  In  this  context,  non-parametric  methods  are  more  attractive  because  no  specific  distributions  of  data  are  required.  Dudoit  et  al.  (2002)  used  a  non-parametric  t-test  with  a  corrected  family-based  error  rate  to  detect  differentially  expressed  genes.  Tusher  et  al.  (2001)  discussed  significant  analysis  of  microarrays  (SAM)  in  which  repeated  measurements  are  permuted  to  estimate  the  false  discovery  rate  of  differentially  genes.  Efron  and  Tibshirani  (2002)  considered  a  Wilcoxon  statistic  and  estimated  the  associated  distributions  using  an  empirical  Bayes  approach.  Pan  et  al.  (2002)  applied  a  mixture  normal  approach  to  a  t-type  statistic  when  the  sample  size  under  each  condition  is  even.  Zhao  and  Pan  (2003)  further  proposed  a  modified  statistic  which  overcomes  the  disagreement  of  the  null  statistic  and  test  statistic  under  the  null  hypothesis  (no  differential  expressions  here),  and
0	M-spline  for  detecting  differentially  expressed  genes
0	their  method  can  be  used  for  data  without  even  numbers  of  samples.  The  basic  idea  of  those  non-parametric  approaches  is  to  construct  a  null  and  a  test  statistics  which  have  the  same  distribution  under  the  null  hypothesis,  and  deviation  of  the  distribution  of  the  test  statistic  under  the  alternative  hypothesis  is  used  to  identify  differentially  expressed  genes.  The  distributions  of  the  null  and  test  statistics  under  the  null  and  alternative  hypotheses  are  estimated  non-parametrically.  Although  non-parametric  methods  have  the  advantage  of  not  requiring  a  specific  distribution  form,  there  are  some  drawbacks.  The  inferential  procedures  based  on  non-parametric  methods  are  generally  complex  (Efron  et  al.,  2000;  http://  www-stat.stanford.edu/tibs/research.html).  Non-parametric  estimates  may  not  be  as  efficient  as  the  parametric  estimates,  and  therefore  the  tests  for  differentially  expressed  genes  may  not  have  adequate  power.  Non-parametric  Wilcoxon  test,  for  example,  is  rank  based  and  does  not  make  use  of  all  available  information  for  genes,  thus  it  may  have  low  power  to  identify  differently  expressed  genes  (Thomas  et  al.,  2001).  Furthermore,  as  pointed  out  in  Pan  (2002),  the  Wilcoxon  test  is  not  applicable  when  the  expression  levels  of  a  gene  may  have  unequal  variances  under  the  two  experimental  conditions.  In  this  paper,  we  propose  to  use  a  weakly  parametric  approach  to  characterize  the  density  functions  for  both  differentially  and  non-differentially  expressed  genes.  Specifically  we  consider  a  spline  function  approach.  This  approach  is  widely  used  in  survival  analysis  to  model  the  hazard  functions  (e.g.  He  and  Lawless,  2003).  It  has  appeal  that  no  strong  assumptions  about  the  underlying  distributions  are  needed,  and  the  inferences  are  likelihood  based  and  therefore  straightforward.  We  use  maximum  likelihood  methods  to  estimate  the  parameters  involved  in  the  density  functions  and  the  prior  probability  of  differentially  expressed  genes.  As  a  result,  the  posterior  probability  is  applied  to  identify  differentially  expressed  genes.  The  proposed  method  is  applied  to  a  real  data  set,  and  the  results  are  compared  with  those  obtained  by  some  existing  methods.  A  simulation  study  is  also  conducted  to  assess  the  performance  of  the  proposed  method.  We  end  with  concluding  remarks.
0	The  primary  interest  here  is  to  detect  genes  which  are  differentially  expressed  under  the  two  conditions.  In  many  applications,  it  is  the  focus  to  identify  genes  based  on  different  mean  expressions.  For  gene  i,  i  =  1,  .  .  .  ,  N  ,  assume  that  gene  expressions  follow  the  model  Yij  =  µi1  +  and  Yik  =  µi2  +
0	METHODS  Microarray  data
0	Let  the  matrix  [Yij  ]  denote  a  microarray  data  set  of  gene  expressions,  i  =  1,  .  .  .  ,  N,  j  =  1,  .  .  .  ,  n,  with  rows  being  genes  and  columns  being  arrays  (samples).  Without  loss  of  generality,  consider  two  different  experimental  conditions,  and  let  expression  measurements  for  microarrays  under  conditions  1  and  2  be  indexed  by  j  =  1,  .  .  .  ,  n1  ,  and  j  =  n1  +  1,  .  .  .  ,  n1  +  n2  ,  respectively,  where  n1  +  n2  =  n.  The  entries  of  the  matrix  may  be  the  log  ratios  in  cDNA  microarrays,  or  summary  differences  of  the  perfect  match  (PM)  and  mismatch  (MM)  scores  from  oligonucleotide  arrays.
0	where  µi1  and  µi2  are  the  mean  expressions  of  gene  i  under  conditions  1  and  2,  respectively,  ij  ,  j  =  1,  .  .  .  ,  n1  ,  and  ik  ,  k  =  n1  +  1,  .  .  .  ,  n1  +  n2  ,  are  independent  2  2  random  errors  with  mean  0  and  variances  1  and  2  ,  2  and  2  are  not  necessarily  equal.  It  respectively.  1  2  is  a  common  assumption  that  random  errors  are  symmetric.  Note  that  the  normality  assumption  is  not  made  here.  It  is  of  interest  to  test  the  null  hypothesis  Ho  :  µi1  =  µi2  ,  i.e.  whether  or  not  gene  i  is  differentially  expressed  under  the  two  conditions.  This  may  appear  to  be  a  problem  of  the  two-sample  comparison.  However,  the  characteristics  of  microarray  data  limit  the  direct  application  of  traditional  statistical  tests.  The  total  number  N  of  genes  is  large,  usually  larger  than  several  thousands,  whereas  the  numbers  of  arrays  (n1  and  n2  here)  are  usually  small  (<100;  in  some  cases,  the  array  numbers  are  <20).  These  features  make  traditional  t-tests  or  non-parametric  rank-based  tests  infeasible  (Pan,  2003).  Furthermore,  when  multiple  comparisons  are  needed,  it  is  difficult  to  specify  various  significance  levels.  To  utilize  the  large  size  of  N  and  information  between  genes,  a  plausible  way  is  to  select  differentially  expressed  genes  based  on  the  distributions  of  some  statistics  related  to  all  gene  expression  levels  {Yi1  ,  .  .  .  ,  Yin1  }  and  {Yi,n1  +1  ,  .  .  .  ,  Yin  }  for  i  =  1,  .  .  .  ,  N  .  For  gene  i  let  Zi  and  Zi  be  statistics  that  have  the  same  distribution  under  the  null  hypothesis  H0  :  µi1  =  µi2  .  Under  the  alternative  hypothesis  Ha  :  µi1  =  µi2  ,  however,  the  distribution  of  Zi  deviates  from  its  distribution  under  the  null  hypothesis,  whereas  the  distribution  of  Zi  does  not  change.  Zi  and  Zi  are  often  called  the  null  and  test  statistics.  Several  authors  discussed  the  formulation  of  such  summary  statistics.  The  Wilcoxon  statistic  was  discussed  in  Efron  and  Tibshirani  (2002).  Pan  et  al.  (2002)  considered
0	Microarrays  permit  the  analysis  of  gene  expression,  DNA  sequence  variation,  protein  levels,  tissues,  cells  and  other  biological  and  chemical  molecules  in  a  massively  parallel  format.  Robust  microarray  manufacture,  hybridization,  detection  and  data  analysis  technologies  permit  novice  users  to  adapt  this  exciting  technology  readily,  and  more  experienced  users  to  push  the  boundaries  of  discovery.
0	Trends  in  microarray  analysis
0	Purify  mRNA  Label  cDNA  Hybridize,  wash  and  scan  Label  cDNA  Hybridize,  wash  and  scan  Purify  mRNA
0	Purify  mRNA  Label  cDNA  Mix  Label  cDNA  Purify  mRNA
0	Hybridize  and  wash  Superimpose  Scan  and  superimpose
0	to  allow  their  import  into  software  programs  for  data  mining  and  modeling24.  Composite  image  Composite  image  Transformed  and  normalized  data  are  represented  and  modeled  using  a  variety  electricity,  organic  vapors  and  biological  contaminants  can  im-  of  software  tools,  including  scatter  plots,  principal  component  prove  the  quality  of  microarray  manufacture  in  all  settings,  analysis  (PCA),  cluster  diagrams,  self-organizing  maps  (SOMs),  ranging  from  the  smallest  research  laboratories  to  the  largest  neural  networks  and  other  algorithms25-29.  Although  the  mathcommercial  facilities  (see  Supplementary  Note  online).  ematical  and  statistical  basis  of  the  computational  tools  is  comFluorescent  probes  for  expression  profiling  are  typically  pre-  plex,  each  endeavors  to  provide  functionally  relevant  pared  from  total  RNA  or  messenger  RNA  (mRNA)  by  reverse  relationships  between  genes  and  gene  products,  assign  putative  transcription,  although  many  different  labeling  strategies  are  function  to  unknown  sequences,  identify  potential  disease  available.  Methods  that  use  T7  RNA  polymerase  produce  large  markers,  elucidate  the  biochemical  basis  of  drug  and  hormone  amounts  of  amplified  RNA  and  are  widely  used  to  generate  action,  and  so  forth  (see  Supplementary  Table  F  online).  The  probes  from  small  amounts  of  sample.  Because  amplified  RNA  experimental  aspects  of  microarray  analysis  are  linked  to  data  is  produced  by  linear  amplification  with  T7  polymerase,  popu-  extraction,  analysis  and  modeling  in  the  microarray  workflow  lation  skewing  and  the  loss  of  quantitation  are  minimal.  process  (Fig.  2).  Intranets  and  the  Internet,  together  with  relaControl  and  experimental  samples  can  be  labeled  separately  tional  database  warehouses,  figure  centrally  in  generating,  with  fluors  that  have  non-overlapping  emission  spectra,  in-  mining,  storing  and  retrieving  microarray  data  (Fig.  2).  cluding  cyanine,  Alexa,  and  other  fluorescent  derivatives.  Two  Downloadable  software  (`shareware')  packages  are  available  samples  labeled  with  different  fluors  can  be  hybridized  to  a  sin-  free  of  charge  to  microarray  researchers  worldwide  (see  gle  chip  to  derive  absolute  and  comparative  expression  infor-  Supplementary  Note  online).  Forums  on  microarray  data  mation  in  the  two  samples.  analysis,  such  as  the  Critical  Assessm
0	MIAME,  we  have  a  problem
1	Robert  Shields
0	Trends  in  Genetics,  Elsevier,  84  Theobald's  Road,  London,  UK,  WC1X  8RR
0	consistency  is  improved  because  the  same  cross-hybridizing  sequences  are  then  detected  by  all  platforms  [3]?  As  if  the  problems  associated  with  different  platforms  were  not  enough,  a  recent  trio  of  articles  [4-7]  showed  not  only  inconsistencies  across  platforms  but  also  inconsistencies  among  laboratories  that  were  using  the  same  platform,  and  even  using  the  same  RNA  samples.  Matters  were  improved  by  the  use  of  common  protocols  for  RNA  work-up  and  also,  and  the  importance  of  this  is  not  widely  appreciated,  common  methods  of  data  handling  and  analysis.  If  scientists  are  to  create  gene  expression  databases  that  incorporate  results  from  multiple  laboratories,  it  is  simply  not  good  enough  to  adhere  to  the  minimal  information  about  microarray  experiment  (MIAME)  guidelines,  which  only  focus  on  the  documentation  of  experimental  details,  while  failing  to  address  real  problems  with  the  technology  and  how  it  is  used.  Equally  depressing  is  the  rush  to  apply  microarrays  to  obtain  `gene  signatures'  to  aid  disease  diagnosis  and  prognosis.  Again  results  from  different  groups  studying  ostensibly  the  same  disease  are  frequently  non-concordant  [7,8].  The  use  of  different  microarray  platforms  is  partly  to  blame  for  this.  But  perhaps  most  of  the  problem  comes  from  lack  of  `inferential  literacy'  meeting  lack  of  epidemiological  savvy.  The  Toxicogenomics  Research  Consortium  suggested  that  more-consistent  results  would  be  achieved  not  with  signatures  from  individual  genes  but  by  examining  the  gene  ontology  (GO)  categories  of  the  differentially  expressed  genes  [6].  Perhaps,  but  it  is  a  sobering  comment  that  when  two  RNA  samples  were  compared  in  different  laboratories,  on  different  platforms  and  analysed  in  the  same  way,  gene-by-gene  list  comparisons  varied.  All  that  could  be  agreed  on  were  the  changes  in  different  GO  categories  -  representative  of  the  tissue  of  origin  of  the  samples  [6].  If  scientists  in  different  laboratories  cannot  agree  on  an  ordered  list  of  gene-expression  differences  when  presented  with  the  same  two  RNA  samples,  we  really  do  have  a  problem.  So  what  is  the  solution?  Obviously,  putting  the  right  probes  on  the  array  would  be  a  start  -  interrogating  the  same  transcript  or  splice  form  is  important.  Consistent  standards  between  laboratories  would  help  improve  the  consistency  of  results  -  but  consistency  is  not  enough  -  after  all  the  results  within  a  laboratory  were  all  consistent  but  the  results  can  be  consistently  wrong.  What  we  need  is  a  proper  evaluation  of  microarrays  (including  sample  extraction  and  work-up,  data  handling  and  analysis)  and  an  understanding  of  what  is  important  to  achieve  consistent,  accurate  and  reproducible  results  across  laboratories.  But  perhaps
0	most  important  is  that  scientists  understand  the  nature  of  the  technology  they  are  using  -  including  experimental  design,  execution  and  analysis.  We  need  to  go  beyond  MIAME.
0	Miron,  M.  and  Nadon,  R.  (2006)  Inferential  literacy  for  experimental  high-throughput  biology.  Trends  Genet.  22,  (this  issue,  February  2006)  doi:  10.1016/j.tig.2005.12.001  2  Draghici,  S.  et  al.  (2006)  Reliability  and  reproducibility  issues  in  DNA  microarray  measurements.  Trends  Genet.  22,  (this  issue,  February  2006)  doi:  10.1016/j.tig.2005.12.005
0	Project  Creates  Repository  for  Microarray  Datasets
0	NEWS
0	NCBI  GEO:  mining  millions  of  expression  profiles--database  and  tools
1	Tanya  Barrett,  Tugba  O.  Suzek,  Dennis  B.  Troup,  Stephen  E.  Wilhite,  Wing-Chi  Ngau,  Pierre  Ledoux,  Dmitry  Rudnev,  Alex  E.  Lash,  Wataru  Fujibuchi  and  Ron  Edgar*
0	National  Center  for  Biotechnology  Information,  National  Library  of  Medicine,  National  Institutes  of  Health,  45  Center  Drive,  Bethesda,  MD,  USA
0	INTRODUCTION  Since  2000,  the  Gene  Expression  Omnibus  (GEO)  has  served  as  a  public  repository  for  high-throughput  molecular  abundance  experimental  data,  providing  free  distribution  and  shared  access  to  comprehensive  datasets  (1).  These  data  include  single  and  multiple  channel  microarray-based  experiments
0	The  principle  architecture  of  the  GEO  database  remains  as  described  previously  (1).  Briefly,  data  submitted  to  GEO  are  stored  in  a  relational  database  partitioned  into  three  upper-level  entity  types:  Platform,  Sample  and  Series.  A  Platform  describes  the  list  of  elements  (e.g.  oligonucleotide  probesets,  cDNAs,  SAGE  tags,  antibodies)  being  assayed  or
0	A  Drosophila  full-length  cDNA  resource
1	Mark  Stapleton*,  Joe  Carlson*,  Peter  Brokstein*,  Charles  Yu*,  Mark  Champe*§  Reed  George*,  Hannibal  Guarin*,  Brent  Kronmiller*¶,  Joanne  Pacleb*,  Soo  Park*,  Ken  Wan*,  Gerald  M  Rubin*¥#  and  Susan  E  Celniker*
0	comment  reviews
0	reports  deposited  research
0	Background:  A  collection  of  sequenced  full-length  cDNAs  is  an  important  resource  both  for  functional  genomics  studies  and  for  the  determination  of  the  intron-exon  structure  of  genes.  Providing  this  resource  to  the  Drosophila  melanogaster  research  community  has  been  a  long-term  goal  of  the  Berkeley  Drosophila  Genome  Project.  We  have  previously  described  the  Drosophila  Gene  Collection  (DGC),  a  set  of  putative  full-length  cDNAs  that  was  produced  by  generating  and  analyzing  over  250,000  expressed  sequence  tags  (ESTs)  derived  from  a  variety  of  tissues  and  developmental  stages.  Results:  We  have  generated  high-quality  full-insert  sequence  for  8,921  clones  in  the  DGC.  We  compared  the  sequence  of  these  clones  to  the  annotated  Release  3  genomic  sequence,  and  identified  more  than  5,300  cDNAs  that  contain  a  complete  and  accurate  protein-coding  sequence.  This  corresponds  to  at  least  one  splice  form  for  40%  of  the  predicted  D.  melanogaster  genes.  We  also  identified  potential  new  cases  of  RNA  editing.  Conclusions:  We  show  that  comparison  of  cDNA  sequences  to  a  high-quality  annotated  genomic  sequence  is  an  effective  approach  to  identifying  and  eliminating  defective  clones  from  a  cDNA  collection  and  ensure  its  utility  for  experimentation.  Clones  were  eliminated  either  because  they  carry  single  nucleotide  discrepancies,  which  most  probably  result  from  reverse  transcriptase  errors,  or  because  they  are  truncated  and  contain  only  part  of  the  protein-coding  sequence.
0	refereed  research  interactions  information
0	One  of  the  goals  of  the  Berkeley  Drosophila  Genome  Project  is  to  define  experimentally  the  transcribed  portions  of  the  genome  by  producing  a  collection  of  fully  sequenced  cDNAs.  We  have  previously  reported  the  construction  of  cDNA
0	libraries  from  a  variety  of  tissues  and  developmental  stages;  these  libraries  were  used  to  generate  over  250,000  expressed  sequence  tags  (ESTs),  corresponding  to  approximately  70%  of  the  predicted  protein-coding  genes  in  the  Drosophila  melanogaster  genome  [1,2].  We  used  computational  analysis
0	Genome  Biology
0	Stapleton  et  al.
0	of  these  ESTs  to  establish  a  collection  of  putative  full-length  cDNA  clones,  the  Drosophila  Gene  Collection  (DGC)  [1,2].  Here,  we  describe  the  process  by  which  we  sequenced  the  full  inserts  of  8,921  cDNA  clones  from  the  DGC,  describe  the  methods  by  which  we  assess  each  clone's  likelihood  of  containing  a  complete  and  accurate  protein-coding  region,  and  illustrate  how  these  data  can  be  used  to  uncover  additional  cases  of  RNA  editing.  We  have  confirmed  the  identification  of  5,375  cDNA  clones  that  can  be  used  with  confidence  for  protein  expression  or  genetic  complementation.
0	Results  and  discussion
0	Sequencing  strategy
0	Current  approaches  to  full-insert  sequencing  of  cDNA  clones  include  concatenated  cDNA  sequencing  [3],  primer  walking  [4],  and  strategies  using  transposon  insertion  to  create  priming  sites  [5-9].  We  adopted  a  cDNA  sequencing  strategy  that  relies  on  an  in  vitro  transposon  insertion  system  based  on  the  MuA  transposase,  combined  with  primer  walking  (see  Materials  and  methods  for  details).  The  production  of  full-insert  sequences  from  DGC  cDNAs  is  summarized  in  Tables  1  and  2.  For  DGCr1,  clones  were  sized  before  sequencing.  Small  clones  (<  1.4  kilobases  (kb))  were  sequenced  with  custom  primers  and  larger  clones  were  sequenced  using  either  mapped  or  unmapped  transposon  insertions.  For  DGCr2,  clones  were  not  sized  and  a  set  of  unmapped  transposon  insertions  was  sequenced  to  generate  an  average  of  5x  sequence  coverage.  For  both  DGCr1  and  r2,  custom  oligonucleotide  primers  designed  using  Autofinish  [10]  were  used  to  bring  the  sequences  to  high  quality.  To  date,  we  have  completed  sequencing  93%  of  the  DGCr1  clone  set  and  80%  of  the  DGCr2  clone  set.  The  strategy  used  for  sequencing  DGCr1  clones  appears  to  be  more  efficient,  because  on  average  they  required  fewer  sequencing  reads  than  DGCr2  clones.  However,  we  were  able  to  reduce  cycle  time  and  increase  throughput  using  the  shotgun  strategy  adopted  for  sequencing  the  DGCr2  clones.  The  average  insert  size  of  the  8,770  high-quality  cDNA  sequences  that  have  been  submitted  to  GenBank  is  2  kb  and  they  total  17.5  megabases  (Mb)  of  sequence.  The  largest  clone  (SD01389)  is  8.7  kb  and  is  derived  from  a  gene  (CG10011)  that  encodes  a  2,119-amino-acid  ankyrin  repeat-containing  protein.
0	Candidate  clones  to  be  sequenced  Submitted  to  GenBank  Clones  in  progress
0	Evaluating  the  coding  potential  of  each  cDNA  on  the  basis  of  its  full-insert  sequence
0	For  many  potential  uses  in  proteomics  and  functional  genomics  [11-13],  it  is  important  to  establish  cDNA  collections  comprised  only  of  cDNAs  with  complete  and  uncorrupted  open  reading  frames  (ORFs).  To  determine  which  of  our  sequenced  clones  meet  this  standard,  we  compared  them  to  the  annotated  Release  3  genome  sequence  [14,15]  using  a  combination  of  BLAST  [16]  and  Sim4  [17]  alignments  (see  Materials  and  methods  for  details).
0	We  grouped  the  cDNAs  into  four  categories  (Table  3).  The  first  category  contains  a  total  of  5,916  cDNA  clones,  or  68%  of  the  sequenced  clones.  We  are  confident  that  5,375  of  these  clones  contain  a  complete  and  accurate  ORF,  as  they  precisely  match  the  Release  3  predicted  protein  for  the  corresponding  gene.  An  additional  541  clones  are  from  the  SD,  GM  and  AT  libraries,  which  were  generated  from  fly  strains  that  are  not  isogenic  with  the  strain  used  to  produce  the  genome  sequence.  The  predicted  ORFs  from  clones  from  these  libraries  were  required  to  be  identical  in  length  to  the  Release  3  predicted  protein  with  less  than  2%  amino-acid  difference  to  be  placed  in  this  category.  We  cannot  at  present  distinguish  whether  these  differences  result  from  strain  polymorphisms  or  reverse  transcriptase  (RT)  errors.  However,  our  own  internal  estimates  of  RT  errors  (see  below),  based  on  the  observed  nucleotide  substitution  rate  in  cDNAs  derived  from  the  same  strain  as  the  genomic
0	Table  3  cDNA  analysis  comment  DGCr1  Clones  that  encode  complete  ORFs  ORFs  identical  to  the  Release  3  predicte
0	Donor/Acceptor  Interactions  in  Systematically  Modified  RuII-OsII  Oligonucleotides
1	Dennis  J.  Hurley  and  Yitzhak  Tor*
0	Abstract:  Donor/acceptor  (D/A)  interactions  are  studied  in  a  series  of  doubly  modified  19-mer  DNA  duplexes.  An  ethynyl-linked  RuII  donor  nucleoside  is  maintained  at  the  5  terminus  of  each  duplex,  while  an  ethynyllinked  OsII  nucleoside,  placed  on  the  complementary  strands,  is  systematically  moved  toward  the  other  terminus  in  three  base  pair  increments.  The  steady-state  RuII-based  luminescence  quenching  decreases  from  90%  at  the  shortest  separation  of  16  A  (3  base  pairs)  to  11%  at  the  largest  separation  of  61  A  (18  base  pairs).  Time-resolved  experiments  show  a  similar  trend  for  the  RuII  excited-state  lifetime,  and  the  decrease  in  the  averaged  excited-state  lifetime  for  each  duplex  is  linearly  correlated  with  the  fraction  quenched  obtained  by  steady-state  measurements.  Analysis  according  to  the  Forster  dipole-dipole  energy  ¨  transfer  mechanism  shows  a  reasonable  agreement.  Deviation  from  idealized  behavior  is  primarily  attributed  to  uncertainty  in  the  orientation  factor,  2.  Analyzing  D/A  interactions  in  an  analogous  series  of  doubly  modified  oligonucleotides,  where  the  ethynyl-linked  RuII  center  is  replaced  with  a  saturated  two-carbon  linked  complex,  yields  an  excellent  correlation  with  the  Forster  mechanism.  As  this  simple  change  partially  ¨  relaxes  the  rigid  geometry  of  the  donor  chromophore,  these  results  suggest  that  the  deviation  from  idealized  Forster  behavior  observed  for  the  duplexes  containing  the  rigidly  held  RuII  center  originates,  at  least  partially,  ¨  from  ambiguities  in  the  orientation  factor.  Surprisingly,  analyzing  both  quenching  data  sets  according  to  the  Dexter  mechanism  also  shows  an  excellent  correlation.  Although  this  can  be  interpreted  as  strong  evidence  for  a  Dexter  triplet  energy  transfer  mechanism,  it  does  not  imply  that  this  electron  exchange  mechanism  is  operative  in  these  D/A  duplexes.  Rather,  it  suggests  that  systems  that  transfer  energy  via  the  Forster  mechanism  can  under  certain  circumstances  exhibit  Dexter-like  "behavior",  thus  illustrating  the  ¨  danger  of  imposing  a  single  physical  model  to  describe  D/A  interactions  in  such  complex  systems.  While  we  conclude  that  the  Forster  dipole-dipole  energy  transfer  mechanism  is  the  dominant  pathway  for  D/A  ¨  interactions  in  these  modified  oligonucleotides,  a  minor  contribution  from  the  Dexter  electron  exchange  mechanism  at  short  distances  is  likely.  This  complex  behavior  distinguishes  DNA-bridged  RuII/OsII  dyads  from  their  corresponding  low  molecular-weight  and  covalently  attached  counterparts.
0	The  DNA  double  helix  has  been  shown  to  be  an  intriguing  medium  for  exploring  charge  transfer  phenomena.1  The  intricacies  of  these  processes  have  widely  been  probed  using  photoactive  and  redox-active  transition  metal  coordination  compounds.2  Much  less  attention  has  been  given,  however,  to  energy  transfer  processes  in  similarly  metal-modified  DNA  oligonucleotides.  The  relatively  complex  excited-state  manifold  of  polypyridine  RuII  and  OsII  compounds  can  be  engaged  in  multiple  relaxation  mechanisms,  including  dipole-dipole  (Forster)  and  ¨  electron  exchange  (Dexter)  energy  transfer  processes  (Figure  1).3,4  In  simple  heteronuclear  RuII-OsII  dyads,  the  mode  of  the
0	10.1021/ja020172r  CCC:  $22.00  ©  2002  American  Chemical  Society
0	Hurley  and  Tor
0	pend  on  Hec1  may  signal  checkpoint  activation  through  diffusible  Mad2  complexes.  In  Hec1-depleted  cells,  this  activity  could  be  generated  through  CENP-E  or  BubR1.  Because  kinetochores  were  not  stretched  in  Hec1-depleted  cells  (30),  it  is  plausible  that  persistent  checkpoint  activity  was  caused  by  lack  of  tension.  Injection  of  antibodies  to  Hec1  into  bladder  carcinoma  cells  was  reported  to  cause  aberrant  mitotic  progression  and  cell  death  but  no  checkpoint  arrest  (23).  This  result  could  be  explained  if  these  tumor  cells  were  checkpoint-deficient  or  if  the  injected  antibodies  interfered  with  checkpoint  signaling.  In  Saccharomyces  cerevisiae,  mutations  in  the  Hec1  homolog  Ndc80  caused  chromosome  segregation  defects  without  activating  the  checkpoint  (24,  26  ).  This  may  relate  to  the  fact  that  kinetochores  in  budding  yeast  bind  only  a  single  MT,  whereas  those  in  vertebrate  cells  capture  multiple  MTs  (8,  9).  Furthermore,  kinetochore-MT  interactions  and  checkpoint  signaling  in  vertebrates  may  involve  two  distinct  pathways:  one  centered  on  Hec1  interacting  with  Mad1/Mad2  and  the  other  on  CENP-E  interacting  with  CENP-F  and  BubR1,  both  pathways  converging  onto  APC/C  (35,  36  ).  Yeast  has  a  clear  counterpart  of  Hec1  but  lacks  an  obvious  homolog  of  CENP-E.  The  human  kinetochore  protein  Hec1  was  required,  together  with  Mps1,  for  recruiting  the  Mad1/Mad2  complex  to  kinetochores.  Moreover,  Hec1-depleted  cells  displayed  persistent  spindle  checkpoint  activity  although  they  lacked  significant  amounts  of  Mad1  or  Mad2  at  kinetochores.  This  latter  observation  contrasts  with  models  emphasizing  the  importance  of  high  steady-state  levels  of  kinetochore-associated  Mad1/Mad2  complexes  in  checkpoint  signaling  and  instead  suggests  that  some  protein  that  does  not  depend  on  Hec1  for  kinetochore  localization  is  able  to  communicate  with  diffusible  Mad2  complexes.  Many  tumor  cells  are  thought  to  be  defective  in  the  spindle  checkpoint  (37  ).  Any  interference  with  Hec1  function  in  checkpoint-deficient  cells,  be  it  through  siRNA  or  other  specific  inhibitors,  is  predicted  to  result  in  mitotic  catastrophe,  thereby  causing  the  demise  of  most  progeny.  In  contrast,  normal  checkpoint-proficient  cells  may  arrest  transiently  in  response  to  reversible  Hec1  inhibition.  Thus,  Hec1  may  be  an  attractive  target  for  therapeutic  intervention  in  cancer  and  other  hyperproliferative  diseases.
0	Gene  Expression  During  the  Life  Cycle  of  Drosophila  melanogaster
0	Molecular  genetic  studies  of  Drosophila  melanogaster  have  led  to  profound  advances  in  understanding  the  regulation  of  development.  Here  we  report  gene  expression  patterns  for  nearly  one-third  of  all  Drosophila  genes  during  a  complete  time  course  of  development.  Mutations  that  eliminate  eye  or  germline  tissue  were  used  to  further  analyze  tissue-specific  gene  expression  programs.  These  studies  define  major  characteristics  of  the  transcriptional  programs  that  underlie  the  life  cycle,  compare  development  in  males  and  females,  and  show  that  large-scale  gene  expression  data  collected  from  whole  animals  can  be  used  to  identify  genes  expressed  in  particular  tissues  and  organs  or  genes  involved  in  specific  biological  and  biochemical  processes.  Molecular  studies  of  development  in  multicellular  organisms  have  gone  through  two  major  phases  during  the  past  three  decades.  Initially,  solution  hybridization  studies  quantitated  transcript  abundance  and  showed  that  large-scale  changes  in  gene  expression  accompany  development  (1).  In  Drosophila,  such  studies  suggested  that  5000  to  7000  different  polyadenylated  RNA  species  are  produced  at  each  stage  of  the  life  cycle  and  that  the  composition  of  this  set  of  RNAs  shifted  during  development  (1).  These  analyses  gave  an  overview  of  genome  activity  during  development,  but  they  could  not  follow  the  expression  of  individual  genes  or  reveal  their  identities.  Later,  when  it  became  possible  to  clone  individual  genes  (2,  3),  RNA  blots  and  in  situ  hybridization  revealed  when  and  where  individual  genes  were  active.  This  second  phase  of  analysis  allowed
0	an  initial  determination  of  the  links  between  molecules  and  developmental  functions.  This  gene-by-gene  approach  has  dominated  developmental  biology  for  the  past  two  decades.  DNA  microarrays  extend  the  single-gene  approach  to  the  genome  level  by  measuring  the  transcript  levels  of  thousands  of  genes  simultaneously  (4  -  6).  Here  we  present  the  transcriptional  profiles  for  about  one-third  of  all  predicted  Drosophila  genes  (7)  throughout  the  life  cycle,  from  fertilization  to  aging  adults.  cDNA  microarrays  were  used  to  analyze  the  RNA  expression  levels  of  4028  genes  in  wild-type  flies  examined  during  66  sequential  time  periods  beginning  at  fertilization  and  spanning  the  embryonic,  larval,  and  pupal  periods  and  the  first  30  days  of  adulthood,  when  males  and  females  were  sampled  separately  (Fig.  1A).  Early  embryos  change  rapidly,  so  overlapping  1-hour  periods  were  sampled;  adults  were  sampled  at  multiday  intervals  (Fig.  1A)  (8).  We  compared  each  experimental  sample  to  a  common  reference  sample  made  from  pooled  mRNA  representing  all  stages  of  the  life  cycle,  allowing  us  to  measure  each  transcript's  relative  abundance  (8).  We  refer  to  this  relative  abundance  at  each  time  as  a  gene's  transcript  or  expression  level,  and  to  each  gene's  overall  pattern  of  expression  during  development  as  its  transcript  or  expression  profile.  Expression  of  most  genes  assayed  (3483  out  of  4028,  86%)  changed  significantly  [P  0.001,  analysis  of  variance  (ANOVA)]  during  the  40-day  period  surveyed  (8).  Of  these,  3219  genes  exhibited  at  least  a  fourfold  difference  between  their  highest  and  lowest  levels  of  expression  (Fig.  1B  and  table  S1).  The  vast  majority  of  these  developmentally  modulated  genes  (  88%)  are  expressed  during  the  first  20  hours  of  development,  before  the  end  of  embryogenesis  (Fig.  1,  B  and  C).  To  identify  patterns  of  gene  reexpression  during  development,  we  applied  a  peak-finding  algorithm  (8)  to  each  gene's  expression  profile.  We  found  that  36.3%  of  the  genes  (1169  genes)  showed  a  single  major  peak  of  expression  (Fig.  1D,  left  panels),  whereas  40.3%  (1298)  showed  two  peaks  (Fig.  1D,  right  panels)  and  23.4%  (752)  showed  three  or  more  peaks  (fig.  S1  and  tables  S2  to  S6).  Many  genes  are  expressed  in  two  waves
0	BMC  Genomics
0	Methodology  article
0	BioMed  Central
0	Open  Access
0	Utilization  of  a  labeled  tracking  oligonucleotide  for  visualization  and  quality  control  of  spotted  70-mer  arrays
1	Martin  J  Hessner*1,2,  Vineet  K  Singh3,  Xujing  Wang1,2,  Shehnaz  Khan2,  Michael  R  Tschannen2  and  Thomas  C  Zahrt3
0	Hessner  et  al;  licensee  BioMed  Central  Ltd.  This  is  an  Open  Access  article:  verbatim  copying  and  redistribution  of  this  article  are  permitted  in  all  media  for  any  purpose,  provided  this  notice  is  preserved  along  with  the  article's  original  URL.
0	Spotted  oligonucleotide  arrays70-mersgene  expression  analysis
0	Background:  Spotted  70-mer  oligonucleotide  arrays  offer  potentially  greater  specificity  and  an  alternative  to  expensive  cDNA  library  maintenance  and  amplification.  Since  microarray  fabrication  is  a  considerable  source  of  data  variance,  we  previously  directly  tagged  cDNA  probes  with  a  third  fluorophore  for  prehybridization  quality  control.  Fluorescently  modifying  oligonucleotide  sets  is  cost  prohibitive,  therefore,  a  co-spotted  Staphylococcus  aureus-specific  fluorescein-labeled  "tracking"  oligonucleotide  is  described  to  monitor  fabrication  variables  of  a  Mycobacterium  tuberculosis  oligonucleotide  microarray.  Results:  Significantly  (p  <  0.01)  improved  DNA  retention  was  achieved  printing  in  15%  DMSO/1.5  M  betaine  compared  to  the  vendor  recommended  buffers.  Introduction  of  tracking  oligonucleotide  did  not  effect  hybridization  efficiency  or  introduce  ratio  measurement  bias  in  hybridizations  between  M.  tuberculosis  H37Rv  and  M.  tuberculosis  mprA.  Linearity  between  the  mean  log  Cy3/Cy5  ratios  of  genes  differentially  expressed  from  arrays  either  possessing  or  lacking  the  tracking  oligonucleotide  was  observed  (R2  =  0.90,  p  <  0.05)  and  there  were  no  significant  differences  in  Pearson's  correlation  coefficients  of  ratio  data  between  replicates  possessing  (0.72  ±  0.07),  replicates  lacking  (0.74  ±  0.10),  or  replicates  with  and  without  (0.70  ±  0.04)  the  tracking  oligonucleotide.  ANOVA  analysis  confirmed  the  tracking  oligonucleotide  introduced  no  bias.  Titrating  target-specific  oligonucleotide  (40  µM  to  0.78  µM)  in  the  presence  of  0.5  µM  tracking  oligonucleotide,  revealed  a  fluorescein  fluorescence  inversely  related  to  target-specific  oligonucleotide  molarity,  making  tracking  oligonucleotide  signal  useful  for  quality  control  measurements  and  differentiating  false  negatives  (synthesis  failures  and  mechanical  misses)  from  true  negatives  (no  gene  expression).  Conclusions:  This  novel  approach  enables  prehybridization  array  visualization  for  spotted  oligonucleotide  arrays  and  sets  the  stage  for  more  sophisticated  slide  qualification  and  data  filtering  applications.
0	Page  1  of  11
0	(page  number  not  for  citation  purposes)
0	BMC  Genomics  2004,  5
0	variable  DNA  probe  deposition  and  retention  on  the  solid  support  surfaces.  To  minimize  variations  using  this  fabrication  platform,  a  number  of  approaches  have  been  described  that  allow  direct  visualization  of  array  integrity  following  printing  and  blocking  procedures.  Commonly  used  methods  include  the  staining  of  microarrays  with  DNA-binding  fluorescent  dyes,  or  the  hybridization  of  "universal"  targets  (i.e.  random  9-mers)  to  the  spotted  DNA  elements  [15,16].  While  these  techniques  provide  useful  information  regarding  the  physical  characteristics  of  the  array,  its  integrity  may  be  compromised  during  subsequent  de-staining  or  stripping  procedures  required  prior  to  hybridization  of  labeled  targets  [16].  Consequently,  investigators  typically  only  examine  one  or  a  few  representative  slides  to  access  the  quality  of  a  printed  batch.  Previously,  we  have  reported  the  development  and  use  of  a  novel  three-color  cDNA  array  platform  that  allows  immobilized  probes  to  be  directly  visualized  [17-19].  Utilizing  this  format,  oligonucleotide  primers  used  to  amplify  cDNA  targets  are  labeled  at  their  5'  end  with  fluorescein,  a  dye  compatible  with  commonly  used  cyanine  labeling  dyes  and  confocal  laser  scanners  possessing  narrow  bandwidths  [18,20].  Element/array  morphology,  surface  DNA  deposition/retention,  and  surface  background  can  be  monitored  on  each  slide.  Thus,  in  our  laboratory,  all  cDNA  arrays  are  imaged  for  quality  control  prior  to  hybridization,  maximizing  the  use  of  quality  arrays  for  subsequent  experimental  procedures.  It  is  likely  that  many  or  all  of  the  benefits  to  using  a  directly-coupled  fluorophore  are  also  applicable  to  oligonucleotide-based  microarrays;  however,  synthesis  costs  make  this  approach  unfeasible.  In  this  report,  we  describe  the  use  and  evaluation  of  a  Staphylococcus  aureus-specific  fluorescein-labeled  70-mer  "tracking"  oligonucleotide  as  a  third-color  quality  control  measure  of  a  Mycobacterium  tuberculosis-specific  oligonucleotide-based  microarray.
0	Results  and  Discussion
0	Page  2  of  11
0	(page  number  not  for  citation  purposes)
0	BMC  Genomics  2004,  5
0	Variation  in  gene  expression  within  and  among  natural  populations
1	Marjorie  F.  Oleksiak1,  Gary  A.  Churchill2  &  Douglas  L.  Crawford1
0	Evolution  may  depend  more  strongly  on  variation  in  gene  expression  than  on  differences  between  variant  forms  of  proteins1.  Regions  of  DNA  that  affect  gene  expression  are  highly  variable,  containing  0.6%  polymorphic  sites2.  These  naturally  occurring  polymorphic  nucleotides  can  alter  in  vivo  transcription  rates3-7.  Thus,  one  might  expect  substantial  variation  in  gene  expression  between  individuals.  But  the  natural  variation  in  mRNA  expression  for  a  large  number  of  genes  has  not  been  measured.  Here  we  report  microarray  studies  addressing  the  variation  in  gene  expression  within  and  between  natural  populations  of  teleost  fish  of  the  genus  Fundulus.  We  observed  statistically  significant  differences  in  expression  between  individuals  within  the  same  population  for  approximately  18%  of  907  genes.  Expression  typically  differed  by  a  factor  of  1.5,  and  often  more  than  2.0.  Differences  between  populations  increased  the  variation.  Much  of  the  variation  between  populations  was  a  positive  function  of  the  variation  within  populations  and  thus  is  most  parsimoniously  described  as  random.  Some  genes  showed  unexpected  patterns  of  expression--  changes  unrelated  to  evolutionary  distance.  These  data  suggest  that  substantial  natural  variation  exists  in  gene  expression  and  that  this  quantitative  variation  is  important  in  evolution.
0	each  of  907  genes.  The  loop  design  is  substantially  different  from  the  most  commonly  used  `reference  microarray'  design,  in  which  each  RNA  sample  of  interest  is  used  to  probe  the  same  reference  sample  and  all  values  are  expressed  as  ratios  of  the  sample  signal  to  the  reference  signal.  We  proposed  to  answer  two  questions.  First,  what  proportion  of  genes  are  differentially  expressed  between  individuals  within  the  same  population?  Second,  how  many  genes  are  differentially  expressed  between  populations?  To  address  these  questions,  we  applied  ANOVA  methods  to  the  loge  normalized  data18.  Unlike  most  microarray  strategies  (but  similar  to  one  previous  study19),  ours  did  not  depend  on  assessing  ratios  of  fluorescent  signals,  whereby  only  large  differences  can  be  detected.  Instead,  we  investigated  which  genes  showed  statistically  significant  variations  in  expression.  The  expression  levels  of  161  genes  (18%)  were  significantly  different  between  individuals  within  the  same  population  at  the  nominal  P  value  of  0.01  (Fig.  2),  as  determined  using  standard  statistical  tables  or  permutation  analyses  within  each  gene.  This  number  of  significant  genes  is  18  times  larger  than  the  nine  false  positives  expected  under  the  null  hypothesis  when  P  =  0.01.  To  provide  tighter  control  of  type  I  errors  (falsely  rejecting  the  null  hypothesis),  we  considered  applying  a  multiple-testing  adjustment  to  these  tests20.  Experiment-wide  control  of  type  I  error  at  the  5%  level  corresponds  to  an  individual  test  P  value  of  6  x  10-5.  Only  37  of  the  161  genes  showed  significant  differences  in  expression  between  individuals  at  this  level  of  stringency,  which  may  be  overly  conservative.  We  chose  to  use  the  significance  level  of  P  =  0.01  and  accept  a  greater  type  I  error  in  our  analyses.
0	The  proportion  (18%)  of  loci  differing  significantly  in  expression  between  individuals  within  the  same  population  is  similar  to  the  percentage  of  loci  that  differ  significantly  in  expression  between  different  strains  of  yeast21  (24%)  and  the  percentage  of  loci  that  show  non-zero  variance  in  Drosophila  melanogaster19  (25%),  as  determined  by  previous  studies.  These  studies  by  necessity  used  pooled  samples,  and  thus  could  not  measure  variation  in  expression  between  individuals  in  natural  populations.  In  humans  there  is  a  large  variation  in  gene  expression  between  individuals;  in  a  global  comparison  of  mRNA  levels  of  chimpanzees  and  humans,  there  was  greater  variation  within  the  human  population  than  between  human  and  chimpanzee  populations22.  These  results  support  our  finding  of  large  variation  in  gene  expression  between  individuals  and  emphasize  the  importance  of  examining  individual  variation.  An  ANOVA  analysis  calculates  significance  using  an  F  statistic,  and  significant  F  values  require  that  the  variation  between  samples  is  significantly  larger  than  the  residual  variation  within  samples20.  Thus,  finding  significant  differences  between  individuals  requires  that  the  variation  between  individuals  be  larger  than  experimental  variation  (for  example,  variation  due  to  printing,  hybridization,  array  differences  and  other  factors).  One  measure  of  the  experimental  variation  is  the  coefficient  of  variation  (c.v.)  of  gene  expression  for  each  individual  among  the  eight  replicates,  which  equals  the  standard  deviation  divided  by  the  mean,  expressed  as  a  percentage.  Nearly  all  (99%)  of  the  genes  for  each  individual  had  a  c.v.error  of  less  than  15%  (Fig.  2).  The  statistical  significance  of  the  differences  in  expression  of  161  genes  depended  on  this  small  experimental  error.  We  minimized  experimental  error  by  using  eight  replicate  measures  per  individual  for  each  gene  and  using  normalized  data  rather  than  the  ratio  typically  used  in  a  reference  design.  Ratios  of  two  values,  each  having  its  own  variation,  have  larger  experimental  variation20.  Not  surprisingly,  genes  for  which  there  was  little  experimental  variation  (low  c.v.error  values)  showed  the  greatest  significant  differences  in  expression  between  individuals  within  the  same  population,  and  genes  with  large  experimental  variation  values  did  not  differ  significantly  (Fig.  2).
0	Regulation  of  noise  in  the  expression  of  a  single  gene
1	Ertugrul  M.  Ozbudak1,  Mukund  Thattai1,  Iren  Kurtser2,  Alan  D.  Grossman2  &  Alexander  van  Oudenaarden1
0	Nature  Publishing  Group  http://genetics.nature.com
0	Stochastic  mechanisms  are  ubiquitous  in  biological  systems.  Biochemical  reactions  that  involve  small  numbers  of  molecules  are  intrinsically  noisy,  being  dominated  by  large  concentration  fluctuations1-3.  This  intrinsic  noise  has  been  implicated  in  the  random  lysis/lysogeny  decision  of  bacteriophage-4,  in  the  loss  of  synchrony  of  circadian  clocks5,6  and  in  the  decrease  of  precision  of  cell  signals7.  We  sought  to  quantitatively  investigate  the  extent  to  which  the  occurrence  of  molecular  fluctuations  within  single  cells  (biochemical  noise)  could  explain  the  variation  of  gene  expression  levels  between  cells  in  a  genetically  identical  population  (phenotypic  noise).  We  have  isolated  the  biochemical  contribution  to  phenotypic  noise  from  that  of  other  noise  sources  by  carrying  out  a  series  of  differential  measurements.  We  varied  independently  the  rates  of  transcription  and  translation  of  a  single  fluorescent  reporter  gene  in  the  chromosome  of  Bacillus  subtilis,  and  we  quantitatively  measured  the  resulting  changes  in  the  phenotypic  noise  characteristics.  We  report  that  of  these  two  parameters,  increased  translational  efficiency  is  the  predominant  source  of  increased  phenotypic  noise.  This  effect  is  consistent  with  a  stochastic  model  of  gene  expression  in  which  proteins  are  produced  in  random  and  sharp  bursts.  Our  results  thus  provide  the  first  direct  experimental  evidence  of  the  biochemical  origin  of  phenotypic  noise,  demonstrating  that  the  level  of  phenotypic  variation  in  an  isogenic  population  can  be  regulated  by  genetic  parameters.
0	We  selected  as  our  reporter  system  a  single-copy  chromosomal  gene  with  an  inducible  promoter.  As  an  estimated  50-80%  of  bacterial  genes  are  transcriptionally  regulated8,  this  system  typifies  the  majority  of  naturally  occurring  genes,  allowing  our  results  to  be  extended  to  natural  systems.  We  incorporated  a  single  copy  of  our  reporter,  the  green  fluorescent  protein  gene  (gfp),  into  the  chromosome  of  B.  subtilis.  We  chose  to  integrate  gfp  into  the  chromosome  itself,  rather  than  in  the  form  of  plasmids,  as  variation  in  plasmid  copy  number9,10  can  act  as  an  additional  and  unwanted  source  of  noise.  Transcriptional  efficiency  was  regulated  by  using  an  isopropyl--D-thiogalactopyranoside  (IPTG)-inducible  promoter,  Pspac,  upstream  of  gfp,  and  varying  the  concentration  of  IPTG  in  the  growth  medium.  Translational
0	Table  1  ·  Translational  mutants:  point  mutations  in  the  RBS  and  initiation  codon  of  gfp  Strain  ERT25  ERT27  ERT3  ERT29  Ribosome  binding  site  GGG  GGG  GGG  GGG  AAA  AAA  AAA  AAA  AGG  AGG  AGG  AGG  AGG  AGG  TGG  AGG  TGA  TGA  TGA  TGA  ACT  ACT  ACT  ACT  Initiation  Translational  codon  efficiency  ACT  ACT  ACT  ACT  ATG  TTG  ATG  GTG  1.00  0.87  0.84  0.66
0	efficiency  was  regulated  by  constructing  a  series  of  B.  subtilis  strains  (Table  1)  that  contained  point  mutations  in  the  ribosome  binding  site  (RBS)  and  initiation  codon  of  gfp11.  The  use  of  two  different  strategies  to  regulate  transcriptional  and  translational  processes  introduces  a  potential  bias  in  the  relative  contributions  of  these  processes  to  biochemical  noise.  As  a  control,  we  constructed  four  additional  strains  (Table  2)  whose  transcription  rates  were  altered  by  mutations  in  the  promoter  region  of  the  reporter  gene.  As  described  below,  both  strategies  of  transcriptional  regulation  produced  similar  results.  We  measured  expression  of  green  fluorescent  protein  (GFP)  for  single  cells  in  a  bacterial  population  using  flow  cytometry.  Variation  in  GFP  expression  from  cell  to  cell  (phenotypic  noise)  is  seen  in  a  histogram  (Fig.  1a)  showing  the  protein  expression  levels  (p)  measured  during  a  typical  experiment.  The  histogram  is  characterized  by  a  mean  value  p  and  a  standard  deviation  p.  The  phenotypic  noise  strength,  defined  as  the  quantity  p2/p  (variance/mean),  is  sensitive  to  the  biochemical  sources  of  stochasticity  that  we  wished  to  study  and  is  therefore  the  unit  in  which  we  report  our  results.  We  measured  phenotypic  noise  strength  for  the  four  different  translational  strains  as  we  varied  IPTG  concentration  between  30  µM  (near-basal  transcription)  and  1  mM  (full  operon  induction).  For  example,  Fig.  1b  shows  flow  cytometer  results  for  the  four  strains  at  full  induction,  whereas  Fig.  1c  shows  the  results  from  a  series  of  flow  cytometer  experiments  conducted  on  a  single  strain  (ERT3)  as  IPTG  concentration  was  varied.  A  summary  of  all  of  our  experimental  results  (Fig.  2a)  shows  the  measured  noise  strength  as  a  simultaneous  function  of  both  transcriptional  efficiency  (varying  [IPTG]  in  the  growth  medium)  and  translational  efficiency  (using  different  strains  with  mutations  in  the  RBS  and  initiation  codon).  As  the  addition  of  IPTG  and  mutations  in  the  gfp  RBS  are  not  expected  to  affect  normal  cellular  processes,  all  contributions  to  phenotypic  noise  remained  unchanged  throughout  our  experiment,  except  fluctuations  in  rates  of  transcription  and  translation.  The  response  of  phenotypic  noise  strength  to  a  change  in  either  translational  efficiency  (Fig.  2b)  or  transcriptional  efficiency  (Fig.  2c)  indicates  the  isolated  contribution  of  that  parameter  to  the  phenotypic  noise.
0	Table  2  ·  Transcriptional  mutants:  point  mutations  in  the  Pspac  promoter  Strain  ERT57  ERT25  ERT53  ERT51  ERT55  -10  regulatory  region  -10  +1  CAT  CAT  CAT  CAT  CAT  AAT  AAT  AAT  AAT  AAT  GTG  GTG  GTG  GTG  GTG  TGT  TGG  TGC  TGA  TAA  AAT  AAT  AAT  AAT  AAT  Transcriptional  efficiency  6.63  1.00  0.79  0.76  0.76
0	number  of  cells
0	p  /<p>  (fluorescence  units)
0	p  /<p>  (fluorescence  units)
0	[IPTG]=75  µM
0	[IPTG]=30  µM
0	[IPTG]=1  mM
0	Nature  Publishing  Group  http://genetics.nature.com
0	p  (fluorescence  units)
0	<p>  (fluorescence  units)
0	<p>  (fluorescence  units)
0	We  find  that  the  phenotypic  noise  strength  shows  a  strong  positive  correlation  with  translational  efficiency  (Fig.  2b,  slope=21.8),  in  contrast  to  the  weak  positive  correlation  observed  for  transcriptional  efficiency  (Fig.  2c,  slope=6.5).  Switching  from  the  ERT27  strain  to  the  ERT25  strain  (an  increase  in  translational  efficiency  of  about  15%;  Table  1)  increases  the  noise  strength  from  32  to  35  units;  the  same  effect  is  achieved  only  upon  doubling  transcriptional  efficiency  (a  100%  increase)  from  the  half-induction  to  the  full-induction  level.  Experiments  involving  the  control  strains,  in  which  transcription  rates  were  altered  by  mutation  rather  than  by  operon  induction,  supported  the  weak  correlation  between  noise  strength  and  transcriptional  efficiency  (Fig.  2c  inset,  slope=7.3).  The  differential  nature  of  our  measurements  (investigating  changes  rather  than  absolute  values)  makes  our  results  independent  of  the  specific  properties  of  the  reporter  protein,  such  as  gene  locus  or  folding  characteristics.  This  suggests  that
0	increased  translational  efficiency  will  strongly  increase  the  variation  in  the  expression  of  any  naturally  occurring  gene.  A  stochastic  model  for  the  expression  of  a  single  gene  (Fig.  3a)  predicts  that  the  noise  strength  (p2/p)  is  greater  than  Poissonian  (p2/p=1)  and  is  simply  an  increasing  function  of  translational  efficiency12:
0	Here,  b=kP/R  is  the  average  number  of  proteins  synthesized  per  mRNA  transcript;  these  proteins  are  injected  into  the  cytoplasm  in  sharp  bursts  (Fig.  3b).  The  measured  asymmetry  between  the  noise  contributions  of  transcriptional  and  translational  parameters  is  consistent  with  this  prediction  and  provides  evidence  of
0	ngth  noise  stre
0	p  /<p>  (fluorescence  units)
0	p  /<p>  (fluorescence  units)
0	scrip  tion
0	translational  efficiency
0	translational  efficiency
0	transcriptional  efficiency
0	Fundamentals  of  experimental  design  for  cDNA  microarrays
1	Gary  A.  Churchill
0	Sources  of  variation  in  microarray  experiments  The  design  of  a  two-color  microarray  experiment  can  be  considered  as  having  three  layers.  Figure  1  shows  an  example  of  an  experiment  that  compares  the  effects  of  two  treatments--A  and  B--on  gene-expression  profiles  in  a  mouse  tissue.  At  the  top  layer  of  the  experiment  are  the  experimental  units,  the  two  mice  to  whom  each  treatment  is  applied.  The  term  `treatment'  pertains  to  any  attribute,  such  as  the  sex  or  strain  of  the  organism,  of  primary  interest  in  the  experiment.  The  mice  were  selected  to  be  representative  of  a  population  of  mice  and,  if  possible,  the  treatment  should  be  assigned  using  a  randomizing  device  such  as  a  coin  toss.  Assigning  at  least  two  mice  to  each  treatment  group  ensures  that  there  is  biological  replication  in  the  experiment.  In  the  middle  layer,  two  RNA  samples  are  obtained  from  each  mouse.  These  technical  replicates  may  be  two  independent  RNA  extractions  or  two  aliquots  of  the  same  extraction.  The  RNA  samples  are  assigned  to  two  different  dye  labels,  indicated  by  the  red  and  green  test  tubes.  They  are  then  paired  (one  red  and  one  green)  and  mixed  for  co-hybridization  on  microarray  slides.  The  bottom  layer  of  the  experiment  involves  the  arrangement  of  array  elements  on  the  slides.  In  this  example,  duplicate  spots  of  each  cDNA  clone  have  been  printed  side  by  side.  The  many  sources  of  variation  in  a  microarray  experiment  can  be  partitioned  along  these  three  layers.  Biological  variation  (top  layer)  is  intrinsic  to  all  organisms;  it  may  be  influenced  by  genetic  or  environmental  factors,  as  well  as  by  whether  the  samples  are  pooled  or  individual.  Technical  variation  (middle  layer)  is  introduced  during  the  extraction,  labeling  and  hybridization  of  samples.  Measurement  error  (bottom  layer)  is  associated  with  reading  the  fluorescent  signals,  which  may  be  affected  by  factors  such  as  dust  on  the  array.  Valid  statistical  tests  for  differential  expression  of  a  gene  across  the  samples  can  be  constructed  on  the  basis  of  any  of  these  variance  components,  but  there  are  important  distinctions  in  how  the  different  types  of  tests  should  be  interpreted.  If  we  are  interested  in  determining  how  the  treatments  affect  different  biological  populations  represented  in  our  samples,  statistical  tests  should  be  based  on  the  biological  variance.  If  our  interest  is  to  detect  variations  within  treatment  groups,  the  tests  should  be  based  on  technical  variation.  For  example,  Olesiak  et  al.1  employed  both  types  of  tests  to  look  at  variation  between  and  within  natural  populations.  Tests
0	based  on  measurement  error  variance  can  also  be  constructed  but  are  of  limited  utility2.  For  most  questions  of  interest,  the  higher  two  levels  of  variation  are  appropriate  for  constructing  tests,  and  hence  good  designs  should  incorporate  replication  at  the  higher  layers.
0	Experimental  units  and  treatments  The  correlation  observed  between  ratios  of  fluorescent  intensity  from  duplicate  spots  on  a  single  microarray  slide  will  typically  exceed  95%.  This  is  often  interpreted  as  a  demonstration  that  microarray  assays  are  reproducible.  However,  if  the  same  target  sample  is  divided  and  hybridized  to  two  different  microarray  slides,  the  correlation  across  hybridizations  is  likely  to  fall  to  the  60  to  80%  range,  somewhat  lower  if  the  dye  labeling  is  reversed.  Correlations  between  samples  obtained  from  individual  inbred  mice  may  be  as  low  as  30%.  If  the  experiments  are  carried  out  in  different  laboratories,  the  correlations  may  be  lower  still.  These  decreasing  correlations  reflect  the  cumulative  contributions  of  multiple  sources  of  variation.  It  is  tempting  to  avoid  biological  replication  in  an  experiment  because  results  will  appear  to  be  more  reproducible.  The  apparent  increase  in  statistical  power  is  illusory,  however,  and  significant  findings  may  simply  reflect  chance  fluctuations  in  the  particular  animals  chosen  for  the  experiment.  In  general,  it  is  appropriate  to  take  steps  to  vary  the  conditions  of  the  experiment--for  example,  by  assaying  multiple  animals--to  ensure  that  the  effects  that  do  achieve  statistical  significance  are  real  and  will  be  reproducible  in  different  settings3.  Identifying  the  independent  units  in  an  experiment  is  a  prerequisite  for  a  proper  statistical  analysis,  as  any  hidden  correlations  in  the  data  can  lead  to  bias  and  inflated  levels  of  statistical  significance.  Statistical  independence  is  a  relative  concept.  For  example,  hybridizations  of  the  same  target  sample  to  multiple  slides  may  be  viewed  as  independent  replicates  if  the  intent  is  to  characterize  that  sample  accurately.  However,  in  an  experiment  where  the  question  of  interest  concerns  a  biological  comparison  at  the  whole-organism  level  (for  example,  a  comparison  of  geneexpression  profiles  between  genetically  altered  and  control  animals),  the  technical  replicates  from  any  one  sample  may  no  longer  be  regarded  as  independent.  Details  of  how  individual  animals  and  samples  were  handled  throughout  the  course  of  an  experiment  can  be  important  to
0	Allocating  resources  in  a  microarray  experiment  The  precision  of  estimated  quantities  depends  on  the  variability  of  the  experimental  material,  the  number  of  experimental  units,  the  number  of  repeated  observations  per  unit  and  the  accuracy  of  the  primary  measurements4.  The  basis  for  drawing  inferential  conclusion  is  the  residual  error  (or  mean  squared  error,  MSE),  which  quantifies  the  precision  of  estimates  and  thus  allows  one  to  determine  whether  estimated  quantities  are  significantly  different  in  the  statistical  sense.  In  a  microarray  experiment,  the  residual  error  can  be  decomposed  into  three  components  of  variance  corresponding  to  the  three  layers  of  the  design  (Fig.  1).  The  first  component  is  the  intrinsic  variation  of  the  biological  units  within  a  treatment  group,  which  we  will  denote  by  2  .
0	are  multiple  treatment  factors).  If  there  are  no  degrees  of  freedom  left,  there  may  be  no  information  available  to  estimate  the  biological  variance,  the  statistical  tests  will  rely  on  technical  variance  alone,  and  the  scope  of  the  conclusions  will  be  limited  to  the  samples  in  hand.  If  there  are  5  df  or  more,  you  are  in  good  shape  (see  Box).  In  some  circumstances,  a  large  number  of  experimental  units  may  be  available,  perhaps  more  than  can  be  measured  individually,  in  which  case  we  have  the  option  to  form  pools  of  individual  samples.  In  other  cases,  pooling  may  be  a  necessity  owing  to  the  limited  availability  of  RNA.  Pooling  the  original  experimental  units  creates  new  units,  the  pools.  Pooling  can  reduce  the  biological  component  of  variation,  but  it  cannot  reduce  the  variability  due  to  sample  handling  or  measurement  error.  In  a  two-sample  comparison,  we  could  consider  making  two  large  pools  of  all  available  units  and  measuring  each  pool  multiple  times.  This  is  a  poor  design,  as  it  does  not  allow  estimation  of  the  between-pool  variance.  By  pooling  all  the  available  samples  together  we  have  minimized  the  biological  variance,  but  we  have  also  eliminated  all  independent  replication.  It  is  better  to  use  several  pools  and  fewer  technical  replicates.
0	Pairing  samples  for  hybridizations  The
0	The  effect  of  replication  on  gene  expression  microarray  experiments
1	Paul  Pavlidis1,,  Qinghong  Li2  and  William  Stafford  Noble3,
0	Columbia
0	Replication  is  a  straightforward  method  for  improving  the  quality  of  inferences  made  from  experimental  studies.  However,  replication  increases  the  cost  of  experiments  and,  typically,  the  amount  of  material  needed.  In  general,  it  makes  sense  to  do  as  much  replication  as  is  necessary  to  achieve  a  desired  level  of  sensitivity  and  specificity,  but  not  much  more.  This  trade-off  between  cost  and  statistical  power  arises  frequently  in  gene  expression  microarray  experiments.  Replication  is  clearly  necessary  in  this  domain  (Lee  et  al.,  2000;  Novak  et  al.,  2002),  but  microarray  experiments  are  costly  and  involve  RNA  samples  that  are  often  difficult  to  obtain.  We  therefore  need  techniques  for  estimating  in  advance  how  many  replicates  should  be  performed  in  a  given  study.
0	A  standard  approach  to  the  problem  of  estimating  the  statistical  properties  of  a  planned  set  of  data  is  `power  analysis'.  Power  analysis  estimates  the  probability  of  correctly  rejecting  the  null  hypothesis  in  favor  of  a  specific  alternative  while  maintaining  a  particular  Type  I  error  rate.  For  the  situations  we  consider  here,  the  alternative  hypothesis  is  usually  expressed  in  terms  of  `effect  size',  the  actual  difference  in  the  group  means  (relative  to  the  variance)  that  is  desired  to  be  detected.  A  mathematical  model  of  the  data  is  then  used  to  estimate  how  many  replicates  are  needed  to  achieve  the  desired  Type  I  and  Type  II  error  rates.  Certain  parameters  for  the  modeled  data  (most  critically,  the  expected  variability)  are  often  estimated  from  real  data,  perhaps  from  a  pilot  study.  Although  clearly  a  useful  tool,  power  analysis  comes  with  some  caveats.  First,  the  estimated  variability  is  critically  dependent  on  the  assumptions  of  the  model  and  the  quality  of  the  input  parameter  estimates.  A  second  set  of  assumptions  enters  into  the  statistical  test  that  is  used  to  evalute  the  null  hypothesis.  In  addition,  for  gene  expression  studies,  power  analysis  is  potentially  extremely  complex,  with  a  separate  set  of  parameters  for  each  gene,  not  to  mention  the  need  to  account  for  complex  interactions  among  genes.  To  our  knowledge  such  a  complete  power  calculation  has  not  been  attempted,  though  some  papers  have  used  simpler  power  analyses  to  study  microarray  expression  data  (Zien  et  al.,  2002;  Hwang  et  al.,  2002;  Pan  et  al.,  2002).  In  this  paper,  we  study  the  effect  of  increasing  (or  decreasing)  replication  on  the  detection  of  differentially  expressed  genes  in  real  data  sets,  avoiding  the  assumptions  required  to  simulate  data.  However,  because  in  real  data  sets  we  do  not  know  which  genes  truly  show  differential  expression,  we  cannot  directly  assess  power.  Instead,  we  examine  aspects  of  the  results  which  are  of  interest  to  biologists  and  which  complement  traditional  power  analyses.  We  make  our  findings  as  general  as  possible  by  analyzing  many  data  sets.  We  consider  a  simple  general  type  of  experiment,  the  goal  of  which  is  to  identify  genes  that  are  differentially  expressed  between  two  experimental  groups  (for  example,  tumor  and  normal  tissue).  The  two  groups  each  contain  a  number  of
0	Effect  of  replication  on  microarray  experiments
0	replicate  samples.  These  replicates  are  derived  from  different  biological  sources,  as  opposed  to  so-called  `technical  replicates',  in  which  the  same  biological  sample  is  tested  multiple  times.  Differentially  expressed  genes  are  identified  by  a  statistical  test  for  group  comparison  (such  as  a  t-test),  where  the  null  hypothesis  is  equality  of  the  group  means.  A  p-value  threshold  is  applied  following  the  test  to  establish  a  desired  Type  I  error  rate.  The  final  result  obtained  from  this  hypothetical  experiment  is  a  list  of  genes  that  are  differentially  expressed  at  a  particular  level  of  statistical  confidence.  To  study  various  levels  of  replication,  we  use  a  random  sampling  approach.  Given  a  real  data  set,  we  simulate  smaller  data  sets  of  various  sizes  by  randomly  selecting  samples  from  it.  For  example,  if  we  start  with  a  data  set  containing  at  least  12  replicates  in  each  group,  then  we  can  make  data  sets  of  any  level  of  replication  (up  to  12)  by  randomly  selecting  from  the  real  samples  (Fig.  1).  We  then  examine  properties  of  each  of  these  sampled  data  sets  with  methods  described  below.  We  repeat  this  procedure  on  many  data  sets,  for  every  possible  level  of  replication,  for  many  random  samples,  to  generate  a  large  set  of  statistics  on  the  properties  of  data  sets  of  various  sizes.  We  consider  two  qualities  of  each  sampled  data  set.  The  first  and  most  important  is  the  ability  to  obtain  any  results  at  all,  that  is,  to  find  genes  that  meet  our  statistical  criteria.  We  refer  to  this  property  as  `apparent  power'  to  distinguish  it  from  power  in  the  strict  sense.  Because  increasing  sample  size  will  essentially  always  increase  power,  it  might  be  reasonable  for  an  experimenter  to  choose  a  level  of  replication  that  is  sufficient  to  yield  `enough'  high-confidence  candidates,  where  `enough'  must  be  defined  by  the  needs  of  the  experiment.  The  second  quality  that  we  consider  is  the  stability  of  the  results.  Note  that  stability  is  only  meaningful  if  some  genes  have  met  our  statistical  criteria.  We  define  stability  as  the  tendency  for  the  results  to  remain  the  same  as  the  replication  level  is  changed.  We  define  two  metrics  of  stability,  which  differ  in  their  stringency.  First,  we  consider  the  stability  of  the  identities  of  the  genes  that  meet  the  statistical  criteria.  Second,  we  consider  the  rank  order  of  those  genes.  Details  of  our  metrics  are  provided  in  the  methods  section,  below.  Our  goal  is  to  identify,  for  each  data  set,  a  level  of  replication  that  yields  good  performance  according  to  our  metrics,  but  without  requiring  an  unreasonably  large  number  of  replicates.  We  wish  to  ask,  `Can  we  find  useful  results  with  only  a  few  replicates?'  and  at  the  other  extreme,  `Do  we  need  30  replicates?'  Although  the  experimental  design  used  here  is  simple--identifying  differentially  expressed  genes  across  two  conditions--the  techniques  that  we  describe  could  be  applied  to  a  wide  range  of  situations.  Our  results  suggest  that  while  statistical  power  is  a  critical  consideration  in  experimental  design,  researchers  should  also  consider  the  stability  of  the  results  they  obtain.  While  the  specific  findings  are  data  dependent,  we  found  that  good  apparent  power  and  stability  can  usually  be  obtained  with  fewer  than
0	Tissue  class  X  Tissue  class  Y
0	For  r  =  (3...n)  r=6  Up  to  100  random  trials  Up  to  10  random  trials
0	Full  data  set
0	n  =  8  per  group
0	Sample  (S)
0	per  group
0	Sample-test  (Stest)
0	per  group
0	T-test  ranking  Threshold
0	Sample-selected  (Ssel)
0	Sample  (S;  'gold  standard')
0	Comparisons  for  stability  determination
0	T-test  ranking
0	Sample-test-selected  (Stestsel)
0	Sample-test  (Stest)
0	replicates,  and  often  with  fewer  than  10.  On  the  other  hand,  using  fewer  than  five  replicates  almost  always  results  in  poor  apparent  power  and  low  stability.  The  methods  we  present  can  be  used  i
0	Error-correcting  microarray  design
1	Arshad  H.  Khan,a  Alex  Ossadtchi,b  Richard  M.  Leahy,b  and  Desmond  J.  Smitha,*
0	Abstract  We  describe  a  microarray  design  based  on  the  concept  of  error-correcting  codes  from  digital  communication  theory.  Currently,  microarrays  are  unable  to  efficiently  deal  with  "drop-outs,"  when  one  or  more  spots  on  the  array  are  corrupted.  The  resulting  information  loss  may  lead  to  decoding  errors  in  which  no  quantitation  of  expression  can  be  extracted  for  the  corresponding  genes.  This  issue  is  expected  to  become  increasingly  problematic  as  the  number  of  spots  on  microarrays  expands  to  accommodate  the  entire  genome.  The  error-correcting  approach  employs  multiplexing  (encoding)  of  more  than  one  gene  onto  each  spot  to  efficiently  provide  robustness  to  drop-outs  in  the  array.  Decoding  then  allows  fault-tolerant  recovery  of  the  expression  information  from  individual  genes.  The  error-correcting  method  is  general  and  may  have  important  implications  for  future  array  designs  in  research  and  diagnostics.  ©  2003  Elsevier  Science  (USA).  All  rights  reserved.
0	Keywords:  Efficiency;  Error-correcting  codes;  Fault-tolerance;  Microarrays;  Overhead
0	Relative  expression  levels  for  two  different  biological  samples  can  be  measured  simultaneously  for  several  thousand  genes  using  cDNA  microarrays  [1].  The  arrays  are  created  robotically  using  pins  to  spot  different  cDNAs  as  a  2D  grid  on  a  treated  glass  slide.  The  RNA  from  the  two  samples  is  labeled  using  fluorescent  dyes  with  distinct  spectra  and  cohybridized  to  the  array.  A  photomultiplier  tube  (PMT)  is  then  used  to  collect  an  image  of  the  stimulated  fluorescence  for  each  of  the  two  fluorophores  at  every  spot.  Relative  transcript  abundances  for  each  gene  are  quantitated  as  the  log-ratio  of  the  fluorescence  intensities.  Many  factors  can  affect  the  accuracy  of  microarrays:  spot  size,  pin  effects,  hybridization  efficiency,  the  response  of  the  PMT,  and  the  quality  of  the  labeled  RNA  [2].  Taking  the  ratio  for  the  two  fluorophores  at  each  spot  to  compute  relative  expression  can  help  mitigate  effects  common  to  both  samples,  such  as  spot  size  and  hybridization  efficiency.  However,  poor  spot  formation  and  neighborhood  background  fluorescence  may  be  so  severe  that  little  or  no  useful  information  can  be  extracted  from  affected  spots.  As  the
0	number  of  spots  on  microarrays  expands  to  accommodate  the  entire  genome,  the  occurrence  of  such  "drop-outs"  will  tend  to  increase.  Current  microarray  designs  are  not  robust  to  these  errors  and  are  susceptible  to  loss  of  experimental  information  from  genes  that  may  be  essential  for  a  particular  study.  Error-correcting  codes  play  a  fundamental  role  in  reducing  inaccuracies  during  data  transmission  in  digital  communication  systems  [3].  An  important  concept  in  these  codes  is  overhead,  the  percentage  of  transmitted  bits  employed  for  error  correction.  The  converse  quantity  is  efficiency.  The  simplest  approach  to  error  correction  employs  replication  of  all  bits;  however,  this  carries  considerable  overhead  (low  efficiency),  and  much  more  economical  and  elegant  schemes  have  been  devised.  In  this  report,  we  describe  a  new  approach  to  microarray  design  that  employs  the  concepts  of  error-correcting  codes.  The  approach  is  thus  fault  tolerant,  and  expression  levels  for  each  gene  can  be  estimated  in  the  presence  of  corrupted  spots.  The  design  is  based  on  the  use  of  a  binary  encoding  scheme  in  which  two  or  more  genes  are  multiplexed  onto  each  spot.  Using  a  decoding  procedure,  the  expression  level  for  each  gene  can  then  be  recovered.  In  the  case  in  which
0	one  or  more  spots  are  corrupted,  the  decoder  discards  these  data  and  computes  the  expression  level  for  each  gene  using  the  remaining  spots.  The  coding  scheme  has  greater  efficiency  (less  overhead)  than  simple  approaches  such  as  duplication  of  all  spots,  an  important  consideration  since  it  is  necessary  to  keep  array  sizes  within  bounds.  We  first  describe  the  error-correcting  approach  and  then  studies  of  error-correcting  performance,  linearity,  and  sensitivity.  In  the  first  set  of  investigations,  four  genes  are  encoded  using  six  spots,  providing  robustness  to  loss  of  two  spots.  However,  higher  degrees  of  multiplexing  can  be  used  to  reduce  the  total  number  of  spots  (greater  efficiency,  less  overhead)  while  still  providing  error-correcting  capabilities.  In  additional  implementations,  we  demonstrate  the  utility  of  this  principle.
0	Results  Error-correcting  codes  Error-correcting  codes  are  formulated  for  finite  alphabets  and  are  based  on  the  introduction  of  redundancy  into  data  transmitted  over  a  channel,  in  the  case  of  block  codes  by  using  k  codeword  bits  to  encode  n  source  bits,  where  k  n.  Redundancy  in  the  code  allows  detection  and  correction  of  errors.  The  microarray  problem  differs  in  a  fundamental  way,  since  gene  expression  levels  are  continuously  variable.  Consequently,  it  is  not  possible  to  work  in  the  finite  field  framework.  Nevertheless,  because  of  the  impracticality  of  combining  fractional  amounts  of  cDNA  for  different  genes,  use  of  a  binary  encoding  matrix  is  appropriate.  We  denote  by  x  the  vector  of  RNA  levels  corresponding  to  a  set  of  n  genes.  We  will  assume  that  hybridization  rates  are  unaffected  by  the  multiplexing  process.  Then  the  total  concentration  of  RNA  y  at  k  multiplexed  spots  can  be  written  as  y  TGSx,  (1)
0	where  G  is  the  k  n  binary  encoding  matrix,  S  is  a  diagonal  matrix  with  elements  s(j,j)  denoting  the  affinity  of  RNA  from  the  jth  gene  to  cDNA  on  the  array,  and  T  is  a  diagonal  matrix  with  elements  t(i,i)  denoting  spot-specific  effects,  such  as  size,  that  are  not  included  in  S.  The  ith  row  of  the  encoding  matrix  G  has  n  entries  of  value  1  and  0,  indicating  which  of  the  n  genes  are  encoded  in  the  ith  spot  through  inclusion  of  their  cDNA.  The  encoding  matrix  is  chosen  to  maximize  error-correcting  capabilities  while  minimizing  propagation  of  noise  effects.  Let  us  assume  for  now  that  the  entire  process  is  linear,  concentration  levels  are  measured  directly,  T  and  S  are  identity  matrices,  and  measurement  noise  is  identical  and  independent  at  each  spot.  The  expression  levels  can  be  computed  to  minimize  error  variance  by  multiplying  the  measurements  y  by  the  pseudoinverse  G  of  G  [4].  It  is  then  possible  to  design  G  to  minimize  the  noise  variance
0	Slide  reading  and  decoding  The  multiplex  spot  signal  Differences  in  affinity  and  spot  sizes  from  gene  to  gene  make  absolute  quantitation  extremely  difficult  using  cDNA  microarrays.  Consequently,  ratios  of  intensity  between  two  fluorescence  images  are  typically  used  to  determine  relative  expression  [1].  Let  the  vector  xCy5  denote  the  concentration  of  RNA  corresponding  to  the  n  genes  labeled  with  Cy5.  Let  yCy5  TGSxCy5  denote  the  vector  representing  the  concentrations  of  labeled  RNA  hybridized  to  the  k  multiplex  spots.  Similarly,  define  vectors  xCy3  and  yCy3  for  concen-
0	trations  of  Cy3-labeled  RNA.  The  quantity  to  be  extracted  from  the  microarray  data  is  thus  the  ratio  ri  log  x  Cy5/x  Cy3  i  i  i  1,  .  .  .  ,  n.  (5)
0	We  assume  the  response  of  the  scanner  PMT  used  to  measure  fluorescence  is  linear,  so  that  the  measured  image  intensity  can  be  written  as  I  Cy5  I  Cy3  aTGSx  Cy5,  aTGSx  Cy3,  (6)
0	where  a  is  the  calibration  factor.  From  these  measurements  we  compute  the  vector  z  of  log  ratios  of  the  multiplexed  gene  expression  levels,  i.e.,  zi  log  I  Cy5/I  Cy3  i  i  i  1,.  .  .,k,  (7)
0	and  from  these  we  estimate  the  expression  ratios  rj,  j  1,  .  .  .  ,  n,  as  defined  in  Eq.  (5).  Nonlinear  decoding  algorithm  We  use  a  nonlinear  decoding  algorithm  to  estimate  the  relative  expression  levels  for  each  gene.  We  first  identify  and  discard  any  corrupted  spots  to  leave  the  index  set  {1,  .  .  .  ,  k}.  The  remaining  spots  are  then  processed  by  numerically  minimizing  the  function,  J(x  Cy3,  x  Cy5)  ^  ^
0	assess  the  quantitative  performance  and  sensitivity  of  the  microarrays  over  a  large  dynamic  range,  10  different  amounts  of  kidney  RNA  were  cohybridized  to  each  microarray  in  the 
0	Exploring  the  Metabolic  and  Genetic  Control  of  Gene  Expression  on  a  Genomic  Scale
1	Joseph  L.  DeRisi,  Vishwanath  R.  Iyer,  Patrick  O.  Brown*
0	DNA  microarrays  containing  virtually  every  gene  of  Saccharomyces  cerevisiae  were  used  to  carry  out  a  comprehensive  investigation  of  the  temporal  program  of  gene  expression  accompanying  the  metabolic  shift  from  fermentation  to  respiration.  The  expression  profiles  observed  for  genes  with  known  metabolic  functions  pointed  to  features  of  the  metabolic  reprogramming  that  occur  during  the  diauxic  shift,  and  the  expression  patterns  of  many  previously  uncharacterized  genes  provided  clues  to  their  possible  functions.  The  same  DNA  microarrays  were  also  used  to  identify  genes  whose  expression  was  affected  by  deletion  of  the  transcriptional  co-repressor  TUP1  or  overexpression  of  the  transcriptional  activator  YAP1.  These  results  demonstrate  the  feasibility  and  utility  of  this  approach  to  genomewide  exploration  of  gene  expression  patterns.
0	The  complete  sequences  of  nearly  a  dozen
0	microbial  genomes  are  known,  and  in  the  next  several  years  we  expect  to  know  the  complete  genome  sequences  of  several  metazoans,  including  the  human  genome.  Defining  the  role  of  each  gene  in  these  genomes  will  be  a  formidable  task,  and  understanding  how  the  genome  functions  as  a  whole  in  the  complex  natural  history  of  a  living  organism  presents  an  even  greater  challenge.  Knowing  when  and  where  a  gene  is  expressed  often  provides  a  strong  clue  as  to  its  biological  role.  Conversely,  the  pattern  of  genes  expressed  in  a  cell  can  provide  detailed  information  about  its  state.  Although  regulation  of  protein  abundance  in  a  cell  is  by  no  means  accomplished  solely  by  regulation  of  mRNA,  virtually  all  differences  in  cell  type  or  state  are  correlated  with  changes  in  the  mRNA  levels  of  many  genes.  This  is  fortuitous  because  the  only  specific  reagent  required  to  measure  the  abundance  of  the  mRNA  for  a  specific  gene  is  a  cDNA  sequence.  DNA  microarrays,  consisting  of  thousands  of  individual  gene  sequences  printed  in  a  high-density  array  on  a  glass  microscope  slide  (1,  2),  provide  a  practical  and  economical  tool  for  studying  gene  expression  on  a  very  large  scale  (3-6).  Saccharomyces  cerevisiae  is  an  especially
0	favorable  organism  in  which  to  conduct  a  systematic  investigation  of  gene  expression.  The  genes  are  easy  to  recognize  in  the  genome  sequence,  cis  regulatory  elements  are  generally  compact  and  close  to  the  transcription  units,  much  is  already  known  about  its  genetic  regulatory  mechanisms,  and  a  powerful  set  of  tools  is  available  for  its  analysis.  A  recurring  cycle  in  the  natural  history  of  yeast  involves  a  shift  from  anaerobic  (fermentation)  to  aerobic  (respiration)  metabolism.  Inoculation  of  yeast  into  a  medium  rich  in  sugar  is  followed  by  rapid  growth  fueled  by  fermentation,  with  the  production  of  ethanol.  When  the  fermentable  sugar  is  exhausted,  the  yeast  cells  turn  to  ethanol  as  a  carbon  source  for  aerobic  growth.  This  switch  from  anaerobic  growth  to  aerobic  respiration  upon  depletion  of  glucose,  referred  to  as  the  diauxic  shift,  is  correlated  with  widespread  changes  in  the  expression  of  genes  involved  in  fundamental  cellular  processes  such  as  carbon  metabolism,  protein  synthesis,  and  carbohydrate  storage  (7).  We  used  DNA  microarrays  to  characterize  the  changes  in  gene  expression  that  take  place  during  this  process  for  nearly  the  entire  genome,  and  to  investigate  the  genetic  circuitry  that  regulates  and  executes  this  program.  Yeast  open  reading  frames  (ORFs)  were  amplified  by  the  polymerase  chain  reaction  (PCR),  with  a  commercially  available  set  of  primer  pairs  (8).  DNA  microarrays,  containing  approximately  6400  distinct  DNA  sequences,  were  printed  onto  glass  slides  by
0	using  a  simple  robotic  printing  device  (9).  Cells  from  an  exponentially  growing  culture  of  yeast  were  inoculated  into  fresh  medium  and  grown  at  30°C  for  21  hours.  After  an  initial  9  hours  of  growth,  samples  were  harvested  at  seven  successive  2-hour  intervals,  and  mRNA  was  isolated  (10).  Fluorescently  labeled  cDNA  was  prepared  by  reverse  transcription  in  the  presence  of  Cy3(green)or  Cy5(red)-labeled  deoxyuridine  triphosphate  (dUTP)  (11)  and  then  hybridized  to  the  microarrays  (12).  To  maximize  the  reliability  with  which  changes  in  expression  levels  could  be  discerned,  we  labeled  cDNA  prepared  from  cells  at  each  successive  time  point  with  Cy5,  then  mixed  it  with  a  Cy3labeled  "reference"  cDNA  sample  prepared  from  cells  harvested  at  the  first  interval  after  inoculation.  In  this  experimental  design,  the  relative  fluorescence  intensity  measured  for  the  Cy3  and  Cy5  fluors  at  each  array  element  provides  a  reliable  measure  of  the  relative  abundance  of  the  corresponding  mRNA  in  the  two  cell  populations  (Fig.  1).  Data  from  the  series  of  seven  samples  (Fig.  2),  consisting  of  more  than  43,000  expression-ratio  measurements,  were  organized  into  a  database  to  facilitate  efficient  exploration  and  analysis  of  the  results.  This  database  is  publicly  available  on  the  Internet  (13).  During  exponential  growth  in  glucoserich  medium,  the  global  pattern  of  gene  expression  was  remarkably  stable.  Indeed,  when  gene  expression  patterns  between  the  first  two  cell  samples  (harvested  at  a  2-hour  interval)  were  compared,  mRNA  levels  differed  by  a  factor  of  2  or  more  for  only  19  genes  (0.3%),  and  the  largest  of  these  differences  was  only  2.7-fold  (14).  However,  as  glucose  was  progressively  depleted  from  the  growth  media  during  the  course  of  the  experiment,  a  marked  change  was  seen  in  the  global  pattern  of  gene  expression.  mRNA  levels  for  approximately  710  genes  were  induced  by  a  factor  of  at  least  2,  and  the  mRNA  levels  for  approximately  1030  genes  declined  by  a  factor  of  at  least  2.  Messenger  RNA  levels  for  183  genes  increased  by  a  factor  of  at  least  4,  and  mRNA  levels  for  203  genes  diminished  by  a  factor  of  at  least  4.  About  half  of  these  differentially  expressed  genes  have  no  currently  recognized  function  and  are  not  yet  named.  Indeed,  more  than  400  of  the  differentially  expressed  genes  have  no  apparent  homology
0	to  any  gene  whose  function  is  known  (15).  The  responses  of  these  previously  uncharacterized  genes  to  the  diauxic  shift  therefore  provides  the  first  small  clue  to  their  possible  roles.  The  global  view  of  changes  in  expression  of  genes  with  known  functions  provides  a  vivid  picture  of  the  way  in  which  the  cell  adapts  to  a  changing  environment.  Figure  3  shows  a  portion  of  the  yeast  metabolic  pathways  involved  in  carbon  and  energy  metabolism.  Mapping  the  changes  we  observed  in  the  mRNAs  encoding  each  enzyme  onto  this  framework  allowed  us  to  infer  the  redirection  in  the  flow  of  metabolites  through  this  system.  We  observed  large  inductions  of  the  genes  coding  for  the  enzymes  aldehyde  dehydrogenase  (ALD2)  and  acetyl-coenzyme  A(CoA)  synthase  (ACS1),  which  function  together  to  convert  the  products  of  alcohol  dehydrogenase  into  acetyl-CoA,  which  in  turn  is  used  to  fuel  the  tricarboxylic  acid  (TCA)  cycle  and  the  glyoxylate  cycle.  The  concomitant  shutdown  of  transcription  of  the  genes  encoding  pyruvate  decarboxylase  and  induction  of  pyruvate  carboxylase  rechannels  pyruvate  away  from  acetaldehyde,  and  instead  to  oxalacetate,  where  it  can  serve  to  supply  the  TCA  cycle  and  gluconeogenesis.  Induction  of  the  pivotal  genes  PCK1,  encoding  phosphoenolpyruvate  carboxykinase,  and  FBP1,  encoding  fructose  1,6-biphosphatase,  switches  the  directions  of  two  key  irreversible  steps  in  glycolysis,  reversing  the  flow  of  metabolites  along  the  reversible  steps  of  the  glycolytic  pathway  toward  the  essential  biosynthetic  precursor,  glucose-6-phosphate.  Induction  of  the  genes  coding  for  the  trehalose  synthase  and  glycogen  synthase  complexes  promotes  channeling  of  glucose-6-phosphate  into  these  carbohydrate  storage  pathways.  Just  as  the  changes  in  expression  of  genes  encoding  pivotal  enzymes  can  provide  insight  into  metabolic  reprogramming,  the  behavior  of  large  groups  of  functionally  related  genes  can  provide  a  broad  view  of  the  systematic  way  in  which  the  yeast  cell  adapts  to  a  changing  environment  (Fig.  4).  Several  classes  of  genes,  such  as  cytochrome  c-related  genes  and  those  involved  in  the  TCA/glyoxylate  cycle  and  carbohydrate  storage,  were  coordinately  induced  by  glucose  exhaustion.  In  contrast,  genes  devoted  to  protein  synthesis,  including  ribosomal  proteins,  tRNA  synthetases,  and  translation,  elongation,  and  initiation  factors,  exhibited  a  coordinated  decrease  in  expression.  M
0	adulthood,  specific  combinations  of  tumor  suppressor  genes  may  cooperate  to  control  proliferation,  differentiation,  and  survival  in  different  cell  lineages.
0	Microarray  Analysis  of  Drosophila  Development  During  Metamorphosis
1	Kevin  P.  White,*  Scott  A.  Rifkin,  Patrick  Hurban,  David  S.  Hogness
0	Metamorphosis  is  an  integrated  set  of  developmental  processes  controlled  by  a  transcriptional  hierarchy  that  coordinates  the  action  of  hundreds  of  genes.  In  order  to  identify  and  analyze  the  expression  of  these  genes,  high-density  DNA  microarrays  containing  several  thousand  Drosophila  melanogaster  gene  sequences  were  constructed.  Many  differentially  expressed  genes  can  be  assigned  to  developmental  pathways  known  to  be  active  during  metamorphosis,  whereas  others  can  be  assigned  to  pathways  not  previously  associated  with  metamorphosis.  Additionally,  many  genes  of  unknown  function  were  identified  that  may  be  involved  in  the  control  and  execution  of  metamorphosis.  The  utility  of  this  genome-based  approach  is  demonstrated  for  studying  a  set  of  complex  biological  processes  in  a  multicellular  organism.  The  generation  of  vast  amounts  of  DNA  sequence  information,  coupled  with  advances  in  technologies  developed  for  the  e
0	A  common  reference  for  cDNA  microarray  hybridizations
1	Ellen  Sterrenburg,  Rolf  Turk,  Judith  M.  Boer,  Gertjan  B.  van  Ommen  and  Johan  T.  den  Dunnen*
0	ABSTRACT  Comparisons  of  expression  levels  across  different  cDNA  microarray  experiments  are  easier  when  a  common  reference  is  co-hybridized  to  every  microarray.  Often  this  reference  consists  of  one  experimental  control  sample,  a  pool  of  cell  lines  or  a  mix  of  all  samples  to  be  analyzed.  We  have  developed  an  alternative  common  reference  consisting  of  a  mix  of  the  products  that  are  spotted  on  the  array.  Pooling  part  of  the  cDNA  PCR  products  before  they  are  printed  and  their  subsequent  amplification  towards  either  sense  or  antisense  cRNA  provides  an  excellent  common  reference.  Our  results  show  that  this  reference  yields  a  reproducible  hybridization  signal  in  99.5%  of  the  cDNA  probes  spotted  on  the  array.  Accordingly,  a  ratio  can  be  calculated  for  every  spot,  and  expression  levels  across  different  hybridizations  can  be  compared.  In  dye-swap  experiments  this  reference  shows  no  significant  ratio  differences,  with  95%  of  the  spots  within  an  interval  of  T0.2-fold  change.  The  described  method  can  be  used  in  hybridizations  with  both  amplified  and  non-amplified  targets,  is  time  saving  and  provides  a  constant  batch  of  common  reference  that  lasts  for  thousands  of  hybridizations.  INTRODUCTION  cDNA  microarraying  is  currently  widely  used  to  assess  differential  gene  expression  (1).  Simultaneous  hybridization  of  two  samples  labeled  with  different  fluorescent  dyes  provides  an  intensity  ratio  that  reflects  the  relative  mRNA  levels  (2).  Though  adequate  for  comparison  of  two  samples,  assessment  of  expression  levels  across  multiple  samples,  for  example  in  a  time  series,  becomes  complicated.  For  multiarray  comparisons,  hybridization  of  a  common  reference  sample  simultaneously  with  each  experimental  sample  is  recommended  (3,4).  Initially  one  sample,  e.g.  mRNA  originating  from  one  cell  line  or  time  point  zero,  was  used  as  a  common  reference  (5±7).  A  disadvantage  of  this  approach  is  that  the  control  sample  does  not  provide  a  signal  in  all  spots  and,  since  for  these  no  ratio  can  be  calculated,  they  are  usually
0	disregarded  in  the  analysis.  Sometimes  these  gaps  are  filled  in  by  applying  a  program  that  is  designed  to  estimate  missing  values  (8).  However,  to  avoid  using  an  estimation  program  or  other  alternatives,  the  ideal  reference  should  ensure  consistent  and  non-zero  values  for  all  probes  on  the  array,  guaranteeing  that  no  information  is  lost  when  the  ratios  are  calculated  (4).  A  reference  consisting  of  a  labeled  PCR  product  from  a  part  of  the  vector  that  all  the  spotted  probes  have  in  common,  as  has  been  described  for  filter  hybridizations,  meets  this  criterion  (9).  However,  it  will  not  compete  with  the  target  cDNA  for  hybridization  to  the  specific  sequence  of  the  probe.  Consequently,  the  ratios  obtained  from  such  a  hybridization  may  not  always  reflect  the  amount  of  RNA  present  in  the  experimental  sample  (e.g.  saturated  spots).  Another  described  common  reference  consists  of  a  pool  of  RNA  originating  from  different  cell  lines  (3,10±12).  This  approaches  the  ideal  situation,  but  cell  culturing  is  very  time  and  space  consuming.  In  addition,  gene  expression  in  the  pooled  cell  lines  may  not  represent  all  genes  present  on  the  array  and  it  may  change  over  time  under  even  slightly  different  growth  conditions  and  other  variables  like  passage  number.  Furthermore,  it  is  difficult  to  repeatedly  quantify  and  pool  large  amounts  of  RNAs  from  multiple  sources  in  a  reliable  and  reproducible  way.  Bergstrom  et  al.  used  such  a  common  reference  and  reported  a  coverage  of  90%  of  the  array  by  the  reference  (13).  An  alternative  to  this  method,  which  does  provide  signal  in  all  spots  that  need  to  be  analyzed,  is  pooling  part  of  the  RNA  of  all  the  experimental  samples  (e.g.  cell  lines  or  biopsies)  which  will  be  used  in  that  particular  experiment  (4,14).  The  disadvantage  here  is  that  this  approach  is  experiment  specific  and  each  time  a  new  experiment  is  performed,  a  new  reference  pool  has  to  be  made.  Furthermore,  if  the  amount  of  experimental  samples  is  limiting,  it  is  not  possible  to  use  part  of  it  for  the  common  reference  and  if  one  wants  to  study  individual  samples  (e.g.  new  incoming  patients),  there  is  no  reference  sample  present.  The  experiments  presented  here  demonstrate  the  use  of  a  common  reference  for  cDNA  microarrays  consisting  of  a  mix  of  all  probes  spotted  on  the  array.  The  PCR  reference  is  made  by  pooling  a  fraction  of  all  amplified  probes  before  they  are  printed.  Single-stranded  products  are  synthesized  in  a  subsequent  in  vitro  transcription  reaction  and  the  product  is  labeled  in  parallel  with  the  experimental  target.  The  method  can  be  used  in  hybridizations  with  both  amplified  and  non-amplified
0	PAGE  2  OF  6
0	dichloromethane  was  used.  After  extraction,  the  aqueous  layer  was  transferred  to  a  fresh  tube  and  purified  and  concentrated  by  ethanol  precipitation.  Antisense  cRNA  transcripts  were  generated  using  the  Ampliscribe  Sp6  High  Yield  Transcription  kit  (Epicentre),  starting  with  1  mg  of  pooled  PCR  product  (Fig.  1).  In  addition  to  the  protocol,  1  ml  of  RNasin  (Fermentas)  was  added  and  the  reaction  was  incubated  at  42°C  for  3  h.  The  generated  cRNA  was  washed  three  times  with  450  ml  of  diethylpyrocarbonate-treated  water  using  a  Microcon-100  column  (Millipore).  cRNA  (750  ng)  was  reverse  transcribed  with  random  hexamers,  and  labeled  through  incorporation  of  Renaissance  cyanine  5-dUTP  (Cy5)  or  Renaissance  cyanine  3-dUTP  (Cy3)  (NEN)  according  to  the  protocols  of  Ross  et  al.  (12)  with  the  following  modifications:  8  mg  of  random  hexamer  primers  were  used  in  the  reaction  and  before  incubation  at  42°C  the  mixture  was  incubated  at  room  temperature  for  10  min.  Target  preparation  Human  fibroblast  cultures  were  grown  in  DMEM  without  phenol  red  (Gibco  BRL)  supplemented  with  1%  glucose,  2%  glutamax,  100  U/ml  penicillin,  100  mg/ml  streptomycin  and  10%  heat-inactivated  fetal  bovine  serum  (Gibco  BRL).  Cells  were  coll
0	BETWEEN  GENOTYPE  AND  PHENOTYPE:  PROTEIN  CHAPERONES  AND  EVOLVABILITY
1	Suzanne  L.  Rutherford
0	Protein  chaperones  direct  the  folding  of  polypeptides  into  functional  proteins,  facilitate  developmental  signalling  and,  as  heat-shock  proteins  (HSPs),  can  be  indispensable  for  survival  in  unpredictable  environments.  Recent  work  shows  that  the  main  HSP  chaperone  families  also  buffer  phenotypic  variation.  Chaperones  can  do  this  either  directly  through  masking  the  phenotypic  effects  of  mutant  polypeptides  by  allowing  their  correct  folding,  or  indirectly  through  buffering  the  expression  of  morphogenic  variation  in  threshold  traits  by  regulating  signal  transduction.  Environmentally  sensitive  chaperone  functions  in  protein  folding  and  signal  transduction  have  different  potential  consequences  for  the  evolution  of  populations  and  lineages  under  selection  in  changing  environments.
0	The  heat-shock  proteins  (HSPs)  are  highly  conserved  families  of  enzymes  and  CHAPERONES  that  are  involved  in  the  folding  and  degradation  of  damaged  proteins.  They  are  rapidly  and  concertedly  mobilized  in  large  numbers  by  cells  that  are  under  stress.  The  mobilization  of  HSPs  is  an  important  component  of  a  universal  and  tightly  orchestrated  stress  response  that  has  probably  allowed  organisms  to  survive  otherwise  lethal  temperatures  throughout  evolution1,2.  Even  at  normal  temperatures,  several  HSP  chaperones  are  essential  for  viability,  and  promote  the  successful  folding  and  activity  of  many  cellular  proteins2-4.  Recent  reports  document  further  roles  of  some  of  the  constitutively  important  chaperone  families  that  are  expressed  at  the  population  level5-8.  Genetic  or  pharmacological  manipulation  of  these  chaperones  alters  the  expression  of  genetic  variation  in  several  systems.  Therefore,  as  well  as  having  a  vital  role  in  stress  physiology,  chaperones  also  provide  a  plausible  molecular  mechanism  for  regulating  the  capacity  of  populations  and  lineages  for  evolutionary  adaptation  to  changing  environments  --  EVOLVABILITY.  It  is  thought  that  during  periods  of  environmental  stress,  competition  for  chaperones  by  stress-damaged  proteins  compromises  the  ability  of  the  chaperones  to  protect  or  fold  their  usual  targets,  thereby  reducing  the  activities  of  most  target  proteins9,10.  According  to  recent  studies,  the  modulation  of  chaperone  and  target  functions  in  response  to  stress  would  alternately  mask  and  expose  phenotypic  variation,  depending  on  the  degree  of  stress  and  the  availability  of  free  chaperones11-14.  This  indicates  that  chaperones  control  a  reserve  of  neutral  genetic  variation,  which  builds  up  in  populations  under  normal  conditions  and  could  be  expressed  as  heritable  phenotypic  variation  during  periods  of  environmental  change.  As  the  rate  of  evolution  is  limited  by  heritable  variation  in  fitness,  this  chaperone-mediated  mechanism  might  allow  populations  and  lineages  to  better  adapt  to  severe  environmental  change.  The  expression  of  random  genetic  variation  is  expected  to  be  largely  deleterious  to  individual  fitness.  However,  both  individual  organisms  and  interbreeding  groups  of  organisms  produce  the  differential  `births'  (new  individuals  or  groups)  and  `deaths'  (loss  of  reproductive  fitness  or  extinction)  that  are  required  for  evolution.  Under  certain  circumstances,  population-level  traits  can  increase  group  fitness  more  than  they  decrease  individual  fitness,  even  though  the  evolutionary  forces  that  operate  at  each
0	A  class  of  proteins  that,  by  preventing  improper  associations,  assist  in  the  correct  folding  or  assembly  of  other  proteins  in  vivo,  but  that  are  not  a  part  of  the  mature  structure.
0	NATURE  REVIEWS  |  GENETICS
0	Nature  Publishing  Group
0	The  ability  of  random  genetic  variation  to  produce  phenotypic  changes  that  can  increase  fitness  (intrinsic  evolvability)  or  the  ability  of  a  population  to  respond  to  selection  (extrinsic  evolvability).  Extrinsic  evolvability  depends  on  intrinsic  evolvability,  as  well  as  on  external  variables  such  as  the  history,  size  and  structure  of  the  population.
0	GROUP  SELECTION
0	Selection  on  traits  that  increase  the  relative  fitness  of  populations  or  lineages  of  organisms  at  some  fitness  cost  to  individuals.  All  of  the  feasible  mechanisms  require  selection  on  lineages  or  small  interbreeding  groups  of  related  individuals  in  subdivided  populations.
0	MUTATION  LOAD
0	The  accumulated  deleterious  alleles  that  are  carried  by  a  population  at  any  given  time.
0	EXPRESSED  MUTATION  RATE
0	The  rate  of  phenotypic  change  that  results  from  the  continuing  accumulation  of  new  mutations  (expressed  mutation  rate  =  total  mutation  rate  -  neutral  mutation  rate).
0	THRESHOLD  TRAITS
0	Quantitative  traits  that  are  discretely  expressed  in  a  limited  number  of  phenotypes  (usually  two),  but  which  are  based  on  an  assumed  continuous  distribution  of  factors  that  contribute  to  the  trait  (underlying  liability).
0	evolutionary  time6.  This  work  attracted  the  attention  of  biologists  ranging  from  protein  biochemists  to  ecologists  and  evolutionary  biologists17-20.  Recent  experiments  indicate  that  p
0	USING  DROSOPHILA  AS  A  MODEL  INSECT
1	David  Schneider
0	The  fruitfly  Drosophila  melanogaster  has  become  such  a  popular  model  organism  for  studying  human  disease  that  it  is  often  described  as  a  little  person  with  wings.  This  view  has  been  strengthened  with  the  sequencing  of  the  Drosophila  genome  and  the  discovery  that  60%  of  human  disease  genes  have  homologues  in  the  fruitfly.  In  this  review,  I  discuss  the  approach  of  using  Drosophila  not  only  as  a  model  for  metazoans  in  general  but  as  a  model  insect  in  particular.  Specifically,  I  discuss  recent  work  on  the  use  of  Drosophila  to  study  the  transmission  of  disease  by  insect  vectors  and  to  investigate  insecticide  function  and  development.
0	Insects  transmit  pathogens  that  sicken  and  kill  millions  of  people  annually.  Between  300  and  500  million  people  are  infected  each  year,  and  more  than  a  million  die  from  malaria  alone  (WHO  2000  report  on  health).  To  put  these  numbers  into  perspective,  the  number  of  people  killed  by  malaria  in  1998  was  comparable  to  the  number  of  people  killed  by  breast  and  prostate  cancer,  melanoma  and  leukaemia  combined.  Although  malaria  is,  by  far,  the  most  serious  insect-borne  disease,  there  are  still  other  arthropod-transmitted  illnesses,  such  as  Chagas  disease,  leishmaniasis,  sleeping  sickness  and  river  blindness,  which  infect  hundreds  of  thousands  of  people  each  year.  Insects  are  vectors  for  many  animal,  as  well  as  human,  diseases.  Furthermore,  insects  affect  human  health  by  damaging  our  food  supply  and,  by  eating  and  damaging  crops,  insects  also  function  as  vectors  for  various  plant  diseases.  Because  of  the  development  of  insecticide  resistance  in  vector  insects  and  insect  pests,  and  because  of  the  development  of  antibiotic  resistance  in  disease-causing  organisms,  it  is  essential  to  continuously  develop  new  methods  to  fight  these  scourges.  In  addition,  we  must  devise  effective  approaches  to  fight  the  spread  of  disease  where  none  has  existed  before.  This  will  involve  developing  new  pesticides  and  antibiotics  to  keep  ahead  of  resistance,  as  well  as  new,  unique  approaches.  For  example,  modern  molecular  biology  has  led  to  the  development  of  transgenic  crops  that  are  resistant  to  insects.  This  approach  can  narrow  the  target  range  of  control  techniques  and  limit  our  dependence  on  chemical  insecticides.  Similarly,  creative  applications  on  the  basis  of  knowledge  of  insect  biology  should  yield  even  more  results.  The  massive  amount  of  information  known  about  Drosophila  should  be  put  to  use  in  this  endeavour.  This  review  is  divided  into  three  parts.  First,  I  discuss  briefly  how  the  fruitfly  has  been  used  to  solve  general  problems  in  insect  biology.  Second,  I  discuss  how  insects  act  as  vectors  for  human  diseases  and  how  our  understanding  of  Drosophila  biology  has  contributed  to  this  field1  (see  link  to  human  homologues  in  the  fruit  fly).  Last,  I  discuss  how  Drosophila  can  help  us  to  understand  and  control  agricultural  pests.  This  is  not  intended  to  be  a  global  review  of  modern  approaches  to  vector  biology  or  pesticide  research.  This  review  focuses  on  how  Drosophila  has  informed  or  could  inform  work  in  the  fields  of  vector  biology  and  pesticide  research.
0	The  fruitfly  as  a  model  insect
0	The  fruitfly  has  been  a  general  testing  ground  for  genetic  concepts  and  techniques  that  have  applications  for  both  vector  biology  and  pest  control.  For  example,  a  promising  twist  on  the  `sterile-male'  technique,  used  to  reduce  insect  population  size,  has  been  modelled  in  Drosophila2.  Typically,  sterile-male  projects  involve  isolating  large  numbers  of  male  insects  and  then  sterilizing  them  using  radiation.  The  males  are  then  released  into  the  wild,  where  they  overwhelm  local  males,  and  prevent  productive  matings  from  occurring.  Although  this
0	Macmillan  Magazines  Ltd
0	approach  has  been  used  successfully  to  reduce  populations  of  screw  worms  and  the  tsetse  fly3,  its  effectiveness  is  limited  by  the  ability  to  isolate  large  numbers  of  homogeneous  male  populations  and  by  the  reduced  viability  of  irradiated  males.  The  new  technique,  developed  in  the  fruitfly,  takes  advantage  of  a  tetracyclinerepressible  transcription  transactivator  (TRTT).  The  first  step  is  to  create  fruitflies  that  express  the  TRTTencoding  gene  under  the  control  of  a  yolk  promoter,  so  that  expression  is  limited  to  females.  The  fly  is  also  made  transgenic  for  a  dominant-lethal  gene  that  is  expressed  under  the  control  of  TRTT.  This  permits  easy  sorting  of  males  because,  in  the  absence  of  tetracycline,  all  female  offspring  die,  whereas  males  are  unaffected  by  tetracycline  treatment  because  they  never  express  the  transactivator.  The  technique  also  results  in  non-productive  matings,  as  all  female  offspring  die  and  all  male  offspring  carry,  and  will  transmit,  the  lethal  constructs.  The  net  result  is  a  simple  method  of  producing  male  fruitflies  and  a  simple  method  of  sterilizing  a  population.  Now  that  the  value  of  this  approach  has  been  shown  in  the  fruitfly,  the  procedure  should  be  applied  to  other  insects  as  genetic  transformation  becomes  more  readily  available.  A  second  example  of  how  advances  in  our  understanding  of  Drosophila  biology  have  improved  our  ability  to  manipulate  insects  is  the  development  of  methods  for  transforming  genes  into  other  insects4,5.  The  fruitfly  has  functioned  both  as  a  source  of  transposable  elements  and  as  a  system  for  developing  transformation  techniques.  Both  avenues  have  led  to  the  genetic  transformation  of  mosquitoes6,7.  Transgenic  tools  will  facilitate  the  dissection  of  mosquito-parasite  interactions  and  could  lead  to  the  development  of  parasiteresistant  vectors.  These  two  examples  show  the  usefulness  of  the  fruitfly  in  pioneering  technologies  that  should  be  central  in  understanding  and  controlling  the  spread  of  insect-borne  diseases  and  insect  pests  in  general.  In  both  examples,  the  fruitfly  is  used  not  because  we  are  interested  in  studying  it  but  because  Drosophila  is  the  simplest  insect  to  manipulate.
0	Insects  as  vectors  of  human  disease
0	A  mosquito  of  the  subfamily  that  includes  the  genus  Anopheles,  and  which  may  transmit  malaria.
0	A  mosquito  of  the  subfamily  that  includes  the  genera  Mansonia,  Aedes  and  Culex,  and  which  may  transmit  several  diseases.
0	There  is  a  large  variety  of  vector-borne  diseases  (TABLE  1).  From  bacteria  to  viruses,  and  protozoans  to  worms,  almost  every  type  of  pathogen  has  adapted  to  use  insects  as  vectors  (FIG.  1).  Vectors  provide  a  means  of  getting  in  and  out  of  the  vertebrate  host  by  hitching  a  ride  in  a  blood  meal.  In  practice,  however,  insects  are  not  usually  passive  carriers  when  transmitting  disease  from  animal  to  animal;  instead,  parasites  must  overcome  many  barriers  to  colonize  the  insect  host8,9.  There  are  situations  where  passive  transmission  occurs  but,  for  the  diseases  listed  in  TABLE  1,  biological  transmission  is  the  rule10.  This  review  focuses  on  malaria  because  this  is,  by  far,  the  most  life-threatening  of  all  insect-borne  diseases.  In  humans,  malaria  is  caused  by  four  species  of  the  protozoan  genus  Plasmodium11  (FIG.  2),  of  which  a  single  species,  P.  falciparum,  is  responsible  for  most  malarial  deaths.  There  is  stringent  host-parasite  specificity  for  most  species  of  plasmodia  when  interacting  with  both  their  vertebrate  and  insect  hosts.  For  example,  ANOPHELINE  mosquitoes  are  the  insect  vectors  for  all  human-specific  plasmodia  whereas  Plasmodium  gallinaceum,  which  infects  ground  fowl,  uses  CULICINE  mosquitoes  as  vectors12.  Multifaceted  approaches,  such  as  the  coordinated  use  of  vaccines,  antibiotics  and  public  health  measures  have  been  important  in  limiting  disease.  Unfortunately,  few  of  these  tools  are  available  to  fight  malaria.  There  is,  at  present,  no  vaccine  against  any  of  the  plasmodia  strains  that  infect  humans.  Furthermore,  parasites  have  developed  resistance  to  many  of  the  drugs  available  to  fight  the  disease13-15  and  probably  will  develop  resistance  to  new  drugs  as  they  are  introduced.  Because
0	Table  1  |  Arthropod-borne  diseases
0	Viruses  Dengue  fever  West  Nile  fever  Yellow  fever  Bacteria  Plague  Typhus  Lyme  disease  Protozoa  Malaria  Leshmaniasis  Sleeping  sickness  Chagas  disease  Worms  River  blindness  Filariasis  Black  fly  Mosquito  Mosquito  Sand  fly  Tsetse  Kissing  bug  Flea  Louse  Tick  Vector  Mosquito  Mosquito  Mosquito
0	Chagas  disease,  typhus  Leishmania,  plague,  sleeping  sickness  Malaria,  filariasis,  arbovirus
0	NATURE  REVIEWS  |  GENETICS
0	Macmillan  Magazines  Ltd
0	most  people  afflicted  with  malaria  reside  in  developing  cou
0	The  DNA  sequence  and  analysis  of  human  chromosome  6
1	A.  J.  Mungall*,  S.  A.  Palmer,  S.  K.  Sims,  C.  A.  Edwards,  J.  L.  Ashurst,  L.  Wilming,  M.  C.  Jones,  R.  Horton,  S.  E.  Hunt,  C.  E.  Scott,  J.  G.  R.  Gilbert,  M.  E.  Clamp,  G.  Bethel,  S.  Milne,  R.  Ainscough,  J.  P.  Almeida,  K.  D.  Ambrose,  T.  D.  Andrews,  R.  I.  S.  Ashwell,  A.  K.  Babbage,  C.  L.  Bagguley,  J.  Bailey,  R.  Banerjee,  D.  J.  Barker,  K.  F.  Barlow,  K.  Bates,  D.  M.  Beare,  H.  Beasley,  O.  Beasley,  C.  P.  Bird,  S.  Blakey,  S.  Bray-Allen,  J.  Brook,  A.  J.  Brown,  J.  Y.  Brown,  D.  C.  Burford,  W.  Burrill,  J.  Burton,  C.  Carder,  N.  P.  Carter,  J.  C.  Chapman,  S.  Y.  Clark,  G.  Clark,  C.  M.  Clee,  S.  Clegg,  V.  Cobley,  R.  E.  Collier,  J.  E.  Collins,  L.  K.  Colman,  N.  R.  Corby,  G.  J.  Coville,  K.  M.  Culley,  P.  Dhami,  J.  Davies,  M.  Dunn,  M.  E.  Earthrowl,  A.  E.  Ellington,  K.  A.  Evans,  L.  Faulkner,  M.  D.  Francis,  A.  Frankish,  J.  Frankland,  L.  French,  P.  Garner,  J.  Garnett,  M.  J.  R.  Ghori,  L.  M.  Gilby,  C.  J.  Gillson,  R.  J.  Glithero,  D.  V.  Grafham,  M.  Grant,  S.  Gribble,  C.  Griffiths,  M.  Griffiths,  R.  Hall,  K.  S.  Halls,  S.  Hammond,  J.  L.  Harley,  E.  A.  Hart,  P.  D.  Heath,  R.  Heathcott,  S.  J.  Holmes,  P.  J.  Howden,  K.  L.  Howe,  G.  R.  Howell,  E.  Huckle,  S.  J.  Humphray,  M.  D.  Humphries,  A.  R.  Hunt,  C.  M.  Johnson,  A.  A.  Joy,  M.  Kay,  S.  J.  Keenan,  A.  M.  Kimberley,  A.  King,  G.  K.  Laird,  C.  Langford,  S.  Lawlor,  D.  A.  Leongamornlert,  M.  Leversha,  C.  R.  Lloyd,  D.  M.  Lloyd,  J.  E.  Loveland,  J.  Lovell,  S.  Martin,  M.  Mashreghi-Mohammadi,  G.  L.  Maslen,  L.  Matthews,  O.  T.  McCann,  S.  J.  McLaren,  K.  McLay,  A.  McMurray,  M.  J.  F.  Moore,  J.  C.  Mullikin,  D.  Niblett,  T.  Nickerson,  K.  L.  Novik,  K.  Oliver,  E.  K.  Overton-Larty,  A.  Parker,  R.  Patel,  A.  V.  Pearce,  A.  I.  Peck,  B.  Phillimore,  S.  Phillips,  R.  W.  Plumb,  K.  M.  Porter,  Y.  Ramsey,  S.  A.  Ranby,  C.  M.  Rice,  M.  T.  Ross,  S.  M.  Searle,  H.  K.  Sehra,  E.  Sheridan,  C.  D.  Skuce,  S.  Smith,  M.  Smith,  L.  Spraggon,  S.  L.  Squares,  C.  A.  Steward,  N.  Sycamore,  G.  Tamlyn-Hall,  J.  Tester,  A.  J.  Theaker,  D.  W.  Thomas,  A.  Thorpe,  A.  Tracey,  A.  Tromans,  B.  Tubby,  M.  Wall,  J.  M.  Wallis,  A.  P.  West,  S.  S.  White,  S.  L.  Whitehead,  H.  Whittaker,  A.  Wild,  D.  J.  Willey,  T.  E.  Wilmer,  J.  M.  Wood,  P.  W.  Wray,  J.  C.  Wyatt,  L.  Young,  R.  M.  Younger,  D.  R.  Bentley,  A.  Coulson,  R.  Durbin,  T.  Hubbard,  J.  E.  Sulston,  I.  Dunham,  J.  Rogers  &  S.  Beck*
0	The  Wellcome  Trust  Sanger  Institute,  Wellcome  Trust  Genome  Campus,  Hinxton,  Cambridge  CB10  1SA,  UK
0	Following  the  announcement  of  the  completion  of  the  human  genome  project  on  14  April  2003,  we  present  here  our  findings  on  the  mapping,  sequencing  and  analysis  of  chromosome  6.  Chromosome  6  was  best  known  for  the  major  histocompatibility  complex  (MHC),  a  region  of  3.6  megabases  (Mb)  on  band  6p21.3  of  the  short  arm.  The  MHC  has  an  essential  role  in  the  innate  and  adaptive  immune  system,  and  is  characterized  by  high  gene  density,  high  polymorphism  and  high  linkage  disequilibrium.  Much  of  what  we  know  today  about  genetic  variation  and  the  organization  of  haplotypes  was  first  discovered  from  studies  of  this  region.  At  a  time  when  genetic  variation  was  assessed  by  serology  rather  than  sequence,  the  term  `haplotype'  was  first  introduced  to  describe  "the  combination  of  individual  antigenic  [MHC]  determinants  that  are  positively  controlled  by  an  allele"1.  Because  of  its  crucial  role  in  immunity  and  its  association  with  many  common  diseases,  the  MHC  was  sequenced  well  ahead  of  the  rest  of  chromosome  6  (ref.  2).  Particular  care  was  taken  to  ensure  that  the  highest  quality  was  achieved  for  the  sequence,  analysis  and  annotation  of  chromosome  6.  The  annotation  of  all  gene  structures  was  manually  checked  and,  in  some  cases,  led  to  the  correction  of  known  reference  genes.  In  addition  to  the  genome  sequences  of  Mus  musculus  and  Tetraodon  nigroviridis,  the  comparative  analysis  was  enhanced  by  the  inclusion  (for  the  first  time  in  the  analysis  of  human  chromosomes)  of  the  recently  assembled  genomes  of  Rattus  norvegicus,  Fugu  rubripes  and  Danio  rerio.  Our  analysis  is  available  through  the  new  vertebrate  genome  annotation  (VEGA)  database  (http://vega.sanger.ac.uk/),
0	making  the  chromosome  6  annotation  a  high-quality  and  instantly  available  resource.
0	Clone  map  and  sequence  map
0	Bacterial  clone  contigs  were  assembled  using  restriction  enzyme  fingerprinting  and  sequence-tagged  site  (STS)  content  analysis  of  the  clones,  anchored  to  a  radiation  hybrid  (RH)  map  with  a  marker  density  of  16  per  Mb.  A  tiling  path  of  1,797  clones  and  polymerase  chain  reaction  (PCR)  fragments  (see  Supplementary  Table  S1)  were  selected  for  sequencing  spanning  the  chromosome  in  nine  contigs  separated  by  gaps  of  50-200  kilobases  (kb),  as  estimated  by  DNA  fibre  fluorescence  in  situ  hybridization  (FISH)  (see  Supplementary  Table  S2).  All  but  two  gaps  (gaps  2  and  6)  reside  in  the  pericentromeric  or  sub-telomeric  chromosomal  regions.  We  assessed  the  chromosome  coverage  in  several  ways.  First,  38%  of  the  clones  selected  for  sequencing  were  hybridized  to  metaphase  chromosomes  using  FISH.  This  provided  independent  support  of  the  map  construction  and  also  highlighted  the  presence  of  intra-  and  interchromosomal  repeats.  Next  we  identified  known  chromosome  6  markers  in  both  genetic  (deCODE3  and  Marshfield  comprehensive  genetic  maps4)  and  RH  maps  (n  ¼  3,036).  D6S1694  was  the  only  genetic  marker  found  to  be  absent  from  the  sequence.  The  position  of  D6S1694  on  these  maps  indicates  that  it  is  likely  to  reside  within  gap  6,  between  the  sequences  AL135906  and  AL731777.  We  also  accounted  for  all  RefSeq  genes  mapping  to  chromosome  6.  In  the  final  sequence,  no  RefSeq  gene  was  entirely  missing.  Three  RefSeq
0	Nature  Publishing  Group
0	MICROARRAY  TECHNOLOGIES  Creation  of  a  minimal  tiling  path  of  genomic  clones  for  Drosophila:  provision  of  a  common  resource
1	Volker  Hollich1,  Eric  Johnson2,  Eileen  E.  Furlong3,  Boris  Beckmann1,  Joseph  Carlson4,  Susan  E.  Celniker4,  and  Joerg  D.  Hoheisel1
0	INTRODUCTION  Representing  the  entire  genome  of  an  organism  on  DNA  microarrays  rather  than  the  coding  regions  only  is  prerequisite  to  various  functional  analyses,  such  as  chromatin  immunoprecipitation  experiments  (1).  But  even  for  transcriptional  profiling  analyses,  it  could  be  advantageous,  since  a  comprehensive  coverage  would  by  definition  represent  a  complete  and  normalized  gene  repertoire  irrespective  of  the  status  of  sequence  annotation.  In  order  to  produce  a  genomic  tiling  path,  typically,  a  large  set  of  PCR  primers  is  designed  on  the  basis  of  the  genome  sequence.  A  recent  publication  (2)  reports  on  experiments  performed  on  a  relatively  small  set  of  such  fragments  that  represent  in  total  about  3  Mb  of  the  Drosophila  chromosomes  2  and  3.  However,  this  approach  is  rather  time-consuming  and  expensive.  For  coverage  of  the  entire  115-Mb  Drosophila  sequence  with  3kb  non-overlapping  fragments,  more  than  76,000  primer  molecules  would  be  needed.  Alternatively,  the  very  DNA  fragments  on  which  the  sequencing  process  was  performed  could  be  utilized  to  such  an  end.  Since  usually  shotgun  clones  form  the  basis  of  large-scale  se282  BioTechniques
0	quencing  projects,  all  fragments  could  be  readily  amplified  with  a  single  primer  pair,  thus  creating  enormous  savings  in  time  and  expense.  Slightly  disadvantageous  is  the  fact  that  the  fragments  cannot  be  placed  end-to-end,  but  would  overlap  in  part.  Thus,  slightly  more  fragments  would  be  needed  to  cover  a  genome.  However,  a  certain  degree  of  redundancy  in  coverage  may  prove  to  be  beneficial  for  analytical  purposes.  Adopting  the  latter  strategy,  we  set  out  to  cover  the  genome  of  Drosophila  melanogaster  by  selecting  a  minimal  tiling  path  across  the  entire  genome  from  the  bacterial  artificial  chromosome  (BAC)-based  subclone  libraries  used  in  the  sequencing  project  (3).  MATERIALS  AND  METHODS  Clone  Selection  Based  on  the  sequencing  data,  a  minimal  tiling  path  was  calculated  for  each  subclone  contig.  This  was  accomplished  by  construction  of  a  directed  acyclic  graph  for  every  contig.  Within  this  graph,  each  clone  is  represented  by  a  vertex,  and  the  set  of  vertices  within  the  contig  is  called  V.  An  edge  between  two  vertices  is  intro-
0	ing  project--sublibraries  covering  regions  1-11  of  chromosome  X  and  all  of  the  left  arm  of  chromosome  3  as  well  as  a  global  shotgun  library--had  been  destroyed  prior  to  the  start  of  this  initiative.  To  construct  the  tiling  path,  we  initially  determined  the  sequence  positions  of  the  subclones  within  the  regions  that  are  defined  by  638  BAC  clones  (5).  This  included  not  only  subclones,  which  had  been  produced  from  the  respective  BAC,  but  also  subclones  derived  from  P1  clones  generated  during  an  earlier  phase  of  the  sequencing  project.  The  D.  melanogaster  chromosome  arms  of  euchromatic  sequence  Release  3  (6)  had  been  constructed  by  joining  the  individual  sequences  that  represent  the  BAC  clone  inserts.  As  a  result,  the  location  of  each  BAC  within  an  arm  is  known  precisely.  As  a  control,  we  compared  the  distance  of  the  BAC  end  sequences  within  the  genomic  sequence  and  the  actual  length  of  each  BAC  insert  used  in  our  analysis.  On  the  template  of  overlapping  BAC  sequences,  the  position  of  the  shotgun  clones  was  extracted  from  the  Phrap  sequence  assembly,  thus  defining  the  start  and  end  of  each  subclone  insert.  Subsequently,  overlapping  subclones  were
0	combined  into  contigs.  Because  of  both  unfinished  BACs  and  missing  shotgun  clones,  however,  2641  gaps  remained  in  addition  to  the  absent  X(1-11)  and  3L  areas  (Figure  1).  These  gaps  could  not  be  filled  with  2-kb  clones  from  the  whole  genome  shotgun  approach,  since  these  clones  were  not  available  either.  Since  the  tiling  path  is  based  on  randomly  produced  fragments,  there  is  bound  to  be  some  overlap  between  them.  However,  as  known  from  earlier  analyses  (e.g.,  References  2  and  7),  this  is  rather  an  advantage  (e.g.,  increasing  resolution  and  providing  some  degree  of  redundancy).  In  the  selection  process  of  a  minimal  path,  minimizing  the  degree  of  overlap  between  clones  on  2L  gave  rise  to  1.1%  more  clones,  whereas  aiming  at  a  minimal  total  of  clones  resulted  in  39.2%  more  overlap.  This  is  due  to  the  variation  in  clone  lengths.  Thus,  the  clone  with  the  least  overlap  might  be  shorter  than  another  clone,  which  spans  further.  Analyses  on  the  other  arms  led  to  similar  results.  As  the  overlap-optimized  path  has  only  a  small  percentage  of  additional  clones,  we  decided  to  base  our  minimal  tiling  path  on  this  selection  process,  resulting  in  a  set  of  25,135  clones.  In  Figure  1,  the  coverage  of  the  chromo-
0	MICROARRAY  TECHNOLOGIES
0	Open  Access
1	M  Hild¤*,  B  Beckmann¤,  SA  Haas¤,  B  Koch*,  V  Solovyev§,  C  Busold,  K  Fellenberg,  M  Boutros¶,  M  Vingron,  F  Sauer*¥,  JD  Hoheisel  and  R  Paro*
0	An  integrated  gene  annotation  and  transcriptional  profiling  approach  towards  the  full  gene  content  of  the  Drosophila  genome
0	reviews  reports
0	Hild  et  al;  licensee  BioMed  Central  Ltd.  This  is  an  Open  Access  article:  verbatim  copying  and  redistribution  of  this  article  are  permitted  in  all  media  for  any  purpose,  provided  this  notice  is  preserved  along  with  the  article's  original  URL.  using  the  Fgenesh  sequences  forstringent  ofbased  our  approaches  haveannotation  and  same  number  of  potential  genes,genome  anddebate.  the  integrated  geneof  such  in  silicovariety  annotationmore  only  the  combinationthedifferent  computational  more  completematter  of  experiWhile  the  genome  software.  overlap.  This  organisms  are  now  available,  thewhole-transcriptomeinitio  the  Drosophilamethods  stringency  An  Drosophila  genomeBerkeley  a  data  we  indicates  that  D.approach  resulted  in  the  agene  content  of  gene  prediction  of  lower  content  on  mental  human  melanogaster  genome,  will  provide  new  complete  genome  annotations.  In  order  to  get  a  prediction  a  a  careful  comparisonevaluation  the  several  and  transcriptional  profiling  melanogaster  precise  number  of  the  genes  encoded  is  stillbutgene  For  combination  of  annotation  Drosophila  Genome  Project  (BDGP)  towards  of  full  novel  ab  microarray,  the  Heidelberg  FlyArray,  of  the  revealed  only  limited
0	deposited  research
0	Background:  While  the  genome  sequences  for  a  variety  of  organisms  are  now  available,  the  precise  number  of  the  genes  encoded  is  still  a  matter  of  debate.  For  the  human  genome  several  stringent  annotation  approaches  have  resulted  in  the  same  number  of  potential  genes,  but  a  careful  comparison  revealed  only  limited  overlap.  This  indicates  that  only  the  combination  of  different  computational  prediction  methods  and  experimental  evaluation  of  such  in  silico  data  will  provide  more  complete  genome  annotations.  In  order  to  get  a  more  complete  gene  content  of  the  Drosophila  melanogaster  genome,  we  based  our  new  D.  melanogaster  whole-transcriptome  microarray,  the  Heidelberg  FlyArray,  on  the  combination  of  the  Berkeley  Drosophila  Genome  Project  (BDGP)  annotation  and  a  novel  ab  initio  gene  prediction  of  lower  stringency  using  the  Fgenesh  software.  Results:  Here  we  provide  evidence  for  the  transcription  of  approximately  2,600  additional  genes  predicted  by  Fgenesh.  Validation  of  the  developmental  profiling  data  by  RT-PCR  and  in  situ  hybridization  indicates  a  lower  limit  of  2,000  novel  annotations,  thus  substantially  raising  the  number  of  genes  that  make  a  fly.  Conclusions:  The  successful  design  and  application  of  this  novel  Drosophila  microarray  on  the  basis  of  our  integrated  in  silico/wet  biology  approach  confirms  our  expectation  that  in  silico  approaches  alone  will  always  tend  to  be  incomplete.  The  identification  of  at  least  2,000  novel  genes  highlights  the  importance  of  gathering  experimental  evidence  to  discover  all  genes  within  a  genome.  Moreover,  as  such  an  approach  is  independent  of  homology  criteria,  it  will  allow  the  discovery  of  novel  genes  unrelated  to  known  protein  families  or  those  that  have  not  been  strictly  conserved  between  species.
0	refereed  research  interactions  information
0	Genome  Biology  2003,  5:R3
0	R3.2  Genome  Biology  2003,
0	Hild  et  al.
0	Results  and  discussion
0	Combined  annotation
0	To  overcome  the  known  limitations  in  gene  prediction,  we  constructed  our  Drosophila  transcriptome  microarray  by  first  combining  the  BDGP  Drosophila  genome  annotation  Release  2  and  the  BDGP  cDNA  collection  Release  1  [15]  and  then  we  also  included  an  ab  initio  prediction  based  on  the  Fgenesh  software  [16].  We  merged  the  combined  BDGP  set  with  the  20,622  Fgenesh  predicted  genes  (Heidelberg  Prediction,  Heidelberg  Collection  (HDC)),  based  on  the  assumption  that  predictions  showing  an  overlap  of  more  than  30%  of  their  exon  sequences  represent  the  same  gene,  resulting  in  a  set  of  21,396  potential  genes  (Figure  1).  While  the  fact  that  nearly  97%  of  the  BDGP  genes  were  also  predicted  by  Fgenesh  validates  our  overlap  criterion,  we  still  found  a  further  7,464  predicted  genes  (36.2%;  HDC  unique)  not  represented  in  the  BDGP  annotation.
0	Computational  analysis  of  the  combined  annotation
0	The  simplest  explanation  for  the  high  number  of  HDC  unique  predictions  may  be  the  relaxed  stringency  criterion  applied.  Consequently,  a  careful  inspection  of  the  two  sets  (BDGP/FlyBase  versus  HDC)  showed  a  high  degree  of  similarity  for  most  common  predictions;  differences  were  largely  confined  to  the  5'  and  3'  ends  of  the  predictions  as  may  be  expected.  This  is  not  only  because  ab  initio  gene  prediction  algorithms  have  most  difficulties  in  locating  the  precise  ends  of  a  gene,  but  also  because  the  HDC  predictions  contain  only  coding  regions  -  while  the  BDGP/FlyBase  annotat
0	Shotgun  DNA  microarrays  and  stage-specific  gene  expression  in  Plasmodium  falciparum  malaria
0	Q  2000  Blackwell  Science  Ltd
0	unravelling  additional  important  aspects  of  malaria  biology  and  the  general  approach  may  be  applied  to  any  organism,  regardless  of  how  much  of  its  genome  is  sequenced.
0	Introduction  In  the  fight  against  malaria,  there  are  only  eight  commonly  used  drugs  and  no  reliable  vaccines  (White,  1996;  Holder,  1999).  Many  strains  of  the  malaria  parasite  Plasmodium  falciparum  are  now  resistant  to  our  antimalarial  compounds  (Peters,  1998)  and,  in  some  parts  of  the  world,  resistance  to  new  antimalarial  agents  may  be  occurring  faster  than  before  (Rathod  et  al.,  1997).  To  help  overcome  these  problems,  global  malaria  initiatives  have  invested  heavily  in  sequencing  the  Plasmodium  falciparum  genome  and  the  next  challenge  is  to  correlate  genome  sequences  to  function  (Wellems  et  al.,  1999).  Based  on  sequencing  efforts  to  date,  about  half  the  malarial  genome  coding  regions  will  have  unknown  function  (Gardner  et  al.,  1998;  Bowman  et  al.,  1999).  Relating  these  genome  sequences  to  malaria  biology  will  be  particularly  challenging  because  the  experimental  tools  to  study  malaria  are  limited  (Wellems  et  al.,  1999).  First,  most  species  of  malarial  parasites  and  most  stages  of  P.  falciparum  cannot  be  routinely  maintained  in  cell  culture.  Even  the  erythrocytic  cycle  of  P.  falciparum,  which  can  be  cultured,  is  very  slow,  labour  intensive,  and  expensive  to  propagate.  Second,  the  experimental  power  of  transfection  technology  in  P.  falciparum  and  other  malarial  species  is  restricted  at  present.  Although  the  erythrocytic  stages  can  be  transfected,  gene  disruptions  are  only  possible  for  non-essential  genes,  as  this  part  of  the  parasite  life  cycle  is  haploid  (Wellems  et  al.,  1999).  Gene  replacement  is  not  possible  because  there  is  no  negative  selection  system.  Transfection  efficiencies  in  P.  falciparum  are  so  poor  that  no  gene  function  has  been  established  purely  on  the  basis  of  genetic  complementation  with  a  library  of  malarial  genes  or  through  a  population  of  random  knock  outs.  Finally,  as  the  complete  sexual  life  cycle  of  P.  falciparum  can  only  be  studied  in  mosquitoes  and  as  yet  not  in  vitro,  classical  genetics  can  only  be  performed  with  great  difficulty  (Walliker  et  al.,  1987).  Not  surprisingly,  only  two  genetic  crosses  have  been  performed  with  malaria  parasites  and  only  a  handful  of  traits  have  been  mapped  (Walliker  et  al.,  1987;  Vaidya  et  al.,  1995;  Wang  et  al.,  1997;  Wellems  et  al.,  1999).
0	Shotgun  DNA  microarray  for  malaria  Clearly,  there  is  an  urgent  need  for  additional  methods  for  assessing  gene  function  in  malaria.  Recently,  it  has  become  possible  to  decipher  transcriptional  programmes  of  organisms  by  studying  gene  expression  en  masse  (Brown  and  Botstein,  1999).  DNA  microarray  technologies  offer  an  opportunity  to  look  at  changes  in  gene  expression  in  thousands  of  genes  simultaneously  under  different  physiological  conditions  (DeRisi  et  al.,  1997;  DeRisi  and  Iyer,  1999).  Because  the  malarial  genome  is  not  completely  sequenced,  a  variation  on  the  standard  array  technology  was  used  in  this  study.  Inserts  from  a  malarial  genomic  library  were  arrayed  randomly  to  generate  `shotgun'  microarrays.  To  measure  variation  in  expression  of  genes  during  the  parasite  life  cycle,  the  arrays  were  probed  with  differentially  labelled  cDNAs  prepared  from  total  RNA  isolated  from  cells  at  defined  developmental  stages.  PCR  products  on  the  array  that  showed  differential  hybridization  were  sequenced.
0	et  al.,  1984;  Vernick  and  McCutchan,  1998).  Such  digestion  was  expected  to  capture  long  stretches  of  unique  coding  regions  and  avoid  over-representation  of  flanking  sequences  or  introns  on  the  array.  Individual  colonies  from  the  unamplified  library  were  immediately  transferred  to  a  96-well  plate.  Amplified  inserts  from  8000  independent  clones  were  analysed  by  agarose  gel  electrophoresis.  Only  PCR  products  greater  than  about  300  bp  were  applied  on  the  DNA  array.  The  average  size  of  the  insert  applied  to  the  array  was  1±2  kb,  but  some  clones  had  PCR  products  as  large  as  5  kb.  In  addition  to  clones  from  this  library,  several  previously  characterized  genes  encoding  stage-specific  malarial  surface  antigens  (MSP1,  Pfs25,  Pfs28,  Pfs48/45)  were  included  in  the  prototype  array  (Holder,  1988;  Kaslow  et  al.,  1988;  Duffy  et  al.,  1993;  Kocken  et  al.,  1993).  Transcriptional  differences  between  trophozoites  and  gametocytes  The  usefulness  of  the  shotgun  microarray  for  analysing  malarial  transcription  programmes  was  evaluated  by  comparing  gene  expression  between  two  differentiated  forms  of  Plasmodium.  Trophozoite-specific  RNA  was  used  as  a  template  to  generate  Cy3-labelled  cDNA  (green  fluorescence)  and  late-stage  gametocyte-specific  RNA  was  used  to  generate  Cy5-labelled  cDNA  (red  fluorescence).  Equal  amounts  of  the  two  labelled  cDNA  populations  were  mixed  and  hybridized  to  the  shotgun  microarray.  Fluorescence  signals  from  Cy3  and  Cy5  label  were  separately  measured  at  each  spot  on  the  array  using  a
0	Results  and  discussion  Array  construction  The  malaria  shotgun  microarray  was  constructed  by  printing  3648  PCR-amplified  inserts  from  a  P.  falciparum  DNA  library  (Fig.  1).  To  provide  as  complete  a  representation  of  genes  as  possible,  and  to  minimize  bias  towards  specific  sequences,  a  mung  bean  nuclease  genomic  library  was  used.  Mung  bean  nuclease  preferentially  cuts  malarial  DNA  in  regions  flanking  coding  regions  (McCutchan
0	Q  2000  Blackwell  Science  Ltd,  Molecular  Microbiology,  35,  6±14
0	R.  E.  Hayward  et  al.  genes).  Third,  the  50  arrayed  genes  showing  the  highest  red/green  fluorescence  and  the  35  genes  with  the  highest  green/red  fluorescence  were  sequenced,  they  were  found  to  include  several  previously  known  stage-specific  genes  (Table  2A  and  B).  Among  the  trophozoite-selective  gene  transcripts  identified  in  this  way,  MSP-1  was  represented  twice  (Table  2A).  Other  transcripts  such  as  HRP-1  (histidine-rich  protein-1),  RAP-1  (rhoptry-associated  protein-1)  and  PfEMP-3  (P.  falciparum  erythrocyte  membrane  protein  3)  were  also  found  to  be  trophozoite-specific  in  comparison  to  stage  IV±V  gametocytes.  The  stage-specific  expression  of  these  proteins  is  consistent  with  association  of  knob  proteins,  rhoptry  proteins,  PfEMP  3  and  merozoite  function  in  asexual  stage  parasites  (Holder  et  al.,  1985;  Ellis  et  al.,  1987;  Holder,  1988;  Pasloske  et  al.,  1993),  but  not  in  late  stage  (III±V)  gametocytes  (Day  et  al.,  1998).  Among  the  sexual  stage-selective  transcripts,  we  identified  sequences  coding  for  the  known  gametocyte-specific  genes  Pfg377  and  Pfs2400  (11.1  gene)  (Table  2B,  Fig.  3A;
0	scanning  confocal  microscope  (DeRisi  et  al.,  1997).  The  red/green  fluorescence  ratio  provided  a  measure  of  the  relative  abundance  of  transcripts,  from  each  DNA  segment  represented  on  the  array,  in  trophozoites  compared  with  late-stage  gametocytes  (Fig.  2A;  the  raw  data  from  this  hybridization  and  all  the  figures  in  this  publication  may  be  accessed  on  the  web  at  http://derisilab.ucsf.edu/malaria/).  Reliability  The  faithfulness  of  the  shotgun  DNA  microarray  for  reporting  stage-specific  gene  expression  was  apparent  in  four  ways.  First,  three  separate  hybridizations  from  three  independent  cDNA  preparations  showed  virtually  identical  differential  hybridization  patterns  (Fig.  2B).  Second,  genes  such  as  Pfs25,  Pfs28,  Pfs48/45  and  MSP1,  which  were  known  to  be  expressed  in  a  stage-selective  fashion  and  which  were  applied  to  the  microarray  as  controls,  exhibited 
0	ANALYTICAL  BIOCHEMISTRY
0	A  combined  oligonucleotide  and  protein  microarray  for  the  codetection  of  nucleic  acids  and  antibodies  associated  with  human  immunodeficiency  virus,  hepatitis  B  virus,  and  hepatitis  C  virus  infections
1	Agns  Perrin,a,*  David  Duracher,b  Magali  Perret,c  Philippe  Cleuziat,b  e  and  Bernard  Mandranda
0	UMR  2142  CNRS-bioMrieux,  46  alle  dOItalie,  69364  Lyon  Cedex  07,  France  e  e  Apibio,  Zone  ASTEC,  15  rue  des  Martyrs,  38054  Grenoble  Cedex  9,  France  UMR  2142  CERVI  IFR  INSERM  74,  24  avenue  Tony  Garnier  69365,  Lyon  Cedex  07,  France
0	Keywords:  Hybridization;  DNA;  Microtiterplate  well;  Densitometry;  Enzyme  substrate;  Alkaline  phosphatase;  Immunoassays;  ELISA;  Complexity;  Multidetection
0	Coinfections  by  hepatitis  B  (HBV)1  and  C  (HCV)  viruses  are  frequent  in  seropositive  patients  infected  with  human  immunodeficiency  virus  (HIV)  since  the  same
0	routes  of  transmission  are  shared  by  these  viruses  (drug  abusers,  blood  transfusion,  etc.)  [1].  Diagnosis  and  therapy  follow-up  of  such  associated  diseases  are  possible  by  the  combination  of  several  individual  assays  for  testing  pertinent  parameters.  The  immune  response  to  HIV  type  1  (HIV-1)  is  oriented  mainly  against  gag  and  env  glycoproteins,  but  a  period  of  about  3  weeks  is  observed  between  contamination  and  appearance  of  anti-HIV  antibodies.  During  this  period,  p24  protein  is  present  in  the  serum  of  most  patients.  The  recent  emergence  of  combined  assays  for  the  codetection  of  p24  antigenemia  and  anti-HIV  antibody  titer--e.g.,  HIV  Duo  assay  (bioMrieux)--allows  e
0	reducing  the  delay  between  contamination  and  diagnosis  [2].  Quantification  of  HIV-1  genome  is  achieved  by  molecular  techniques,  which  take  on  more  importance  since  they  are  extremely  sensitive  [3],  viral  load  RNA  being  predictive  of  CD4  decline,  acquired  immune  deficiency  syndrome  progression,  and  patient  survival  [4].  In  the  case  of  HBV  infection,  the  presence  of  plasma  hepatitis  B  surface  antigen  (HBs-Ag)  indicates  an  active  HBV  infection  [5].  Furthermore,  testing  HBV  DNA  levels  during  therapy  may  allow  early  recognition  of  patients  who  do  not  respond  to  therapy  [3],  as  both  the  DNA  and  the  protein  are  often  associated  for  HBV  follow-up  [6].  On  the  other  hand,  appearance  of  antiHBs  antibodies  is  an  indicator  of  patient  recovery.  Detection  of  HCV  infection  by  a  HCV  positivity  has  been  facilitated  by  the  development  of  antibody  assays  [7].  However,  these  methods  are  of  restricted  use  due  to  the  period  of  several  weeks  between  infection  and  seroconversion  [8].  Alternatively,  amplification  of  viral  nucleic  acid  is  an  effective  means  for  direct  HCV  quantification  [9].  Many  commercial  tests  currently  available  permit  the  detection  of  each  of  these  parameters  in  separate  assays.  Emerging  protein  microarray  technology  enabling  one  to  set  up  more  complex  systems  such  as  antigen  microarrays  for  serodiagnosis  of  several  infectious  diseases  [10]  has  been  proposed.  Other  generic  array  formats  designed  for  the  detection  of  a  wider  range  of  infectious  or  toxic  substances  have  been  proposed,  notably  by  Lee  et  al.  [11]  or  Yang  et  al.  [12].  These  chips  could  be  used  indiscriminately  for  either  immunoassays  or  DNA  hybridization.  Multiplexed  assays  based  on  tagged  microspheres  are  also  well  adapted  for  versatile  applications  targeting  proteomics  or  genomics  [13,14].  But  to  our  knowledge,  no  description  of  a  technique  allowing  the  simultaneous,  real-time  codetection  of  immunological  and  DNA  hybridization  reactions  has  been  made  in  the  literature.  Our  proposal  in  this  work  is  a  microarray  based  on  a  standard  96-well  microplate  format  for  which  the  potential  as  a  protein  microarray  has  already  been  demonstrated  [15].  Each  well  is  functionalized  by  16  spots  comprising  nucleic  acids  and  viral  proteins,  each  of  these  probes  allowing  the  detection  of  a  parameter  relevant  for  the  diagnosis  or  follow-up  of  three  frequently  associated  viral  infections  (HIV,  HBV,  HCV).  Immunological  models  are  chosen  so  that  a  systematic  comparison  is  possible  between  CombOLISA  and  validated  immunoassay  platforms  such  as  ELISA  in  microtiter  plates  or  the  VIDAS  automat.
0	(SK431:  TGCTATGTCAGTTCCCCTTGGTTCTCT  and  SK462:  AGTTGGAGGACATCAAGCAGCCA  TGCAAAT)  [15]  and  50  -aminated  probe  for  amplified  productsO  capture  (CHIV  :  GAGACCATCAATGAGGA  AGCTGCAGAATGGGAT)  [16]  were  synthetized  by  Eurogentec  (Seraing,  Belgium)  as  were  all  other  oligonucleotides.  HCV  RNA  targets  from  HCV  were  extracted  from  serum  of  chronically  infected  patients  using  Nucleospin  RNA  Virus  Kit  (Macherey-Nagel,  Hoerdt,  France)  and  amplified  by  RT-PCR  with  50  -biotinylated  primers  (RC21:  CTCCCGGGGCACTCGCAAGC  and  RC1:  GTGTA  GCCATGGCGTTAGTA)  [17].  The  50  -aminated  probe  CHCV  (CATAGTGGTCTGCGGAACCGGTGAGT)  [18]  was  designed  to  capture  biotinylated  amplified  products.  HIV  and  HCV  targets  were  amplified  by  RT-PCR  under  the  following  conditions  using  an  Access  kit  from  Promega  (Madison,  WI,  USA):  1A  AMV/Tfl  reaction  buffer,  1.8  mM  MgSO4  ,  0.2  mM  dNTP,  1  lM  primers,  1  U  of  AMV  reverse  transcriptase,  and  5  U  of  Tfl  DNA  polymerase;  RT  cycle  48  °C  for  45  min;  35  PCR  cycles  (94  °C  for  30  s  60  °C  for  1  min,  68  °C  for  2  min);  final  extension  at  68  °C  for  7  min.  PCR  templates  were  analyzed  on  agarose  gels  stained  with  ethidium  bromide  and  revealed  under  UV  illumination.  Concentrations  of  amplified  products  were  evaluated  by  comparison  to  band  density  of  a  mass  ladder  (Eurogentec).  HIV  and  HCV  amplicons  were  46  and  23  nM,  respectively.  HBV  A  synthetic  single-stranded  nucleic  target  (74  bp)  (CCCAGTAAAGTTCCCCACCTTATGAGTCCAAG  GAATTACTAACATTGAGATTCCCGAGATTGAG  ATCTTCTGCGA)  from  the  HBV  genome  [19],  a  50  aminated  capture  probe  for  target  hybridization  (CHBV  :  ATCTCGGGAATCTCAATGTTAG),  and  a  50  -biotinylated  detection  probe  that  also  hybridizes  to  the  synthetic  target  (DHBV  :  TATTCCGACTCATAAGGTG)  were  synthetized.  Immunoassay  Recombinant  HCV  core  protein,  whose  synthesis  is  described  elsewhere  [20],  and  HIV  envelope  glycoprotein  GP160  were  obtained  from  bioMerieux.  HBs  antigens  were  obtained  from  Hytest  (Turku,  Finland)  for  the  Ay  subtype  and  from  Cliniqa  (Fallbrook,  CA,  USA)  for  the  Ad  subtype.  GP160  and  HBs  antigens  were  the  same  as  those  used  for  adsorption  on  receptacles  of  the  VIDAS  instrument  in  the  HIV  Duo  kit  and  in  the  Anti-HBs  Total  kit,  respectively.  Two  proteins  (NSP1  ,  NSP2  )  having  no  affinity  in  the  present  study  were  also  spotted  to  verify  immunological  reaction  specificity.  Infected  human  sera  were  kindly  provided  by  the  Croix-Rousse
0	Materials  and  methods  Nucleic  acid  probe  and  DNA  targets  HIV  HIV-1  RNA  was  bought  from  Ambion  (Austin,  TX,  USA).  Biotinylated  primers  for  amplification
0	Hospital  (Lyon,  France).  Alkaline  phosphatase-labeled  goat  anti-human  IgG  (AP-GaH  IgG)  was  from  Jackson  Immunoresearch  (West  Grove,  PA,  USA)  and  alkaline  phosphatase-labeled  streptavidin  (AP-SA)  was  from  Sigma  (St.  Quentin,  France).  Microarray  setup  Capture  probes  CHIV  ,  CHBV  ,  and  CHCV  were  diluted  at  10  lM  in  a  coating  buffer  (150  mM  Na2  HPO4  /  NaH2  PO4  ,  450  mM  NaCl,  1  mM  EDTA,  pH  7.4).  Nonspecific  proteins  (NSP1  ,  bovine  serum  albumin;  NSP2  ,  human  chorionic  gonadotropin)  were  diluted  at  50  lg/ml  in  50  mM  carbonate  buffer,  pH  9.3.  GP160,  HBs  antigens,  and  HCV  core  proteins  were  diluted  at  10  lg/ml  in  phosphate-buffered  saline  (PBS;  50  mM  Na2  HPO4  /NaH2  PO4  ,  150  mM  NaCl,  pH  7.4).  Spotting  was  carried  out  with  the  Biochip  Arrayer  (Perkin-  Elmer,  Boston,  MA,  USA),  which  is  based  on  a  submicroliter  noncontact,  drop-on-demand  piezoelectric  dispensing  technology  providing  a  typical  spot  diameter  of  250  lm.  Each
0	BRIEF  COMMUNICATIONS
0	Genotyping  over  100,000  SNPs  on  a  pair  of  oligonucleotide  arrays
1	Hajime  Matsuzaki,  Shoulian  Dong,  Halina  Loi,  Xiaojun  Di,  Guoying  Liu,  Earl  Hubbell,  Jane  Law,  Tam  Berntsen,  Monica  Chadha,  Henry  Hui,  Geoffrey  Yang,  Giulia  C  Kennedy,  Teresa  A  Webster,  Simon  Cawley,  P  Sean  Walsh,  Keith  W  Jones,  Stephen  P  A  Fodor  &  Rui  Mei
0	We  present  a  genotyping  method  for  simultaneously  scoring  116,204  SNPs  using  oligonucleotide  arrays.  At  call  rates  >99%,  reproducibility  is  >99.97%  and  accuracy,  as  measured  by  inheritance  in  trios  and  concordance  with  the  HapMap  Project,  is  >99.7%.  Average  intermarker  distance  is  23.6  kb,  and  92%  of  the  genome  is  within  100  kb  of  a  SNP  marker.  Average  heterozygosity  is  0.30,  with  105,511  SNPs  having  minor  allele  frequencies  >5%.
0	Single-nucleotide  polymorphisms  (SNPs)  are  emerging  as  the  marker  of  choice  for  a  broad  spectrum  of  genetic  analyses.  Previously,  we  demonstrated  a  highly  accurate  approach  for  genotyping  over  10,000  SNPs  which  combines  reduction  in  genome  complexity  with  the  allele-discriminating  specificity  of  oligonucleotide  arrays1,2  .  Recent  advancements  in  array  technology,  assay  and  algorithm  development,  together  with  new  SNP  content  from
0	BRIEF  COMMUNICATIONS
0	RNA  interference  microarrays:  High-throughput  loss-of-function  genetics  in  mammalian  cells
1	Jose  M.  Silva,  Hana  Mizuno,  Amy  Brady,  Robert  Lucito,  and  Gregory  J.  Hannon*
0	RNA  interference  (RNAi)  is  a  biological  process  in  which  a  doublestranded  RNA  directs  the  silencing  of  target  genes  in  a  sequencespecific  manner.  Exogenously  delivered  or  endogenously  encoded  double-stranded  RNAs  can  enter  the  RNAi  pathway  and  guide  the  suppression  of  transgenes  and  cellular  genes.  This  technique  has  emerged  as  a  powerful  tool  for  reverse  genetic  studies  aimed  toward  the  elucidation  of  gene  function  in  numerous  biological  models.  Two  approaches,  the  use  of  small  interfering  RNAs  and  short  hairpin  RNAs  (shRNAs),  have  been  developed  to  permit  the  application  of  RNAi  technology  in  mammalian  cells.  Here  we  describe  the  use  of  a  shRNA-based  live-cell  microarray  that  allows  simple,  low-cost,  high-throughput  screening  of  phenotypes  caused  by  the  silencing  of  specific  endogenous  genes.  This  approach  is  a  variation  of  ``reverse  transfection''  in  which  mammalian  cells  are  cultured  on  a  microarray  slide  spotted  with  different  shRNAs  in  a  transfection  carrier.  Individual  cell  clusters  become  transfected  with  a  defined  shRNA  that  directs  the  inhibition  of  a  particular  gene  of  interest,  potentially  producing  a  specific  phenotype.  We  have  validated  this  approach  by  targeting  genes  involved  in  cytokinesis  and  proteasome-mediated  proteolysis.
0	similarly  using  cell  microarrays  for  loss-of-function  genetics.  This  is  accomplished  by  creating  a  microarray  of  living  cells  that  have  been  transfected  in  situ  with  either  small  interfering  RNAs  (siRNAs)  or  with  DNA  constructs  that  direct  the  expression  of  short  hairpin  RNAs  (shRNAs).  These  are  effective  at  initiating  a  silencing  response  and  in  creating  defined  areas  (spots)  of  cells  in  which  suppression  of  a  targeted  gene  generates  an  expected  phenotype.  Such  arrays  will  find  broad  application  to  highthroughput  low-cost  phenotype-based  screens  in  mammalian  cells.  Materials  and  Methods
0	Microarray  Printing  and  Reverse  Transfection.  Transfection  mixes
0	NA  interference  (RNAi)  has  emerged  as  one  of  the  standard  techniques  to  study  gene  function  in  diverse  experimental  systems.  Introduction  of  double-stranded  RNA  (dsRNA)  into  a  cell  decreases  the  level  of  the  complementary  mRNAs  producing  a  knockdown  of  the  corresponding  protein.  The  current  model  of  the  RNAi  mechanism  proposes  that  the  silencing  ``trigger''  is  processed  by  Dicer  into  small  RNAs  of  21-22  nucleotides  in  length.  These  become  incorporated  into  an  RNA-induced  silencing  complex  with  endonuclease  activity  (RISC),  which,  in  turn,  identifies  and  cleaves  homologous  mRNAs  (1,  2).  Based  on  this  approach,  genomewide  RNAi  approaches  have  been  used  successfully  for  phenotype-based  screens  in  Caenorhabditis  elegans  (3-5)  and  Drosophila  melanogaster  (6,  7).  In  part,  these  successes  derive  from  the  availability  of  convenient  and  inexpensive  methods  for  producing  and  introducing  dsRNA.  For  example,  it  has  previously  been  shown  that  RNAi  can  be  triggered  by  soaking  C.  elegans  in  a  solution  of  dsRNA  (8),  or  by  feeding  worms  with  E.  coli  expressing  gene-specific  dsRNAs  (9).  In  Drosophila  cells  a  soaking  protocol  is  also  available  allowing  an  easy  method  of  introducing  dsRNA  (10).  Unfortunately,  similarly  straightforward  approaches  for  triggering  silencing  have  not  been  described  in  mammals.  Analysis  of  multiples  genes  requires  a  ``gene  by  gene''  method,  in  which  individual  transfections  must  be  performed,  making  these  studies  expensive,  tedious,  and  dependent  on  high-throughput  robotic  systems.  Cell  microarrays  represent  a  novel  alternative  to  classical  approaches  to  phenotype-based  assays  in  mammalian  cells.  Cell  microarrays  were  first  described  by  Ziauddin  and  Sabatini  (11),  who  demonstrated  that  cells  grown  on  a  glass  substrate  could  take  up  DNA-lipid  complexes  that  had  been  deposited  on  the  slide  before  cells  were  plated.  Cells  essentially  became  transfected  in  situ,  with  defined  spots  of  transfected  cells  localized  over  the  printed  DNAs.  These  studies  demonstrated  the  use  of  conventional  DNA  constructs  for  creating  phenotypes  based  on  ectopic  expression.  Here  we  investigate  the  possibility  of
0	Reporter  Assays.  One  hundred  sixty  dots  containing  a  dual
0	reporter  vector  expressing  GFP  dsRed  fluorescent  proteins  (gift  of  Alla  Karpova,  Cold  Spring  Harbor  Laboratory)  and  individual  shRNAs  were  printed.  All  shRNA  were  part  of  a  library  of  U6  polymerase  III  promoter-driven  hairpins  (28).  Four  groups  of  experiments  with  40  dots  (each)  were  printed:  the  first  group  contained  only  dual  reporter  vector,  the  second  group  contained  the  reporter  vector  plus  an  shRNA  or  siRNA  against  firefly  luciferase  (Ff  shRNA  and  Ff  siRNA),  the  third  group  contained
0	Ninety-Six-Well  Plate  Analyses.  All  RNAi  microarray  results  were
0	validated  by  using  cells  transfected  in  96-well  tissue  culture  plates.  Cells  were  transfected  with  LT-1  (Mirus,  Madison,  WI)  according  to  the  manufacturer's  instructions  at  50-70%  confluence.  The  plasmids  containing  appropriate  constructs  were  cotransfected,  keeping  the  same  ratios  used  in  the  arrayed  slides  but  with  a  total  mass  of  100  ng  of  DNA  for  each  transfected  well.  Again,  results  were  analyzed  after  60  h  of  incubation.  Results
0	Targeting  Reporter  Genes  in  Situ  by  Using  siRNAs.  Given  previous
0	the  reporter  vector  plus  a  shRNA  or  a  siRNA  against  GFP  that  has  no  effect  in  the  expression  level  of  the  protein  (GFP  shRNA-1  and  GFP  siRNA-1),  and  the  last  group  contained  the  reporter  vector  plus  a  shRNA  that  reduces  by  90%  the  GFP  signal  when  tested  in  culture  plates  (GFP  shRNA-2  and  GFP  siRNA-2).  Several  cell  lines  were  tested  for  transfection,  NIH  3T3,  IMR90  E1A,  HeLa,  and  HEK  293T.  To  test  the  stability  of  the  printed  array,  we  repeated  the  assay  at  different  time  points  after  printing,  day  0,  1  week,  2  weeks,  4  weeks,  and  2  months.  For  testing  the  stability  of  the  transfection  master  mix,  we  stored  the  solution  at  4°C  and  then  printed  new  slides  and  assayed  them  at  the  time  points  described  above.
0	Proteasome-Mediated  Proteolysis  Assays.  Thirty  shRNAs  targeting  different  proteasome  subunits  were  printed  in  triplicate.  Every  dot  harbored  an  shRNA-expression  vector,  a  plasmid  expressing  dsRed  (dsRed  N-1,  Clontech),  and  a  vector  encoding  a  proteasome  fluorescent  reporter  (ZsProSensor,  Clontech).  This  reporter  encodes  a  fusion  protein  that  has  been  engineered  to  show  varying  levels  of  expression  depending  on  the  status  of  the  proteasome  pathway.  Every  transfection  master  mix  contained  400  ng  of  dsRed  vector,  100  ng  of  ZsProSensor,  and  1  g  of  shRNA  plasmid.  Twenty  micrograms  of  total  protein  lysates  was  used  for  Western  blot  analysis.  Rabbit  anti-PSMC-6  subunit  of  the  proteasome  (Affinity,  Biomol,  Plymouth  Meeting,  PA),  rabbit  anti-ubiquitin  (StressGen  Biotechnologies,  Victoria,  Canada),  and  mouse  anti-  -actin  (United  States  Biological,  Swampscott,  MA)  antibodies  were  also  used  in  these  studies.  Cytokinesis  Defect  Assays.  Eight  shRNAs  targeting  the  motor
0	successes  in  ectopically  expressing  genes  by  reverse  transfection  (11),  we  hoped  that  similar  approaches  could  be  coupled  with  the  use  of  RNAi  to  produce  knockdown  phenotypes.  Therefore,  we  began  by  testing  the  ability  of  siRNAs  to  be  deposited  on  a  microarray  as  lipid-RNA  comple
0	An  Arabidopsis  promoter  microarray  and  its  initial  usage  in  the  identification  of  HY5  binding  targets  in  vitro
1	Ying  Gao1,2,  Jinming  Li3,  Elizabeth  Strickland2,  Sujun  Hua4,  Hongyu  Zhao5,  Zhangliang  Chen1,  Lijia  Qu1  and  Xing  Wang  Deng1,2,*
0	Key  words:  Arabidopsis,  HY5,  promoter  microarray,  transcription  factor-promoter  interaction
0	Abstract  To  analyze  transcription  factor-promoter  interactions  in  Arabidopsis,  a  general  strategy  for  generating  a  promoter  microarray  has  been  established.  This  includes  an  integrated  platform  for  promoter  sequence  extraction  and  the  design  of  primers  for  the  PCR  amplification  of  the  promoter  regions  of  annotated  genes  in  the  Arabidopsis  genome.  A  web-interfaced  primer-retrieval  program  was  used  to  obtain  up  to  10  primer  pairs  with  a  suitability  ranking  given  to  each  gene.  We  selected  primer  pairs  for  the  promoters  of  about  3800  genes,  and  greater  than  95%  of  the  promoter  fragments  from  the  total  genomic  DNA  were  successfully  amplified  by  PCR.  These  PCR  products  were  purified  and  used  to  print  an  Arabidopsis  promoter  microarray.  This  initial  promoter  microarray  was  used  to  study  the  in  vitro  binding  of  the  transcription  factor  HY5  to  its  promoter  targets.  A  set  of  promoter  fragments  exhibited  consistent  and  strong  interaction  with  the  HY5  protein  in  vitro,  and  computational  analysis  revealed  that  they  were  enriched  with  the  HY5  consensus  binding  G-box  motif.  Thus,  a  promoter  microarray  can  be  a  useful  tool  for  identifying  transcription  factor  binding  sites  at  the  genomic  scale  in  higher  plants.
0	Introduction  Transcription  factor-promoter  interactions  are  fundamentally  important  for  understanding  the  regulation  of  genome  expression,  and,  thus,  eukaryotic  cell  growth  and  development.  A  series  of  recent  papers  revealed  critical  insights  in  the  genome-wide  transcription  regulatory  network  using  a  global  genome-wide  analysis  of  transcription  factor  binding  sites  in  several  model  organisms,  including  yeast  (Ren  et  al.,  2000;  Iyer  et  al.,  2001;  Simon  et  al.,  2001;  Wyrick  et  al.,  2001),  Drosophila  (Markstein  et  al.,  2002;  Stathopoulos
0	et  al.,  2002;  Orian  et  al.,  2003),  and  mammalian  cells  (Horak  et  al.,  2002;  Ren  et  al.,  2002;  Weinmann  et  al.,  2002).  Although  a  combination  of  gene  expression  analysis  and  computational  prediction  strategy  has  been  employed  previously  to  understand  genome  expression  regulation  in  Arabidopsis  (Hong  et  al.,  2003;  Ramirez-Parra  et  al.,  2003),  the  analysis  of  transcription  factor-promoter  interactions  has  been  largely  limited  to  individual  genes  (Saha  et  al.,  2001;  Egelkrout  et  al.,  2002;  Lopez-Molina  et  al.,  2002).  The  Arabidopsis  thaliana  genome  encodes  at  least  fifteen-hundred  transcription  factors,  which
0	We  retrieved  the  assembled  Arabidopsis  chromosome  sequences  and  annotation  information  from  MAtDB  -  the  MIPS  Arabidopsis  thaliana  database  (ftp://ftpmips.gsf.de/cress/).  The  annotation  information  included  gene  contig  names,  entry  codes,  gene  structures,  and  transcription  directions.  The  promoter  region  of  each  gene  was  located  according  to  the  annotation  information  and  then  was  extracted  from  the  chromosome  sequences.  Representative  promoter  deletion  analyses  have  shown  that  most  Arabidopsis  genes  have  functional  promoters  within  1400  bp  of  their  translational  start  sites  (Conley  et  al.,  1994;  Tjaden  et  al.,  1995;  Honma  and  Goto,  2000;  Haralampidis  et  al.,  2002;  Brown  et  al.,  2003).  Therefore,  we  used  1400  bp  as  an  upper  limit  for  our  promoter  sequence  selection  of  Arabidopsis  genes.  To  select  promoter  fragments  for  microarray  construction,  we  also  considered  the  need  for  the  uniformity  of  promoter  size,  so  as  to  reduce  the  variation  in  PCR  amplification  yield,  as  well  as  hybridization  efficiency.  Therefore,  the  following  principles  were  followed  in  selecting  promoter  fragments  for  PCR  amplification.  First,  the  longest  fragment  size  of  the  PCR  products  was  1400  bps.  Second,  a  minimum  fragment  size  of  the  PCR  products  was  set  to  500  bps.  Third,  the  promoter  3¢  end  was  always  near  and  no  more  than  50  bps  upstream  of  the  ATG.  To  apply  the  above  principles,  transcription  directions  of  the  selected  specific  gene  and  the  length  of  the  intergenic  region  between  this  gene  and  its  upstream  neighbor  gene  were  considered.  These  intergenic  regions  in  the  genome  were  grouped  into  14  types,  and  in  each  case  a  distinct  formula  was  used  to  define  the  promoter  region  for  PCR  amplification  (Figure  2).  Then  the  promoter  sequences  from  these  defined  promoter  regions  were  extracted  from  the  chromosome  sequences,  stored  in  the  database,  and  used  for  primer  selection.  A  da
0	Microarray  and  Functional  Gene  Analyses  of  Sulfate-Reducing  Prokaryotes  in  Low-Sulfate,  Acidic  Fens  Reveal  Cooccurrence  of  Recognized  Genera  and  Novel  Lineages
1	Alexander  Loy,1  Kirsten  Kusel,2  Angelika  Lehner,3  Harold  L.  Drake,2  ¨  and  Michael  Wagner1*
0	MATERIALS  AND  METHODS
0	Site  description.  The  two  low-moor  fens,  designated  Schloppnerbrunnen  I  ¨  (50°08  14  N,  11°53  07  E)  and  Schloppnerbrunnen  II  (50°08  38  N,  11°51  41  E),  ¨  that  were  investigated  are  in  the  Lehstenbach  catchment  in  the  Fichtelgebirge  mountains  in  northeastern  Bavaria  (Germany).  The  catchment  has  an  area  of  4.2  km2,  and  the  highest  elevation  is  877  m  above  sea  level.  Ninety  percent  of  the
0	SULFATE-REDUCING  PROKARYOTES  IN  ACIDIC  FENS  TABLE  1.  16S  rRNA  gene-targeted  primers
0	Short  namea
0	Full  nameb
0	Annealing  temp  (°C)
0	Sequence  (5  -3  )
0	616V  630R  1492R  ARGLO36F  DSBAC355F  DSMON85F  DSMON1419R  SYBAC  282F  SYBAC1427R  DBACCA65F  DBACCA1430R
0	S-D-Bact-0008-a-S-18  S-D-Bact-1529-a-A-17  S-  -Proka-1492-a-A-19  S-G-Arglo-0036-a-S-17  S-  -Dsb-0355-a-S-18  S-G-Dsmon-0085-a-S-20  S-G-Dsmon-1419-a-A-20  S-  -Sybac-0282-a-S-18  S-  -Sybac-1427-a-A-18  S-S-Dbacca-0065-a-S-18  S-S-Dbacca-1430-a-A-18
0	Most  Bacteria  Most  Bacteria  Most  Bacteria  and  Archaea  Archaeoglobus  spp.  Most  "Desulfobacterales"  and  "Syntrophobacterales"  Desulfomonile  spp.  Desulfomonile  spp.  "Syntrophobacteraceae"  and  some  other  Bacteria  "Syntrophobacteraceae"  Desulfobacca  acetoxidans  Desulfobacca  acetoxidans
0	Short  name  used  in  the  reference  or  in  this  study.  Name  of  16S  rRNA  gene-targeted  oligonucleotide  primer  based  on  established  nomenclature  (6).  The  annealing  temperature  was  52°C  when  the  primer  was  used  with  forward  primer  616V  or  ARGLO36F,  and  the  annealing  temperature  was  60°C  when  the  primer  was  used  with  forward  primer  DSBAC355F.
0	area  is  covered  with  Norway  spruce  (Picea  abies  [L.]  Karst.)  of  different  ages.  Upland  soils  in  the  catchment  are  not  water  saturated,  have  developed  from  weathered  granitic  bedrock,  and  are  predominantly  cambisols  and  cambic  podsols  (according  to  the  Food  and  Agriculture  Organization  system).  Considerable  parts  of  the  catchment  (approximately  30%)  are  covered  by  minerotrophic  fens  or  intermittent  seeps.  The  annual  precipitation  in  the  catchment  is  900  to  1,160  mm,  and  the  average  annual  temperature  is  5°C.  Schloppnerbrunnen  I  is  covered  with  patches  of  Sphagnum  moss  and  spruce,  ¨  and  the  soil  is  a  fibric  histosol  and  is  usually  water  saturated;  in  years  with  extremely  hot  summer  months,  the  upper  soil  can  become  dry.  Schloppnerbrun¨  nen  II  is  permanently  water  saturated  and  completely  overgrown  by  the  grass  Molinia  caerula.  The  soil  of  Schloppnerbrunnen  II  has  a  larger  amount  of  bio¨  available  Fe3  than  the  soil  of  Schloppnerbrunnen  I  has.  The  soil  pHs  of  Sch¨  loppnerbrunnen  I  and  II  were  approximately  3.9  and  4.2,  respectively;  the  soil  ¨  solution  pH  varied  between  4  and  6.  Dialysis  chambers.  A  soil  solution  from  the  upper  40  cm  of  each  site  was  sampled  with  dialysis  chambers  (27)  every  2  months  from  July  2001  to  November  2002.  Each  dialysis  chamber  consisted  of  40  1-cm  cells  covered  with  a  cellulose  acetate  membrane  with  a  pore  diameter  of  0.2  m.  Prior  to  installation,  the  chamber  was  filled  with  anoxic,  deionized  water.  The  dialysis  chambers  were  placed  in  the  water-saturated  fens  for  2  weeks  prior  to  sampling.  On  the  sampling  date,  each  chamber  was  closed  (i.e.,  made  airtight),  transported  to  the  laboratory,  and  sampled  with  argon-flushed  syringes.  Collection  of  soil.  For  microcosms,  soil  samples  from  three  different  depths  (approximately  0  to  10,  10  to  20,  and  20  to  30  cm)  were  obtained  in  December  2001  in  sterile  airtight  vessels,  transported  to  the  laboratory,  and  processed  within  4  h.  For  isolation  of  DNA,  soil  cores  (diameter,  3  cm)  from  four  different  depths  (approximately  0  to  7.5,  7.5  to  15,  15  to  22.5,  and  22.5  to  30  cm)  were  collected  on  24  July  2001  and  immediately  cooled  on  ice.  Soil  samples  were  brought  to  the  laboratory,  where  they  were  diluted  1:1  (vol/vol)  in  phosphatebuffered  saline  (130  mM  NaCl,  10  mM  NaH2PO4,  10  mM  Na2HPO4;  pH  7.3),  homogenized  by  vortexing,  and  stored  at  20°C.  Anoxic  microcosms.  Thirty-gram  (fresh  weight)  portions  of  soil  were  placed  into  125-ml  infusion  flasks  (Merck
0	The  Use  of  Carbohydrate  Microarrays  to  Study  Carbohydrate-Cell  Interactions  and  to  Detect  Pathogens
1	Matthew  D.  Disney  and  Peter  H.  Seeberger*  Laboratory  for  Organic  Chemistry  Swiss  Federal  Institute  of  Technology  Zuerich  ETH  Hoenggerberg  HCI  F315  Wolfgang-Pauli-Strasse  10  8093  Zuerich,  Switzerland  Summary  The  use  of  carbohydrate  microarrays  to  investigate  the  carbohydrate  binding  specificities  of  bacteria,  to  detect  pathogens,  and  to  screen  antiadhesion  therapeutics  is  reported.  This  system  is  ideal  for  wholecell  applications  because  microarrays  present  carbohydrate  ligands  in  a  manner  that  mimics  interactions  at  cell-cell  interfaces.  Other  advantages  include  assay  miniaturization,  since  minimal  amounts  (wpicomoles)  of  a  ligand  are  required  to  observe  binding,  and  high  throughput,  since  thousands  of  compounds  can  be  placed  on  an  array  and  analyzed  in  parallel.  Pathogen  detection  experiments  can  be  completed  in  complex  mixtures  of  cells  or  protein  using  the  known  carbohydrate  binding  epitopes  of  the  pathogens  in  question.  The  nondestructive  nature  of  the  arrays  allows  the  pathogen  to  be  harvested  and  tested  for  antibacterial  susceptibility.  These  investigations  allow  microarraybased  screening  of  biological  samples  for  contaminants  and  combinatorial  libraries  for  antiadhesion  therapeutics.  Introduction  Carbohydrates  displayed  on  the  surface  of  cells  play  critical  roles  in  cell-cell  recognition,  adhesion,  signaling  between  cells,  and  as  markers  for  disease  progression.  Neural  cells  use  carbohydrates  to  facilitate  development  and  regeneration  [1];  cancer  cell  progression  is  often  characterized  by  increased  carbohydrate-dependent  cell  adhesion  and  the  enhanced  display  of  carbohydrates  on  the  cell  surface  [2];  viruses  recognize  carbohydrates  to  gain  entry  into  host  cells  [3];  and  bacteria  bind  to  carbohydrates  for  host  cell  adhesion  [4].  Identification  of  the  specific  saccharides  involved  in  these  processes  is  important  to  better  understand  cell-cell  recognition  at  the  molecular  level  and  to  aid  the  design  of  therapeutics  and  diagnostic  tools.  Many  interactions  at  cell-cell  interfaces  involve  multiple  binding  events  that  occur  simultaneously  [5,  6].  This  "multivalent"  type  of  binding  amplifies  affinities  relative  to  interactions  that  involve  only  a  single  ligand  [6].  This  effect  has  led  to  the  development  of  multivalent  antiadhesive  therapeutics  against  bacteria  [7,  8]  and  viruses  by  displaying  carbohydrates  on  flexible  polymers  [9-  11].  Dendrimers  and  bovine  serum  albumin  (BSA)  have  also  been  used  as  multivalent  scaffolds  [8].  Additionally,  devices  that  are  responsive  to  the  presence  of  a
0	Results  and  Discussion  Cell  Adhesion  to  Carbohydrate  Arrays  Five  different  monosaccharides  equipped  with  an  ethanolamine  linker  on  their  reducing  ends  were  used  to  construct  the  carbohydrate  arrays  (Figure  1).  Functionalized  sugars  were  spotted  onto  glass  slides  that  had  been  coated  with  the  amine-reactive  homobifunctional  disuccinimidyl  carbonate  linker.  In  initial  tests,  10  µl  of  a  20  mM  carbohydrate  solution  was  placed  onto  different  positions  on  the  surface.  Slides  were  hybridized  with  109  E.  coli  (ORN178)  cells  that  had  been  stained  with  a  nucleic  acid  staining  dye  (Figure  2).  After  removing  unbound  bacteria  by  washing,  slides  were  scanned  using  a  fluorescent  array  scanner.  Results  show  that  a  strongly  fluorescent  signal  (signal  to  noise  [S/N]  >10)  was  observed  at  positions  where  mannose  was  immobilized;  hybridization  with  unstained  E.  coli  resulted  in  a  weak  signal  (S/N  w2).  The  remainder  of  the  slide  exhibited  no  signal  above  background  (data  not  shown).  Next,  an  arraying  robot  was  used  to  construct  highdensity  arrays.  The  robot  spatially  delivered  1  nl  of  carbohydrate-containing  solutions  that  ranged  in  concentration  from  20  mM  to  15  M,  and  the  resulting  spots  had  a  diameter  of  w200  m.  Several  types  of  slides  were  tested  to  optimize  array  performance.  Standard  amine-coated  glass  slides  were  reacted  with  either  disuccinimidyl  carbonate  or  disuccinimidyl  tetrapolyethylenglycol  linkers,  alternatively  CodeLink  polymer  coated  slides  were  used  (data  not  shown).  For  each  of  these
0	Chemistry  &  Biology  1702
0	slides,  ORN178  bound  to  mannose  and  not  to  the  other  carbohydrates.  Furthermore,  binding  occurred  with  a  signal  to  noise  ratio  of  >100  despite  the  small  size  of  the  spots  (Figure  3).  CodeLink  slides  had  the  best  performance  since  they  gave  the  highest  binding  signal  and  the  lowest  background.  These  slides  were  used  in  all  subsequent  array  experiments  where  monosaccharides  were  displayed.  Most  likely,  the  three-dimensional  manner  in  which  the  carbohydrates  were  immobilized  on  these  slides  is  responsible  for  the  enhanced  performance.  Other  arrays  that  displayed  mono-  to  nonamannosides,  which  were  constructed  as  described  [15],  were  tested  for  binding  to  ORN178  (see  Supplemental  Data).  Results  from  these  experiments  show  that  ORN178  has  little  preference  for  binding  to  these  mannosides,  despite  varying  lengths  and  linkage  stereochemistry.  This  likely  reflects  that  recognition  of  mannose  residues  by  this  strain  occurs  through  only  a  single  mannose  residue,  and  that  stereochemistry  of  the  linkage  plays  little  role  in  binding.  The  observation  of  cell  adhesion  to  arrays  constructed  using  an  arraying  robot  with  microarray-size  spots  is  promising.  A  previous  report  studied  adhesion  of  chicken  hepatocytes  and  human  T  cells  to  carbohydrates  arrays  that  were  manually  constructed.  These  spots  were  1.7  mm  in  diameter  and  allowed  for  w200  spots  to  be  placed  on  a  single  slide  [16].  The  arrays  described  here  show  that  the  interactions  of  bacteria  to  carbohydrates  can  be  studied  in  a  high-throughput  manner  with  the  arrays.  Due  to  the  smaller  spot  size  used  here,  a  much  larger  number  of  interactions  can  be  screened  in  parallel.  The  minimal  amount  of  carbohydrate  sufficient  to  detect  binding  was  determined.  Analyte  consumption  is  an  important  aspect  for  carbohydrate  arrays,  since  materials  isolated  from  natural  sources  are  in  short  supply.  Several  1  nl  aliquots  of  serially  diluted  solutions  of  carbohydrate  that  ranged  in  concentration  from  20  mM  to  15  M  were  arrayed.  A  concentration-dependent  decrease  in  signal  was  observed,  and  delivery  of  as  little  as  20  fmol  to  a  slide  was  sufficient  to  obtain  a  signal
0	above  background  (Figure  4).  Different  concentrations  of  bacteria  were  next  hybridized  with  the  arrays  to  determine  the  bacterial  detection  limit.  As  expected,  a  concentration-dependent  decrease  in  signal  was  observed.  When  106  or  greater  ORN178  were  incubated,  signals  were  well  above  background  (Supplemental  Data);  however,  hybridization  of  105  cells  gave  signal  that  approached  background,  thus  defining  the  current  detection  limit.  This  sensitivity  rivals  or  exceeds  that  used  in  methods  requiring  a  bacterial  enrichment  step  prior  to  detection  [17].  Standard  microscopic  images  were  taken  of  ORN178  bound  to  three  mannose-containing  spots.  Images  show  that  ORN178  only  adhered  to  these  positions,  they  are  densely  covered  with  bacteria  (Figure  4),  and  no  bacteria  are  observed  outside  of  this  area.  This  illustrates  that  these  slides  are  resistant  to  nonspecific  adhesion  of  bacteria.  Assessing  the  Carbohydrate  Binding  Specificities  of  Different  Bacterial  Strains  The  arrays  were  tested  for  their  ability  to  probe  differences  in  carbohydrate  binding  affinities  between  re-
0	Carbohydrate  Microarrays  to  Detect  Pathogens  1703
0	Intact  cell  adhesion  to  glycan  microarrays
0	Department  of  Pharmacology  and  Molecular  Sciences,  The  Johns  Hopkins  School  of  Medicine,  725  N.  Wolfe  Street,  Baltimore,  MD  21205;  4  Instituto  de  Microbiologia  Prof.  Paulo  de  Goes,  Universidade  Federal  do  Rio  de  Janeiro,  Rio  de  Janeiro,  Brazil;  and  5Glycominds,  Ltd.,  Lod  71291,  Israel
0	A  rapid  and  reproducible  method  was  developed  to  detect  and  quantify  carbohydrate-mediated  cell  adhesion  to  glycans  arrayed  on  glass  slides.  Monosaccharides  and  oligosaccharides  were  covalently  attached  to  glass  slides  in  1.7-mmdiameter  spots  (200  spots/slide)  separated  by  a  Teflon  gasket.  Primary  chicken  hepatocytes,  which  constitutively  express  a  C-type  lectin  that  binds  to  nonreducing  terminal  N-acetylglucosamine  residues,  were  labeled  with  a  fluorescent  dye  and  incubated  in  1.3-mL  aliquots  on  the  glycosylated  spots.  After  incubating  to  allow  cell  adhesion,  nonadherent  cells  were  removed  by  immersing  the  slide  in  phosphate  buffered  saline,  inverting,  and  centrifuging  in  a  sealed  custom  acrylic  chamber  so  that  cells  on  the  derivatized  spots  were  subjected  to  a  uniform  and  controlled  centrifugal  detachment  force  while  avoiding  an  air±liquid  interface.  After  centrifugation,  adherent  cells  were  fixed  in  place  and  detected  by  fluorescent  imaging.  Chicken  hepatocytes  bound  to  nonreducing  terminal  GlcNAc  residues  in  different  linkages  and  orientations  but  not  to  nonreducing  terminal  galactose  or  N-acetylgalactosamine  residues.  Addition  of  soluble  GlcNAc  (but  not  Gal)  prior  to  incubation  reduced  cell  adhesion  to  background  levels.  Extension  of  the  method  to  CD4  human  T-cells  on  a  45-glycan  diversity  array  revealed  specific  adhesion  to  the  sialyl  Lewis  x  structure.  The  described  method  is  a  robust  approach  to  quantify  selective  cell  adhesion  using  a  wide  variety  of  glycans  and  may  contribute  to  the  repertoire  of  tools  for  the  study  of  glycomics.  Key  words:  CD4  /glycomics/hepatocyte/lectins/  oligosaccharides  Introduction  Carbohydrate-mediated  cell±cell  recognition  is  emerging  as  an  important  component  in  the  repertoire  of  molecular  recognition  events  that  underlie  the  orderly  development  and  functioning  of  multicellular  organisms  (Crocker  and
0	L.  Nimrichter  et  al.
0	lectins  or  by  generating  multivalency  using  chimeras  or  secondary  binding  proteins.  In  nature,  multivalency  is  often  generated  by  lectin  expression  on  cell  surfaces,  where  lectin  molecules  selfassociate  or  cluster  in  response  to  multivalent  binding  arrays  on  an  apposing  surface  (Weis  and  Drickamer,  1996;  Weisz  and  Schnaar,  1991).  Here  we  report  methods  that
0	detect  specific  adhesion  of  intact  cells  to  covalent  carbohydrate  microarrays  engineered  on  glass  slides.  These  methods  take  advantage  of  the  natural  multivalency  of  cell  surface  carbohydrate  binding  to  extend  the  applicability  of  glycan  microarrays.  Results  Glass-slide  arrayed  carbohydrates  Defined  glycosides  were  covalently  arrayed  on  standardsize  glass  slides  using  previously  described  chemistry  (Schwarz  et  al.,  2003).  The  array  consisted  of  8  rows,  25  columns  of  1.7-mm  diameter  spots  separated  by  a  Teflon  gasket  (Figure  1).  Adhesion  of  intact  chicken  hepatocytes  to  GlcNAc-terminated  glycans  A  method  for  quantifying  intact  cell  adhesion  to  glass  slide  glycan  arrays  was  developed  and  refined  using  primary  chicken  hepatocytes,  which  express  the  well-defined  GlcNAc-specific  chicken  hepatic  lectin  on  their  surface  (Drickamer,  1981).  Initial  experiments  used  slides  with  multiple  spots  derivatized  with  GlcNAc,  Gal,  linker  arm  (control),  and  no  modification  (Figure  2).  Chicken  hepatocytes  adhered  selectively  to  spots  derivatized  with  GlcNAc  glycosides.  Cell  adhesion  to  Gal-derivatized  spots  and  control  surfaces  was  very  low.  Varying  the  conditions  for  blocking  nonspecific  cell  adhesion  (5  mg/mL  or  10  mg/mL  bovine  serum  albumin  [BSA])  did  not  alter  the  results.  Microscopic  examination  of  the  wells  (Figure  3)  confirmed
0	Cell  adhesion  to  glycan  microarrays
0	scientific  report  scientificreport
0	Parasite-specific  immune  response  in  adult  Drosophila  melanogaster:  a  genomic  study
1	¨m-Lindquist*w,  Olle  Terenius*  &  Ingrid  Faye+  Katarina  Roxstro
0	Insects  of  the  order  Diptera  are  vectors  for  parasitic  diseases  such  as  malaria,  sleeping  sickness  and  leishmania.  In  the  search  for  genes  encoding  proteins  involved  in  the  antiparasitic  response,  we  have  used  the  protozoan  parasite  Octosporea  muscaedomesticae  for  oral  infections  of  adult  Drosophila  melanogaster.  To  identify  parasite-specific  response  molecules,  other  flies  were  exposed  to  virus,  bacteria  or  fungi  in  parallel.  Analysis  of  gene  expression  patterns  after  24  h  of  microbial  challenge,  using  Affymetrix  oligonucleotide  microarrays,  revealed  a  high  degree  of  microbe  specificity.  Many  serine  proteases,  key  intermediates  in  the  induction  of  insect  immune  responses,  were  uniquely  expressed  following  infection  of  the  different  organisms.  Several  lysozyme  genes  were  induced  in  response  to  Octosporea  infection,  while  in  other  treatments  they  were  not  induced  or  downregulated.  This  suggests  that  lysozymes  are  important  in  antiparasitic  defence.
0	The  majority  of  insect  vectors  for  human  parasites  are  found  among  dipterans.  In  an  attempt  to  understand  the  immunological  basis  for  Anopheles  vector  capacity,  Schneider  &  Shahabuddin  (2000)  successfully  used  Drosophila  melanogaster  and  the  malaria  parasite  Plasmodium  gallinaceum  as  a  vector-parasite  model  system.  Ookinetes  injected  into  the  fly  haemocoel  developed  into  sporozoites  that  were  infective  when  injected  into  the  chicken  host.  However,  when  feeding  the  flies  with  parasitized  blood  or  ookinetes,  parasite  development  was  hampered,  indicating  that  the  important  barrier  for  the  parasite  to  develop  resides  in  the  gut  of  this  insect.  Either  certain  mosquito-specific  invasion  routes  are  not  present  in  Drosophila,  or  the  malaria  parasites  encountered
0	EUROPEAN  MOLECULAR  BIOLOGY  ORGANIZATION
0	scientific  report
0	Drosophila  were  fed  with  DCV,  30-50%  of  the  flies  died  within  6  days  after  infection  (Gomariz-Zilber  et  al,  1995).  This  is  the  first  whole-genome  study  on  antiparasitic  response  in  D.  melanogaster.  We  demonstrate  that  Drosophila  responds  by  upregulating  a  new  and  specific  set  of  genes  on  an  oral  infection  with  Octosporea.  Many  of  the  genes  with  unknown  function  have  signal  peptides  and  will  be  a  subject  for  future  analyses  of  antiparasitic  activity.
0	Beauveria  49
0	Antiparasitic  gene  expression  in  Drosophila  K.  Roxstrom-Lindquist  et  al.  ¨
0	Octosporea  23
0	RESULTS  AND  DISCUSSION  Genome  data  analysis
0	The  Drosophila  gene  expression  in  response  to  different  microbes  was  examined  after  24  h  of  natural  infection  of  adult  males.  The  RNA  was  hybridized  to  Affymetrix  Drosophila  GeneChips,  and  Affymetrix  MAS  5.0  software  was  used  for  the  calculation  of  expression  and  statistical  analyses  of  the  chips  (supplementary  information  table  1  online).  Duplicates  of  each  infection  were  compared  to  duplicates  of  normal  flies  in  a  2  A  2  matrix  (supplementary  information  text  part  A  online).  The  genes  that  were  significantly  increased  (Po0.0025,  Wilcoxon's  signed  ranks  test)  in  all  four  comparisons  were  defined  as  induced  genes.  In  total,  427  genes  were  induced  and  selected  for  further  analysis  (supplementary  information  table  2  online).  The  fungal  infection  generated  the  strongest  response,  with  298  genes  induced,  and  the  parasitic  infection  induced  127  genes.  In  the  viral  and  bacterial  infections,  a  low  number  of  genes  were  significantly  induced:  11  and  10,  respectively.  The  significantly  induced  genes  are  found  in  many  different  functional  classes  (Fig  1).  A  common  feature  in  the  four  infections  was  that  many  of  the  genes  encode  enzymes,  in  particular  serine  proteases:  Octosporea,  35%  enzymes  (13%  serine  proteases);  Beauveria,  24%  (8%);  Serratia,  60%  (50%);  and  DCV,  36%  (27%)  (supplementary  information  table  4  online).  Unique  or  common  induction  of  a  gene  was  determined  by  comparing  the  expression  of  each  induced  gene  selected  in  one  treatment  with  its  expression  in  other  treatments  (supplementary  information  text  part  A  online).  The  numbers  of  uniquely  induced  genes  were  214,  59  and  2  in  response  to  Beauveria,  Octosporea  and  DCV,  respectively;  this  constitutes  65%  of  the  427  induced  genes  and  thereby  demonstrates  specificity  in  the  immune  response  (Fig  2).  Many  genes  were  induced  in  several  infections;  16  genes  are  designated  as  common  in  response  to  all  four  infections.  The  genes  in  common  encode  the  antimicrobial  proteins  Attacin  A,  Cecropin  A1,  Cecropin  A2,  Drosomycin  and  Metchnikowin,  as  well  as  acetylCoA  homeostasis  (CG8628),  one  serine  protease  (CG6483)  and  nine  genes  with  unknown  functions  (supplementary  information  table  3  online).
0	Confirmation  of  genes  responding  to  Beauveria  infection
0	The  antifungal  peptide  genes  Drosomycin  and  Metchnikowin  (Ekengren  &  Hultmark,  2001,  and  references  therein)  were  heavily  induced  by  Beauveria  in  our  study:  14.3-  and  19.9-fold,  respectively  (Table  1).  In  a  similar  experiment,  where  the  D.  melanogaster  strain  OregonR  was  naturally  infected  with  the  same  strain  of  Beauveria,  the  response  at  24  h  was  lower  compared  to  our  results:  Drosomycin  6.4-fold  and  Metchnikowin  4.4-fold  (De  Gregorio  et  al,  2001).  The  Canton  S  flies  used  in  our  study  died  within  5  days  (Fig  3),  whereas  90%  of  the  OregonR  flies  used  by  De  Gregorio  et  al  (2002)  were  still  alive  at  that  time  point.  This
0	may  indicate  that  our  flies  were  more  heavily  infected,  or  that  there  is  a  certain  genetic  difference  between  these  two  wild-type  isolates  of  D.  melanogaster.  Turandot  M  (TotM)  is  a  stress-induced  humoral  protein  gene  in  Drosophila,  earlier  shown  to  be  upregulated  by  the  Gram-negative  bacterium  Enterobacter  cloacae  b12  when  injected  into  adults  (Ekengren  &  Hultmark,  2001).  In  our  study,  TotM  is  induced  13.7fold  by  fungal  infection  (Table  1)  and  2.4-fold  by  bacterial  feeding.  The  strong  fungal  induction  could  reflect  the  stress  response  inferred  by  cuticular  penetration.  Notably,  in  De  Gregorio's  study  TotM  (CG14027)  is,  after  24  h,  upregulated  3.6-fold  by  the  fungal  infection  and  13.6-fold  by  septic  injury.  This  is  a  recurring  pattern  of  contrasting  results  on  fungal  versus  bacte
0	The  Human  MitoChip:  A  High-Throughput  Sequencing  Microarray  for  Mitochondrial  Mutation  Detection
1	Anirban  Maitra,1,3  Yoram  Cohen,2  Susannah  E.D.  Gillespie,3  Elizabeth  Mambo,2  Noriyoshi  Fukushima,1  Mohammad  O.  Hoque,2  Nila  Shah,4  Michael  Goggins,1  Joseph  Califano,2  David  Sidransky,1,2  and  Aravinda  Chakravarti3,5
0	et  al.  1998;  Fliss  et  al.  2000;  Bianchi  et  al.  2001;  Jones  et  al.  2001;  Parrella  et  al.  2001;  Sanchez-Cespedes  et  al.  2001;  Chen  et  al.  2002;  Copeland  et  al.  2002).  The  frequency  of  mitochondrial  mutations  in  these  studies  is  high,  with  half  to  two-thirds  of  cancers  harboring  at  least  one  somatic  mutation.  The  mitochondrial  genome  is  an  ideal  target  for  mutation  detection  in  cancers  for  several  reasons.  First,  mitochondrial  mutations  in  cancer  are  not  only  common,  but  unlike  nuclear  genes,  do  not  appear  to  be  restricted  by  cancer  type  (Polyak  et  al.  1998;  Fliss  et  al.  2000;  Jones  et  al.  2001;  Sanchez-Cespedes  et  al.  2001).  Second,  detection  of  mitochondrial  DNA  mutations  in  clinical  samples  (such  as  exfoliated  cells  in  urine,  or  lavage  fluids)  offers  a  distinct  advantage  over  nuclear  DNA  because  of  the  high  copy  number  of  mitochondrial  genomes  in  cancer  cells.  Fliss  et  al.  (2000)  determined  that  mitochondrial  DNA  was  19  to  220  times  as  abundant  as  mutated  p53  nuclear  DNA  in  matched  body  fluids  from  cancer  patients.  Similarly,  Jones  et  al.  (2001)  confirmed  the  facile  detection  of  mitochondrial  DNA  mutations  in  primary  tumors  with  a  30%  or  less  neoplastic  cellularity,  whereas  known  nuclear  DNA  mutations  could  not  be  detected  in  the  nonenriched  samples.  Finally,  the  presence  of  mitochondrial  DNA  mutations  in  a  proportion  of  preneoplastic  lesions  suggests  that  mutations  occur  early  in  multistep  tumor  progression  (Jeronimo  et  al.  2001;  Parrella  et  al.  2001;  Ha  et  al.  2002),  and  hence,  may  be  used  as  a  tool  for  early  detection  of  cancer  in  clinical  samples,  including  body  fluids  and  serum  (Hibi  et  al.  2001;  Jeronimo  et  al.  2001;  Nomoto  et  al.  2002;  Okochi  et  al.  2002).  Current  strategies  for  using  the  mitochondrial  genome  as  a  screening  tool  in  cancer  are  limited  by  the  availability  of  a  highthroughput  platform  for  mutation  detection.  Even  with  the
0	Genome  Research
0	Mitochondrial  Sequencing  Microarray
0	Reproducibility  of  Array-Based  Sequencing
0	availability  of  sensitive  and  rapid  mutation  detection  platforms  such  as  automated  capillary  sequencers  and  denaturing  highperformance  liquid  chromatography  (HPLC;  Medintz  et  al.  2001;  Liu  et  al.  2002),  the  routine  sequencing  of  16.5  kb  of  mitochondrial  DNA  is  an  onerous  task.  Microarrays  are  inherently  parallel  devices  that  offer  the  promise  of  determining  the  genotypes  at  every  site  of  interest  with  a  limited  level  of  effort  (Hacia  1999).  Chee  et  al.  developed  the  first  mitochondrial  sequencing  microarray  in  1996,  comprised  of  "tiled"  oligonucleotide  sequencing  probes  synthesized  using  standard  photolithography  and  solidphase  DNA  synthesis  (Chee  et  al.  1996).  This  microarray  platform,  however,  had  several  limitations,  including  the  requirement  for  generating  RNA  by  in  vitro  transcription  of  genomic  DNA  for  chip  hybridization,  tiling  of  only  a  single  strand  of  the  target  mitochondrial  sequence  on  the  chip,  and  absence  of  robust  genotype  assignment  software.  We  have  developed  a  "second-generation"  sequencing  microarray  for  high-throughput  analysis  of  mitochondrial  coding 
0	A  custom  microarray  platform  for  analysis  of  microRNA  gene  expression
1	J  Michael  Thomson1,  Joel  Parker2,5,  Charles  M  Perou2-4  &  Scott  M  Hammond1,2
0	MicroRNAs  are  short,  noncoding  RNA  transcripts  that  posttranscriptionally  regulate  gene  expression.  Several  hundred  microRNA  genes  have  been  identified  in  Caenorhabditis  elegans,  Drosophila,  plants  and  mammals.  MicroRNAs  have  been  linked  to  developmental  processes  in  C.  elegans,  plants  and  humans  and  to  cell  growth  and  apoptosis  in  Drosophila.  A  major  impediment  in  the  study  of  microRNA  function  is  the  lack  of  quantitative  expression  profiling  methods.  To  close  this  technological  gap,  we  have  designed  dual-channel  microarrays  that  monitor  expression  levels  of  124  mammalian  microRNAs.  Using  these  tools,  we  observed  distinct  patterns  of  expression  among  adult  mouse  tissues  and  embryonic  stem  cells.  Expression  profiles  of  staged  embryos  demonstrate  temporal  regulation  of  a  large  class  of  microRNAs,  including  members  of  the  let-7  family.  This  microarray  technology  enables  comprehensive  investigation  of  microRNA  expression,  and  furthers  our  understanding  of  this  class  of  recently  discovered  noncoding  RNAs.
0	MicroRNAs  comprise  a  large  family  of  noncoding  RNAs  found  in  organisms  ranging  from  nematodes  to  plants  to  humans  (see  ref.  1  for  a  review).  Over  200  microRNAs  have  been  identified  in  mammals,  either  through  computational  searches  or  by  RT-PCRmediated  cloning.  These  RNAs  function  as  natural  triggers  of  the  RNAi  pathway,  regulating  gene  expression  at  a  post-transcriptional  step.  MicroRNA  biogenesis  begins  with  a  primary  transcript  that  contains  a  stem-loop  structure1.  This  transcript  is  processed  by  the  ribonuclease  III  enzyme  Drosha,  liberating  the  stem-loop,  which  is  termed  the  precursor.  This  precursor  is  transported  out  of  the  nucleus  in  a  process  dependent  on  the  Ran  GTPase  and  the  export  receptor  exportin-5.  Further  processing  in  the  cytoplasm  by  the  ribonuclease  III  enzyme  Dicer  leads  to  the  production  of  mature  RNAs  of  B22  nucleotides  (nt)  that  are  incorporated  into  the  RNAi  effector  complex  RISC  (RNA-induced  silencing  complex).  Complementarity  with  elements  in  mRNAs  leads  to  suppression  of  gene  expression.  In  cases  where  the  microRNA  is  an  imperfect  match  to  the  mRNA,  as  with  C.  elegans  lin-4,  recognition  leads  to  reduction  in  protein  levels  without  affecting  mRNA  levels.  In  plants,  mRNA  targets  in  the  scarecrow-like  family  of
0	transcription  factors  contain  sequences  perfectly  complementary  to  the  microRNA  miR-39.  Similarly,  in  mammals,  miR-196  has  near-perfect  identity  with  elements  in  the  mRNA  of  the  homeobox  transcription  factor  gene  HoxB8  (ref.  2).  In  this  case  recognition  of  the  mRNA  by  microRNAs  leads  to  cleavage,  rather  than  translational  repression,  analogous  to  siRNA-mediated  gene  silencing3,4.  Despite  the  large  number  of  identified  microRNAs,  the  scope  of  their  roles  in  regulating  cellular  gene  expression  is  not  known.  The  founding  members  of  this  family  of  noncoding  RNAs  are  the  C.  elegans  lin-4  and  let-7  (refs.  5,6).  Expression  of  these  microRNAs,  originally  termed  short-temporal  RNAs,  is  essential  for  proper  timing  of  events  during  larval  development.  For  example,  levels  of  the  let-7  RNA  increase  during  the  fourth  larval  stage  and  the  adult  stage,  resulting  in  suppression  of  larval-specific  genes,  including  lin-41  (ref.  6).  Partially  complementary  elements  in  the  lin-41  mRNA  are  binding  sites  for  let-7  (ref.  7).  The  role  of  microRNAs  in  cell  lineage  and  development  has  recently  been  found  to  extend  to  mammalian  systems.  miR-181  is  highly  expressed  in  hematopoietic  progenitors,  and  its  overexpression  promotes  differentiation  into  B-lineage  cells8.  The  regulation  of  homeobox  genes  by  microRNAs  further  links  this  gene  family  to  mammalian  developmental  processes2.  One  approach  to  identifying  the  cellular  roles  of  microRNAs  is  the  identification  of  mRNA  targets.  Several  groups  have  developed  computational  methods  to  search  for  target  sequences  of  microRNAs  (see  ref.  1  for  a  discussion).  These  methods  have  yielded  hundreds  of  candidate  targets  in  plants,  Drosophila  and  mammals  that  implicate  microRNAs  in  a  diverse  range  of  cellular  pathways.  Essential  for  the  interpretation  of  these  data,  however,  is  an  `-vis  understanding  of  microRNA  expression  patterns  vis-a  expression  patterns  of  predicted  targets.  The  temporally  restricted  expression  of  large  sets  of  microRNAs  in  C.  elegans  and  Drosophila  has  been  reported9-11.  More  recently,  tissue-specific  expression  patterns  of  mammalian  microRNAs  have  been  described12.  All  data  were  obtained  by  northern  blot  analysis  of  microRNA  levels.  As  a  refinement  to  this  approach,  the  use  of  nylon  macroarrays  for  analysis  of  44  microRNAs  during  brain  development  has  been  reported13.  All  the  aforementioned  approaches,  however,
0	prevents  edge  effects.  We  adapted  MJ  Research  in  situ  PCR  chambers  as  disposable  hybridization  chambers.  A  reference  oligonucleotide  set  corresponding  to  all  mature  microRNAs,  labeled  with  Cy5  (red  channel),  was  included  in  all  hybridizations.  This  reference  set  provides  an  internal  hybridization  control  for  every  probe  on  the  array.  In  principle,  this  could  permit  absol
0	MicroRNAs:  SMALL  RNAs  WITH  A  BIG  ROLE  IN  GENE  REGULATION
1	Lin  He  and  Gregory  J.  Hannon
0	MicroRNAs  are  a  family  of  small,  non-coding  RNAs  that  regulate  gene  expression  in  a  sequence-specific  manner.  The  two  founding  members  of  the  microRNA  family  were  originally  identified  in  Caenorhabditis  elegans  as  genes  that  were  required  for  the  timed  regulation  of  developmental  events.  Since  then,  hundreds  of  microRNAs  have  been  identified  in  almost  all  metazoan  genomes,  including  worms,  flies,  plants  and  mammals.  MicroRNAs  have  diverse  expression  patterns  and  might  regulate  various  developmental  and  physiological  processes.  Their  discovery  adds  a  new  dimension  to  our  understanding  of  complex  gene  regulatory  networks.
0	RNA  INTERFERENCE
0	(RNAi).  A  form  of  posttranscriptional  gene  silencing,  in  which  dsRNA  induces  degradation  of  the  homologous  mRNA,  mimicking  the  effect  of  the  reduction,  or  loss,  of  gene  activity.
0	The  discovery  of  miRNAs
0	The  founding  member  of  the  miRNA  family,  lin-4,  was  identified  in  C.  elegans  through  a  genetic  screen  for  defects  in  the  temporal  control  of  post-embryonic  development10,11.  In  C.  elegans,  cell  lineages  have  distinct  characteristics  during  4  different  larval  stages  (L1-L4).  Mutations  in  lin-4  disrupt  the  temporal  regulation  of  larval  development,  causing  L1  (the  first  larval  stage)specific  cell-division  patterns  to  reiterate  at  later  developmental  stages10.  Opposite  developmental  phenotypes  --  omission  of  the  L1  cell  fates  and  premature  development  into  the  L2  stage  --  are  observed  in  worms  that  are  deficient  for  lin-14  (REF.  12).  Even  before  the  molecular  identification  of  lin-4  and  lin-14,  these  loci  were  placed  in  the  same  regulatory  pathway  on  the  basis  of  their  opposing  phenotypes  and  antagonistic  genetic  interactions11.  Most  genes  identified  from  mutagenesis  screens  are  protein-coding,  but  lin-4  encodes  a  22-nucleotide  non-coding  RNA  that  is  partially  complementary  to  7  conserved  sites  located  in  the  3-untranslated  region  (UTR)  of  the  lin-14  gene  (FIG.  1b)13,14.  lin-14  encodes  a  nuclear  protein,  downregulation  of  which  at  the  end  of  the  first  larval  stage  initiates  the  developmental  progression  into  the  second  larval  stage13,15.  The  negative  regulation  of  LIN-14  protein  expression  requires  an  intact  3  UTR  of  its  mRNA14,  as  well  as  a  functional  lin-4  gene13.  These  genetic  interactions  inspired  a  series  of  molecular  and  biochemical  studies  demonstrating  that
0	the  direct,  but  imprecise,  base  pairing  between  lin-4  and  the  lin-14  3  UTR  was  essential  for  the  ability  of  lin-4  to  control  LIN-14  expression  through  the  regulation  of  protein  synthesis16-18.  Through  an  analogous  mechanism,  lin-4  also  negatively  regulates  the  translation  of  lin-28,  a  cold-shock-domain  protein  that  initiates  the  developmental  transition  between  the  L2  and  L3  stages19.  Compared  with  lin-14,  lin-28  has  fewer  lin-4  binding  sites,  which  might  lead  to  its  translational  repression  being  delayed  following  lin-4  expression  owing  to  less  efficient  lin-4  binding  6,19.  The  discovery  of  lin-4  and  its  target-specific  translational  inhibition  hinted  at  a  new  mechanism  of  gene  regulation  during  development.  In  2000,  almost  7  years  after  the  initial  identification  of  lin-4,  the  second  miRNA,  let-7,  was  discovered,  also  using  forward  genetics  in  worms.  let-7  encodes  a  temporally  regulated  21-nucleotide  small  RNA  that  controls  the  developmental  transition  from  the  L4  stage  into  the  adult  stage20-22.  Similar  to  lin-4,  let-7  performs  its  function  by  binding  to  the  3  UTR  of  lin-41  and  hbl-1  (lin-57),  and  inhibiting  their  translation20-24.  The  identification  of  let-7  not  only  provided  another  vivid  example  of  developmental  regulation  by  small  RNAs,  but  also  raised  the  possibility  that  such  RNAs  might  be  present  in  species  other  than  nematodes.  Unlike  lin-4,  the  orthologues  of  which  in  flies  and  mammals  initially  escaped  bioinformatic  searches,  and  were  only  recognized  recently25,26,  both  let-7  and  lin-41  are  evolutionarily  conserved  throughout  metazoans,  with  homologues  that  were  readily  detected  in  molluscs,  sea  urchins,  flies,  mice  and  humans27.  This  extensive  conservation  strongly  indicated  a  more  general  role  of  small  RNAs  in  developmental  regulation,  as  supported  by  the  recent  characterization  of  miRNA  functions  in  many  metazoan  organisms.
0	miRNAs  and  siRNAs  --  what's  the  difference
0	lin-4  pre-miRNA
0	GU  CU  G  UU  U  C  A  G  CCUG  CCC  GAGA  CUCA  GUGUGA  GUA  A  U  C  GGAC  GGG  CUCU  GGGU  CACACUUCGU  U  A  CAU  C  C  C  AG
0	lin-4  miRNA
0	Ribosome  ORF  lin-14
0	A  AA  AU  UCAUGCUCUCAGGA  AGUGUGAGAGUCCU  AA  C  CC  UC  AUUCAAAACUCAGGA  UGAGU  GAGUCCU  GA  C  U  C  G  C  AU  AC
0	UCAUUGAACUCAGGA  AGUG  GAGUCCU  A  C  U  C  G  A  UC  AC
0	UCACAACCAACUCAGGGA  AGUGU  G  GAGUCCCU  A  AC  AC  CU  A  UUAUGUUAAAAUCAGGA  A  G  UGUGA  AGUCCU  A  G  C  CA  UC  C
0	22nt  U  UCGCAUUU  CUCAGGGA  AGUGUGAA  GAGUCCCU  C  A  UC  C
0	UCUACCUCAGGGA  AGGUGGAGUCCCU  U  AA  AC  CC  U
0	Hundreds  of  miRNAs  have  now  been  identified  in  various  organisms,  and  the  RNA  structure  and  regulatory  mechanisms  that  have  been  characterized  in  lin-4  and  let-7  still  provide  unique  molecular  signatures  as  to  what  defines  miRNAs.  miRNAs  are  generally  21-25nucleotide,  non-coding  RNAs  that  are  derived  from  larger  precursors  that  form  imperfect  stem-loop  structures  (FIG.  1a)4,5.  The  mature  miRNA  is  most  often  derived  from  one  arm  of  the  precursor  hairpin,  and  is  released  from  the  primary  transcript  through  stepwise  processing  by  two  ribonuclease-III  (RNase  III)  enzymes28,29.  At  least  in  animals,  most  miRNAs  bind  to  the  target-3  UTR  with  imperfect  complementarity  and  function  as  translational  repressors  (see  below  for  a  discussion  of  plant  miRNAs)4.  Almost  coincident  with  the  discovery  of  the  second  miRNA,  let-7,  small  RNAs  were  also  characterized  as  components  of  a  seemingly  separate  biological  process,  RNA  interference  (RNAi).  RNAi  is  an  evolutionarily  conserved,  sequence-specific  gene-silencing  mechanism  that  is  induced  by  exposure  to  dsRNA30.  In  many  systems,  including  worms,  plants  and  flies,  the  stimulus  that  was  used  to  initiate  RNAi  was  the  introduction  of  a  dsRNA  (the  trigger)  of  ~500  bp.  The  trigger  is  ultimately  processed  in  vivo  into  small  dsRNAs  of  ~21-25  bp  in  length,  designated  as  small  interfering  RNAs  (siRNAs)31,32.  It  is  now  clear  that  one  strand  of  the  siRNA  duplex  is  selectively  incorporated  into  an  effector  complex  (the  RNA-induced  silencing  complex;  RISC).  The  RISC  directs  the  cleavage  of  complementary  mRNA  targets,  a  process  that  is  also  known  as  post-transcriptional  gene  silencing  (PTGS)  (FIG.  2)33.  The  evolutionarily  conserved  RNAi  response  to  exogenous  dsRNA  might  reflect  an  endogenous  defense  mechanism  against  virus  infection  or  parasitic  nucleic  acids30.  Indeed,  mutations  of  the  RNAi  components  greatly  compromise  virus  resistance  in  plants,  indicating  that  PTGS  might  normally  mediate  the  destruction  of  the  viral  RNAs34.  In  addition,  siRNAs  can  also  regulate  the  expression  of  target  transcripts  at  the  transcriptional  level,  at  least  in  some  organisms.  Not  only  can  siRNAs  induce  sequence-specific  promoter  methylation  in  plants35,36,  but  they  are  also  crucial  for  heterochromatin  formation  in  fission  yeast37,38,  and  transposon  silencing  in  worms39,40.  Fundamentally,  siRNAs  and  miRNAs  are  similar  in  terms  of  their  molecular  characteristics,  biogenesis  and  effector  functions  (see  below  for  details).  So,  the  current  distinctions  between  these  two  species  might  be  arbitrary,  and  might  simply  reflect  the  different  paths  through  which  they  were  originally  discovered.  miRNAs  and  siRNAs  share  a  common  RNase-III  processing  enzyme,  Dicer,  and  closely  related  effector  complexes,  RISCs,  for  post-transcriptional  repression  (FIG.  2).  In  fact,  much  of  our  current  knowledge  of  the  biochemistry  of  miRNAs  stems  f
0	Understanding  the  molecular  responses  to  hypoxia  using  Drosophila  as  a  genetic  model
1	Reza  Farahani  a,  Gabriel  G.  Haddad  a,b,*
0	Keywords:  Anoxia,  tolerance,  genetic  approaches;  genes,  anoxia  tolerance,  d  ADAR;  invertebrates,  Drosophila  melanogaster
0	`genetic'  discoveries  were  being  made,  even  without  the  understanding  of  the  basis  for  heredity.  Subsequent  to  this  era,  and  more  recently  in  the  past  couple  of  decades,  the  emphasis  has  shifted  to  a  totally  different  paradigm.  At  present,  a  considerable  amount  of  research  is  tied  to  the  understanding  of  behavioral,  biochemical  or  genetic  processes  at  the  molecular  level  because  it  may  have  direct  implications  on  a  disease  process  in  mammals  or  humans.  Examples  in  point  are  related,  for  instance,  to  the  past  effort,  that  went  on  to  understand  the  development  of  the  thorax  (or  bithorax)  and  the  effort  that  is  on-going  at  present  to  solve  the  molecular  underpinnings  of
0	aging,  tumor  formation,  alcohol  intoxication,  neurodegeneration,  and  memory.  We  have  been  interested  in  a  variety  of  questions  that  span  from  O2  sensing  to  the  cellular  and  molecular  responses  to  hypoxia  and  to  injury  from  anoxia.  Although  most  of  our  previous  work  has  been  done  in  mammals,  we  have  recently  discovered  that  Drosophila  is  very  resistant  to  O2  deprivation  (Krishnan  et  al.,  1997).  This  opened  major  avenues  for  us  since  the  Drosophila  has  been  used  so  effectively  in  so  many  relevant  research  areas,  as  noted  above.  Indeed,  in  spite  of  many  advances  in  monitoring  oxygenation,  there  is  still  considerable  morbidity  and  mortality  arising  from  conditions  with  O2  deprivation  leading  to  hypoxic/  ischemic  damage,  especially,  brain  injury.  Part  of  this  failure  is  related  to  the  complexity  of  the  cascade  of  events  that  ensue  after  hypoxia.  Hence,  Drosophila  has  been  used  in  our  laboratory  to  solve  some  of  the  questions  related  to  tolerance  or  susceptibility  to  hypoxia.  In  this  review,  the  role  and  importance  of  genetic  models,  such  as  Drosophila  melanogaster  ,  are  discussed  and  an  example  illustrating  how  to  harness  the  power  of  Drosophila  genetics  is  detailed.  In  this  review,  we  will  detail  approaches  that  have  been  used  in  flies  or  other  genetic  models  and  have  been  shown  to  be  very  useful.  We  demonstrate  that  these  approaches  have  also  been  fruitful  in  trying  to  understand  hypoxic  responses  and  the  basis  for  tolerance  or  susceptibility  to  hypoxic  tissue  injury.
0	Some  of  the  more  recent  studies  in  our  laboratory  as  well  as  in  others,  using  molecular  and  genetic  approaches,  have  provided  evidence  that  there  are  genes  that  can  protect  against  or  predispose  to  cell  injury  and  death  when  nerve  cells  are  exposed  to  O2  deprivation  (Ma  and  Haddad,  1997;  Haddad  et  al.,  1997;  Ma  et  al.,  1999;  Ma  and  Haddad,  1999,  2000).  In  this  review,  we  will  review  some  of  these  novel  approaches,  focus  on  genetic  models  and  delineate  some  of  their  experimental  power.  2.1.  Forward  genetics  2.1.1.  Tolerance  to  hypoxia,  a  Drosophila  phenotype  Drosophila  can  be  placed  in  100%  N2  for  several  hours  and  yet  survive  the  stress  with  no  apparent  injury:  following  return  to  a  normoxic  milieu,  they  can  mate,  fly,  and  see,  among  other  complex  behaviors  that  seem  to  be  intact.  Furthermore,  electron  microscopic  studies  of  the  central  nervous  system  of  the  fly  did  not  show  any  disruption  or  swelling  of  any  cellular  organelles  or  membranes.  The  time  period  during  which  flies  can  sustain  such  a  stress  (i.e.  hours)  is  clearly  very  significant  since  the  life  span  of  these  flies  is  just  over  1  month.  One  of  the  interesting  aspects  of  the  Drosophila  phenotype  with  respect  to  anoxia  tolerance  is  that,  unlike  other  animals  (such  as  the  turtle),  the  Drosophila  is  tolerant  not  because  of  a  lack  of  sensitivity  to  the  stress.  Indeed,  these  animals  are  very  sensitive  to  stress  and  `sense'  hypoxia:  when  exposed  to  a  partial  pressure  of  O2  (PO2)  of  0  (anoxia).  Under  these  circumstances,  flies  lose  coordination,  stop  moving  first  and  then  fall  and  remain  motionless  for  the  rest  of  the  anoxic  period  (Krishnan  et  al.,  1997;  Haddad  et  al.,  1997).  When  they  are  exposed  to  about  2A/3%  O2  (which  is  extremely  low  by  mammalian  standards),  they  continue  flying  and  moving  for  hours  albeit  at  a  slower  pace  than  in  normoxic  conditions.  Their  O2  consumption  during  hypoxia  (2  A/3%  O2)  drops  to  about  20%  of  control  and  this  demonstrates  that  they  `sense'  the  lack  of  O2  at  cellular  level.  Therefore,  we  believe  that  the  Drosophila  tolerance  to  the  lack  of  O2  is  derived  from  their  ability  to
0	Approaches  for  the  study  of  hypoxia  Many  approaches  have  been  taken  to  study  questions  about  the  importance  in  nerve  cell  response  and/or  injury  due  to  O2  deprivation.  Some  investigators  have  used  acute  settings  and  mostly  electrophysiologic  techniques,  to  examine  ionic  homeostasis  (Haddad  and  Jiang,  1993).  Others  have  relied  on  morphometric  and  anatomic  approaches,  and  still  others  have  focused  almost  exclusively  on  molecular  approaches,  especially  in  settings  in  which  the  stress  is  modest  and  cells  and  tissues  withstood  prolonged  periods  of  hypoxia  (Banasiak  and  Haddad,  1998;  Banasiak  et  al.,
0	number  of  mutant  lines  (deficiencies,  inversions,  duplications,  etc.)  and  chromosomal  markers  available  for  mapping  and  mutagenesis.  (iii)  There  are  tools  available  for  the  study  of  cell  or  organ  physiology  in  Drosophila  such  as  the  Giant  Fiber  System,  which  is  very  well  studied  in  Drosophila  (Haddad  et  al.,  1997).  Finally,  (iv)  P-elements,  which  are  transposable  DNA  elements  with  known  sequences,  have  been  very  useful  in  Drosophila  for  cloning,  mutagenesis  and  over-expression  of  genes  using  Gal4  syst
0	Anomalies  in  the  Expression  Profile  of  Interspecific  Hybrids  of  Drosophila  melanogaster  and  Drosophila  simulans
0	Genome  Research
0	Ranz  et  al.
0	RESULTS  AND  DISCUSSION
0	TECHNICAL  REPORTS
0	Comparing  genomic  expression  patterns  across  species  identifies  shared  transcriptional  profile  in  aging
0	We  developed  a  method  for  systematically  comparing  gene  expression  patterns  across  organisms  using  genome-wide  comparative  analysis  of  DNA  microarray  experiments.  We  identified  analogous  gene  expression  programs  comprising  shared  patterns  of  regulation  across  orthologous  genes.  Biological  features  of  these  patterns  could  be  identified  as  highly  conserved  subpatterns  that  correspond  to  Gene  Ontology  categories.  Here,  we  demonstrate  these  methods  by  analyzing  a  specific  biological  process,  aging,  and  show  that  similar  analysis  can  be  applied  to  a  range  of  biological  processes.  We  found  that  two  highly  diverged  animals,  the  nematode  Caenorhabditis  elegans  and  the  fruit  fly  Drosophila  melanogaster,  implement  a  shared  adult-onset  expression  program  of  genes  involved  in  mitochondrial  metabolism,  DNA  repair,  catabolism,  peptidolysis  and  cellular  transport.  Most  of  these  changes  were  implemented  early  in  adulthood.  Using  this  approach  to  search  databases  of  gene  expression  data,  we  found  conserved  transcriptional  signatures  in  larval  development,  embryogenesis,  gametogenesis  and  mRNA  degradation.  Gene  expression  profiling  measures  the  expression  levels  of  thousands  of  genes  at  once1,2.  Most  expression  profiling  studies  have  focused  on  the  specific  genes  that  respond  to  specific  conditions,  but  another  important  direction  in  functional  genomics  is  to  derive  insight  from  global  patterns  of  gene  expression.  Genome-scale  expression  patterns  have  been  used  as  physiological  `fingerprints'  for  classifying  tumors3,4  and  assigning  uncharacterized  mutations  and  drugs  to  known  pathways5.  Because  they  use  information  from  many  genes  at  once,  patterns  have  great  discriminating  power,  even  when  the  transcriptional  effects  on  individual  genes  are  small5,6.  The  patterns  of  changes  in  gene  expression  observed  in  microarray  experiments  can  be  extensive  and  complex.  To  try  to  analyze  these  patterns,  we  exploited  the  principle  that  important  biological  processes  are  often  conserved  between  organisms.  We  present  an  approach  to  comparative  functional  genomics  based  on  shared  patterns  of  regulation
0	across  orthologous  genes.  We  also  present  a  method  for  identifying  conserved  biological  components  of  those  patterns  that  correspond  to  Gene  Ontology  categories.  These  methods  can  be  used  to  search  databases  of  microarray  experiments  to  discover  connections  among  biological  processes  in  different  organisms.  RESULTS  Comparing  genomic  expression  patterns  across  species  We  used  phylogenetic  analysis  to  systematically  identify  orthologous  groups  of  genes  for  all  pairwise  comparisons  between  C.  elegans,  D.  melanogaster,  Saccharomyces  cerevisiae  and  Homo  sapiens  (Supplementary  Tables  1-5  online).  For  C.  elegans  and  D.  melanogaster,  we  identified  3,851  most-conserved  orthologous  gene  pairs  (Fig.  1a).  We  used  DNA  microarrays  in  each  organism  to  compare  gene  expression  under  different  conditions  (Fig.  1b).  We  then  used  gene  phylogenetic  relationships  to  match  systematically  the  measurements  of  differential  expression  between  orthologous  genes  from  the  two  organisms  (Fig.  1c).  We  used  the  correlation  of  the  log-transformed  relative  change  in  expression  of  orthologous  genes  to  assess  the  extent  of  shared  regulation.  Global  similarity  of  transcriptional  profiles  of  aging  Using  this  approach,  we  asked  whether  gene  expression  patterns  in  adult  aging  were  shared  by  two  highly  diverged  animals:  the  nematode  C.  elegans  and  the  fruit  fly  D.  melanogaster,  whose  last  common  ancestor  existed  about  one  billion  years  ago7.  We  used  spotted-PCR-product  microarrays1  to  compare  gene  expression  in  middle-aged  adult  (6  d  adult)  and  young  adult  (0  d  adult)  sterile  C.  elegans  hermaphrodites  and  used  Affymetrix  oligonucleotide  microarrays2  to  compare  expression  in  middle-aged  adult  (23  d  old)  and  young  adult  (3  d  old)  female  flies8.  The  cross-species  Pearson  correlation  of  the  log-transformed  relative  change  in  expression  of  orthologous  genes  during  aging  was  0.144,  which  is  significant  at  the  10-11  level.  Sixteen  comparisons  of  independent  experimental  replicates  all  had  high  significance  values,  with  a  mean
0	TECHNICAL  REPORTS
0	review  article
0	The  immune  response  of  Drosophila
1	Jules  A.  Hoffmann
0	Institut  de  Biologie  Moleculaire  et  Cellulaire  du  CNRS,  67084  Strasbourg  Cedex,  France
0	Drosophila  mounts  a  potent  host  defence  when  challenged  by  various  microorganisms.  Analysis  of  this  defence  by  molecular  genetics  has  now  provided  a  global  picture  of  the  mechanisms  by  which  this  insect  senses  infection,  discriminates  between  various  classes  of  microorganisms  and  induces  the  production  of  effector  molecules,  among  which  antimicrobial  peptides  are  prominent.  An  unexpected  result  of  these  studies  was  the  discovery  that  most  of  the  genes  involved  in  the  Drosophila  host  defence  are  homologous  or  very  similar  to  genes  implicated  in  mammalian  innate  immune  defences.  Recent  progress  in  research  on  Drosophila  immune  defence  provides  evidence  for  similarities  and  differences  between  Drosophila  immune  responses  and  mammalian  innate  immunity.
0	Toll(s)  in  the  host  defence  of  Drosophila
0	Toll  activation  during  the  immune  response  (Fig.  1)  is  strictly  dependent  on  the  product  of  the  Spaetzle  gene.  The  Spaetzle  protein  is  a  cystine-knot  molecule  with  structural  similarities  to  mamma33
0	Nature  Publishing  Group
0	review  article
0	lian  neurotrophins,  and  requires  proteolytic  cleavage  for  full  biological  activity23,24.  This  cleavage  is  induced  by  a  proteolytic  cascade  activated  as  an  early  result  of  infection.  The  mature  12-kDa  form  of  Spaetzle  binds  as  a  dimer  to  the  Toll  ectodomain  with  high  affinity  (K  d  <  0.4  nM)  and  with  a  stoichiometry  of  one  Spaetzle  dimer  to  two  receptor  proteins25.  The  intracytoplasmic  TIR  domain  of  Toll  interacts  with  three  partners,  each  of  which  has  a  death  domain.  Two  of  these  are  adaptor  proteins:  the  Drosophila  homologue  of  MyD8826-29,  which  in  addition  to  the  death  domain  has  a  TIR  domain  similar  to  that  of  Toll  with  which  it  associates,  and  Tube.  Tube  has  no  obvious  mammalian  homologue.  The  third  deathdomain  protein  in  this  receptor-adaptor  complex  is  Pelle,  which  has  a  serine-threonine  kinase  domain  and  is  homologous  to  mammalian  IRAKs  (interleukin-1  receptor-associated  kinases;  reviewed  in  ref.  30).  Depending  on  the  developmental  stage,  Toll  can  activate  two  closely  related  NF-kB  proteins  in  immune-responsive  tissues:  DIF31  (Dorsal-related  immunity  factor)  in  adults,  and  Dorsal  and/or  DIF  in  larvae32-34.  The  end  effect  of  Toll  signalling  is  the  dissociation  of  NF-kB  protein  from  the  ankyrin-repeat  inhibitory  protein  Cactus,  a  homologue  of  mammalian  IkBs.  This  process  involves  signal-dependent  phosphorylation  of  Cactus,  followed  by  its  degradation  by  the  proteasome35,36.  The  activation  of  Dorsal  requires  phosphorylation,  in  addition  to  dissociation  from  Cactus  (see  also37,38).  It  is  unclear  how  activation  of  the  Toll  receptor-  adaptor  complex  leads  to  these  various  processes.  Although  Drosophila  expresses  genes  encoding  members  of  the  TRAF  (TNF-receptor-associated  factor)  family  and  homologues  of  mammalian  IKK-b  (IkB  kinase-b)  and  IKK-g/NEMO,  genetic  studies  have  failed  so  far  to  demonstrate  an  involvement  of  any  of  these  genes  downstream  of  Toll.  Furthermore,  Pelle  does  not  directly  phosphorylate  Cactus  and  the  identity  of  the  Cactus  kinase  remains  elusive.  The  precise  roles  of  the  Toll  pathway  during  the  response  to  fungal  and  Gram-positive  bacterial  infection  are  not  fully  understood.  One  effect  is  obviously  to  direct  the  expression  of  various  antimicrobial  peptides.  However,  microarray  data  have  indicated  that  hundreds  of  genes  are  markedly  upregulated  as  a  consequence  of  the  challenge-dependent  activation  of  Toll39,40,  and  their  functions  have  not  yet  been  adequately  addressed.  In  addition  to  Toll,  the  Drosophila  genome  contains  eight  homologues  (18-Wheeler/Toll-2  to  Toll-9)41.  Except  for  Toll,  it  has  not  been  possible  to  unequivocally  a
0	A  genome-wide  analysis  of  immune  responses  in  Drosophila
1	Phil  Irving*,  Laurent  Troxler*,  Timothy  S.  Heuer,  Marcia  Belvin,  Casey  Kopczynski,  Jean-Marc  Reichhart*,  Jules  A.  Hoffmann*,  and  Charles  Hetru*§
0	Oligonucleotide  DNA  microarrays  were  used  for  a  genome-wide  analysis  of  immune-challenged  Drosophila  infected  with  Grampositive  or  Gram-negative  bacteria,  or  with  fungi.  Aside  from  the  expression  of  an  established  set  of  immune  defense  genes,  a  significant  number  of  previously  unseen  immune-induced  genes  were  found.  Genes  of  particular  interest  include  corin-  and  Stubblelike  genes,  both  of  which  have  a  type  II  transmembrane  domain;  easter-  and  snake-like  genes,  which  may  fulfil  the  roles  of  easter  and  snake  in  the  Toll  pathway;  and  a  masquerade-like  gene,  potentially  involved  in  enzyme  regulation.  The  microarray  data  has  also  helped  to  greatly  reduce  the  number  of  target  genes  in  large  gene  groups,  such  as  the  proteases,  helping  to  direct  the  choices  for  future  mutant  studies.  Many  of  the  up-regulated  genes  fit  into  the  current  conceptual  framework  of  host  defense,  whereas  others,  including  the  substantial  number  of  genes  with  unknown  functions,  offer  new  avenues  for  research.
0	at  either  18  or  25°C.  Adult  male  flies  were  removed  from  the  colonies  at  1-day-old  and  kept  at  18°C  until  3  days  old.  At  this  age,  flies  were  either  inoculated  or  designated  as  controls.  Control  and  infected  flies  were  snap-frozen  in  liquid  nitrogen  and  stored  at  80°C  before  extraction  of  total  RNA.
0	Microbial  Challenge  of  Flies.  Inoculation  with  bacteria.  The  bacteria  Escherichia  coli  and  Micrococcus  luteus  were  precultured  in  LB  medium.  Pellets  taken  when  the  cultures  were  in  the  log  phase  of  growth  were  resuspended  in  a  small  amount  of  culture  medium,  and  sharpened  needles  dipped  into  these  suspensions  were  used  to  inoculate  the  flies.  Flies  were  harvested  at  6,  12,  and  48  h  after  inoculation.  Natural  infection  with  fungi.  Flies  anaesthetized  with  CO2  were  shaken  for  a  few  minutes  in  a  Petri  dish  containing  a  sporulating  culture  of  Beauveria  bassiana.  Flies  covered  with  spores  were  placed  in  fresh  tubes  of  Drosophila  medium  and  kept  at  25°C.  Flies  were  collected  3  days  after  infection.  Sample  Preparation  and  Analysis.  For  each  time  point  and  infection
0	nnate  immunity  is  the  first-line  defense  of  multicellular  organisms  that  operates  to  limit  infection  after  exposure  to  microbes.  Invertebrates  and  vertebrates  share  a  common  ancestry  for  this  defense  system,  illustrated  by  the  striking  conservation  of  the  intracellular  signaling  pathways  that  regulate  the  rapid  transcriptional  response  to  infection  in  the  fruit  fly  Drosophila  and  in  mammals  (1,  2).  Because  of  its  flexible  genetics,  Drosophila  has  emerged  as  a  powerful  model  system  for  the  study  of  innate  immunity.  Prominent  among  the  innate  immunity  reactions  is  the  phagocytosis  or  encapsulation  of  the  invading  organism  by  the  hemocytes  (3)  and  the  massive  synthesis  of  antimicrobial  peptides  by  the  fat  body  (4,  5),  a  functional  equivalent  of  the  liver.  Transcriptional  induction  of  antimicrobial  peptide  genes  is  known  to  be  controlled  by  at  least  two  distinct  pathways,  Toll  and  Imd  (6).  Although  much  has  been  learned  about  Drosophila  immunity  through  genetic  screens  and  biochemical  analyses,  many  questions  remain.  For  example,  what  gene  products  are  responsible  for  recognition  of  invading  pathogens  and  how  do  they  activate  the  Toll  or  Imd  pathways?  What  genes  other  than  the  antimicrobial  peptide  genes  are  induced  after  immune  challenge  and  what  roles  do  these  genes  play  in  the  innate  immune  response?  To  complement  the  genetic  approaches  currently  underway,  transcriptional  profiling  experiments  were  carried  out  to  survey  the  majority  of  Drosophila  genes  for  their  response  to  bacterial  and  fungal  infection,  using  Affymetrix  (Santa  Clara,  CA)  GeneChips.  The  induction  of  the  various  Drosophila  antimicrobial  peptides  correlated  well  with  many  earlier  studies  based  on  Northern  blotting  experiments  (7,  8),  confirming  the  accuracy  of  the  microarray  methodology  used.  In  addition,  a  large  number  of  genes  previously  unknown  to  be  induced  by  infection  were  identified.  The  potential  role  of  these  genes  in  recognition,  signaling,  and  effector  mechanisms  of  the  Drosophila  immune  response  can  now  be  assessed  by  using  reverse  genetic  tools  available  in  Drosophila.  Materials  and  Methods  Drosophila  Stocks.  Cinnabar  brown  flies  (cn  bw)  were  reared  on  standard  cornmeal  medium  in  vials  held  in  humid  culture  rooms,
0	L.T.,  and  T.S.H.  contributed  equally  to  this  work.
0	The  publication  costs  of  this  article  were  defrayed  in  part  by  page  charge  payment.  This  article  must  therefore  be  hereby  marked  "advertisement"  in  accordance  with  18  U.S.C.  §1734  solely  to  indicate  this  fact.
0	December  18,  2001
0	Table  1.  Absolute  and  relative  expression  values  for  genes  discussed  in  text
0	Caspase  CG7486  (Dredd)  Death  related  ced-3  Nedd2-like  protein  CG7788  (Ice)  Interleukin-1  beta-converting  enzyme  CG14902  (Decay)  Death  executioner  caspase  related  to  Apopain  CG18188  (Daydream)  Death  Associated  Molecule  related  to  Mch2  Defense  or  immunity  protein  CG11709  (PGRP-SA)  Peptidoglycan  recognition  protein-SA  CG9681  (PGRP-SB1)  Peptidoglycan  recognition  protein-SB1  CG14745  (PGRP-SC2)  Peptidoglycan  recognition  protein-SC2  CG7496  (PGRP-SD)  Peptidoglycan  recognition  protein-SD  CG14704  (PGRP-LB)  Peptidoglycan  recognition  protein-LB  CG10146  (AttA)  Attacin-A  CG18372  (AttB)  Attacin-B  CG4740  (AttC)  Attacin-C  CG7629  (AttD)  Attacin-D  CG1365  (CecA1)  Cecropin  A1  CG1367  (CecA2)  Cecropin  A2  CG1878  (CecB)  Cecropin  B  CG1373  (CecC)  Cecropin  C  CG12763  (Dpt)  Diptericin  A  CG10794  (DptB)  Diptericin  B  CG1385  (D
0	Open  Access
1	Ian  Birch-Machin¤*,  Shan  Gao¤,  David  Huen,  Richard  McGirr*,  Robert  AH  White*  and  Steven  Russell
0	Genomic  analysis  of  heat-shock  factor  targets  in  Drosophila
0	Birch-Machin  et  al.;  licensee  BioMed  Central  Ltd.  This  is  an  Open  Access  article  distributed  under  the  terms  of  the  Creative  Commons  Attribution  License  (http://creativecommons.org/licenses/by/2.0),  which  permits  unrestricted  use,  distribution,  and  reproduction  in  any  medium,  provided  the  original  work  is  properly  cited.
0	deposited  research
0	We  have  used  a  chromatin  immunoprecipitation-microarray  (ChIP-array)  approach  to  investigate  the  in  vivo  targets  of  heat-shock  factor  (Hsf)  in  Drosophila  embryos.  We  show  that  this  method  identifies  Hsf  target  sites  with  high  fidelity  and  resolution.  Using  cDNA  arrays  in  a  genomic  search  for  Hsf  targets,  we  identified  141  genes  with  highly  significant  ChIP  enrichment.  This  study  firmly  establishes  the  potential  of  ChIP-array  for  whole-genome  transcription  factor  target  mapping  in  vivo  using  intact  whole  organisms.
0	refereed  research
0	Chromatin  immunoprecipitation  or,  more  correctly,  immunopurification  (ChIP)  has  emerged  as  a  valuable  approach  for  identifying  the  in  vivo  binding  sites  of  transcription  factors  [1-6].  Before  the  availability  of  complete  genome  sequence  the  use  of  this  approach  for  identifying  transcription  targets  on  a  genome-wide  scale  was,  however,  limited.  Over  the  past  few  years,  a  number  of  laboratories  have  successfully  used  high-density  DNA  microarrays  to  identify  sequences  enriched  by  chromatin  immunopurification  (the  ChIP-array  approach).  In  the  yeast  Saccharomyces  cerevisiae,  microarrays  containing  virtually  all  of  the  intergenic  sequences  from  the  genome  have  been  used  to  identify  the  binding  sites  of  a  large  number  of  transcription  factors  [7,8].  In  principle,  the  same  techniques  can  be  applied  to  higher  eukaryotes,  but  the  complexity  of  their  genomes  presents  a  challenge  for  the  construction  of  full  genomic  microarrays.
0	Despite  such  difficulties,  several  studies  have  shown  the  feasibility  of  the  ChIP-array  approach  with  small  regions  of  complex  eukaryotic  genomes  using  tissue  culture  systems.  In  cultured  mammalian  cells,  for  example,  the  binding  sites  for  several  transcription  factors  have  been  mapped  using  microarrays  composed  of  specific  promoter  regions  or  enriched  for  promoter  sequences  with  CpG  arrays  [9-11].  Although  such  studies  are  valuable  in  identifying  some  of  the  targets  of  particular  transcription  factors,  they  are  limited  because  the  microarray  designs  restrict  the  analysis  to  proximal  promoter  elements  of  a  subset  of  genes.  It  would  be  preferable  to  examine  binding  sites  in  an  unbiased  fashion  by  constructing  tiling  arrays  composed  of  all  possible  binding  targets.  Such  tiling  arrays  have  been  constructed  on  a  small  scale  with  microarrays  containing  a  series  of  1-kb  fragments  from  the  -globin  locus  [12],  or  on  a  large  scale  with  oligonucleotide  arrays  containing  elements  that  detect  all  the  unique  sequences  of  human  chromosomes  21  and  22  [13].  These  studies  indicate  that  the  DNA-binding  patterns  of  regulatory  molecules  in
0	interactions  information
0	Genome  Biology  2005,  6:R63
0	R63.2  Genome  Biology  2005,
0	Birch-Machin  et  al.
0	large  eukaryotic  genomes  are  complex  and  highlight  the  need  for  a  comprehensive  approach  to  understand  how  transcription  factors  interact  with  DNA  in  vivo.  Drosophila  melanogaster,  with  a  genome  complexity  intermediate  between  that  of  yeast  and  human,  provides  a  powerful  system  for  investigating  transcription  factor  targets  and  regulatory  networks  in  a  complex  multicellular  eukaryote.  Recently,  the  principle  of  using  Drosophila  genome  tile  arrays  to  identify  transcription  factor  binding  sites  in  tissue  culture  cells  has  been  demonstrated.  Using  a  technique  employing  fusions  between  DNA-binding  proteins  and  the  Escherichia  coli  DNA  adenine  methyltransferase  (DamID;  [14])  the  binding  locations  for  the  GAGA  transcription  factor  and  the  heterochromatin  protein  HP1  were  mapped  within  a  3-Mb  region  of  the  Drosophila  genome  in  a  tissue  culture  system  [15].  Other  studies  have  used  this  method  to  map  proximal  binding  sites  with  cDNA  arrays  [16].  While  this  elegant  technique  has  the  advantage  that  high-quality  antibodies  against  particular  transcription  factors  are  not  required,  and  a  recent  study  indicates  that  it  may  be  possible  to  transfer  from  a  tissue  culture  system  to  the  intact  organism  [17],  it  clearly  has  limitations,  as  in  vivo  the  DAM-tagged  transcription  factor  is  not  expressed  in  its  normal  developmental  context.  It  is  therefore  desirable  to  develop  methods  that  allow  the  mapping  of  native  transcription  factors  in  their  correct  in  vivo  context  within  the  organism.  Here  we  adapt  chromatin  immunopurification  techniques  using  intact  Drosophila  embryos  and  demonstrate  the  reliable  identification  of  in  vivo  binding  sites  for  the  heat-shock  transcription  factor  Hsf  on  both  genome  tile  and  cDNA  arrays.  The  response  of  most  organisms  to  heat  stress  involves  the  rapid  induction  of  a  set  of  heat-shock  proteins  (Hsps),  including  several  chaperone  molecules  that  assist  in  protecting  the  cell  from  the  deleterious  effects  of  heat  [18-21].  Several  direct  targets  of  the  Hsf  transcription  factor  are  already  well  characterized.  In  higher  eukaryotes,  including  Drosophila  and  mammals,  heat  stress  results  in  the  trimerization  of  Hsf  monomers,  which  then  bind  with  high  affinity  to  regulatory  elements  (heat-shock  elements,  HSE)  close  to  the  transcriptional  start  sites  of  Hsp  genes  [22,23].  The  Drosophila  heat-shock  system  has  been  characterized  at  several  levels,  from  the  cytological  mapping  of  Hsf-binding  sites  on  polytene  chromosomes  [22]  to  the  detailed  molecular  and  biochemical  analysis  of  transcriptional  regulation  at  individual  Hsp  genes  [24-26].  In  this  study  we  extend  the  analysis  of  the  Drosophila  heat-shock  response  by  demonstrating  that  chromatin  immunopurification  from  embryos  can  accurately  map  in  vivo  Hsf-binding  sites  on  genome  tile  microarrays  and  identify  new  potential  in  vivo  HSEs.  In  addition,  using  microarrays  containing  full-length  cDNA  clones  for  over  5,000  Drosophila  genes  we  identify  almost  200  genes  that  are  reproducibly  bound  by  Hsf  upon  heat  shock  in  Drosophila  embryos.  The  targets  correspond  well  with  previously  identified  cytological  locations  of  Hsf  binding  on  salivary  gland  pol-
0	ytene  chromosomes,  thus  providing  direct  target  genes  associated  with  the  low-resolution  cytological  analysis.  A  comparison  with  studies  using  S.  cerevisiae  Hsf  [27,28]  suggest  that  a  set  of  conserved  genes  are  regulated  by  Hsf  in  both  organisms.  Overall,  this  study  presents  the  strong  potential  of  this  approach  for  in  vivo  genome-wide  mapping  of  transcription  factor  binding  sites  in  higher  eukaryotes  using  the  whole  organism.
0	Results  and  discussion
0	Immunopurification  of  Hsf-bound  chromatin
0	Genome  Biology  2005,  6:R63
0	Nature  Publishing  Group  http://genetics.nature.com
0	The  contributions  of  sex,  genotype  and  age  to  transcriptional  variance  in  Drosophila  melanogaster
0	Nature  Publishing  Group  http://genetics.nature.com
1	Wei  Jin1,4*,  Rebecca  M.  Riley1*,  Russell  D.  Wolfinger2,  Kevin  P.  White3,  Gisele  Passador-Gurgel1  &  Greg  Gibson1
0	Here  we  present  a  statistically  rigorous  approach  to  quantifying  microarray  expression  data  that  allows  the  relative  effects  of  multiple  classes  of  treatment  to  be  compared  and  incorporates  analytical  methods  that  are  common  to  quantitative  genetics.  From  the  magnitude  of  gene  effects  and  contributions  of  variance  components,  we  find  that  gene  expression  in  adult  flies  is  affected  most  strongly  by  sex,  less  so  by  genotype  and  only  weakly  by  age  (for  1-  and  6-wk  flies);  in  addition,  sex  x  genotype  interactions  may  be  present  for  as  much  as  10%  of  the  Drosophila  transcriptome.  This  interpretation  is  compromised  to  some  extent  by  statistical  issues  relating  to  power  and  experimental  design.  Nevertheless,  we  show  that  changes  in  expression  as  small  as  1.2-fold  can  be  highly  significant.  Genotypic  contributions  to  transcriptional  variance  may  be  of  a  similar  magnitude  to  those  relating  to  some  quantitative  phenotypes  and  should  be  considered  when  assessing  the  significance  of  experi-
0	mental  treatments.
0	Statistical  genetic  approaches  to  mapping  genotype  onto  phenotype  continue  to  place  in  a  black  box  all  the  events  occurring  between  the  gene  and  the  appearance  of  a  trait.  Despite  the  historical  successes  of  partitioning  environmental  and  interaction  effects  into  variance  components1,  it  can  be  argued  that  the  failure  to  include  a  mechanistic  component  in  this  general  approach  presents  a  considerable  obstacle  to  the  integration  of  developmental/physiological  genetics  and  quantitative  genetics.  In  this  context,  the  precise  quantification  of  intracellular  processes  such  as  transcription  and  translation  should  be  an  important  goal  of  genomic  analysis.  Comparing  gene  expression  among  lines  and  treatments  using  complementary  DNA  microarray  technology  presents  one  means  of  achieving  this  goal.  Currently,  microarray  data  are  most  often  analyzed  by  comparing  an  experimental  treatment  to  a  common  control  and  measuring  the  ratio  of  inferred  transcript  levels  for  each  gene  from  the  ratio  of  fluoresence2.  This  approach  is  inadequate  for  quantitative  analysis  for  two  main  reasons:  the  choice  of  arbitrary  ratio  thresholds  has  no  sound  basis  in  statistical  theory  and  the  approach  does  not  provide  the  flexibility  to  allow  direct  comparison  of  different  sources  of  variance.  It  has  been  pointed  out  that  standard  methods  of  quantitative  genetic  analysis  can  be  applied  to  microarray  data3,4,  and  in  fact  such  methods  suggest  experimental  designs  that  dispense  with  reference  samples  but  increase  statistical  power  as  compared  with  ratio-based  methods5.  Relying  on  moderate  levels  of  replication,  these  methods  allow  investigators  to  identify  significant
0	Nature  Publishing  Group  http://genetics.nature.com
0	reported,  but  the  fact  that  most  fly  traits  show  sex  variance  and  sex  x  genotype  interactions9,13  in  addition  to  the  obvious  differences  between  the  male  and  female  reproductive  systems  implies  that  the  transcriptomes  of  the  two  sexes  are  likely  to  be  quite  different.  Here  we  use  a  long-standing  and  widely  used  statistical  method  from  agricultural  and  quantitative  genetics--the  mixed  model  analysis  of  variance14--to  rank  the  effects  of  sex,  genotype  and  age  on  transcription  and  to  draw  comparisons  between  the  contributions  of  sex  and  genotype  to  the  variance  of  transcription  and  of  phenotypic  traits.
0	Nature  Publishing  Group  http://genetics.nature.com
0	Our  experimental  design  consisted  of  24  cDNA  microarrays,  6  for  each  combination  of  2  genotypes  (Oregon  R  and  Samarkand)  and  the  2  sexes,  involving  48  separate  labeling  reactions.  We  directly  contrasted  two  time  points,  1-wk  and  6-wk  adult  flies,  on  each  microarray.  The  dyes  Cy3  and  Cy5  were  flipped  for  two  of  the  six  replicates  of  each  genotype  and  sex  combination.  A  common  reference  sample  was  not  used.  In  total,  we  spotted  4,256  clones,  representing  a  third  of  the  genome--two-thirds  of  which  were  verified  by  resequencing  before  printing.  We  excluded  325  clones  from  the  analysis  because  no  consistent  expression  above  background  was  detected.  We  analyzed  fluorescence  levels  with  the  objective  of  establishing  whether  the  level  of  expression  of  each  gene  relative  to  the  sample  mean  of  the  labeling  reaction  varies  according  to  sex,  genotype  and  age.  We  used  two  sequential  analyses  of  variance  (ANOVAs).  This  procedure  uses  differences  in  normalized  expression  levels,  rather  than  ratios,  as  the  unit  of  analysis  of  expression  differences,  eliminating  the  need  for  a  reference  sample.  The  statistical  model  for  each  clone  simultaneously  fits  the
0	effects  of  the  treatments  of  interest  across  the  entire  experiment,  allowing  direct  contrasts  of  the  magnitude  of  the  effects  caused  by  each  treatment  and  interactions  among  treatments.  Differences  in  global  levels  of  transcription  among  treatments  can  also  be  tested  (Methods).  In  our  experiment,  the  male  samples  tended  to  show  higher  fluorescence  intensities  than  the  female  ones,  although  the  magnitude  of  the  effect  was  very  small  relative  to  significant  individual  gene  differences.  Cluster  analysis  of  normalized  expression  levels  tends  to  group  genes  according  to  the  overall  mean  fluorescence  intensity  and,  to  some  extent,  to  the  greatest  effect  (in  this  case  sex),  but  is  inefficient  at  identifying  groups  of  genes  coregulated  in  more  subtle  ways.  Nevertheless,  after  grouping  genes  according  to  the  significance  of  fixed  effects,  we  used  TreeView15  to  provide  a  visual  representation  analogous  to  the  standard  method  of  representing  ratio  effects  (Fig.  1).  Representative  sex  and  sex  x  genotype  interaction  effects  of  various  types  are  clearly  seen  in  Fig.  1a,b,  whereas  more  subtle  genotype  and  age  effects  can  be  seen  by  close  inspection  of  Fig.  1c,d.  Plots  of  normalized  expression  levels  for  individual  clones  provide  a  visual  means  of  assessing  within-  and  among-treatment  variance  (Fig.  2).  Lines  link  measures  on  a  single  array,  with  agecontrasted  pairs  of  points  corresponding  to,  from  left  to  right,  Oregon  R  females  and  males  and  then  Samarkand  females  and  males.  The  top  two  genes  (fs(1)K10  and  ebony)  show  significant  effects  of  both  sex  and  genotype,  whereas  the  testes-enriched  gene  ocnus  shows  only  the  sex  effect.  CG9090,  which  encodes  a  putative  mitochondrial  phosphate  transporter  (Flybase;  http://flybase.bio.  indiana.edu),  is  unaffected  by  sex  or  genotype,  but  is  consistently  reduced  in  older  flies  (P<0.0001,  ANOVA).  Note  that  few  of  these  effects  exceed  the  commonly  used  arbitrary  threshold  of  a  twofold
0	Rapid  evolution  of  male-biased  gene  expression  in  Drosophila
1	Colin  D.  Meiklejohn*,  John  Parsch,  Jose  M.  Ranz*,  and  Daniel  L.  Hartl*  ´
0	A  number  of  genes  associated  with  sexual  traits  and  reproduction  evolve  at  the  sequence  level  faster  than  the  majority  of  genes  coding  for  non-sex-related  traits.  Whole  genome  analyses  allow  this  observation  to  be  extended  beyond  the  limited  set  of  genes  that  have  been  studied  thus  far.  We  use  cDNA  microarrays  to  demonstrate  that  this  pattern  holds  in  Drosophila  for  the  phenotype  of  gene  expression  as  well,  but  in  one  sex  only.  Genes  that  are  male-biased  in  their  expression  show  more  variation  in  relative  expression  levels  between  conspecific  populations  and  two  closely  related  species  than  do  female-biased  genes  or  genes  with  sexually  monomorphic  expression  patterns.  Additionally,  elevated  ratios  of  interspecific  expression  divergence  to  intraspecific  expression  variation  among  male-biased  genes  suggest  that  differences  in  rates  of  evolution  may  be  due  in  part  to  natural  selection.  This  finding  has  implications  for  our  understanding  of  the  importance  of  sexual  dimorphism  for  speciation  and  rates  of  phenotypic  evolution.
0	microarray  intraspecific  variation  interspecific  variation  cDNA
0	nisogamous  reproduction  is  common  in  many  animal  and  plant  species  and  can  produce  a  number  of  conflicts  with  important  evolutionary  consequences.  For  example,  differential  selection  coefficients  between  the  two  sexes  can  lead  to  stable  genetic  polymorphisms  or  a  decline  in  population  mean  fitness  (1).  It  can  also  drive  accelerated  rates  of  phenotypic  evolution,  as  many  morphologies  associated  with  sex  and  reproduction  diverge  more  rapidly  than  other  phenotypes  (2).  Molecular  techniques  that  provide  rapid  and  quantitative  measures  of  genotypic  and  phenotypic  variation  have  extended  this  pattern  to  include  accelerated  rates  of  evolution  among  proteins  with  sexual  or  reproductive  functions  (3,  4).  Since  then,  most  data  supporting  this  observation  have  come  from  homologous  nucleotide  sequences  of  genes  that  are  associated  with  sex  or  reproduction.  In  ciliates,  green  algae,  diatoms,  angiosperms,  fungi,  and  at  least  four  animal  phyla,  unusually  high  ratios  of  nonsynonymous  to  synonymous  substitutions  (dN  dS)  between  species  have  been  documented  in  sex-related  genes  (reviewed  in  ref.  5).  Some  of  these  genes  also  show  high  levels  of  intraspecific  differentiation  (5).  In  Drosophila,  much  of  this  work  has  focused  on  genes  that  are  expressed  in  testes  or  accessory  glands  (e.g.,  refs.  6  and  7),  although  a  high  dN  dS  has  also  been  observed  for  genes  expressed  in  females  and  components  of  the  sex  determination  pathway  (8).  Protein  coding  sequences  provide  a  natural  context  for  studying  rates  of  evolution,  as  the  effect  of  a  given  nucleotide  substitution  on  the  polypeptide  is  predictable,  and  comparison  between  neighboring  synonymous  and  nonsynonymous  sites  controls  for  mutation  rate.  Because  of  the  lack  of  an  analogous  context  for  regulatory  sequences,  the  rates  and  patterns  of  evolution  in  regions  of  the  genome  controlling  gene  expression  are  less  well  understood.  Thus,  it  is  not  known  whether  the  rapid  rates  of  evolution  among  genes  associated  with  sex  and  reproduction  holds  for  gene  expression  as  well.  Because  a  large  proportion  of  important  phenotypic  evolution  may  be  the  result  of  changes  in  gene  expression  (9,  10),  understanding  rates  and  patterns  of  regulatory  change  within  and  between  species  is
0	critical  for  a  comprehensive  picture  of  biological  evolution.  Given  the  pattern  seen  for  amino  acid  sequences  and  morphologies,  we  would  predict  that  genes  associated  with  sex  should  be  evolving  faster  at  the  level  of  gene  regulation  as  well.  Indeed,  much  of  the  divergence  among  proteins  in  the  male  reproductive  tract  of  Drosophila  may  be  attributable  to  large  changes  in  protein  levels,  which  is  likely  due  in  part  to  changes  in  gene  expression  (3).  To  test  this  prediction,  we  obtained  gene  expression  data  for  1  3  of  the  genome  from  adult  males  of  eight  strains  of  Drosophila  melanogaster,  and  from  adult  males  and  females  of  one  strain  of  D.  melanogaster  and  one  strain  of  Drosophila  simulans.  By  analyzing  intra-  and  interspecific  expression  differentiation  within  males  and  the  sex-specificity  of  expression  in  both  species,  we  show  that  gene  expression  in  males  evolves  more  rapidly  than  in  females.  Genes  that  are  male-biased  in  their  expression  have  on  average  more  intra-  and  interspecific  divergence  in  expression  than  genes  with  female-biased  expression.  Furthermore,  comparison  of  intra-  and  interspecific  differentiation  suggests  that  at  least  some  of  the  excess  in  divergence  among  male-biased  genes  (MBGs)  is  due  to  differential  selective  pressures  acting  on  the  expression  of  different  sexbiased  classes  of  genes.  Materials  and  Methods
0	Gene  Collection  version  1.0  (12)  were  amplified  by  PCR  with  universal  primers,  and  the  products  were  confirmed  by  gel
0	This  paper  was  submitted  directly  (Track  II)  to  the  PNAS  office.  Abbreviations:  MBGs,  male-biased  genes;  FBGs,  female-biased  genes;  UBGs,  unbiased  genes;  OBGs,  ovary-biased  genes.
0	Table  1.  Overrepresentation  of  MBGs  among  genes  with  polymorphic  expression  within  D.  melanogaster
0	Subsets  of  genes  include  those  that  exhibit  at  least  one  pairwise  difference  between  any  two  strains  at  the  significance  level  indicated.  G,  G  test  of  independence.
0	Influence  of  age,  sex,  and  strength  training  on  human  muscle  gene  expression  determined  by  microarray
0	THE  LOSS  OF  SKELETAL  MUSCLE
0	SKELETAL  MUSCLE  GENE  EXPRESSION
0	Physiol  Genomics  ·  VOL
0	blood,  and  connective  tissue,  enclosed  in  cryovials,  snapfrozen  in  liquid  nitrogen,  and  stored  at  80°C  until  analysis.  Microarray  molecular  biology.  Total  RNA  was  extracted  using  the  SV  RNA  Isolation  Kit  (Promega)  according  to  manufacturer's  instructions  (which  included  DNase  I  treatment)  and  quantitated  by  determining  absorbance  at  260  nm  in  triplicate,  with  the  values  averaged.  For  each  microarray  experiment,  a  total  of  1  g  of  total  RNA  was  used  for  each  hybridization,  thus  200  ng  of  total  RNA  was  taken  from  each  sample  and  pooled  for  each  group.  Arrays  were  hybridized  according  to  the  manufacturer's  instructions,  once  for  each  experimental  condition  (baseline,  ST)  within  a  single  group.  Thus  four  total  microarrays,  one  for  each  of  the  four  groups,  were  hy
0	Transcriptional  Repressor  Functions  of  Drosophila  E2F1  and  E2F2  Cooperate  To  Inhibit  Genomic  DNA  Synthesis  in  Ovarian  Follicle  Cells
0	CAYIRLIOGLU  ET  AL.
0	MOL.  CELL.  BIOL.
0	Research  article
0	A  genomic  analysis  of  Drosophila  somatic  sexual  differentiation  and  its  regulation
1	Michelle  N.  Arbeitman1,*,,  Alice  A.  Fleming1,,  Mark  L.  Siegal1,  Brian  H.  Null2  and  Bruce  S.  Baker1,
0	In  virtually  all  animals,  males  and  females  are  morphologically,  physiologically  and  behaviorally  distinct.  Using  cDNA  microarrays  representing  one-third  of  Drosophila  genes  to  identify  genes  expressed  sexdifferentially  in  somatic  tissues,  we  performed  an  expression  analysis  on  adult  males  and  females  that:  (1)  were  wild  type;  (2)  lacked  a  germline;  or  (3)  were  mutant  for  sex-determination  regulatory  genes.  Statistical  analysis  identified  63  genes  sex-differentially  expressed  in  the  soma,  20  of  which  have  been  confirmed  by  RNA  blots  thus  far.  In  situ  hybridization  experiments  with  11  of  these  genes  showed  they  were  sex-differentially  expressed  only  in  internal  genital  organs.  The  nature  of  the  products  these  genes  encode  provides  insight  into  the  molecular  physiology  of  these  reproductive  tissues.  Analysis  of  the  regulation  of  these  genes  revealed  that  their  adult  expression  patterns  are  specified  by  the  sex  hierarchy  during  development,  and  that  doublesex  probably  functions  in  diverse  ways  to  set  their  activities.
0	Key  words:  Drosophila,  Sex  determination,  Microarray,  Somatic,  Reproduction
0	In  essence,  sexual  reproduction  is  the  process  whereby  two  gametes,  one  contributed  by  each  parent,  fuse  to  form  a  new  individual.  Achieving  this  end  is  an  elaborate  process  that  in  multicellular  animals  requires,  along  with  germline  development,  the  appropriate  sex-specific  development  and  physiology  of  the  external  genitalia,  portions  of  the  nervous  system  that  control  sex-specific  reproductive  behaviors,  somatic  tissues  of  the  gonads  (which  play  important  roles  in  gametogenesis),  and  the  internal  genital  organs  (whose  products  are  important  both  pre-  and  post-copulation  for  successful  reproduction).  Currently,  we  have  limited  knowledge,  in  any  organism,  of  the  sets  of  genes  that  are  deployed  sex-differentially  in  adult  somatic  tissues,  and  limited  knowledge  of  their  roles  in  sexual  reproduction.  Drosophila  melanogaster  is  a  powerful  model  system  in  which  to  acquire  an  understanding  of  the  sex-specific  physiology  of  adult  somatic  tissues,  because  we  have  a  thorough  understanding  at  the  molecular-genetic  level  of  the  regulatory  hierarchy  that  controls  somatic  sexual  differentiation  (Fig.  1)  (reviewed  by  Cline  and  Meyer,  1996;  Baker  et  al.,  2001;  Christiansen  et  al.,  2002).  There  have  been  significant  advances  in  understanding  how  the  actions  of  DSXF  and  DSXM,  terminal  transcription  factors  in  the  hierarchy  encoded  by  the  doublesex  (dsx)  gene,  are  integrated  with  other  key  developmental  hierarchies  to  achieve  sex-specific  patterns  of  growth,  morphogenesis  and  differentiation  (reviewed  by
0	Christiansen  et  al.,  2002).  However,  we  have  relatively  little  knowledge  of  the  genes  that  are  sex-differentially  deployed  in  adults  through  the  action  of  the  two  final  genes  in  the  hierarchy,  dsx  and  fruitless  (fru),  which  encodes  (among  several  isoforms)  a  male-specific  transcription  factor  hereafter  referred  to  as  FRUM.  Several  approaches  have  been  used  to  identify  genes  expressed  sex-differentially  in  D.  melanogaster  adults.  The  most  thoroughly  studied  tissue  is  the  male  accessory  gland,  in  which  75  genes  have  been  identified  using  biochemical  purification  and  differential  cDNA  hybridization  (reviewed  by  Wolfner,  2002).  Several  of  these  genes  encode  proteins  whose  effects  in  the  mated  female  have  been  characterized  and  include  decreasing  female  receptivity  to  re-mating,  increasing  ovulation  and  egg  laying,  and  facilitating  sperm  storage.  Additional  screens  have  focused  on  sex-differential  gene  expression  in  the  head  and  foreleg.  In  head  tissues,  subtractive  hybridization  identified  takeout  (Dauwalder  et  al.,  2002),  and  serial  analysis  of  gene  expression  (SAGE)  uncovered  46  sexdifferentially  expressed  genes  (Fujii  and  Amrein,  2002).  From  the  foreleg,  two  genes  implicated  in  male-specific  chemosensory  function  (CheA29a  and  CheB42a)  were  isolated  by  subtractive  cloning  (Xu  et  al.,  2002).  Sex-differential  gene  expression  in  adults  has  also  been  studied  using  microarray  technology  (Jin  et  al.,  2001;  Arbeitman  et  al.,  2002;  Parisi  et  al.,  2003;  Ranz  et  al.,  2003).  In  two  of  these  studies  (Arbeitman  et  al.,  2002;  Parisi  et  al.,  2003),  both  the  somatic  and  germline
0	Development  131  (9)  components  of  sex-differential  expression  were  determined,  but  regulation  by  the  sex-determination  hierarchy  was  not  explored.  Here,  we  identify  genes  that  are  expressed  sex-differentially  in  somatic  tissues  of  adults  and  regulated  by  the  sex  hierarchy.  Using  arrays  that  assay  approximately  one-third  (4040)  of  Drosophila  genes,  we  analyzed  adults  mutant  for  the  regulatory  genes  transformer  (tra),  dsx  and  fru  (Fig.  1).  To  select  a  small  number  of  such  genes  for  further  study,  we  chose  a  conservative  approach.  Stringent  statistical  analysis  of  these  data,  combined  with  data  from  wild-type  adults  and  adults  that  lack  germline  tissue  (Arbeitman  et  al.,  2002),  identified  63  genes  that  are  sex-differentially  expressed  in  the  adult  soma  and  regulated  by  the  somatic  sex  hierarchy.  Additional  selection  criteria,  and  validation  by  RNA  blot  analysis,  defined  a  set  of  11  genes  for  further  characterization.  In  situ  hybridization  revealed  that  sex-differential  expression  of  all  11  genes  is  confined  to  the  internal  genitalia.  Analysis  of  the  regulation  of  these  genes  revealed  that  the  sex  hierarchy  functions  during  development  to  specify  their  adult  expression  patterns,  and  that  dsx  probably  functions  in  diverse  ways  to  set  their  activities.
0	Research  article
0	fru  males;  if  it  is  controlled  by  fru,  its  expression  level  is  expected  not  to  differ  between  tud  females  and  dsxD  pseudomales.  First,  the  within-group  mean  square  (MS)  was  calculated  assuming  the  gene  was  under  dsx  control.  Three  means  were  calculated:
0	x  tudF  =  x  dsxD  =  xM  =
0	Then  the  sum  of  squared  deviations  of  each  data  point  from  its  respective  mean  was  calculated  and  divided  by  the  degrees  of  freedom:
0	MSDSX  =
0	x  tudF  )  +  (  x4  j  -  x  d  sxD  )  +  (  x1j  -  x  M  )  +  (  x  2  j  -  x  M  )  2
0	The  MS,  assuming  fru  control,  was  calculated  in  the  same  way,  except  that  genotypes  were  expected  to  have  the  same  expression  level:
0	Materials  and  methods
0	Drosophila  stocks  Flies  were  grown  using  standard  conditions  at  25°C,  unless  otherwise  indicated.  The  wild-type  stock  was  Canton  S.  XX  tra,  XX  DsxD  pseudomales,  fru  males  and  dsx  intersexual  mutant  animals  were  wa/w;  tra1/Df(3L)st-j7,  w/+;DsxD/dsxm+r15  (XX),  fru4-40/frup14  (XY),  w/+;  dsxm+r15/dsxd+r3  (XX),  and  w;dsxm+r15/dsxd+r3  (XY),  respectively.  tudor  mutants  are  the  progeny  of  virgin  tud1  bw  sp  females  crossed  to  Canton  S  males.  tra2  temperature-shift  experiments  used  the  following  genotypes:  BsY;tra-2ts1/tra-2ts2(XY)  and  tra-2ts1/tra-2ts2  (XX).
0	x  wtM  =  x  fruM  =  xF  =
0	MSFRU  =
0	The  MSs  were  then  compared  using  an  F  test  with  the  appropriate  degrees  of  freedom.  RNA  blot  analyses  Total  RNA  was  isolated  with  Trizol  (Invitrogen),  followed  by  RNeasy  (Qiagen)  or  poly(A)+  isolation  using  Poly-ATtract  (Promega).  Blots  were  prepared  from  a  Northern  Max  kit  (Ambion).  Radiolabeled  RNA  probes  made  with  Strip-EZ  kit  (Ambion)  were  used  at  approximately  1-7x106  cpm/ml  of  hybridization  solution.  Blots  were  typic
0	Drosophila  melanogaster  MNK/Chk2  and  p53  Regulate  Multiple  DNA  Repair  and  Apoptotic  Pathways  following  DNA  Damage
1	Michael  H.  Brodsky,1,2*  Brian  T.  Weinert,2  Garson  Tsang,2,3  Yikang  S.  Rong,4  Nadine  M.  McGinnis,1  Kent  G.  Golic,5  Donald  C.  Rio,2  and  Gerald  M.  Rubin2,3
0	BRODSKY  ET  AL.
0	MOL.  CELL.  BIOL.
0	RESEARCH  ARTICLE
0	Patterns  of  Gene  Expression  During  Drosophila  Mesoderm  Development
1	Eileen  E.  M.  Furlong,1  Erik  C.  Andersen,1*  Brian  Null,1  Kevin  P.  White,2  Matthew  P.  Scott1
0	The  transcription  factor  Twist  initiates  Drosophila  mesoderm  development,  resulting  in  the  formation  of  heart,  somatic  muscle,  and  other  cell  types.  Using  a  Drosophila  embryo  sorter,  we  isolated  enough  homozygous  twist  mutant  embryos  to  perform  DNA  microarray  experiments.  Transcription  profiles  of  twist  loss-of-function  embryos,  embryos  with  ubiquitous  twist  expression,  and  wild-type  embryos  were  compared  at  different  developmental  stages.  The  results  implicate  hundreds  of  genes,  many  with  vertebrate  homologs,  in  stagespecific  processes  in  mesoderm  development.  One  such  gene,  gleeful,  related  to  the  vertebrate  Gli  genes,  is  essential  for  somatic  muscle  development  and  sufficient  to  cause  neural  cells  to  express  a  muscle  marker.  Formation  of  muscles  during  embryonic  development  is  a  complex  process  that  requires  coordinate  actions  of  many  genes.  Somatic,  visceral,  and  heart  muscle  are  all  derived  from  mesoderm  progenitor  cells.  The  Drosophila  twist  gene  (1),  which  encodes  a  bHLH  transcription  factor,  is  essential  for  multiple  steps  of  mesoderm  development:  invagination  of  mesoderm  precursors  during  gastrulation  (2),  segmentation  (3),  and  specification  of  muscle  types  (4).  The  role  of  twist  in  mesoderm  development  has  been  conserved  during  evolution  (5),  perhaps  because  it  controls  conserved  regulatory  mesoderm  genes.  For  example,  tinman  and  dMef  2  are  regulated  by  Twist  in  flies  (6,  7)  (Fig.  1A)  and  are  highly  conserved  in  sequence  and  function  in  vertebrates  (8-10).  In  Drosophila,  somatic  muscle  forms  from  progenitor  cells  that  divide  to  become  muscle  founder  cells  (11).  Founder  cells  acquire  unique  identities  controlled  by  transcription  factors  including  Kruppel,  S59,  ves¨  tigial,  and  apterous.  Each  of  the  30  body  wall  muscles  in  an  abdominal  hemisegment  is  initiated  by  a  single  founder  cell  and  has  unique  attachments  and  innervations  (12).  To  further  clarify  mechanisms  underlying  founder  cell  specification,  myoblast  fusion,  and  muscle  patterning,  we  have  used  Drosophila  mutants  together  with  microarrays  of  cDNA  clones.
0	dependent  embryo  collections,  embryo  sortings,  and  microarray  hybridizations  were  conducted.  The  microarrays  used  for  the  analysis  contained  over  8500  cDNAs  corresponding  to  5081  unique  genes  plus  a  variety  of  controls  [see  Web  fig.  3  for  array  details  (13)].  Each  embryonic  RNA  sample  was  compared  with  a  reference  sample,  which  contains  RNA  made  from  all  stages  of  the  Drosophila  life  cycle  and  allows  direct  comparisons  among  all  the  experiments.  Sample  and  array  variability  was  determined  by  calculating  correlation  coefficients  and  standard  deviations  for  each  gene  for  all  pair-wise  combinations  of  repeated  samples.  The  median  correlation  coefficient  is  0.92,  and  median  standard  deviation  divided  by  mean  is  0.246  [see  Web  text  for  validation  information  (13)].  To  determine  how  transcription  was  affected  by  the  twist  mutation,  SAM  (significance  analysis  of  microarrays)  analysis  was  used  (17).  Genes  that  are  normally  highly  expressed  in  mesoderm  should  have  lower  transcript  levels  in  twist  homozygotes.  Genes  in  other  tissues  whose  expression  depends  on  signals  from  the  mesoderm  might  also  have  reduced  expression.  Transcripts  of  130  genes,  the  "Twist-low"  group,  were  significantly  lower  in  twist  mutants  than  in  wild  type  (Fig.  2A).  Conversely,  cells  that  would  have  formed  mesoderm  may  take  on  other  fates  in  the  absence  of  twist,  such  as  neuroectoderm;  therefore,  many  transcript  levels  could  increase  in  twist  mutants.  Genes  whose  transcription  is  repressed  by  signals  from  the  mesoderm  would  also  be  enriched  in  twist  mutants.  One  hundred  fifty  genes,  called  the  "Twist-high"  group,  have  increased  levels  of  RNA  in  twist  mutant  embryos  (Fig.  2A).  In  total,  280  of  5000  genes  had  significant  changes  in  transcript  levels,  with  10  false  positives  (17)  [see  Web  text  for  validation  information  (13)].  The  genes  on  the  array  include  15  previously  characterized  mesoderm-specific  genes,  all  of  which  were  significantly  reduced  in  twist  mutant  embryos  (Fig.  3A).  The  arrays  also  contain  genes  known  to  be  transcribed  in  both  mesoderm  and  other  cell  types.  Significant  changes  in  expression  were  detected  for  many  of  these  genes  (Fig.  3B).  The  130  Twist-low  genes  were  divided  into  three  groups  (A,  B,  and  C)  with  similar  trends  of  expression  by  a  self-organizing  map  (SOM)  clustering  program  (Fig.  1B)  (18).  The  24  group  A  genes,  which  included  tinman,  dMef  2,  and  bagpipe,  had  reduced  transcript  levels  in  twist  mutants  at  all  developmental  stages  assayed.  Most  of  the  Twist-low  genes  fall  into  the  B  and  C  groups.  The  62  group  B  "early  genes"  encode  transcripts  with  reduced  levels  of  expression  in  twist  mutants  only  during  stages  9-10,  not  later.  One  member  of  group  B,  stumps  (dof/hbr)  is
0	RESEARCH  ARTICLE
0	essential  for  mesoderm  cell  migration.  stumps  RNA  is  abundant  in  the  mesoderm  at  stages  9-10  and  is  strongly  reduced  by  stage  11  (Fig.  1B)  (19).  At  stage  11,  stumps  RNA  accumulates  in  trachea,  which  are  largely  unaffected  in  twist  mutants.  The  44  group  C  genes  have  reduced  transcript  levels  in  twist  mutant  embryos  only  during  late  stage  11  and  stage  12.  These  "late  genes"  include  blown  fuse,  a  gene  essential  for  myoblast  fusion  (20);  delilah,  a  gene  required  for  somatic  muscle  attachment  (21);  and  genes  such  as  kettin,  which  is  required  to  form  contractile  muscle  (22).  Given  the  predominantly  early  expression  of  twist,  the  early  genes  in  groups  A  and  B  are  the  best  candidates  for  direct  transcription  targets  of  Twist,  though  some  indirectly  activated  genes  may  be  present  within  these  groups.  Group  C  late  genes  are  likely  to  be  regulated  by  products  of  genes  that  are  activated  by  Twist.  In  situ  hybridizations  were  done  using  a  previously  uncharacterized  representative  of  each  Twist-low  group  (Fig.  1C).  In  each  case,  the  hybridization  pattern  was  consistent  with  the  predicted  time  of  transcription.  A  group  A  gene,  CG15015  (GH16741),  is  transcribed  in  somatic  muscle  throughout  stages  9-12.  A  group  B  gene,  CG12177  (GH22706),  is  transcribed  during  early  mesoderm  development,  but  not  later.  CG14848  (GH21860),  a  group  C  gene,  is  expressed  in  the  stomodeum  but  not  the  mesoderm  during  stages  9-10.  Its  mesoderm  expression  initiates  during  stage  11,  the  latest  period  of  the  twist  experiment.  Thus,  combining  loss-of-function  mutant  embryo  analysis  with  staged  embryo  collections  provides  gene  expression  information  for  both  tissue  specificity  and  temporal  expression.  A  complementary  test:  The  transcription  profile  with  twist  overexpression.  The  misexpression  of  twist  in  the  ectoderm  is  sufficient  to  convert  both  neuronal  and  epidermal  tissues  to  a  myogenic  cell  fate  (4).  RNA  from  embryos  with  ubiquitous  twist  expression  was  used  to  evaluate  the  ability  of  Twist  to  initiate  mesoderm-like  gene  expression  in  cells  that  would  normally  form  other  tissue  types.  Genes  whose  transcript  levels  decrease  in  twist  loss-of-function  embryos  and  increase  when  twist  is  ubiquitous  are  excellent  candidates  for  regulators  of  mesoderm  development  or  differentiation.  To  ectopically  express  twist,  a  dominant  gain-of-function  mutation  of  the  maternal  gene  Toll  (Toll10B)  was  used  (23).  Activated  Toll  induces  the  expression  of  twist  and  snail  in  early  embryos  and  of  immune  response  genes  in  older  embryos  (Fig.  1A)  (24,  25).  Thus,  the  effects  of  Toll10B  on  gene  expression  reflect  the  activities 
0	Dmp53  protects  the  Drosophila  retina  during  a  developmentally  regulated  DNA  damage  response
1	Omar  W.Jassim,  Jill  L.Fink  and  Ross  L.Cagan1
0	Ultraviolet  (UV)  light  is  absorbed  by  cellular  proteins  and  DNA,  promoting  skin  damage,  aging  and  cancer.  In  this  paper,  we  explore  the  UV  response  by  cells  of  the  Drosophila  retina.  We  demonstrate  that  the  retina  enters  a  period  of  heightened  UV  sensitivity  in  the  young  developing  pupa,  a  stage  closely  associated  with  its  period  of  normal  developmental  programmed  cell  death.  Injury  to  irradiated  cells  included  morphology  changes  and  apoptotic  cell  death;  these  defects  could  be  completely  accounted  for  by  DNA  damage.  Cell  death,  but  not  morphological  changes,  was  blocked  by  the  caspase  inhibitor  P35.  Utilizing  genetic  and  microarray  data,  we  provide  evidence  for  the  central  role  of  Hid  expression  and  for  Diap1  protein  stability  in  controlling  the  UV  response.  In  contrast,  we  found  that  Reaper  had  no  effect  on  UV  sensitivity.  Surprisingly,  Dmp53  is  required  to  protect  cells  from  UV-mediated  cell  death,  an  effect  attributed  to  its  role  in  DNA  repair.  These  in  vivo  results  demonstrate  that  the  cellular  effects  of  DNA  damage  depend  on  the  developmental  status  of  the  tissue.  Keywords:  apoptosis/Drosophila/Dmp53/retina/UV
0	UV-damaged  DNA  can  be  repaired  by  a  number  of  mechanisms,  including  nucleotide  excision  repair  (Friedberg,  2001)  and  photoreactivation  (Carell  et  al.,  2001).  In  the  process  of  nucleotide  excision  repair,  pyrimidine  dimers  are  excised  and  replaced  with  undamaged  nucleotides.  The  disorder  xeroderma  pigmentosum  is  linked  to  at  least  seven  genetic  loci  that  encode  factors  that  participate  in  nucleotide  excision  repair  (e.g.  the  nucleases  XPF  and  XPG);  patients  exhibit  hypersensitivity  to  UV  light  and  a  strong  predisposition  toward  skin  cancer.  An  alternate  repair  mechanism  is  photoreactivation.  Many  vertebrates  and  invertebrates  use  this  system  to  repair  pyrimidine  dimers.  It  includes  a  lightdependent  photolyase  repair  enzyme  that  binds  to  pyrimidine  dimers;  the  dimer  is  then  enzymatically  restored  to  a  monomeric  form  using  350±450  nm  light  as  an  energy  source.  Several  lines  of  evidence  suggest,  however,  that  the  damage  provoked  by  UV  irradiation  is  mediated  by  more  than  its  ability  to  alter  DNA.  Activation  of  a  number  of  signaling  pathways,  including  JNK,  EGFR  and  TNF,  can  occur  in  a  manner  independent  of  either  prior  nuclear  signaling  or  effects  on  DNA  (e.g.  Kulms  et  al.,  1999;  Kulms  and  Schwarz,  2002a).  This  broad  spectrum  has  led  to  the  suggestion  that  most  receptors  that  are  activated  by  oligomerization  can  be  affected  by  UV  (Rosette  and  Karin,  1996).  In  some  cell  lines,  effects  on  cellular  proteins  are  thought  to  represent  the  principal  UV-mediated  insult.  Once  DNA  is  damaged,  the  tumor  suppressor  P53  mediates  a  cell's  response  by  regulating  expression  of  a  number  of  targets  including  signal  transduction  factors,  cell  cycle  regulators,  cell  repair  genes  and  cell  death  regulators  (Vousden  and  Lu,  2002).  P53  also  binds  to  specific  DNA  sites  and  damaged,  single-strand  DNA  (Liu  and  Kulesz-Martin,  2001).  UV  irradiation  leads  to  stabilization  of  the  P53  protein,  in  part  due  to  its  phosphorylation  by  ERK  and  P38  kinases  (She  et  al.,  2000;  Chouinard  et  al.,  2002).  The  kinases  ATR  and  ATM  have  also  been  implicated  in  signaling,  and  perhaps  even  sensing  DNA  damage,  leading  to  their  subsequent  targeting  of  P53  (Lakin  et  al.,  1999;  Tibbetts  et  al.,  1999).  The  result  is  a  dual  role  for  P53:  it  can  direct  cell  cycle  arrest  to  permit  DNA  repair  or  promote  cell  death  when  this  repair  fails.  The  Drosophila  P53  ortholog  Dmp53  also  acts  in  the  cellular  response  to  DNA  damage.  Following  ionizing  radiation,  Dmp53  targets  expression  of  the  pro-apoptotic  effector  Reaper  (Brodsky  et  al.,  2000;  Jin  et  al.,  2000;  Ollmann  et  al.,  2000;  Sogame  et  al.,  2003).  Consistent  with  this  connection,  removing  Reaper  in  the  larval  wing  disk  results  in  a  reduction  of  DNA  damage-induced  programmed  cell  death  (PCD;  Peterson  et  al.,  2002).  Overexpression  of  Dmp53  in  the  retina  can  lead  to  extensive  cell  death  (Jin  et  al.,  2000;  Ollmann  et  al.,
0	a  European  Molecular  Biology  Organization
0	UV  irradiation  of  the  Drosophila  retina
0	These  observations  have  led  to  the  suggestion  that  Dmp53  is  promoting  inappropriate  Reaper  expression,  although  genetic  tests  did  not  confirm  this  association  (Peterson  et  al.,  2002).  Reaper  belongs  to  the  family  of  RHG  proteins  that  includes  Hid,  Grim,  and  Sickle;  these  proteins  are  critical  during  embryonic  PCD  (Grether  et  al.,  1995).  The  role  of  Grim  and  Hid  during  radiationmediated  apoptosis  has  not  been  examined.  Each  of  these  family  members  is  active  in  specific  tissues  and  responds  to  specific  death  stimuli.  For  example,  Reaper  is  active  during  embryonic  segmentation  and  larval  CNS  development  (Lohmann  et  al.,  2002;  Peterson  et  al.,  2002),  whereas  Hid  appears  necessary  for  PCD  within  the  pupal  retina  (Yu  et  al.,  2002).  RHG  proteins  direct  apoptosis  at  least  in  part  by  targeting  Diap1  (Drosophila  inhibitor  of  apoptosis  protein-1)  for  degradation.  Diap1  normally  inhibits  caspase  activity  by  direct  binding,  and  removal  of  Diap1  leads  to  caspase  activation  and  subsequent  apoptosis.  In  Drosophila,  regulation  of  Diap1  stability  appears  to  be  the  primary  step  in  the  regulation  of  apoptosis  (Martin,  2002).  Its  role  in  radiation-induced  cell  death,  however,  has  yet  to  be  explored.  In  this  report,  we  exploit  the  developing  Drosophila  retina  as  a  model  system  to  explore  the  factors  that  provoke  UV  and  DNA  damage  response  within  an  emerging  epithelium.  We  utilize  several  advantages  offered  by  the  pupal  retina  as  an  in  vivo  model  for  UV  irradiation:  it  is  a  simply  constructed  neuroepithelium,  constituent  cells  are  post-mitotic,  the  tissue  is  superficial  and  is  therefore  accessible  (and  highly  sensitive)  to  UV  irradiation,  and  the  molecular  aspects  of  its  development  have  been  studied  extensively.  We  present  a  number  of  interesting  features  and  factors  associated  with  the  retina's  response  to  UV,  and  find  parallels  between  this  response  and  the  factors  that  direct  normal  PCD  during  its  development.
0	UV  irradiation  leads  to  retinal  defects
0	of  40  000  mJ/cm2  was  chosen  for  the  assay  as  it  resulted  in  a  moderate  roughening  and  ablation  of  the  retina;  ~10  000  mJ/cm2  resulted  in  minimal  defects  and  ~100  000  mJ/cm2  resulted  in  near  complete  retinal  ablation  and  eventual  pupal  death.  The  effect  of  radiation  waned  after  25  h  APF  (Figure  1D;  see  Supplementary  data).  By  42  h  APFDthe  stage  by  which  most  developmental  cell  death  is  completeDthe  retina  no  longer  responded  to  moderate  UV  treatment.  We  were  unable  to  assess  the  sensitivity  of  the  retina  prior  to  18  h  APF  as  at  that  point  the  retina  has  yet  to  emerge  from  deeper  within  the  developing  pupa.  The  period  of  significant  UV  sensitivity  (<25  h  APF)  corresponds  to  the  early  stages  of  cell  death  in  the  pupal  retina  (Cagan  and  Ready,  1989a;  Wolff  and  Ready,  1991),  suggesting  that  the  signals  modulating  the  induction  of  developmental  cell  death  may  regulate  UV-induced  cell  death  as  well.  Some  of  the  phenotypes  observed  with  UV  were  due  to  induction  of  apoptotic  cell  death:  we  observed  condensed  nuclei  and  fragmentation  of  DNA  as  assessed  by  TUNEL  (Figure  1F).  In  addition,  irradiation  led  to  activation  of  caspases  as  assessed  by  antibodies  that  target  the  cleaved  downstream  caspases  ca
0	RESEARCH  ARTICLE
0	A  Gene  Expression  Map  for  the  Euchromatic  Genome  of  Drosophila  melanogaster
1	Viktor  Stolc,1,5*  Zareen  Gauhar,1,2*  Christopher  Mason,2*  Gabor  Halasz,7  Marinus  F.  van  Batenburg,7,9  Scott  A.  Rifkin,2,3  Sujun  Hua,2  Tine  Herreman,2  Waraporn  Tongprasit,6  Paolo  Emilio  Barbano,2,4  Harmen  J.  Bussemaker,7,8  Kevin  P.  White2,3.
0	We  used  a  maskless  photolithography  method  to  produce  DNA  oligonucleotide  microarrays  with  unique  probe  sequences  tiled  throughout  the  genome  of  Drosophila  melanogaster  and  across  predicted  splice  junctions.  RNA  expression  of  protein  coding  and  nonprotein  coding  sequences  was  determined  for  each  major  stage  of  the  life  cycle,  including  adult  males  and  females.  We  detected  transcriptional  activity  for  93%  of  annotated  genes  and  RNA  expression  for  41%  of  the  probes  in  intronic  and  intergenic  sequences.  Comparison  to  genome-wide  RNA  interference  data  and  to  gene  annotations  revealed  distinguishable  levels  of  expression  for  different  classes  of  genes  and  higher  levels  of  expression  for  genes  with  essential  cellular  functions.  Differential  splicing  was  observed  in  about  40%  of  predicted  genes,  and  5440  previously  unknown  splice  forms  were  detected.  Genes  within  conserved  regions  of  synteny  with  D.  pseudoobscura  had  highly  correlated  expression;  these  regions  ranged  in  length  from  10  to  900  kilobase  pairs.  The  expressed  intergenic  and  intronic  sequences  are  more  likely  to  be  evolutionarily  conserved  than  nonexpressed  ones,  and  about  15%  of  them  appear  to  be  developmentally  regulated.  Our  results  provide  a  draft  expression  map  for  the  entire  nonrepetitive  genome,  which  reveals  a  much  more  extensive  and  diverse  set  of  expressed  sequences  than  was  previously  predicted.  Characterization  of  the  complete  expressed  set  of  RNA  sequences  is  central  to  the  functional  interpretation  of  each  genome.  For  almost  3  decades,  the  analysis  of  the  Drosophila  genome  has  served  as  an  important  model  for  studying  the  relationship  between  gene  expression  and  development.  In  recent  years,  Drosophila  provided  the  initial  demonstration  that  DNA  microarrays  could  be  used  to  study  gene  expression  during  development  (1),  and  subsequent  large-scale  studies  of  gene  expression  in  this  and  other  developmental  model  organisms  have  given  new  insights  into  how
0	of  the  human  genome  and  for  Arabidopsis  (11-13).  Microarrays  have  also  recently  been  used  to  characterize  the  great  diversity  of  RNA  transcripts  brought  about  by  differential  splicing  in  human  tissues  (14).  We  used  both  types  of  approaches  to  characterize  the  Drosophila  genome.  Experimental  design.  To  determine  the  expressed  portion  of  the  Drosophila  genome,  we  designed  high-density  oligonucleotide  microarrays  with  probes  for  each  predicted  exon  and  probes  tiled  throughout  the  predicted  intronic  and  intergenic  regions  of  the  genome.  We  used  maskless  array  synthesizer  (MAS)  technology  (15,  16)  to  synthesize  custom  microarrays  containing  179,972  unique  36-nucleotide  (nt)  probes  (17).  Of  these,  61,371  exon  probes  (EPs)  assayed  52,888  exons  from  13,197  predicted  genes,  87,814  nonexon  probes  (NEPs)  assayed  expression  from  intronic  and  intergenic  regions,  and  30,787  splice  junction  probes  (SJPs)  assayed  potential  exon  junctions  for  a  test  subset  of  3955  genes.  For  the  SJPs,  we  used  36-nt  probes  spanning  each  predicted  splice  junction,  with  18  nt  corresponding  to  each  exon  (14).  RNA  from  six  developmental  stages  during  the  Drosophila  life  cycle  (early  embryos,  late  embryos,  larvae,  pupae,  and  male  and  female  adults)  was  isolated  and  reversetranscribed  in  the  presence  of  oligodeothymidine  and  random  hexamers,  and  the  labeled  cDNA  was  hybridized  to  these  arrays.  The  stages  were  chosen  to  maximize  the  number  of  transcripts  that  would  be  differentially  expressed  between  samples  on  the  basis  of  previous  results  (3,  7).  Each  sample  was  hybridized  four  times,  twice  with  Cy5  labeling  and  twice  with  Cy3  labeling  (fig.  S1).  Genomic  and  chromosomal  expression  patterns.  We  determined  which  exon  or  nonexon  probes  correspond  to  genomic  regions  that  are  transcribed  at  any  stage  during  development  (18).  We  used  a  negative  control  probe  (NCP)  distribution  (fig.  S3)  to  score  the  statistical  significance  of  the  EP  or  NEP  signal  intensities  for  each  of  the  24  unique  combinations  of  stage,  dye,  and  array,  correcting  for  probe  sequence  bias  (17,  19).  These  results  were  combined  into  a  single  expression-level  estimate  (19),  a  threshold  for  which  was  determined  by  requiring  a  false  discovery  rate  of  5%  (20).  This  threshold  shows  47,419  of  61,371  EPs  (77%)  and  35,985  out  of  87,814  NEPs  (41%)  were  significantly  expressed  at  some  point  during  the  fly  life  cycle.  Significantly  expressed  EPs  correspond  to  79%  (41,559/52,888)  of  all  exons  probed  and  93%  (12,305/13,197)  of  all  probed  gene  annotations.  Our  results  confirmed  2426  annotated  genes  not  yet  validated  through  an  EST  sequence  (Fig.  1A).  Out  of  10,280  genes  represented  by  EST  sequences,
0	OCTOBER  2004
0	RESEARCH  ARTICLE
0	only  401  (3.0%)  were  not  detected  in  these  microarray  experiments.  Our  finding  that  a  large  fraction  of  intergenic  and  intronic  regions  (NEPs)  is  expressed  in  D.  melanogaster  mirrors  similar  observations  for  chromosomes  21  and  22  in  humans  (16)  and  for  Arabidopsis  (14).  These  results  support  the  conclusion  that  extensive  expression  of  intergenic  and  intronic  sequences  occurs  in  the  major  evolutionary  lineages  of  animals  (deuterostomes  and  protostomes)  and  in  plants.  We  noted  that  mRNA  expression  levels  for  protein-encoding  genes  varied  with  the  protein  function  assigned  in  the  Drosophila  Gene  Ontology  (fig.  S2)  (21).  For  example,  genes  encoding  G  protein  receptors  were  expressed  at  relatively  low  levels,  whereas  genes  encoding  ribosomal  proteins  were  highly  expressed.  A  gene's  expression  level  was  also  associated  with  cellular  compartmentalization  and  the  biological  process  it  mediates  (fig.  S2).  For  example,  genes  encoding  cytosolic  and  cytoskeletal  factors  were  more  highly  expressed  than  those  predicted  to  function  within  organelles  such  as  the  endoplasmic  reticulum,  Golgi,  and  peroxisome.  To  determine  whether  a  high  level  of  gene  expression  was  associated  with  essential  genetic  functions,  we  examined  the  expression  levels  of  genes  recently  shown  to  be  required  for  cell  viability  (Fig.  1B)  in  a  genome-wide  RNA  interference  (RNAi)  screen  in  Drosophila  (22).  Compared  to  the  rest  of  the  genome,  the  genes  identified  as  essential  by  RNAi  showed  a  significant  increase  in  expression  during  all  stages  of  development  (P  0  0.0009,  t  test),  even  when  the  highly  expressed  ribosomal  protein  genes  were  omitted  (P  0  0.0005,  t  test).  This  result  is  also  consistent  with  the  observation  that  genes  with  mutant  phenotypes  from  the  3-Mbase  Adh  genomic  region  are  overrepresented  in  EST  libraries  (23).  High  levels  of  essential  gene  expression  may  in  part  reflect  widespread  expression  in  cells  throughout  the  animal,  and  the  relative  RNA  expression  level  may  serve  as  a  rough  predictor  of  essential  cellular  function.  We  also  examined  changes  in  gene  expression  during  the  fly  life  cycle  to  determine  what  fraction  of  the  entire  genome  is  differentially  expressed  between  developmental  stages.  Figure  2A  shows  the  expression  signal  intensities  of  transcripts  from  a  typical  50-kilobase  pair  (kbp)  region  of  the  Drosophila  genome  during  each  major  developmental  stage.  Stage-specific  variation  in  expression  is  observed  not  only  for  exon  probes,  as  expected,  but  also  for  intergenic  and  intronic  probes.  We  used  analysis  of  variance  (ANOVA)  (24)  to  systematically  identify  probes  as  differentially  expressed  at  a  false  discovery  rate  of  5%  (16).  As  expected,  the  majority  of  probes  detecting  differentially  expressed  sequences  are  also  expressed  above  background  noise  level  (89%  of  EPs  and  81%  of  NEPs)  (17)  (Table  1).  We  found  27,176  EPs  to  be  differentially  expressed,  corresponding  to  76%  of  annotated  genes,  and  even  more  when  we  applied  a  less  conservative  background  model  (fig.  S4).  The  fact  that  the 
0	SHORT  REPORT
0	High  resolution  microarray  comparative  genomic  hybridisation  analysis  using  spotted  oligonucleotides
1	B  Carvalho,  E  Ouwerkerk,  G  A  Meijer,  B  Ylstra
0	Background:  Currently,  comparative  genomic  hybridisation  array  (array  CGH)  is  the  method  of  choice  for  studying  genome  wide  DNA  copy  number  changes.  To  date,  either  amplified  representations  of  bacterial  artificial  chromosomes  (BACs)/phage  artificial  chromosomes  (PACs)  or  cDNAs  have  been  spotted  as  probes.  The  production  of  BAC/PAC  and  cDNA  arrays  is  time  consuming  and  expensive.  Aim:  To  evaluate  the  use  of  spotted  60  mer  oligonucleotides  (oligos)  for  array  CGH.  Methods:  The  hybridisation  of  tumour  cell  lines  with  known  chromosomal  aberrations  on  to  either  BAC  or  oligoarrrays  that  are  mapped  to  the  human  genome.  Results:  Oligo  CGH  was  able  to  detect  amplifications  with  high  accuracy  and  greater  spatial  resolution  than  other  currently  used  array  CGH  platforms.  In  addition,  single  copy  number  changes  could  be  detected  with  a  resolution  comparable  to  conventional  CGH.  Conclusions:  Oligos  are  easy  to  handle  and  flexible,  because  they  can  be  designed  for  any  part  of  the  genome  without  the  need  for  laborious  amplification  procedures.  The  full  genome  array,  containing  around  30  000  oligos  of  all  genes  in  the  human  genome,  will  represent  a  big  step  forward  in  the  analysis  of  chromosomal  copy  number  changes.  Finally,  oligoarray  CGH  can  easily  be  used  for  any  organism  with  a  fully  sequenced  genome.
0	Abbreviations:  BAC,  bacterial  artificial  chromosome;  CGH,  comparative  genomic  hybridisation;  CHORI,  Children's  Hospital  Oakland  Research  Institute;  oligo,  oligonucleotide;  PAC,  phage  artificial  chromosome;  PCR,  polymerase  chain  reaction
0	rray  comparative  genomic  hybridisation  (array  CGH)  has  been  used  successfully  for  the  detection  of  genomic  imbalances  in  human  and  mouse  tumours.1-6  As  chromosomal  representations,  approximately  2500  bacterial  artificial  chromosome  (BAC)  and  phage  artificial  chromosome  (PAC)  clones  have  been  amplified  and  spotted  for  genome  wide  CGH  arrays,  yielding  a  resolution  of  1-1.5  Mb,7  in  addition  to  cDNAs,8  which  encompass  a  maximum  of  13  824  genes  and  yield  an  average  resolution  of  267  kb.9  Although  spatial  resolution  using  cDNAs  is  currently  higher,  the  number  of  cDNAs  is  finite  and  their  sensitivity  is  lower.  This  reduced  sensitivity  of  cDNAs  is  partly  the  result  of  cross  hybridisation.  Oligonucleotides  (oligos)  can  theoretically  circumvent  the  problems  encountered  with  cDNAs.  In  addition,  oligo-libraries  are  cheaper,  easier  to  work  with,  and  faster  than  cDNAs  or  BAC/PAC  clones,  because  no  DNA  isolation  or  PCR  amplification  steps  are  necessary.  The  in  silico  design  can  control  for  the  hybridisation  temperature  and  specificity  and  there  is  no  limit  to  the  spatial  resolution.  Finally,  oligos  can  be  designed  for  any  organism  with  a  sequenced  genome.
0	MATERIALS  AND  METHODS
0	Short  report
0	for  each  clone.  On  the  oligoarrays,  each  experiment  was  performed  three  times  and  data  were  taken  from  one  representative  experiment.  Average  and  standard  deviations  of  log2  ratios  were  calculated  for  each  oligonucleotide  across  the  three  experiments.  A  moving  average  (window  of  eight  by  eight)  was  applied  to  plot  genome  wide  graphs.
0	RESULTS  AND  DISCUSSION
0	We  hybridised  a  19  K  human  60  mer  oligoarray  with  breast  tumour  cell  line  (BT474)  DNA,  labelled  with  Cy3,  and  normal  genomic  kidney  (female)  DNA,  labelled  with  Cy5.  Ratios  for  the  non-flagged  oligos  (35%)  were  ordered  by  their  position  on  the  chromosome  (June  2002  freeze;  http://genome.ucsc.  edu/).  We  compared  the  oligo  CGH  profile  with  the  BAC  array  CGH  profile  (fig  1).  Both  array  profiles  showed  the  same  pattern--for  example,  on  the  short  arm  of  chromosome  1  neither  profile  showed  a  change  in  DNA  copy  number,  whereas  on  the  q  arm  two  amplified  areas  are  present  in  both  profiles.  No  aberrations  can  be  seen  on  chromosome  2.  The
0	standard  deviation  of  the  log2  ratio  of  the  individual  probes  was  0.21  for  the  BAC  array  and  0.45  for  the  oligoarray.  On  chromosome  3,  a  loss  on  the  short  arm  was  evident  on  the  oligoarray,  and  was  also  seen  with  the  BAC  array  (fig  1).  Figure  2  shows  two  regions  of  amplification  on  the  q  arm  of  chromosome  17:  one  narrow  peak  over  the  chromosomal  region  containing  c-Erb-B2/neu  (Her2)10  and  a  second  amplicon  distal  to  c-Erb-b2.  The  BAC  array  has  three  clones  over  c-Erb-B2,  and  the  best  possible  judgment  towards  the  start  and  end  of  the  amplicon,  according  to  the  April  2003  freeze,  is  therefore  2.5  Mb.  With  the  oligo  approach,  38  non-flagged  oligos  represent  amplified  ratios  in  this  region  and  the  size  of  the  amplicon  is  2.4  Mb  according  to  the  April  2003  freeze.  Thus,  the  actual  resolution  in  the  region  is  63  kb  on  average.  The  log2  ratios  for  the  three  replicate  BACs  containing  the  c-Erb-b2  gene  are  2.91,  3.06,  and  2.53,  a  similar  order  of  magnitude  as  that  obtained  in  three  independent  experiments  with  the  oligoarray  for  the  single  oligo  corresponding  to  the  c-Erb-b2  gene:  3.4,  3.4,  and  3.9.
0	Short  report
0	Take  home  message
0	We  describe  pilot  experiments  that  serve  as  a  proof  of  principle  that  oligonucleotides  are  a  feasible  platform  for  array  comparative  genomic  hybridisation  (CGH)  Oligoarray  CGH  can  be  rapidly,  cost  effectively,  and  easily  used  to  measure  chromosomal  copy  number  changes  for  any  organism  with  a  fully  sequenced  genome
0	like  to  thank  Professor  D  G  Albertson,  Professor  D  Pinkel,  and  laboratory  staff  (UCSF  Comprehensive  Cancer  Centre)  for  their  support  in  performing  the  hybridisation  procedures  and  for  the  GM0143  DNA  sample.  BC  is  holder  of  fellowship  SFRH/BPD/5599/  2001  and  is  working  in  the  frame  of  the  Grant  Project  POCTI/CBO/  41179/2001.  This  work  is  furthermore  supported  by  the  Dutch  Cancer  Society  (VU  2002-2618).  We  thank  the  mapping  core  and  map  finishing  groups  of  the  Wellcome  Trust  Sanger  Institute  for  initial  BAC  clone  supply  and  verification.  .....................
0	Comparative  genomic  hybridization  using  oligonucleotide  microarrays  and  total  genomic  DNA
1	Michael  T.  Barrett*,  Alicia  Scheffer*,  Amir  Ben-Dor*,  Nick  Sampas*,  Doron  Lipson*§,  Robert  Kincaid*,  Peter  Tsang*,  Bo  Curry*,  Kristin  Baird¶,  Paul  S.  Meltzer¶,  Zohar  Yakhini*,  Laurakay  Bruhn*,  and  Stephen  Laderman*
0	Array-based  comparative  genomic  hybridization  (CGH)  measures  copy-number  variations  at  multiple  loci  simultaneously,  providing  an  important  tool  for  studying  cancer  and  developmental  disorders  and  for  developing  diagnostic  and  therapeutic  targets.  Arrays  for  CGH  based  on  PCR  products  representing  assemblies  of  BAC  or  cDNA  clones  typically  require  maintenance,  propagation,  replication,  and  verification  of  large  clone  sets.  Furthermore,  it  is  difficult  to  control  the  specificity  of  the  hybridization  to  the  complex  sequences  that  are  present  in  each  feature  of  such  arrays.  To  develop  a  more  robust  and  flexible  platform,  we  created  probedesign  methods  and  assay  protocols  that  make  oligonucleotide  microarrays  synthesized  in  situ  by  inkjet  technology  compatible  with  array-based  comparative  genomic  hybridization  applications  employing  samples  of  total  genomic  DNA.  Hybridization  of  a  series  of  cell  lines  with  variable  numbers  of  X  chromosomes  to  arrays  designed  for  CGH  measurements  gave  median  ratios  for  X-chromosome  probes  within  6%  of  the  theoretical  values  (0.5  for  XY  XX,  1.0  for  XX  XX,  1.4  for  XXX  XX,  2.1  for  XXXX  XX,  and  2.6  for  XXXXX  XX).  Furthermore,  these  arrays  detected  and  mapped  regions  of  single-copy  losses,  homozygous  deletions,  and  amplicons  of  various  sizes  in  different  model  systems,  including  diploid  cells  with  a  chromosomal  breakpoint  that  has  been  mapped  and  sequenced  to  a  precise  nucleotide  and  tumor  cell  lines  with  highly  variable  regions  of  gains  and  losses.  Our  results  demonstrate  that  oligonucleotide  arrays  designed  for  CGH  provide  a  robust  and  precise  platform  for  detecting  chromosomal  alterations  throughout  a  genome  with  high  sensitivity  even  when  using  full-complexity  genomic  samples.
0	cancer  DNA  microarrays  genome
0	dated  for  expression  profiling  of  17,000  transcripts  (expression  array),  was  used  to  develop  initial  assay  conditions  for  aCGH.  The  second  design  consisted  of  custom  microarrays  containing  a  higher  density  of  probes  that  represent  unique  genomic  sequences  for  selected  chromosomes  (CGH  array).  The  content  of  the  CGH  array  was  biased  toward  gene  regions,  but  it  also  included  noncoding  regions  for  chromosome-wide  coverage.  These  arrays  were  used  to  explore  performance  improvements  that  could  be  made  possible  by  developing  oligonucleotide  probe-selection  methods  specifically  for  CGH.  Materials  and  Methods
0	Genomic  DNA.  We  obtained  genomic  DNA  from  normal  male
0	rray-based  comparative  genomic  hybridization  (aCGH)  allows  the  identification  of  chromosomal  regions  of  gains  and  losses  in  cancers  and  genetic  diseases  (1-5).  Oligonucleotide-array  probes  can  be  designed  in  silico  for  any  sequenced  region  of  a  genome,  thus  allowing  genome-wide  and  higher-density  region-specific  coverage,  in  principle.  Application-specific  designs,  assays,  and  analysis  methods  allow  routine  use  of  oligonucleotide  arrays  for  gene-expression  studies  and  characterization  of  DNA  polymorphisms  and  mutations  (6-11).  Typically,  these  applications  use  labeled  targets  of  markedly  reduced  complexity  relative  to  a  complete  genome  (for  example,  expressed  sequences  in  transcriptional  profiling  and  PCR  amplicons  for  polymorphic  allele  analyses).  The  usefulness  of  oligonucleotide  arrays  for  aCGH  has  also  been  examined  by  using  targets  of  reduced  complexity  (12-16).  However,  the  broadest  use  of  aCGH,  including  both  a  simplified  preparation  of  targets  and  hybridization  of  samples  to  any  array  design  of  interest,  requires  preserving  the  greatest  possible  complexity  of  targets  derived  from  whole-genome  samples.  Therefore,  we  investigated  and  developed  probe-design  criteria,  assay  conditions,  and  analysis  methods  that  enable  60-mer  oligonucleotide  arrays  to  be  used  for  CGH  measurements  even  when  using  total  genomic  DNA.  We  used  two  array  designs  for  these  studies.  The  first  design,  consisting  of  60-mer  oligonucleotide  probes  designed  and  vali-
0	46,XY  and  normal  female  46,XX  from  Promega.  The  following  cell  lines  are  part  of  the  National  Institute  of  General  Medical  Sciences  Human  Genetic  Cell  Repository  and  were  obtained  from  the  Coriell  Institute  for  Medical  Research  (Camden,  NJ):  47,XXX  (repository  no.  GM04626),  48,XXXX  (repository  no.  GM01415D),  49,XXXXX  (repository  no.  GM05009C),  and  the  18q  deletionsyndrome  cell  line  (repository  no.  GM50122).  The  colon  (COLO  320DM,  HT  29,  and  HCT116)  and  breast  (MDA-MB-231  and  MDA-MB-453)  carcinoma  cell  lines  were  obtained  from  the  American  Type  Culture  Collection.  Each  cell  line  was  grown  under  the  conditions  recommended  by  the  supplier.  Genomic  DNA  was  prepared  from  each  cell  line  by  using  the  DNeasy  tissue  kit  (Qiagen,  Germantown,  MD).  Tumor  biopsies  were  collected  from  1980-  2003  and  accessed  by  means  of  the  National  Cooperative  Human  Tissue  Network  (Charlottesville,  VA).  Total  cellular  DNA  was  isolated  from  fresh-frozen  tumor  specimens  by  using  TRIzol  reagent  (Invitrogen)  extraction  techniques  and  further  purified  by  phenol-chloroform  extraction.
0	Freely  available  online  through  the  PNAS  open  access  option.  Abbreviations:  CGH,  comparative  genomic  hybridization;  aCGH,  array-based  CGH.
0	and  A.S.  contributed  equally  to  this  work.
0	Technologies,  the  employer  of  M.T.B.,  A.S.,  A.B.-D.,  N.S.,  D.L.,  R.K.,  P.T.,  B.C.,  Z.Y.,  L.B.,  and  S.L.,  manufactures  DNA  microarrays.
0	§Present  address:  Technion  Israel  Institute  of  Technology,  Technion  City,  Haifa  32000,  Israel.
0	December  21,  2004
0	Image  and  Data  Analysis.  Microarray  images  were  analyzed  by  using
0	FEATURE  EXTRACTION
0	aCGH.  For  each  CGH  hybridization,  we  digested  20
0	in  plots  of  raw  data  are  obscured  by  even  a  small  percentage  of  outlier  probes.  Therefore,  we  applied  a  50-kb  moving  average,  as  calculated  below,  to  plots  presented  in  Figs.  4-6.  The  log2  ratio  measured  for  all  m  probes  of  the  chromosome  was  smoothed  by  using  the  following  weighted  moving  average:
0	where  yi  is  the  measured  log2  ratio  at  xi.  The  weights  are  given  by  the  following  triangular  function:  x  wx  0  W  W  xW  W  0  for  for  for  for  x  W  W  x  x  W  W  x  0  0  [2]
0	software  (version  6.1.1,  Agilent  Technologies).  Default  settings  were  used,  except  that  probes  from  autosomal  chromosomes  were  used  for  dye  normalization  by  using  the  locally  weighted  linear-regression  curve  fit  option.  Also,  we  used  signals  from  negative  control  featu
0	Requirement  of  Circadian  Genes  for  Cocaine  Sensitization  in  Drosophila
1	Rozi  Andretic,  Sarah  Chaney,  Jay  Hirsh*
0	The  circadian  clock  consists  of  a  feedback  loop  in  which  clock  genes  are  rhythmically  expressed,  giving  rise  to  cycling  levels  of  RNA  and  proteins.  Four  of  the  five  circadian  genes  identified  to  date  influence  responsiveness  to  freebase  cocaine  in  the  fruit  fly,  Drosophila  melanogaster.  Sensitization  to  repeated  cocaine  exposures,  a  phenomenon  also  seen  in  humans  and  animal  models  and  associated  with  enhanced  drug  craving,  is  eliminated  in  flies  mutant  for  period,  clock,  cycle,  and  doubletime,  but  not  in  flies  lacking  the  gene  timeless.  Flies  that  do  not  sensitize  owing  to  lack  of  these  genes  do  not  show  the  induction  of  tyrosine  decarboxylase  normally  seen  after  cocaine  exposure.  These  findings  indicate  unexpected  roles  for  these  genes  in  regulating  cocaine  sensitization  and  indicate  that  they  function  as  regulators  of  tyrosine  decarboxylase.  In  response  to  exposure  to  volatilized  freebase  cocaine,  Drosophila  perform  a  set  of  reflexive  behaviors  similar  to  those  observed  in  vertebrate  animals,  including  grooming,  proboscis  extension,  and  unusual  circling  locomotor  behaviors  (1-3).  Additionally,  flies  can  show  sensitization  after  even  a  single  exposure  to  cocaine  provided  that  the  doses  are  separated  by  an  interval  of  6  to  24  hours  (1).  Sensitization,  a  process  in  which  repeated  exposure  to  low  doses  of  a  drug  leads  to  increased  severity  of  responses,  has  been  linked  to  the  addictive  process  in  humans  (4-6)  and  is  potentially  involved  in  the  enhanced  craving  and  psychoses  that  occur  after  repeated  psychostimulant  administration.  We  have  shown  circadian  variation  in  the  agonist  responsiveness  of  Drosophila  nerve  cord  dopamine  receptors  functionally  coupled  to  locomotor  output  (7).  This  variation  is  dependent  on  the  normal  functioning  of  the  Drosophila  period  (  per)  gene,  the  founding  member  of  the  circadian  gene  family  (8,  9).  Because  changes  in  postsynaptic  dopamine  receptor  responsiveness  are  also  seen  during  cocaine  sensitization  in  vertebrates  (10-12),  we  examined  flies  mutant  in  circadian  functions  for  alterations  in  responsiveness  to  cocaine.  Wild-type  (WT)  flies  or  flies  containing  a  per  null  mutation,  per  o,  were  exposed  to  75  g
0	of  cocaine  four  times  over  2  days,  and  the  fraction  of  flies  showing  severe  responses  was  quantified  after  each  exposure  (Fig.  1A).  Whereas  WT  flies  showed  sensitization  after
0	the  initial  cocaine  exposure,  per  o  flies  showed  no  sensitization  either  to  a  normal  or  increased  dose  even  after  repeated  exposures.  As  with  WT  flies,  per  o  flies  showed  a  dose-dependent  increase  in  the  severity  of  responses,  and  the  normal  cocaine-induced  types  of  behaviors  were  observed  (13).  per  alleles  that  either  shorten  or  lengthen  the  circadian  periods  show  distinct  patterns  of  cocaine  responsiveness.  The  short-period  mutants  per  S  and  perT  (14,  15)  both  showed  increased  responsiveness  to  the  initial  cocaine  exposure  and  weak  sensitization  to  a  second  75-  g  exposure  (Fig.  2A),  with  only  the  sensitization  of  per  S  showing  statistical  significance.  Sensitization  is  not  observed  in  these  lines  when  tested  with  other  cocaine  doses  (16).  The  long-period  mutant  per  L1  (17)  showed  a  normal  initial  cocaine  response  but  no  sensitization  to  a  subsequent  exposure.  Similarly,  other  circadian  genes  showed  effects  on  cocaine  sensitization:  Both  clock  and  cycle  mutants  failed  to  sensitize  when  given  two  doses  of  cocaine  (Fig.  2B).  Because  these  mutants  showed  an  increased  sensitivity  to  the
0	first  exposure  (16),  cocaine  doses  were  decreased  to  50  g.  The  inability  of  clock  and  cycle  to  sensitize  is  markedly  similar  to  the  behavior  of  per  o  mutants.  The  gene  product  of  timeless  (tim),  TIM,  is  required  for  nuclear  translocation  of  PER  and  its  stability  in  the  cytoplasm;  in  timo  mutants,  cytoplasmic  PER  is  degraded  and  per  mRNA  levels  are  constant  (18  -20).  Cocaine  responses  in  timo  mutant  flies  were  normal  (Fig.  2B),  both  in  initial  responsiveness  and  in  showing  a  robust  sensitized  response  to  the  second  exposure.  Recently,  a  doubletime  (dbt)  protein  with  homology  to  human  casein  kinase  I  was  identified  and  shown  to  be  required  for  phosphorylation  of  PER  (21).  We  tested  cocaine  responses  in  two  viable  dbt  mutants,  dbt  S  and  dbt  L,  which  shorten  and  lengthen  the  circadian  locomotor  period,  respectively  (22).  dbt  mutants  required  a  substantially  higher  cocaine  dose  to  show  behaviors  normally  observed  at  75  g  (Fig.  2B),  but  even  at  these  higher  doses  dbt  flies  did  not  show  significant  sensitization.  If  the  role  of  dbt  in  cocaine  responsiveness  is  analogous  to  its  role  in  circadian  behavior,  then  PER  phosphorylation  status  may  be  important  in  regulating  both  initial  cocaine  responsiveness  and  sensitization.  Modulation  of  dopamine  receptor  responsiveness  is  important  in  both  the  sensitization  to  cocaine  in  vertebrate  animals  and  in  the  circadian  modulation  of  locomotion  in  Drosophila  (7,  23).  We  tested  whether  cocaine-sensitized  flies  would  show  an  increase  in  the  responsiveness  of  the  nerve  cord  dopamine  D2-like  receptors  by  using  a  preparation  of  behaviorally  acFig.  2.  Circadian  mutants  show  altered  cocaine  responses.  (A)  per  mutants.  Flies  carrying  per  mutations,  as  indicated,  were  exposed  twice  to  75  g  of  volatilized  cocaine  6  hours  apart.  The  number  of  flies  assayed,  for  first  and  second  exposures,  is  as  follows:  WT  CantonS,  n  105,  95;  per  o,  n  81,  60;  perS,  n  114,  112;  perT,  n  88,  52;  per  L1,  n  86,  83.  (B)  Other  circadian  mutants.  As  in  (A),  except  that  cocaine  doses  were  adjusted  to  compensate  for  differences  in  cocaine  responsiveness  to  the  initial  dose:  WT  CantonS  exposed  to  75  g  of  cocaine,  n  105,  95;  timo,  n  66,  63.  Circadian  mutants  exposed  to  50  g  of  cocaine:  clock,  n  187,  182;  and  cycle,  n  79,  79.  dbt  mutants  were  exposed  to  100  g  of  cocaine:  dbt  S,  n  59,  55;  dbt  L,  n  52,  51.  In  both  (A)  and  B),  significant  differences  in  responses  to  the  first  versus  second  exposures  are  indicated  (*P  0.05,  **P  0.01;  2  test).
0	tive  decapitated  flies  that  allows  direct  addition  of  drugs  to  the  nerve  cord  (24).  After  decapitation,  cocaine-sensitized  WT  flies  locomoted  significantly  more  than  sham-treated  controls  in  response  to  the  dopamine  D2-like  agonist  quinpirole  (Fig.  1B).  However,  there  was  no  increase  in  quinpirole  responsiveness  of  per  o  flies  that  did  not  sensitize  to  repeated  cocaine  exposures.  Thus,  similar  to  the  inability  of  per  o  mutant  to  modulate  receptor  responsiveness  as  a  function  of  the  time  of  day  (7),  per  o  is  unable  to  modulate  dopamine  receptor  responsiveness  after  cocaine  exposure.  The  observation  that  cocaine  sensitization  is  associated  with  increased  responsiveness  of  postsynaptic  dopamine  receptors  shows  additional  similarities  between  this  system  and  that  in  higher  vertebrates,  where  a  similar  relation  holds  (12,  23).  In  Drosophila,  sensitization  requires  the  trace  amine  tyramine  because  the  mutant  inactive,  which  is  defective  in  sensitization,  shows  both  reduced  tyramine  and  reduced  levels  of  the  enzyme  involved  in  t
0	Genome-wide  Transcriptional  Orchestration  of  Circadian  Rhythms  S  in  Drosophila*
1	Hiroki  R.  Ueda§¶  ,  Akira  Matsumoto¶**,  Miho  Kawamura§,  Masamitsu  Iino,  Teiichi  Tanimura**,  and  Seiichi  Hashimoto§
0	Circadian  rhythms  govern  the  behavior,  physiology,  and  metabolism  of  living  organisms.  Recent  studies  have  revealed  the  role  of  several  genes  in  the  clock  mechanism  both  in  Drosophila  and  in  mammals.  To  study  how  gene  expression  is  globally  regulated  by  the  clock  mechanism,  we  used  a  high  density  oligonucleotide  probe  array  (GeneChip)  to  profile  gene  expression  patterns  in  Drosophila  under  light-dark  and  constant  dark  conditions.  We  found  712  genes  showing  a  daily  fluctuation  in  mRNA  levels  under  light-dark  conditions,  and  among  these  the  expression  of  115  genes  was  still  cycling  in  constant  darkness,  i.e.  under  free-running  conditions.  Unexpectedly  the  expression  of  a  large  number  of  genes  cycled  exclusively  under  constant  darkness.  We  found  that  cycling  in  most  of  these  genes  was  lost  in  the  arrhythmic  Clock  (Clk)  mutant  under  lightdark  conditions.  Expression  of  periodically  regulated  genes  is  coordinated  locally  on  chromosomes  where  small  clusters  of  genes  are  regulated  jointly.  Our  findings  reveal  that  many  genes  involved  in  diverse  functions  are  under  circadian  control  and  reveal  the  complexity  of  circadian  gene  expression  in  Drosophila.
0	cells  (4,  5).  Since  information  about  all  the  possible  transcription  units  is  available  in  Drosophila  (6,  7),  we  can  extensively  analyze  the  data  for  all  the  genes  relating  to  their  function.  Functions  of  identified  genes  can  be  analyzed  using  various  genetic  tool  and  databases  (9  -11)  available  in  Drosophila.
0	EXPERIMENTAL  PROCEDURES
0	The  use  of  Drosophila  has  been  at  the  forefront  of  studies  of  the  molecular  and  genetic  basis  of  circadian  rhythms  (1).  A  number  of  clock  genes  have  been  identified  in  Drosophila,  and  interlocked  per-tim  and  Clk  feedback  loops  are  now  thought  to  underlie  the  central  molecular  machinery  of  circadian  rhythms  (2,  3).  However,  we  still  do  not  know  how  expression  of  the  whole  genome  is  orchestrated  by  the  circadian  mechanism  nor  have  we  identified  all  the  genes  involved.  One  comprehensive  way  to  find  out  all  the  rhythmically  expressed  genes  is  to  utilize  microarray.  A  number  of  genes  regulated  in  a  circadian  manner  have  been  identified  in  Arabidopsis  and  mammalian  cultured
0	Genome-wide  Orchestration  of  Circadian  Rhythms
0	Microarray  Analysis  and  Organization  of  Circadian  Gene  Expression  in  Drosophila
0	Summary  We  have  used  high-density  oligonucleotide  arrays  to  study  global  circadian  gene  expression  in  Drosophila  melanogaster.  Coupled  with  an  analysis  of  clock  mutant  (Clk)  flies,  a  cell  line  designed  to  identify  direct  targets  of  the  CLOCK  (CLK)  transcription  factor  and  differential  display,  we  uncovered  several  striking  features  of  circadian  gene  networks.  These  include  the  identification  of  134  cycling  genes,  which  contribute  to  a  wide  range  of  diverse  processes.  Many  of  these  clock  or  clock-regulated  genes  are  located  in  gene  clusters,  which  appear  subject  to  transcriptional  coregulation.  All  oscillating  gene  expression  is  under  clk  control,  indicating  that  Drosophila  has  no  clk-independent  circadian  systems.  An  even  larger  number  of  genes  is  affected  in  Clk  flies,  suggesting  that  clk  affects  other  genetic  networks.  As  we  identified  a  small  number  of  direct  target  genes,  the  data  suggest  that  most  of  the  circadian  gene  network  is  indirectly  regulated  by  clk.  Introduction
0	Cycling  Circadian  Genes  To  isolate  mRNA  for  analysis,  we  entrained  wild-type  Canton-S  flies  for  3  days  in  a  standard  12:12  hr  light  dark  (LD)  cycle  and  then  collected  flies  every  4  hr  during  the  first  full  day  in  constant  darkness  (DD).  This  strategy  was  chosen  to  avoid  light-regulated  genes  not  under  circadian  control  as  well  as  the  damping  (e.g.,  a  decreased  cycling  amplitude  of  circadian  gene  expression)  that  occurs  during  extended  incubation  in  constant  darkness  (see  Discussion).  Fly  head  mRNA  was  harvested  from  the  six  time  points,  biotinylated  cRNA  prepared  and  Affymetrix  Drosophila  GeneChips  used  to  probe  the  labeled  cRNA.  The  final  data  set  includes  replicas  of  4  chips  for  CT0,  CT4,  CT8  and  CT12,  5  chips  for  CT16,  and  3  chips  for  CT20.  The  GeneChip  data  were  analyzed  using  a  model-based  expression  approach  with  dCHIP  software  (Li  and  Hung  Wong,  2001a,  2001b;  for  complete  dataset,  see  Supplemental  Table  S3).  To  identify  a  set  of  circadian  genes  with  confidence,  we  put  the  data  through  four  sequential  analyses.  First,  signals  were  averaged  over  the  6  time  points,  and  those  that  did  not  have  an  average  signal  intensity  greater  than  20  were  excluded.  This  step  removed  genes  with  very  weak  or  dubious  expression  levels  (  40%  of  the  transcripts).  Second,  we  required  the  difference  between  the  highest
0	Cell  568
0	Circadian  Genes,  Microarrays,  and  Drosophila  569
0	Table  1.  Top  10  Highest  Fold  Cycling  Genes  Flybase  ID  ldlr  CG11854  CG13856  per  vri  tim1  CG5798  CG2069  clk  CG5156  Function  scavenger  receptor  ligand  binding  or  carrier  unknown  PAS  domain  clock  protein  par  domain  clock  protein  clock  protein  ubiquitin  thiolesterase  unknown  bHLH  PAS  clock  protein  unknown  Fold  Cycling  40.8  5.7  5.6  5.3  4.8  4.6 
0	Global  Survey  of  Chromatin  Accessibility  Using  DNA  Microarrays
0	Program  in  Molecular  Biophysics,  Division  of  Cell  and  Molecular  Biology,  Southwestern  Graduate  School  of  Biomedical  Science,  Department  of  Molecular  Biology,  3Hamon  Center  for  Therapeutic  Oncology  Research,  4Center  for  Biomedical  Inventions,  5  Department  of  Internal  Medicine,  6Eugene  McDermott  Center  for  Human  Growth  and  Development,  and  7Department  of  Pharmacology,  UT  Southwestern  Medical  Center,  Dallas,  Texas  75390,  USA;  8Department  of  Experimental  and  Clinical  Radiobiology,  Center  of  Oncology,  Gliwice,  44-100,  Poland
0	In  recent  years,  the  study  of  transcriptional  regulation  by  epigenetic  mechanisms  has  enjoyed  a  renaissance  because  of  advances  in  DNA  microarray  technology.  These  developments  include  the  creation  of  high-throughput  CpG  methylation  resequencing  microarrays  (Hatada  et  al.  2002)  and  advances  in  using  DNA  microarrays  to  probe  Chromatin  Immuno-Precipitation  (ChIP)  assays  (Ren  et  al.  2000)  on  a  genomic  scale.  Even  with  all  these  advances,  perhaps  one  of  the  most  important  epigenetic  regulation  systems,  chromatin  architecture,  has  been  overlooked.  By  mediating  the  availability  of  specific  DNA  sequences  to  regulatory  proteins,  chromatin  accessibility  in  the  form  of  chromatin  condensation  or  relaxation  is  thought  to  be  a  major  regulator  of  transcription  (Orphanides  and  Reinberg  2002).  Current  methods  of  studying  chromatin  architecture  either  measure  the  accessibility  of  the  genome  as  a  whole  (Banerjee  and  Hulten  1994)  or  of  a  few  sub-kilobase  regions  (Reid  et  al.  2000),  but  no  technique  is  currently  available  to  easily  and  simultaneously  measure  the  chromatin  accessibility  of  the  whole  genome  at  kilobase  resolution  (Urnov  2003;  Crawford  et  al.  2004).  In  this  paper,  we  describe  a  new  method  for  using  DNA  microarrays  to  study  the  global  chromatin  accessibility  state  as  a  measure  of  nuclease  accessibility  in  relation  to  expression  at  the  resolution  of  single  genes.  The  primary  method  we  chose  for  isolating  DNA  by  its  chromatin  accessibility  state  takes  advantage  of  the  solubility  differences  of  histone  H1-depleted  mononucleosomes  and  histone  H1-containing  mono-  and  oligonucleosomes  in  the  presence  or  absence  of  MgCl2  and  KCl  to  recover  different  chromatin  fractions  based  on  their  activity  states.  This  method's
0	utility  was  demonstrated  by  Rose  and  Garrard  (1984)  to  study  the  chromatin  packing  of  immunoglobulin  light  chain  genes  in  relation  to  their  transcription  during  B-cell  development.  A  second  method  was  optimized  to  use  the  preferential  sensitivity  of  transcriptionally  active  chromatin  to  DNase  I  cleavage  (Weintraub  and  Groudine  1976)  to  recover  the  relatively  resistant  regions  as  the  "condensed"  fraction  using  fragment  length  selection.  Both  of  these  methods  are  currently  used  in  high-resolution,  lowthroughput  chromatin  accessibility  studies.  To  make  these  techniques  both  high  resolution  and  high  throughput,  we  optimized  microarray-based  comparative  genomic  hybridization  (CGH)  methods  using  commercially  available  probe  sets  or  microarrays  to  probe  the  chromatin  accessibility  state  en  masse  (Pollack  et  al.  1999;  Weil  et  al.  2002).  This  "Chromatin  Array"  allows  us  to  overcome  the  limited  resolution  and  throughput  problems  of  previous  methods  (Banerjee  and  Hulten  1994;  Reid  et  al.  2000)  by  using  the  multiplex  nature  of  microarray  experiments  while  retaining  the  high  resolution  of  low-throughput  chromatin  accessibility  measurement  techniques.  Because  this  new  type  of  microarray  experiment  has  a  novel  output,  we  developed  methods  to  interpret  the  chromatin  state  from  the  relationship  of  the  condensed  fraction's  hybridization  intensity  as  compared  with  the  intensity  of  total  genomic  DNA.  These  data  can  then  be  related  to  the  absolute  RNA  expression  level  measured  on  an  identical  microarray.  To  demonstrate  the  utility  of  the  Chromatin  Array  method,  we  chose  the  cell  line  MCF7  because  it  is  has  been  extensively  studied  by  other  groups  (Pollack  et  al.  1999;  Ross  et  al.  2000).  We  show  that  the  chromatin  solubility  assay  recovered  fractions  based  on  the  condensation  state  of  the  chromatin,  and  that  the  microarray-based  measurements  could  accurately  measure  the  accessibility.  The  reproducibility  of  the  condensation  state  mea-
0	Genome  Research
0	Global  Survey  of  Chromatin  Accessibility
0	surements  was  independently  verified  using  two  different  methods  to  extract  the  condensed  chromatin  for  microarray-based  measurements.  To  support  the  data  analysis  and  interpretation,  we  used  the  Stanford  Microarray  Database  (SMD)  to  validate  our  expression  findings  (Sherlock  et  al.  2001).  Although  the  condensation  state  and  expression  measurement  of  a  single  gene  may  be  of  great  value  in  transcriptional  discovery,  the  biological  relevance  of  the  data  on  a  global  scale  is  possibly  even  more  valuable.  By  relating  function  as  defined  by  the  Gene  Ontology  (GO)  database  (Ashburner  et  al.  2000)  to  the  condensation  state  of  large  groups  of  genes,  specific  accessibility  signatures  of  functionally  related  genes  can  be  identified.  These  signatures  are  based  on  the  different  functional  gene  groupings  of  a  particular  accessibility  state,  and  the  differences  in  functional  group  assignments  observed  across  the  different  accessibility  states  (Jimenez-Sanchez  et  al.  2001).  These  signatures  can  then  be  used  to  uniquely  define  a  cell  line.  By  comparing  the  signatures  of  multiple  cell  lines,  it  should  be  possible  to  identify  the  disease-  and  tissue-specific  components  of  the  signatures.  Analysis  of  the  accessibility  data  in  light  of  both  the  condensation  state  of  single  genes  as  well  as  its  global  relationship  to  gene  function  makes  the  development  of  the  Chromatin  Array  method  a  novel  and  important  addition  to  study  chromatin  structure-function  relationships.
0	RESULTS  AND  DISCUSSION
0	The  Chromatin  Array  Accurately  Measures  the  Accessibility  State  of  the  DNA  Recovered  by  the  Chromatin  Solubility  Assay
0	The  chromatin  solubility  assay  first  uses  micrococcal  nuclease  to  generate  mono-  and  oligonucleosomes  that  are  separated  into  three  fractions  designated  S1,  S2,  and  P.  The  transcriptionally  active  DNA  is  found  in  the  S1  and  P  fractions,  which  in  MCF7  comprise  68%  of  the  total  DNA.  The  S1  fraction  is  depleted  in  histone  H1  and  enriched  in  the  high  mobility  group  (HMG)  proteins  and  heterogeneous  ribonucleoproteins  particles  (HnRNPs),  both  of  which  are  known  to  be  associated  with  actively  transcribed  chromatin  (Huang  et  al.  1986).  Likewise,  the  P  fraction  is  highly  enriched  in  nonhistone  proteins,  and  with  further  digestion,  it  can  be  partially  converted  to  the  S1  fraction  (Rose  and  Garrard  1984;  Huang  et  al.  1986).  The  S2  fraction  represents  32%  of  the  total  DNA  and  contains  nucleosomes  stoichiometrically  associated  with  histone  H1  and  highly  deficient  in  nonhistone  proteins  (Rose  and  Garrard  1984).  This  S2  fraction  operationally  represents  the  most  condensed  chromatin  fraction  as  indicated  by  previous  studies  that  have  demonstrated  that  his-
0	The  number  of  genes  (19,437  possible)  in  each  group  that  pass  all  data  possessing  filters  is  shown.  Reproducibility  refers  to  the  percentage  of  genes  that  yield  similar  results  in  an  independent  replicate  experiment  of  chromatin  solubility  fractionation  on  a  different  array  platform.  Concordance  refers  to  the  percentage  of  genes  between  the  merged  fragment  length  selection  data  and  the  chromatin  solubility  da
0	Predictive  ability  of  DNA  microarrays  for  cancer  outcomes  and  correlates:  an  empirical  assessment
1	Evangelia  E  Ntzani,  John  P  A  Ioannidis
0	DNA  microarray  analysis  is  a  highly  promising  technique  with  broad  applications.  Simultaneous  characterisation  of  the  expression  pattern  of  thousands  of  genes  could  allow  better  understanding  of  the  molecular  properties  of  healthy  and  diseased  tissue.1,2  Such  information  might  lead  to  more  accurate  diagnosis  and  individual  prediction  of  clinical  outcomes.3  Oncology  has  been  one  of  the  most  promising  specialties  for  this  technique  to  date.4  By  use  of  DNA  microarrays,  investigators  have  tried  to  predict  the  overall  clinicopathological  behaviour  of  diverse  malignant  disorders.  Although  this  information  could  revolutionise  cancer  prognosis  and  therapy,  there  is  a  need  for  close  scrutiny  of  the  clinical  performance  of  the  new  method.  We  undertook  a  systematic  assessment  of  molecular  profiling  studies  that  used  DNA  microarray  analysis  to  generate  predictive  models  for  clinical  cancer  outcomes.  We  also  recorded  studies  that  addressed  the  relation  of  molecular  subtypes  with  other  clinicopathological  features  of  malignant  diseases.  We  investigated  the  strength  of  the  current  evidence  for  the  predictive  performance  of  DNA  microarray  analyses  in  oncology,  whether  this  predictive  information  is  independent  of  known  traditional  predictors  of  cancer  outcomes,  and  whether  there  are  features  that  influence  the  chances  that  a  DNA  microarray  study  will  find  significant  associations  with  clinical  outcomes  and  correlates  thereof.
0	Study  eligibility  and  search  strategy  We  selected  original  studies  in  which:  cDNA  or  oligonucleotide  microarray  analyses  were  done  for  functional  gene  expression  of  at  least  500  genes;  samples  from  at  least  ten  patients  with  cancer  were  included;  and  an  attempt  was  made  to  classify  cancers  into  subtypes  for  prospective  prediction  of  a  major  clinical  outcome  or  to  assess  correlations  with  any  other  clinicopathological  variables.  Major  clinical  outcomes  were  death,  metastasis,  recurrence,  or  clinical  response  to  therapy.  Studies  were  included  whether  or  not  they  succeeded  in  subtyping.  We  excluded  studies  that  focused  on  structural  gene  alterations  and  those  that  used  only  pooled  samples,  cancer  cell  lines,  or  xenografts.  When  various  samples  were  used,  we  focused  on  individual  patients'  samples.  We  also  excluded  studies  that  contrasted  normal  (or  premalignant)  and  malignant  tissue  samples  without  subtyping  tumour  samples;  studies  of  differential  gene  expression  among  cancer  tissue  samples  from  different  organs;  studies  aiming  to  separate  known  distinct  entities  (eg,  myeloid  vs  lymphocytic  leukaemia);  and  studies  focusing  a  priori  on  a  specific  gene.  We  used  the  cut-off  of  500  genes  to  exclude  studies  more  focused  on  identifying  the  role  of  a  limited  number  of  preselected  genes.  Some  early  microarrays  used  slightly  over  500  probes.  We  searched  MEDLINE  limited  to  human  studies  and  using  the  terms  "microarr*",  "gene  expression  profiling",
0	For  personal  use.  Only  reproduce  with  permission  from  The  Lancet  publishing  Group.
0	We  plotted  on  receiver  operating  characteristic  (ROC)  spaces  the  sensitivity  and  specificity  of  molecular  subtypes  for  major  clinical  outcomes.  Sensitivity  and  specificity  estimates  were  calculated  in  a  standard  way  from  information  presented  in  the  reports  and  supplementary  files  of  eligible  studies.  The  major  outcome  definitions  followed  the  main  definition  of  the  primary  investigators.  Whenever  there  were  more  than  two  resulting  subtypes,  the  subtype  with  worse  prognosis  was  compared  against  all  others  combined.  Continuous  predictive  scores  were  split  into  two  groups,  as  done  by  the  primary  investigators.  Separate  plots  were  drawn  for  independent  validations,  cross-validations,  and  unsupervised  classifications.  When  different  crossvalidations  (complete  and  incomplete)  were  reported,  we  captured  the  predictive  accuracy  of  all  of  them  and  discussed  any  d
0	Original  article
0	Monitoring  gene  expression  profile  changes  in  bladder  transitional  cell  carcinoma  using  cDNA  microarray
1	Sun  Ying-Hao,  M.D.a,*,  Yang  Qing,  M.D.a,  Wang  Lin-Hui,  M.D.a,  Gao  Li,  M.D.b,  Tang  Rong,  M.D.c,  Ying  Kang,  M.D.c,  Xu  Chuan-Liang,  M.D.a,  Qian  Song-Xi,  M.D.a,  Li  Yao,  M.D.c,  Xie  Yi,  M.D.c,  Mao  Yu-Ming,  M.D.c
0	Keywords:  Bladder  neoplasms;  Carcinoma;  cDNA  microarray;  Gene
0	Introduction  Cancers  have  been  defined  as  a  group  of  cells  exhibiting  an  unrestrained  proliferation  phenotype.  The  development  and  progression  of  cancer  result  from  complex  changes  in  patterns  of  gene  expression  in  the  cell,  which  are  accompanied  by  different  histological  or  clinical  classification  of  the  abnormal  cells'  growth.  It's  very  important  to  screen  out  these  special  genes  from  the  human  genome.  Conventional  methods  such  as  northern  or  southern  blot  fail  to  achieve  its  expedient  effect,  but  the  advanced  technique  of  cDNA  microarrays  works.  It  allows  monitoring  simultaneously  the  expression  level  of  thousands  of  both  selected  known  genes  and  cDNAs  representing  uncharacterized  genes  in  one  hybridization  experiment.  By  employing  this  technique,  detec-
0	Chipping  away  at  brain  function:  mining  for  insights  with  microarrays
1	Gilbert  L  Henry,  Karen  Zito  and  Josh  DubnauA
0	The  impact  of  microarray  studies  on  neurobiology  has  been  limited  because,  with  the  exception  of  a  few  outstanding  papers,  most  reports  provide  little  more  than  lists  of  genes,  often  leaving  the  reader  at  a  loss  to  understand  which  and  how  many  of  the  identified  transcripts  will  be  true  positives  with  significant  biological  impact.  However,  some  recent  papers  have  offered  considerable  biological  insight  by  providing  independent  in  vivo  confirmation  of  the  roles  of  candidate  genes,  offering  a  glimpse  of  the  potential  power  of  microarrays  in  neurobiological  research.
0	to  `genes  with  metabolic  function';  in  all  cases,  `genes  of  unknown  function'  dominate  the  pack.  Second,  the  unavoidably  high  level  of  false  positives  inherent  in  the  massively  parallel  quantification  of  small-magnitude  effects  has  necessitated  the  use  of  careful,  low-throughput  follow-up  assays  to  validate  high-throughput  array  experiments.  Despite  these  caveats,  it  is  evident  from  several  recent  studies  that  genome-wide  expression  approaches,  when  validated  with  in  vivo  follow-up  experiments,  can  yield  significant  insights.  Our  objective  with  this  review  is  not  to  dwell  upon  the  technical  aspects  of  gene  expression  profiling  in  the  brain,  as  experimental  design  and  analysis  methods,  and  the  pitfalls  associated  with  these,  have  been  reviewed  extensively  elsewhere  (e.g.  in  [1,2]),  but  instead  to  concentrate  on  the  insights  this  technology  has  offered  us  as  neurobiologists.  With  this  in  mind,  we  have  chosen  to  discuss  a  subset  of  the  most  recent  papers  that  we  feel  increase  our  understanding  of  brain  function.
0	Chips  and  brain  development
0	One  major  effort  of  neurobiological  research  is  the  study  of  brain  development.  At  the  cellular  level,  there  are  questions  concerning  the  genetic  programs  responsible  for  specification  of  neural  cell  fates  and  the  differentiation  of  the  myriad  neuronal  and  glial  types  (the  mammalian  retina  alone  contains  approximately  55  separate  neuronal  types  [3]).  At  the  circuit  level,  the  processes  of  axon  guidance,  target  selection,  synapse  formation  and  refinement  of  synaptic  connections  each  rely  on  a  combination  of  intrinsic  gene-expression  patterns  and  environmental  influences.  In  addition,  at  the  systems  level,  there  is  a  drive  to  fully  map  the  spatial  and  temporal  expression  patterns  of  each  gene.  The  utilization  of  microarrays  to  probe  gene  expression  patterns  at  each  of  these  levels  has  resulted  in  the  identification  of  a  considerable  number  of  candidate  genes,  of  which  a  few  have  been  confirmed  with  in  vivo  studies.
0	Cellular-level  analyses
0	Abbreviations  CREB  cyclic  AMP  response-element  binding  protein  EAE  experimental  autoimmune  encephalomyelitis  FACS  fluorescence-activated  cell  sorting  FGF18  fibroblast  growth  factor  18  FMR1  fragile  X  mental  retardation  gene  FMRP  fragile  X  mental  retardation  protein  FraX  fragile  X  syndrome  GC  granule  cell  G-CSF  granulocyte  colony  stimulating  factor  GFP  green  fluorescent  protein  HD  Huntington's  disease  htt  huntingtin  MS  multiple  sclerosis  OPC  olfactory  progenitor  cell  PolyQ  polyglutamine  SCN  suprachiasmatic  nucleus
0	In  the  past  few  years,  the  use  of  genome-wide  expression  profiling  in  neurobiology  has  exploded.  Although  these  studies  have  in  a  short  period  produced  an  impressive  list  of  candidate  genes,  two  issues  have  limited  the  scope  of  the  ensuing  biological  insights.  First,  long  lists  of  genes  do  not,  on  their  own,  further  our  understanding  of  the  biology.  In  virtually  all  cases,  most  functional  categories  of  genes  are  identified,  ranging  from  from  `transcription  factors'  to  `translation  factors',  from  `signaling  molecules'  to  `cell-cycle  control  genes'  and  from  `cytoskeletal  proteins'
0	In  the  past  few  years,  several  groups  have  used  microarrays  to  probe  for  gene  expression  patterns  that  confer  upon  neural  stem  cells  their  unique  ability  both  to  selfrenew  and  to  differentiate  into  multiple  cell  types  (e.g.  [4-7]).  In  each  case,  >200  genes  were  identified,  including  many  known  markers  of  stem  cells.  It  is  worth  noting,  however,  that  a  third-party  comparison  of  the  `stem-cell  enriched'  transcripts  identified  in  two  of  these  studies  revealed  an  overlap  of  only  15  genes  [8].  This  small  overlap  is  likely  to  be  due  to  discrepancies  in  the  manner
0	Chipping  away  at  brain  function:  mining  for  insights  with  microarrays  Henry,  Zito  and  Dubnau  571
0	in  which  the  stem  cell  populations  were  isolated  and  to  their  lack  of  purity.  The  neural  stem  cells  for  these  experiments  were  obtained  from  neurospheres,  colonies  of  cultured  stem  cells  from  regions  of  the  mammalian  ventricular  and  subventricular  zones.  Neurospheres  are  known  to  be  heterogeneous,  containing  only  3-4%  true  stem  cells  that  give  rise  to  all  three  neural  lineages  [9].  This  heterogeneity  creates  a  signal-to-noise  problem  for  the  detection  of  gene  expression  in  a  given  cell  type.  One  way  to  alleviate  problems  of  tissue  heterogeneity  is  through  the  analysis  of  single  cells.  Technologies  for  single-cell  mRNA  analysis  have  been  under  development  for  just  over  a  decade  [10],  and  recently  the  first  neurobiological  reports  have  emerged  on  the  use  of  single  cells  in  combination  with  microarrays  (e.g.  [11-13]).  Tietjen  et  al.  compared  single  neuronal  progenitor  cells  (OPCs)  from  the  olfactory  bulb  to  mature  olfactory  sensory  neurons  [12].  The  authors  identified  197  genes  enriched  in  OPCs,  some  of  which  were  confirmed  by  in  situ  hybridizations  to  be  expressed  in  proliferative  regions  of  the  olfactory  epithelium.  Evaluation  of  the  overall  success  of  these  and  the  earlier  experiments  awaits  a  detailed  examination  of  the  expression  patterns  of  the  identified  genes  to  determine  their  utility  as  markers  of  stem  cells.  The  discovery  of  stem-cell  marker  genes  should  facilitate  the  identification  and  selection  of  stem-cell  populations  for  functional  studies  as  well  as  for  therapeutic  purposes.  A  second  example  of  the  advantages  afforded  by  cell  purification  comes  from  a  study  of  neuronal  differentiation  in  Caenorhabditis  elegans.  Zhang  et  al.  examined  downstream  targets  of  mec-3,  a  transcription  factor  required  for  the  development  and  function  of  the  touch-receptor  neuron  [14].  The  authors  used  fluorescence-activated  cell  sorting  (FACS)  to  isolate  populations  of  GFP-expressing  touch-receptor  neurons  from  wild-type  and  mec3-mutant  animals.  They  identified  71  mec-3-dependent  candidate  genes,  including  seven  of  the  nine  known  mec3-dependent  genes,  two  genes  known  to  be  expressed  in  touch  receptors,  and  mec-17,  a  gene  previously  identified  in  an  independent  screen  and  required  for  the  maintenance  of  touch-receptor  differentiation.  Seventeen  of  the  newly  identified  and  eight  of  the  nine  known  mec-3dependent  genes  contained  in  their  promoter  regions  an  over-represented  heptanucleotide  indirectly  implicated  in  mec-3-dependent  transcription,  making  them  potential  direct  targets  for  mec-3  regulation.  Thus,  microarrays  can  facilitate  identification  of  the  genes  responsible  for  differentiation  of  a  particular  neuronal  subtype.  Access  to  a  homogeneous  population  of  that  neuronal  type  greatly  facilitates  this  type  of  analysis.
0	Circuit-level  analyses
0	attempted  to  address  this  issue  in  the  developing  pontocerebellar  projection  system.  First,  the  authors  used  a  powerful  combination  of  approaches  to  characterize  gene  expression  in  cerebellar  granule  cells  (GCs).  By  examining  developmental  gene  expression  in  the  cerebellum  (of  which  GCs  make  up  a  major  component),  in  cultured  GCs,  in  acutely  isolated  GCs,  and  in  two  strains  of  mutant  mice  that  lack  GCs,  the  authors  identified  genes  that  could  play  a  role  in  GC  differentiation.  They  then  looked  at  gene  expression  changes  in  the  pontine  nucleus,  which  contains  the  presynaptic  cells  that  project  to  and  synapse  on  GCs  of  the  cerebellum.  With  the  expectation  that,  in  the  absence  of  their  GC  targets,  pontine  cells  would  not  undergo  target  selection  and  synapse  formation,  the  authors  again  used  mutant  mice  that  lack  GCs,  this  time  to  identify  candidate  genes  responsible  for  axonal  outgrowth  and  synapse  formation.  Although  these  experiments  reveal  some  potentially  promising  candidates,  only  a  small  fraction  of  genes  were  validated  by  in  situ  hybridizations,  and  thus  it  remains  to  be  seen  whether  the  newly  identified  candidate  genes  in  fact  have  their  hypothesized  roles  in  GC  differentiation,  axon  outgrowth  and  synapse  formation.  Additional  investigations  of  the  type  described  by  Diaz  et  al.  are  expected  to  greatly  improve  our  understanding  of  the  genes  involved  in  the  establishment,  maintenance  and  modification  of  the  connectivity  patterns  of  neuronal  circuits.
0	Systems-level  analyses
0	An  ultimate  goal  of  microarray  studies  in  the  brain  is  the  assembly  of  a  comprehensive  map  of  gene  expression  across  all  neuronal  types,  brain  regions,  and  developmental  stages  [1,16].  The  resulting  expression  map  should  enable  neuroscientists  to  gain  a  broader  understanding  of  brain  function  through  a  systems-level  analysis  of  coordinate  gene  regulation  patterns.  Several  groups  have  identified  genes  with  subregion-specific  expression  patterns,  for  example  in  the  hippocampal  subregions  [17]  or  in  the  amygdaloid  subnuclei  [18].  However,  it  is  increasingly  clear  that  most  individual  laboratories  do  not  have  the  resources  to  carry  out  large-scale  validation  of  all  candidate  genes,  which  would  be  required  before  they  could  be  incorporated  into  a  comprehensive  molecular  atlas  of  th
0	Identification  of  genes  involved  in  Drosophila  melanogaster  geotaxis,  a  complex  behavioral  trait
0	Nature  Publishing  Group  http://genetics.nature.com
1	Daniel  P.  Toma1,  Kevin  P.  White2,  Jerry  Hirsch3  &  Ralph  J.  Greenspan1
0	Pioneering  experiments  on  Drosophila  melanogaster  and  Drosophila  pseudoobscura  investigated  the  nature  of  the  genetic  basis  for  extreme,  selected  geotaxic  behavior.  These  experiments  constituted  the  first  attempt  at  the  genetic  analysis  of  a  behavior.  Selection  and  chromosomal  substitution  experiments  successfully  showed  that  there  is  a  genetic  basis  for  extreme  geotaxic  response  in  flies1-5  and,  by  implication,  for  behavior  in  general.  These  experiments  also  added  to  our  understanding  of  the  role  of  variation  in  phenotypic  evolution  and  selection6-8.  Despite  their  seminal  contributions  in  behavioral  genetics,  population  genetics  and  the  study  of  selection,  by  their  nature  these  experiments  could  not  identify  specific  genes9.  These  results  highlight  both  the  success  and  the  limitation  of  behavioral  selection  experiments.  Although  selection  results  tend  to  be  representative  of  the  natural  interactions  of  genes  that  produce  behavior10  and  can  demonstrate  that  a  trait  has  a  genetic  basis,  they  do  not  pinpoint  specific  genes  that  influence  the  trait.  This  is  partly  due  to  the  involvement  of  many  genes  and  the  relatively  minor  role  of  each  in  complex  polygenic  phenotypes--a  problem  that  is  especially  acute  for  the  intrinsically  more  variable  phenotypes  that  are  associated  with  behavior.  The  advent  of  cDNA  microarray  technology  offers  an  easily  generalized  strategy  for  detecting  gene  expression  differences  and  can  complement  other  means  of  identifying  the  genes  that  underlie  complex  traits11.  An  expression  difference  may  occur  in  a  gene  that  is  not  itself  polymorphic,  but  that  gene  may  contribute  to  the  realization  of  the  phenotypic  difference.
0	cDNA  microarray  and  qPCR  Initially,  we  used  cDNA  microarrays13  that  contained  about  onethird  of  the  predicted  genes  in  the  genome  to  identify  roughly  250  genes  that  showed  an  approximately  twofold  or  greater  expression  differential  between  the  Hi5  and  Lo  lines.  We  did  these  experiments  in  duplicate  with  different  sets  of  flies  and  removed  the  few  genes  that  behaved  inconsistently  from  further  analysis.  The  number  of  genes  that  showed  consistent  differential  expression  was  about  5%  of  those  assayed.  Thus,  gene  expression  in  these  strains  has  been  modified  as  the  result  of  laboratory  selection.  The  polymorphisms  responsible  for  this  differential  gene  expression  probably  derive  both  from  variation  that  was  present  Results  in  the  initial  selected  populations  and  from  spontaneous  mutaGeotaxis  behavior  for  selected  lines  As  a  starting  point  for  identifying  genes  that  affect  a  complex  tions  that  occurred  during  the  course  of  the  selection  experitrait,  we  analyzed  the  selected,  established  Hi5  and  Lo  extreme  ments.  Not  all  of  these  differentially  expressed  genes  would  be
0	Table  1  ·  Comparison  of  cDNA  microarray  and  qPCR  ratios  of  mRNAs  Gene  Array  (Lo/Hi5)  qPCR  (Lo/Hi5)  cry  3.57  5.96  Pdf  1.85  2.02  Experimental  group  Pen  pros  (l)  0.18  -  3.22  3.71  pros  (sl)  3.22  1.57  cnk  0.92  0.69  Csp  1.03  1.00  for  1.27  1.42  Control  group  mth  nmo  1.11  1.62  1.01  1.01  per  -  1.74
0	The  average  coefficient  of  variance  for  the  qPCR  results  from  each  selected  line  was  19.33%  with  a  range  of  17.32-23.08%  for  Hi5,  and  22.86%  with  a  range  of  21.96-24.17%  for  Lo.  Because  arrays  were  repeated  only  twice,  no  estimate  of  variance  was  possible.  We  report  no  Pen  qPCR  data  because,  of  six  primer  pairs  tested,  none  amplified  efficiently  enough  to  obtain  consistent  results,  although  the  direction  of  change  for  those  that  gave  some  amplification  was  in  the  predicted  direction.  pros  has  two  splice  variants17,  short  (s)  and  long  (l),  which  the  array  did  not  resolve.  We  therefore  designed  a  separate  primer  pair  for  each  form,  but  the  pair  for  the  short  form,  designated  (sl),  amplifies  both.
0	Nature  Publishing  Group  http://genetics.nature.com
0	(CS),  that  was  different  from  either  of  the  selected  lines.  We  tested  the  resultant  strains  (Table  2)  in  a  geotaxis  maze.  We  placed  the  mutants  on  a  neutral  background  to  assay  for  those  genes  that  have  the  most  robust  phenotypic  effect  that  is  independent  of  the  combination  of  alleles  in  the  selected  lines.  We  also  tested  the  effects  of  varying  the  gene  dosage  of  Pdf  and  pros.  For  Pdf,  we  constructed  lines  with  Pdf01  (henceforth  referred  to  as  Pdf-)  and  the  wildtype  transgenic  insertion  Pdf+t3.530  (henceforth  referred  to  as  Pdf+t)  to  titrate  its  effect  on  the  behavior.  Likewise,  for  pros  we  used  the  mutant  allele  pros17  and  the  transgenic  insertion  pros+t30.8  (henceforth  referred  to  as  pros+t).  The  Pen  and  cry  mutants  deviated  significantly  from  CS  (Table  3  and  Fig.  2a).  Pdf-  flies  also  deviated  significantly  from  CS.  There  were  also  effects  on  geotaxic  behavior  in  Pdf-  flies  owing  to  alterations  in  gene  dosage  and  sex  (genotype  x  sex  interaction,  F  =  3.85,  P  <  0.0015;  Table  4  and  Fig.  2b).  The  sex-specific  effect  of  varying  Pdf  gene  dosage  was  graded,  with  the  homozygous  Pdf-  males  showing  the  same  response  as  Hi5  males.  In  males,  the  effect  was  Hi5  =  Pdf-/Pdf-  >  Pdf+t/+;  Pdf-/Pdf-  =  Pdf+t/Pdf+t;  Pdf-/Pdf-  >  Pdf-/+  =  CS  =  Pdf+t/+  =  Pdf+t/Pdf+t  >  Lo,  where  nonsignificance  is  indicated  by  `='  and  significance  is  indicated  by  `>'  (Table  4  and  Fig.  2b).  Thus,  although  Pdf-/Pdf-  males  did  not  differ  significantly  from  Hi5  males,  adding  one  copy  of  the  transgene  significantly  lowered  their  score.  Adding
0	Molecular  Characterization  of  Clinical  Study  Schizophrenia  Viewed  by  Microarray  Analysis  of  Gene  Expression  in  Prefrontal  Cortex
0	Neuron  54
0	The  changes  in  schizophrenic  subjects  were  assessed  by  gene  expression  profiling  for  250  gene  groups  related  to  metabolic  pathways,  enzymes,  functional  pathways,  or  brain-specific  functions.  More  than  98%  of  the  gene  groups,  when  compared  to  the  expression  pattern  of  all  detectable  transcripts,  were  not  significantly  different  (p  0.05)  between  the  schizophrenic  and  control  subjects  (Figures  1E-1H),  establishing  that  other  changes  that  we  did  detect  are  not  due  simply  to  human  subject  variability.  This  observation  also  is  in  agreement  with  previous  findings  that  total  mRNA  levels  in  schizophrenic  subjects  are  comparable  to  those  in  the  unaffected  human  population  (Harrison  et  al.,  1997).  However,  several  gene  groups  exhibited  significantly  changed  expression  in  schizophrenic  subjects,  both  within  individual  pairs  and  across  pairs  (presynaptic  sec
0	A  cDNA  microarray  from  the  telencephalon  of  juvenile  male  and  female  zebra  finches
1	Juli  Wade  a,  ,  Camilla  Peabody  a  ,  Paul  Coussens  b  ,  Robert  J.  Tempelman  b  ,  David  F.  Clayton  c  ,  Lei  Liu  d  ,  Arthur  P.  Arnold  e  ,  Robert  Agate  e
0	Abstract  Studies  over  roughly  the  last  decade  have  emphasized  the  importance  of  gene  expression  in  the  development  of  structure  and  function  of  the  songbird  forebrain.  However,  few  tools  have  been  available  to  efficiently  identify  the  critical  factors.  To  that  end,  we  have  produced  a  normalized  cDNA  library  from  juvenile  zebra  finch  telencephalon,  and  have  spotted  inserts  from  2400  randomly  selected  cDNA  clones  on  microarrays  (1664  unique  sequences).  We  have  also  added  several  previously  cloned  cDNAs  of  interest,  including  three  representing  genes  encoded  on  sex  chromosomes.  Hybridizations  comparing  Cy3-  and  Cy5-labeled  cDNA  from  the  telencephalon  of  day  25  male  and  female  zebra  finches  confirmed  sexually  dimorphic  expression  of  the  Z-  and  W-linked  genes,  demonstrating  the  utility  of  these  microarrays  for  detecting  differential  expression  and  providing  information  about  the  relative  expression  of  these  genes  in  the  brains  of  juveniles  of  this  age.  ©  2004  Elsevier  B.V.  All  rights  reserved.
0	Keywords:  Songbird;  Sexual  differentiation;  Sexual  dimorphism;  Song  development;  Brain  development
0	ing  song  playbacks  differs  in  juvenile  males  and  females  (Bailey  and  Wade,  2003).  Sexual  differentiation  of  the  neural  circuits  governing  reproductive  behaviors  is  regulated  by  gonadal  steroid  hormones  in  diverse  vertebrate  groups.  However,  in  the  zebra  finch,  numerous  experiments  have  suggested  that  gonadal  steroids  are  not  critical  to  the  masculinization  or  feminization  of  the  forebrain  regions  controlling  their  courtship  song  (Arnold,  2002;  Balthazart  and  Adkins-Regan,  2002).  Instead,  factors  intrinsic  to  the  brain  are  responsible,  likely  both  steroid  hormones  synthesized  within  that  organ  and  gene  products  (proteins)  produced  in  neurons  and/or  glia  (Agate  et  al.,  2003;  Holloway  and  Clayton,  2001).  However,  relatively  little  is  known  about  the  specific  genes  involved  in  sexual  differentiation--those  that  influence  or  are  influenced  by  steroid  hormones,  as  well  as  those  that  independently  cause  masculine  or  feminine  development.  Similarly,  although  the  expression  of  immediate  early  genes  has  been  a  powerful  technique  for  functionally  mapping  anatomical  structures  critical  to  song  perception  and  perhaps
0	song-related  memory  formation  (Bailey  et  al.,  2002;  Mello  and  Clayton,  1994;  Mello  et  al.,  1992;  Stripling  et  al.,  2001),  cellular  activity  downstream  of  fos  or  zenk  activation  remains  largely  unexplored  because  an  efficient  means  of  screening  for  the  transcription  of  songbird  genes  has  not  existed.  Until  very  recently  only  tens  of  gene  products  had  been  cloned  from  the  songbird  brain  (Clayton,  1997),  with  a  few  isolated  from  the  zebra  finch  telencephalon  using  differential  display  RT-PCR  (Denisenko-Nehrbass  et  al.,  2000;  Veney  et  al.,  2003).  To  identify  the  critical  gene  products  more  quickly,  we  developed  a  microarray  of  cDNAs  from  the  zebra  finch  telencephalon  useful  for  the  study  of  gene  expression  under  a  variety  of  developmental  conditions.  Morphological  differentiation  of  the  song  circuit(s)  occurs  until  approximately  50  days  after  hatching.  Although  some  characteristics  are  sexually  dimorphic  before  post-hatching  day  10  (Gahr  and  Metzdorf,  1999),  anatomical  differentiation  occurs  at  the  greatest  rate  during  about  days  20-35  (Bottjer  et  al.,  1985;  Kirn  and  DeVoogd,  1989;  Nixdorf-Bergweiler,  1996).  Also,  under  normal  conditions,  exposure  to  song  during  roughly  post-hatching  days  25-35  influences  the  ability  of  both  sexes  to  produce  and/or  respond  to  it  appropriately  in  adulthood  (Clayton,  1988;  Eales,  1985;  Immelmann,  1969;  Miller,  1979;  Nordeen  and  Nordeen,  1997).  Males  typically  form  templates  of  their  fathers'  songs  during  this  period,  and  then  integrate  these  memories  with  their  own  attempts  at  production  until  they  create  a  song  quite  similar  to  their  fathers'  by  about  60  days  of  age  (Nordeen  and  Nordeen,  1997).  Although  it  takes  another  2  weeks  or  so  to  reliably  take  on  its  permanent,  stable  form,  the  majority  of  song  learning  is  completed  by  day  60.  To  focus  on  genes  involved  in  sexual  differentiation  and  development  of  song  production  and  perception,  cDNA  microarrays  were  produced  using  normalized  libraries  generated  from  the  telencephalons  of  males  and  females  at  days  10-60  post-hatching.  In  addition  to  testing  hypotheses  associated  with  those  processes,  depending  on  the  design  of  the  experiment,  the  cDNAs  on  these  arrays  can  provide  information  about  gene  expression  associated  with  changes  in  neural  function  under  numerous  conditions.
0	Materials  and  methods  2.1.  RNA  isolation  and  library  production  RNA  was  isolated  from  the  telencephalon  of  two  males  and  two  females  at  day  10,  two  females  and  one  male  at  day  22,  and  one  individual  of  each  sex  at  days  30,  45,  and  60  using  Trizol  (Invitrogen  Life  Technologies).  The  concentration  of  each  sample  was  determined,  and  the  purity  and  integrity  of  each  was  checked  on  1%  agarose  gels  before  proceeding.  Separate  male  and  female  cDNA  libraries  were  produced  a
0	LETTERS  SCIENCE  &  SOCIETY  POLICY  FORUM  BOOKS  ET  AL.  PERSPECTIVES  REVIEWS
0	IN  OUR  REPORT,  "EVIDENCE  FOR  COHERENT  proton  tunneling  in  a  hydrogen  bond  network"  (1),  we  presented  nuclear  magnetic  resonance  relaxometry  results  for  calix(4)arene  in  the  solid  state.  A  peak  at  35  MHz  in  the  magnetic  field  dependence  of  the  proton  spin-lattice  relaxation  rate  was  interpreted  as  a  manifestation  of  coherent  proton  tunneling  in  a  cyclic  array  of  four  hydrogen  bonds.  In  the  course  of  further  investigations,  it  has  become  apparent  that  the  sample  supplied  to  us  contained  residues  of  dichloromethane.  This  brings  into  question  the  assignment  of  the  spectral  feature  because  we  cannot  now  rule  out  the  possibility  that  it  derives  from  quadrupole  resonance  transitions  associated  with  chlorine  nuclei.  Thus,  we  must  retract  our  report.  Conclusions  regarding  the  incoherent  tunneling  of  protons  in  this  material  are  not  in  question.
0	HIV  Among  Drug  Users  in  China
0	J.  KAUFMAN  AND  J.  JING  PROVIDE  AN  EXCELlent  overview  of  the  potentially  catastrophic  epidemic  of  HIV/AIDS  in  China  in  their  Policy  Forum  "China  and  AIDS--the  time  to  act  is  now"  (28  June,  p.  2339).  They  note  that  the  Chinese  epidemic  began  among  injecting  drug  users  (IDUs)  and  call  for  education  on  safer  injection  and  clean  needle  programs  to  reduce  HIV  transmission  among  IDUs.  HIV  among  IDUs  is  clearly  a  major  problem  in  China:  (i)  68.7%  of  all  reported  cases  of  HIV  are  among  IDUs;  (ii)  HIV  infection  has  spread  along  drug  distribution  routes  and  has  occurred  among  IDUs  in  all  provinces;  (iii)  extremely  rapid  HIV  transmission  has  occurred  in  some  populations  of  IDUs,  with  incidence  rates  of  over  30%  per
1	CHENG  FENG1  AND  DON  DES  JARLAIS2  Kingdom  HIV  Prevention  and  Care  Project,  27  Nanweilu,  Beijing  100050,  China.  2Baron  Edmond  de  Rothschild  Chemical  Dependency  Institute,  Beth  Israel  Medical  Center,  First  .  Avenue  at   Street,  New  York,  NY  10013,  USA.
0	Trying  to  Make  Sense  of  Disorder
0	CREDIT:  AP  PHOTO/GREG  BAKER
0	FENG  AND  DES  JARLAIS  RAISE  IMPORTANT  points,  and  we  fully  agree  with  their  opinions.  Policies  and  programs  to  contain  the  spread  of  HIV  among  IDUs  require  much
0	IN  HIS  ARTICLE  "A  FRESH  TAKE  ON  DISORDER,  or  disorderly  science?"  (News  Focus,  23  Aug.,  p.  1268),  Adrian  Cho  reports  on  a  lively  controversy  presently  raging  over  what  is  called  "Tsallis  entropy,"  which  has  been  wrongly  suposed  to  be  the  physical  entropy  of  the  natural  world,  superseding  the  universal  and  general  Clausius-Boltzmann  statisticalthermodynamic  entropy.  The  new  definition  of  entropy  developed  by  Constantino  Tsallis  is  a  very  useful--and  sophisticated--tool  for  generating  a  so-called  nonextensive  thermostatistics,  which  can  be  used  for  adjusting  and  analyzing  experimental  data  in  certain  partic-
0	NOVEMBER  2002
1	ROBERTO  LUZZI,  AUREA  R.  VASCONCELLOS,  J.  GALVAO  RAMOS  Instituto  de  Fisica-Unicamp,  13083-970  Campinas,  SP,  Brasil.
0	ADRIAN  CHO'S  ARTICLE  ON  TSALLIS  ENTROPY  ("A  fresh  take  on  disorder,  or  disorderly  science,"  News  Focus,  23  Aug.,  p.  1268)  emphasizes  the  importance  of  nonextensive  energies  when  analyzing  complex  systems.  To  complement  his  picture,  I  would  like  to  draw  attention  to  an  alternative  way  of  treating  nonextensive  energies,  developed  by  Terrell  Hill  about  40  years  ago  (1-3).  Hill's  approach  is  based  on  the  fundamental  foundation  of  Gibbs'  ensembles  and  does  not  involve  modifying  the  definition  of  entropy.  To  my  knowledge,  Hill's  work  remains  the  only  comprehensive  treat
0	Mfold  web  server  for  nucleic  acid  folding  and  hybridization  prediction
1	Michael  Zuker*
0	Department  of  Mathematical  Sciences,  Rensselaer  Polytechnic  Institute,  Troy,  NY  12180,  USA
0	gij26014111jref  jNW  044277:1jRnUn  1636  Rattus  norvegicus  WGS  supercontig  ATGTTCAATTTTATCTAATCCCTGTTACTCTGGAAAACAGGTTAAAAAAAAAAATCCTCCACAATCCATT  TTCTGGAAAACAGCTTACTTCAAAGACCCACCCTTCCTGTAGGACTTTAGTACATCTTTCAGGTGCTTCT;
0	then  the  resulting  sequence  will  be
0	GIREFNWRNU  60  UUUAUCUAAU  110  CACAAUCCAU  160  UAGGACUUUA  20  NRAUUUSNOR  70  CCCUGUUACU  120  UUUCUGGAAA  170  GUACAUCUUU  30  VEGICUSWGS  80  CUGGAAAACA  130  ACAGCUUACU  180  CAGGUGCUUC  40  50  SUPERCONUI  GAUGUUCAAU  90  100  GGUUAAAAAA  AAAAAUCCUC  140  150  UCAAAGACCC  ACCCUUCCUG  190  U;
0	rather  than
0	The  letter  `N'  should  be  used  for  an  unspecified  base.  It  is  not  allowed  to  pair.  The  lett
0	BMC  Bioinformatics
0	BMC  Bioinformatics  2002,  3
0	BioMed  Central
0	Methodology  article
0	Open  Access
0	Oliz,  a  suite  of  Perl  scripts  that  assist  in  the  design  of  microarrays  using  50mer  oligonucleotides  from  the  3'  untranslated  region
1	Hao  Chen*  and  Burt  M  Sharp*
0	Keywords:  oligonucleotide  microarray,  Perl,  UniGene
0	DNA  microarrays  usually  involve  the  hybridization  of  labeled  cDNA  samples  to  a  set  of  complementary  DNA  (either  PCR  products  or  synthetic  oligonucleotides)  fixed  onto  solid  media.  Spotting  presynthesized  oligonucleotide  has  many  advantages,  such  as  high  sensitivity,  convenience,  and  cost  effectiveness.  Most  importantly,  the  use  of  oligonucleotide  probes  circumvents  the  high  error  rate  that  is  associated  with  the  PCR  amplification  of  bacterial  clones  [4,6].  The  starting  point  in  the  design  of  oligonucleotide  microarrays  is  the  identification  of  short  DNA  sequences  that  can  be  used  as  probes  for  the  genes  of  interest.  Obviously,
0	all  sequences  should  be  gene  specific  and  have  similar  melting  temperature  (Tm).  We  have  been  interested  in  using  the  3'  untranslated  region  (3'UTR)  as  the  target  region  for  the  design  of  oligonucleotide  probes  primarily  because  of  the  relatively  high  specificity  of  this  region  [8]  and  the  availability  of  sequence  information  (in  the  form  of  Expressed  Sequence  Tags,  ESTs).  Frist,  our  approach  involves  the  identification  of  genes  of  interest  in  the  form  of  UniGene  clusters.  The  sequences  of  these  clusters  were  retrieved  and  assembled  into  contigs.  Then,  the  3'UTRs  were  parsed  from  the  contigs.  Finally,  oligonucleotide  sequences  of  50  nucleotides  with  similar
0	Page  1  of  7
0	(page  number  not  for  citation  purposes)
0	BMC  Bioinformatics  2002,  3
0	Step  1.  UniGene  retrieval  and  contig  assembly  A  list  of  selected  UniGenes  is  first  compiled  and  used  as  the  input  file  for  the  UNI  module.  The  sequences  contained  in  these  UniGene  clusters  are  extracted  by  the  UNI  module.  To  achieve  this  function,  the  UNI  module  requires  a  file  that  contains  all  the  UniGene  sequences  of  the  species  of  interest.  This  file  is  available  from  NCBI's  FTP  site  [ftp://ftp.ncbi.nih.gov/repository/UniGene].  The  name  of  this  file  follows  the  convention  of  "species.seq.all".  Then,  the  CONTIG  module  assembles  each  of  the  clusters  into  a  contig  using  the  CAP3  program[2].  Due  to  the  high  error  rate  in  both  the  sequence  and  annotation  of  ESTs,  clusters  contains  only  one  EST  sequence  are  excluded  from  further  analysis.  Step  2.  Parsing  3'UTR  The  UTR  module  performs  several  tasks.  Initially,  it  determines  the  orientation  of  a  contig  by  comparing  it  to  a  reference  sequence,  such  as  those  provided  by  the  NCBI  RefSeq  project  [5]  (1st  priority),  or  GenBank  sequences  with  coding  region  annotations  (2nd  priority),  or  sequences  with  polyA  tails  (3rd  priority).  It  is  generally  assumed  that  these  sequences  are  in  5'-3'  orientation.  When  the  above  approaches  fail  to  identify  the  orientation  of  the  contig,  its  cluster  identifier  is  sent  to  a  separate  file.  The  orientation  of  these  contigs  can  be  obtained  manually  by  cross-referencing  to  their  homologues  in  other  species,  and  then  be  incorporated  into  the  results.
0	The  UTR  module's  main  function  is  to  parse  the  3'UTR  of  the  contigs,  according  to  the  coding  region  annotation  in  the  reference  sequence.  The  length  of  the  3'UTR  varies  from  gene  to  gene.  Based  on  the  average  length  of  the  transcripts  obtained  from  oligo  dT  primed  cDNA  synthesis,  we  decided  to  target  the  last  500  bases  of  the  3'UTR  as  the  region  for  the  selection  of  50mer  oligonucleotides.  In  addition,  the  UTR  module  generates  several  HTML  files  to  facilitate  visual  inspection  of  the  results.  These  files  contain  links  to  the  UniGene  cluster  sequences,  the  contigs  and  the  3'UTRs.
0	Step  3.  Generating  50mer  oligonucleotides  with  close  Tms  The  EMBOSS  prima  program  was  used  to  select  50mer  oligonucleotides  with  similar  Tms  for  each  3'UTR.  The  Tm  was  set  at  76  ±  5°C  based  on  the  average  Tm  for  50mers.  The  resulting  50mer  sequences  were  saved  as  a  commaseparated  text  file,  ready  for  processing  by  the  UNIQ  module.  Step  4.  Similarity  search  One  of  the  advantages  of  using  the  3'UTR  as  the  target  region  for  hybridization  is  that  this  region  has  been  under  less  evolutionary  pressure  to  remain  constant.  However,  this  does  not  guarantee  that  all  50mers  selected  from  this  region  are  gene  specific.  Therefore  it  is  necessary  to  identi-
0	melting  temperature  (Tm)  and  GC  content  were  selected  and  screened  for  specificity  (Figure  1).
0	The  Oliz  suite  was  written  in  Perl  (v.5.6)  and  was  tested  on  the  RedHat  Linux  (v.7.1)  operating  system.  Oliz  has  four  modules.  The  UNI  module  extracts  UniGene  clusters,  which  are  assembled  into  contig(s)  by  the  CONTIG  module.  Then,  the  UTR  module  parses  the  3'UTRs  of  the  contigs,  and  selects  multiple  50mer  sequences  that  are  within  the  selected  range  for  GC  content  (45-50%)  and  Tm  (76°C  ±  5).  Lastly,  the  UNIQ  module  performs  blast  searches  on  the  50mers  to  ensure  their  gene  specificity.
0	Page  2  of  7
0	(page  number  not  for  citation  purposes)
0	BMC  Bioinformatics  2002,  3
0	fy  potentially  similar  sequences  in  other  genes.  The  UNIQ  module  automates  the  blastn  search,  analyzes  the  blastn  results,  and  decides  whether  to  retain  or  discard  a  particular  50mer  based  on  the  set  criteria.  The  UNIQ  module  runs  blastn  searches  using  a  local  database  constructed  using  sequences  obtained  from  NCBI.  While  analyzing  the  sequences  identified  by  blastn,  it  disregards  accession  numbers  that  are  found  in  the  same  UniGene  cluster  as  the  50mer.  Matches  that  are  oriented  complementary  to  the  50mer  also  are  ignored,  and  only  sense/sense  pairs  are  analyzed  further.  The  orientation  of  the  blastn  matches  is  apparent  when  they  are  known  genes.  EST  hits  are  judged  based  on  their  "clone_end"  annotation.  Kane  et.  al.  [3]  reported  that  specificity  of  a  50mer  oligonucleotide  requires  that  it  is  less  than  75%  similar  to  all  non-target  transcripts.  In  addition,  when  it  is  50-75%  similar  to  a  non-target  transcript,  the  similar  region  must  not  include  a  stretch  of  sequence  of  greater  than  15  contiguous  bases.  Since  blast  only  returns  part  of  the  sequence  where  a  match  is  found  (usually  less  than  50  nucleotides),  it  is  necessary  for  the  UNIQ  module  to  retrieve  the  entire  matching  sequence  before  calculating  the  overall  sequence  similarity.  The  guidelines  reported  by  Kane  et  al.  are  then  followed  to  determine  whether  candidate  50mers  are  acceptable.  Occasionally  all  the  candidate  sequences  generated  by  EMBOSS  prima  were  disqualified  when  compared  to  one  EST  entry.  This  is  a  difficult  issue,  insofaras  these  ESTs  may  represent  unknown  genes,  implying  that  the  candidate  50mer  is  not  gene  specific.  However,  the  apparent  similarity  may  simply  be  caused  by  errors  in  the  EST.  When  this  occurs,  the  UNIQ  module  performs  another  blastn  search  that  excludes  all  the  ESTs  from  the  database.  The  accession  number  of  the  EST  in  question  is  provided  in  the  output  file  and  a  detailed  log  file  for  each  oligonucleotide  sequence  is  also  provided.
0	Experimental  verification  of  the  specificity  of  the  50mer  oligonucleotides  A  set  of  1816  rat  specific  50mer  oligonucleotide  sequences  was  obtained  using  the  methods  described  above.  Most  of  these  genes  are  known  to  be  expressed  in  the  central  nervous  system.  These  oligonucleotides  were  spotted  in  duplicate  onto  TeleChem  SuperAmine  slides.
0	brain  mRNA.  Five  of  these  ten  primer  pairs  amplified  a  single  product  with  the  expected  length.  A  second  PCR  reaction  was  performed  on  these  5  RT-PCR  products  to  selectively  amplify  the  antisense  strand  while  incorporating  amino  allyl  dUTP.  These  antisense  DNAs  then  were  labeled  with  Cy3  fluorescent  dye,  and  were  used  for  microarray  hybridization.  Each  microarray  slide  was  only  probed  with  one  Cy3-labeled  DNA.  All  of  the  five  Cy3-labeled  cDNAs  hybridized  to  their  expected  spots.  The  subgrids  (13  ´  8  spots)  that  contain  the  specific  hybridized  spots  are  shown  in  Figure  2.  Two  spots  on  the  array,  known  to  have  green  autofluorescence  (not  shown),  were  excluded  from  the  analysis.  Depending  on  the  specific  cDNA  sequence,  there  were  0-4  additional  spots  that  had  detectable  fluorescence.  This  represent
0	Spotted  Long  Oligonucleotide  Arrays  for  Human  Gene  Expression  Analysis
1	Andrea  Barczak,1  Madeleine  Willkom  Rodriguez,1  Kristina  Hanspers,2
1	Laura  L.  Koth,1  Yu  Chuan  Tai,3  Benjamin  M.  Bolstad,3  Terence  P.  Speed,4,5  and  David  J.  Erle1,6
0	Microarrays  can  be  produced  by  deposition  (or  spotting)  of  DNA  or  by  in  situ  synthesis  of  oligonucleotides  on  a  solid  substrate.  Spotted  cDNA  arrays  are  typically  produced  by  depositing  PCR  amplicons,  made  from  cDNA  clones,  on  modified  glass  slides  (Schena  et  al.  1996).  In  general,  PCR  amplicons  are  several  hundred  to  a  few  thousand  base  pairs,  and  one  amplicon  (or  sometimes  a  few  different  amplicons)  are  used  to  probe  each  gene.  These  arrays  can  be  produced  by  individual  investigators  or  core  facilities,  or  can  be  purchased  commercially.  Production  of  microarrays  by  in  situ  synthesis  requires  more  sophisticated  and  costly  equipment,  and  these  arrays  are  generally  produced  commercially.  One  widely  used  implementation  of  this  technology  is  the  Affymetrix  short  oligonucleotide  array  (GeneChip).  Here,  photolithography  and  solid-phase  chemistry  are  used  to  produce  high-density  arrays  of  25-mer  oligonucleotides  (Lockhart  et  al.  1996).  Each  perfect-match  oligonucleotide  is  paired  with  a  mismatched  oligonucleotide,  and  several  (11-20)  pairs  of  25-mers  are  used  for  each  gene.  Various  approaches  have  been  used  to  verify  the  accuracy  of  microarray  data.  Microarray  assay  technology  can  be  calibrated  by  spiking  known  quantities  of  one  or  several  RNA  transcripts  into  test  samples.  Alternatively,  independent
0	Genome  Research
0	Barczak  et  al.
0	We  produced  two  different  sets  of  spotted  arrays  using  two  collections  of  long  oligonucleotide  probes  (Operon  Human  Genome  Oligo  Set  Versions  1  and  2,  Table  1).  There  were  10,801  UniGene  clusters  that  were  represented  in  both  groups  of  probes,  but  the  sequences  of  these  two  groups  of  probes  were  largely  independent:  Version  1  and  Version  2  probes  overlapped  significantly  (by  at  least  25  identical  bases)  for  just
0	Genome  Research
0	Spotted  Long  Oligonucleotide  Arrays
0	of  the  10,801  gene  clusters  that  were  represented  in  both  versions.  We  also  used  commercially  produced  arrays  containing  sets  of  25-mer  probes  synthesized  in  situ  (Affymetrix  U95Av2  GeneChips).  We  used  all  three  groups  of  probes  to  compare  gene  expression  in  two  total  RNA  samples,  one  made  from  K562  erythroleukemia  cells  and  one  made  from  a  pool  of  10  different  cell  lines.  For  spotted  long  oligonucleotide  arrays,  the  RNA  samples  were  used  to  produce  labeled  cDNA  targets.  Two  color  hybridizations  were  performed  using  Cy3-  and  Cy5-labeled  targets  derived  from  the  two 
0	ANALYTICAL  BIOCHEMISTRY
0	A  new  polymeric  coating  for  protein  microarrays
1	Marina  Cretich,a,¤  Giovanna  Pirri,a  Francesco  Damin,a  Isabella  Solinas,b  and  Marcella  Chiaria
0	Keywords:  Protein  microarrays;  Polymer  coating;  Rheumatoid  factor
0	Protein  microarrays  are  becoming  an  important  tool  in  proteomics,  drug  discovery  programs,  and  diagnostics  [1].  The  amount  of  information  obtained  from  small  quantities  of  biological  samples  is  signiWcantly  increased  in  the  microarray  format.  This  feature  is  extremely  valuable  in  protein  proWling,  where  samples  are  often  limited  in  supply  and  unlike  DNA,  cannot  be  ampliWed  [2].  Protein  microarrays  are  more  challenging  to  prepare  than  are  DNA  chips  [3]  because  several  technical  hurdles  hamper  their  application.  The  surfaces  typically  used  with  DNA  are  not  easily  adaptable  to  proteins,  owing  to  the  biophysical  diVerences  between  the  two  classes  of  bioanalytes  [4].  Arrayed  proteins  must  be  immobilized  in  a  native  conformation  to  maintain  their  biological  function.  Unfortunately,  proteins  tend  to  unfold  when  immobilized  onto  a  support  so  as  to  allow  internal  hydrophobic  side  chains  to  form  hydrophobic  bonds  with  the  solid  surface  [5].  The  accessibility  of  the  protein  is  also  of  crucial  importance  to
0	achieve  proper  recognition  during  hybridization;  protein-  substrate  interactions  reduce  the  accessibility  of  the  target,  leading  to  false  negative  results.  Another  important  requirement  of  the  surface  is  to  provide  a  low  unspeciWc  background  because  unwanted  adsorption  of  proteins  leads  to  false  positive  results.  The  presence  of  an  aspeciWc  background  is  one  of  the  most  severe  problems  in  antibody  microarrays  [6].  The  achievement  of  a  low  degree  of  unspeciWc  binding  is  extremely  diYcult  when  the  protein  sample  is  a  complex  mixture  of  thousands  of  molecules  [4].  Current  microarray  supporting  materials  can  be  divided  into  two  major  categories  [7]:  surfaces  coated  with  gels,  such  as  polyacrylamide  and  agarose,  and  surfaces  derivatized  with  functional  groups,  such  as  aldehyde,  epoxy,  and  amino  groups  (polylysine).  Methods  for  on-chip  protein  analysis  also  include  the  ProteinChip  array  technology  that  is  based  on  selective  extraction  and  retention  of  proteins  on  chromatographic  chip  surfaces  and  analysis  by  laser  desorption/ionization  mass  spectrometry  [8].
0	Recently,  our  group  has  introduced  a  new  type  of  polymeric  glass  slide  for  DNA  microarrays  [9]  obtained  by  adsorption  of  a  copolymer  of  N,N-dimethylacrylamide  (DMA),1  N,N-acryloyloxysuccinimide  (NAS),  and  [3-(methacryloyl-oxy)propyl]trimethoxysilyl  (MAPS):  copoly(DMA-NAS-MAPS).  Each  monomer  confers  to  the  copolymer  a  speciWc  feature.  NAS  is  the  reactive  group  able  to  bind  amino-modiWed  DNA  and  primary  amines  of  lysines  and  arginines  in  proteins.  DMA,  which  forms  the  polymer  backbone,  facilitates  polymer  adsorption  on  the  glass  surface,  whereas  MAPS  covalently  reacts  with  free  silanols  and  stabilizes  the  coating.  The  coating  is  innovative  in  that  it  adsorbs  onto  the  glass  surface  very  quickly  (10-30  min)  from  a  diluted  aqueous  solution.  Therefore,  the  coating  procedure  is  fast  and  robust,  providing  an  inexpensive  hydrophilic  functional  surface.  The  performance  of  glass  slides  coated  with  the  copoly(DMA-NAS-MAPS)  has  been  studied  extensively  in  DNA  microarray  experiments  [9].  In  the  current  work,  copoly(DMA-NAS-MAPS)  slides  were  used  as  a  microarray  support  for  protein-  protein  interaction  experiments  and  in  the  assessment  of  rheumatoid  factor  (RF)  in  human  serum  samples.
0	Materials  and  methods  Materials  DMA  and  MAPS  were  obtained  from  Sigma  (St.  Louis,  MO,  USA).  NAS  was  obtained  from  Polysciences  (Warrington,  PA).  Anti-rabbit  immunoglobulin  G  (IgG)  F(ab  )2  fragments  speciWc,  developed  in  goat  (goat  IgG  speciWc  for  the  Fab  fragments)  were  obtained  from  Jackson  ImmunoResearch  Laboratories  (West  Grove,  PA,  USA).  Anti-human  polyvalent  immunoglobulins  developed  in  goat  (goat  IgG),  Tris,  BSA,  and  Tween  20  were  obtained  from  Sigma.  Immunoglobulins  from  rabbit  serum  (rabbit  IgG)  were  obtained  from  Life  Line  Lab  (Pomezia,  Italy).  CodeLink  Activated  Slides  were  obtained  from  Amersham  Biosciences  (Piscataway,  NJ,  USA),  and  ArrayIt  Super  Aldehyde  Substrates  were  obtained  from  TeleChem  International  (Sunnyvale,  CA,  USA).  Glass  slide  coating  Untreated  microscope  glass  slides  (Sigma)  were  pretreated  with  1  M  NaOH  for  30  min  and  1  M  HCl  for  1  h,
0	Abbreviations  used:  DMA,  N,N-dimethylacrylamide;  NAS,  N,N-acryloyloxysuccinimide;  MAPS,  [3-(methacryloyl-oxy)propyl]trimethoxysilyl;  RF,  rheumatoid  factor;  IgG,  immunoglobulin  G;  NHS,  N-hydroxysuccinimide;  D/P,  dye-to-protein  ratio;  PMT,  photomultiplier  tube;  S/N,  signal-to-noise  ratio;  EIA,  enzyme-linked  immunoassay;  ELISA,  enzyme-linked  immunosorbent  assay;  XRR,  X-ray  reXectivity.
0	where  170,000  M¡1  cm¡1  is  assumed  as  the  molar  extinction  coeYcient  for  IgG.  The  dye-to-protein  ratio  (D/P)  for  the  labeled  IgG  was  calculated  according  to  the  following  equation:  (D/P)  D  (1.13A552)/[A280  ¡  (0.08A552)],  (2)
0	and  scanned  again.  Mean  intensity  values  of  4  £  4  spot  subarrays  were  calculated  and  plotted  against  spotted  concentration.  Antibody  Fab  portion  recognition  on  copoly(DMA-NASMAPS)  slides  Rabbit  IgG  were  dissolved  in  a  PBS  buVer  at  diVerent  concentrations  and  spotted  on  the  copoly(DMA-NASMAPS)  slides.  After  overnight  binding  in  a  humid  chamber,  printed  slides  were  rinsed  and  blocked  with  BSA  (2%  w/v)  in  a  phosphate  buVer  (50  mM,  pH  7.2)  for  1  h.  The  slides  were  incubated  for  1  h  with  Cy3-labeled  goat  IgG  speciWc  for  the  Fab  fragments,  dissolved  in  the  hybridization  buVer  (Tris-HCl,  0.1  M,  pH  8;  0.1  M  NaCl;  1%  w/v  BSA;  0.02%  w/v  Tween  20)  at  a  concentration  of  0.05  mg/ml.  After  washing  with  Tris-HCl  (0.05  M,  pH  9),  0.25  M  NaCl,  0.05%  Tween  20,  PBS,  and  water,  the  slides  were  dried  and  scanned  for  Xuorescence  evaluation.  Sandwich  immunoassay  on  microarray  format  The  capture  antigen  (rabbit  IgG)  was  dissolved  in 
0	Evolution  of  new  nonantibody  proteins  via  iterative  somatic  hypermutation
1	Lei  Wang*,  W.  Coyt  Jackson*,  Paul  A.  Steinbach*,  and  Roger  Y.  Tsien*
0	B  lymphocytes  use  somatic  hypermutation  (SHM)  to  optimize  immunoglobulins.  Although  SHM  can  rescue  single  point  mutations  deliberately  introduced  into  nonimmunoglobulin  genes,  such  experiments  do  not  show  whether  SHM  can  efficiently  evolve  challenging  novel  phenotypes  requiring  multiple  unforeseeable  mutations  in  nonantibody  proteins.  We  have  now  iterated  SHM  over  23  rounds  of  fluorescence-activated  cell  sorting  to  create  monomeric  red  fluorescent  proteins  with  increased  photostability  and  far-red  emissions  (e.g.,  649  nm),  surpassing  the  best  efforts  of  structure-based  design.  SHM  offers  a  strategy  to  evolve  nonantibody  proteins  with  desirable  properties  for  which  a  high-throughput  selection  or  viable  single-cell  screen  can  be  devised.
0	directed  evolution  mPlum  Ramos  red  fluorescent  protein
0	Materials  and  Methods
0	Introduction  of  the  mRFP1.2  Gene  into  Ramos  Cells.  The  mRFP1.2
0	gene  was  amplified  with  primer  pair  LW5  (5  -CGCGGATCCGCCACCATGGTGAGCA  AGGGC-3  )  and  LW3  (5  CCATCGAT  T  TAGGCGCCGGTGGAGTGGCG-3  ),  digested  with  BamHI  and  ClaI,  and  ligated  into  a  precut  pCLNCX  (Imgenex,  San  Diego)  derivative  retroviral  vector,  in  which  the  cytomegalovirus  (CMV)  promoter  was  replaced  with  the  inducible  Tet-on  promoter.  The  resultant  plasmid,  pCLT-mRFP,  was  cotransfected  with  pCL-Ampho  (Imgenex)  into  HEK293  cells  to  make  the  retrovirus,  which  was  subsequently  used  to  infect  Ramos  cells  [CRL-1596,  American  Type  Culture  Collection  (ATCC)]  together  with  another  retrovirus  harboring  the  reverse  Tet-controlled  transactivator.  Ramos  cells  were  grown  in  modified  RPMI  medium  1640  as  suggested  by  ATCC.  Doxycycline  (2  g  ml)  was  added  to  induce  the  expression  of  mRFP  24  h  before  FACS,  and  infected  cells  were  sorted  for  six  rounds  to  enrich  red  fluorescent  cells.  In  the  initial  sorting,  5%  of  cells  became  red,  indicating  a  multiplicity  of  infection  well  below  1.
0	Protein  Evolution  by  FACS.  Ratio  sorting  was  applied  to  evolve  mRFP  mutants  with  red-shifted  emissions.  Ramos  cells  were  excited  at  568  nm,  and  two  emission  filters  (660  40  and  615  40)  were  used.  The  ratio  of  intensity  at  660  nm  to  that  at  615  nm  was  plotted  against  the  intensity  at  660  nm.  Cells  with  the  highest  ratio  and  sufficient  intensity  at  660  nm  were  collected  (Fig.  1B).  Usually  one  million  cells  were  collected  each  time,  and  they  were  grown  in  the  absence  of  doxycycline  until  24  h  before  the  next  round  of  sorting.  Mutant  Characterization.  Sorted  cells  were  amplified  in  the  absence  of  doxycycline,  and  0.1  g  ml  doxycycline  was  then  added  for  10  h.  Total  mRNA  was  extracted  from  these  cells  and  used  as  template  for  RT-PCR  to  clone  mRFP  mutant  DNA  with  primer  pair  pCL5  (5  -AGCTCGTTTAGTGAACCGTCAGATC-3  )  and  pCL3  (5  -GGTCTTTCATTCCCCCCTTTTTCTGGAG-3  ).  These  mutant  mRFP  genes  were  subcloned  into  a  pBAD  vector  (Invitrogen)  and  expressed  in  Escherichia  coli.  A  His-6  tag  was  added  to  the  C  terminus  to  facilitate  protein  purification  using  Ni-NTA  chromatography  (Qiagen,  Valencia,  CA).  Spectroscopic  measurements  were  as  described  previously  (12),  except  that  concentrations  of  mRFPs  were  determined  by  assuming  an  extinction  coefficient  after  denaturation  in  0.1  M  NaOH  of  44,000  M  1  cm  1  at  452  nm,  the  same  value  as  that  of  similarly  denatured  Renilla  GFP  (13,  14).  Photobleaching  Measurements.  Microdroplets  of  aqueous  protein,
0	pH  7.4,  typically  5-10
0	m  in  diameter,  were  created  on  a
0	Freely  available  online  through  the  PNAS  open  access  option.  Abbreviations:  SHM,  somatic  hypermutation;  mRFP,  monomeric  red  fluorescent  protein.  Data  deposition:  The  sequences  reported  in  this  paper  have  been  deposited  in  the  GenBank  database  [accession  nos.  AY786536  (mRaspberry)  and  AY786537  (mPlum)].
0	by  The  National  Academy  of  Sciences  of  the  USA
0	November  30,  2004
0	APPLIED  BIOLOGICAL  SCIENCES
0	Identification  of  Integration  Loci.  The  integration  loci  of  provirus
0	microscope  coverslip  under  mineral  oil  and  bleached  by  using  a  Zeiss  Axiovert  200  microscope  at  14.3  W  cm2  with  a  75-W  xenon  lamp  and  a  540-  to  595-nm  excitation  filter.  Reproducible  results  required  preextraction  of  the  mineral  oil  with  aqueous  buffer  shortly  before  microdroplet  formation.
0	Wang  et  al.
0	MAXIMIZING  THE  POTENTIAL  OF  FUNCTIONAL  GENOMICS
1	Lars  M.  Steinmetz*  and  Ronald  W.  Davis
0	Geneticists  have  made  tremendous  progress  in  understanding  the  genetic  basis  of  phenotypes,  and  genomics  promises  to  bring  further  insights  at  a  rapid  pace.  The  progress  in  functional  genomics  has  been  driven  primarily  by  the  development  of  new  techniques  that  are  used  in  a  few  dedicated  research  centres.  Focusing  on  selected  advances  in  genomic  technologies,  we  assess  the  results  that  have  been  obtained  so  far,  highlight  the  challenges  faced  by  these  new  tools  and  suggest  ways  in  which  they  can  be  overcome.  We  argue  that  progress  in  functional  genomics  will  depend  on  developing  high-throughput  technologies  that  can  easily  be  moved  away  from  dedicated  centres  and  into  individual  laboratories.
0	COMPLEX  TRAITS
0	A  trait  that  is  determined  by  many  genes,  almost  always  interacting  with  environmental  influences.
0	Biology  is  entering  an  exciting  era  brought  about  by  the  increase  in  genome-wide  information.  Functional  genomics  in  particular  is  making  rapid  progress  in  assigning  biological  meaning  to  genomic  data.  The  tools  of  functional  genomics  have  enabled  several  systematic  approaches  that  can  provide  the  answers  to  a  few  basic  questions  for  the  majority  of  genes  in  a  genome,  including  when  is  a  gene  expressed,  where  is  its  product  localized,  with  which  other  gene  products  does  it  interact  and  what  phenotype  results  if  a  gene  is  mutated.  Functional  genomics  aspires  to  answer  such  questions  systematically  for  all  genes  in  a  genome  in  contrast  to  conventional  approaches  that  do  so  for  one  gene  at  a  time.  Several  key  biological  challenges  are  central  to  continuing  genome  projects  and  are  relevant  to  any  eukaryotic  organism,  from  yeast  to  humans.  One  challenge  is  to  understand  how  genes  that  are  encoded  in  a  genome  operate  and  interact  to  produce  a  complex  living  system.  A  related  challenge  is  to  determine  the  function  of  all  the  sequence  elements  in  the  genome.  A  third  challenge  is  to  understand  the  contributions  of  the  multitude  of  sequence  variants  to  phenotypic  variation,  both  within  and  between  species.  One  of  the  most  enduring  challenges  in  genetics  has  been  to  find  the  genetic  variants  that  are  responsible  for  COMPLEX  1  TRAITS  .  Current  methods  have  mostly  failed  to  meet
0	this  challenge2,  resulting  in  the  need  for  new  concepts  and  genome-wide  technologies  if  this  complexity  is  to  be  dissected.  Despite  the  unresolved  issues,  the  power  and  potential  of  functional  genomics  is  impressive.  We  illustrate  this  here  by  discussing  three  core  applications  of  genome  technology,  using  selected  examples  from  different  organisms:  genome-wide  knock-out,  gene  expression  and  genetic  mapping  studies.  We  go  beyond  these  examples  to  point  out  the  areas  in  which  technological  improvements  are  possible.  As  functional  approaches  and  verification  of  their  accuracy  often  require  genetic  manipulation,  many  technical  advances  in  functional  genomics  have  their  origin  in  model  systems.  Nonetheless,  an  effective  transition  of  some  of  the  technologies  to  humans  is  becoming  more  attractive3.  The  utility  of  such  a  transition  can  be  maximized  by  careful  evaluation  of  the  power  and  limitation  of  these  approaches.  To  obtain  the  most  benefits  from  functional  genomics,  we  argue,  the  technology,  which  is  at  present  mainly  carried  out  by  a  few  dedicated  centres,  needs  to  become  integrated  into  individual  laboratories.  Individual  laboratories  often  have  crucial  expertise  in  a  specific  biological  problem,  and  although  functional  genomics  might  provide  approaches  to  address  them,  a  key  discovery  can  often  only  be  made  by  bringing  the
0	two  together.  We  believe  that  for  this  to  be  achieved,  two  goals  should  be  met:  experiments  must  be  further  miniaturized  and  costs  must  be  lowered.
0	Technological  innovations
0	sequences  takes  centre  stage3.  With  this  role  in  mind,  we  evaluate  three  areas  of  functional  genomics  that  have  been  piloted  in  different  model  systems.  We  indicate  promising  directions  of  research  and  suggest  new  approaches  that  need  to  be  designed.  Interfering  with  gene  function.  Phenotypic  analysis  of  mutants  has  been  a  powerful  approach  for  determining  gene  function.  Gene  function  can  be  altered  through  gene  deletions,  insertional  mutagenesis  and  RNA  INTERFERENCE  (RNAi)  (BOX  1).  Few  methods  offer  the  experimental  control  that  is  afforded  by  gene  deletion.  A  true  knock-out  or  null  mutation  achieves  complete  functional  reduction  of  the  encoded  gene  product.  Because  it  is  difficult  to  achieve  in  many  organisms,  compromises  have  been  made  by  generating  incomplete  knock-outs.  Gene  products  can  be  knocked-down  or  silenced  as  a  result  of  point  mutagenesis,  insertional  mutagenesis  or  RNAi.  Although  not  yet  feasible  on  a  large  scale,  proteins  might  be  targeted  using  drugs7,  and  it  might  eventually  be  possible  to  use  drug  compounds  to  generate  knock-downs  for  every  gene  product  in  a  genome  and  to  apply  them  across  species.  The  power  of  systematic  mutant  analysis  is  well  illustrated  by  an  experiment  in  which  an  international  consortium  systematically  generated  a  gene  deletion  strain  for  every  gene  in  the  yeast  Saccharomyces  cerevisiae  genome  and  analysed  the  phenotypes  in  a  single  tube  assay8,9  (FIG.  1).  The  quantitative  fitness  measurements  that  are  obtained  for  each  gene  with  this  tool  enable  applications  beyond  determining  whether  a  gene  is  essential.  This  is  an  important  advance  because  it  opens  up  a  wide  variety  of  applications  based  on  quantitative  analysis,  such  as  identifying  functionally  relevant  genes  and  drug  targets,  comparing  function  and  expression,  defining  candidate  disease  genes  and  studying  molecular  evolution  (BOX  2).
0	Efforts  towards  increased  miniaturization  and  decreased  costs  are  exemplified  by  developments  that  originated  from  genome  sequencing.  In  many  ways,  functional  genomics  was  catalysed  by  the  genome-sequencing  projects:  large-scale  sequencing  and  the  genome  projects  created  an  increase  in  available  DNA  sequences,  around  which  new  technologies  that  use  this  information  were  developed.  A  result  is  one  of  the  most  widely  recognized  and  accessible  genomics  tools  --  the  DNA  microarray  --  which  allows  parallel  hybridization  assays  to  be  carried  out  on  an  unprecedented,  miniaturized  scale.  The  second,  and  often  unrecognized,  contribution  of  the  genome  projects  is  the  ~1,000-fold  decrease  in  the  cost  of  DNA  sequencing,  which  had  to  be  achieved  to  complete  the  Human  Genome  Project.  The  drop  in  sequencing  costs  facilitated  large-scale  sequencing  projects  of  other  organisms  and  has  contributed  to  the  fact  that  DNA  sequencing  is  still  the  most  frequently  used  technology  for  detecting  DNA  variation.  Today,  the  comparison  of  genomes  among  several  species  allows  the  study  of  numerous  biological  features,  such  as  studies  of  conserved  sequences4-6.  Developments  of  genomic  technology  have  until  now  primarily  focused  on  the  generation  of  genome  sequence  data,  from  the  development  of  genomeanalysis  technologies  to  the  generation  of  physical  and  genetic  maps,  the  sequencing  of  model  organism  genomes  and  the  completion  of  the  human  genome  sequence.  The  next  focus  in  genomics  builds  on  the  genome  sequences  and  heralds  the  beginnings  of  an  exciting  phase  of  genome  biology  --  the  true  genome  era,  when  deriving  functional  information  from  genome
0	Targeted  deletion  by  homologous  recombination
0	Precise  gene  deletion  can  be  readily  achieved  by  homologous  recombination  in  yeast  8,9  and  mouse11.  Because  this  approach  removes  the  targeted  gene,  functional  reduction  is  complete.  In  organisms  in  which  it  works,  this  method  is  the  gold  standard.  Unfortunately,  homologous  recombination  does  not  work  efficiently  in  several  model  organisms,  including  Arabidopsis  and  Caenorhabditis  elegans.  Although  it  has  been  shown  to  work  in  some  cases,  as  seen  recently  in  Drosophila12,  the  efficiencies  are  still  too  low  for  systematic  application.
0	Insertional  mutagenesis
0	Disruption  of  gene  sequences  can  be  achieved  by  insertional  mutagenesis  using  transposons  or  other  insertion  sequences.  Because  the  genome  insertions  are  random,  screening  for  disruption  in  a  gene  of  interest  is  required.  The  insertion  can  lead  to  complete,  incomplete  or  no  functional  reduction,  depending  on  where  the  integration  occurs.  The  insertion  site  and  level  of  functional  reduction  therefore  need  to  be  determined  experimentally.  The  method  has  been  used  extensively  in  Arabidopsis16  and  Drosophila77,78,  yeast15,  mouse79  and  C.  elegans  80.
0	RNA  interference
0	RNA  INTERFERENCE
0	(RNAi).  A  process  by  which  double-stranded  RNA  silences  specifically  the  expression  of  homologous  genes  through  degradation  of  their  cognate  mRNA.
0	RNA  interference  (RNAi)  is  the  newest  technology  for  reducing  gene  expression.  It  follows  reports  of  gene  silencing  in  plants  and  other  model  organisms81,  and  is  based  on  the  observation  from  C.  elegans  that  adding  double-stranded  RNA  (dsRNA)  to  cells  often  interferes  with  gene  function  in  a  sequence-specific  manner17.  In  most  cases,  the  level  of  functional  reduction  is  incomplete  and  the  level  of  specificity  is  not  entirely  predictable24-26.  Nevertheless,  RNAi  has  been  shown  to  work  in  many  model  organisms.  Current  applications  are  primarily  in  C.  elegans18,  Drosophila19,  various  plants  81,  tissue  culture  cells  of  Drosophila  82  and  mammals23.
0	NATURE  REVIEWS  |  GENETICS
0	CP  UPTAG  CP  KanMX  CP  DNTAG  CP  Deletion  cassette  ORF  Start  Stop
0	F  CP  UPTAG  F  KanMX
0	F  DNTAG  CP  F  PCR  amplification  F  TAG  F  F  F  TAG  F  Hybr
0	Picoliter-Scale  Protein  Microarrays  by  Laser  Direct  Write
1	B.  R.  Ringeisen,*  P.  K.  Wu,  H.  Kim,  A.  Pique,  R.  Y.  C.  Auyeung,  H.  D.  Young,  and  ´  D.  B.  Chrisey
0	Naval  Research  Laboratory,  Code  6372,  4555  Overlook  Ave.  SW,  Washington,  D.C.  20375
1	D.  B.  Krizman
0	Advanced  Technology  Center,  National  Cancer  Institute,  Gaithersburg,  Maryland
0	We  demonstrate  the  accurate  picoliter-scale  dispensing  of  active  proteins  using  a  novel  laser  transfer  technique.  Droplets  of  protein  solution  are  dispensed  onto  functionalized  glass  slides  and  into  plastic  microwells,  activating  as  small  as  50-µm  diameter  areas  on  these  surfaces.  Protein  microarrays  fabricated  by  laser  transfer  were  assayed  using  standard  fluorescent  labeling  techniques  to  demonstrate  successful  protein  and  antigen  binding.  These  results  indicate  that  laser  transfer  does  not  damage  the  active  site  of  the  dispensed  protein  and  that  this  technique  can  be  used  to  successfully  fabricate  a  functioning  protein  microarray.  Also,  as  a  result  of  the  efficient  nature  of  the  process,  material  usage  is  reduced  by  two  to  four  orders  of  magnitude  compared  to  conventional  pin  dispensing  methods  for  protein  spotting.
0	Microarrays  are  used  widely  as  an  efficient  method  to  identify  thousands  of  different  analytes  in  solution  with  a  single  assay  (e.g.,  protein  expression,  drug  efficacy,  DNA  binding,  etc.)  (1).  In  the  field  of  genomics,  microarrays  are  fabricated  using  different  immobilized  cDNA  molecules  to  detect  genes  for  both  biological  and  medical  research  (2).  This  technology  has  increased  the  speed  and  efficiency  of  gene  identification  orders  of  magnitude  over  more  traditional  assays  such  as  Northern  blot  and  RTPCR  approaches  (3).  The  power  and  success  of  high  throughput  screening  experiments  has  resulted  in  a  new  industry  that  manufactures  both  standard  cDNA  microarrays  and  machines  developed  to  fabricate  arrays  specific  to  user  needs  (4).  The  next,  potentially  more  important  step  in  biomedical  research  is  that  of  high-throughput  protein  analysis  (5).  Because  proteins  perform  most  vital  functions,  many  scientists  believe  that  the  key  to  early-stage  disease  detection,  interdiction,  and  prevention  lies  with  protein  identification  and  expression  analysis.  One  method  of  identifying  proteins  is  to  create  an  antibody  microarray  that  uses  thousands  of  different  antibodies  synthesized  to  bind  specifically  to  different  proteins  (6).  Knezevic  et  al.  used  this  approach  to  successfully  identify  365  different  proteins  and  correlated  differential  patterns  of  protein  expression  with  disease  progression  (7).  This
0	Materials  and  Methods
0	Matrix  assisted  pulsed  laser  evaporation  direct  write,  or  MAPLE  DW,  is  a  laser-based  processing  technique  that  is  capable  of  fabricating  structures  from  a  wide  range  of  materials  including  metals,  dielectrics,  polymers,  active  proteins,  and  even  living  cells  (9,  10).  Figure  1  shows  a  schematic  of  the  MAPLE  DW  technique  as  applied  to  protein  solutions.  To  dispense  active  biological  fluids,  a  variable  concentration  protein  solution  is  mixed  using  40  vol  %  glycerol/60  vol  %  phosphate  buffer  solution  (PBS)  as  a  solvent.  Altering  the  concentration  of  proteins  in  these  solutions  over  several  orders  of  magnitude  is  used  as  a  method  to  control  the  density  of  active  molecules  on  the  microarray  substrate.  A  0.5-  to  1.0-µL  aliquot  of  protein  solution  is  then  uniformly  coated  at  room  temperature  onto  a  UV  transparent  quartz  disk  over  an  area  of  1  cm2  by  using  a  micropipet  to  spread  the  fluid  and  a  spin  coater  to  homogenize  the  film  (disk  is  spun  for  10  s  at  1000  rpm)  (11).  A  193-nm  laser  pulse  from  an  ArF  excimer  laser  is  first  focused  at  the  quartz/  fluid  interface  to  150  x  200  µm2  and  50  mJ/cm2.  This  pulse  is  directed  through  the  backside  of  the  quartz  support  so  that  the  laser  energy  first  interacts  with  the  fluid  at  the  quartz  interface.  Layers  of  fluid  near  the  support  interface  then  evaporate  as  a  result  of  localized  heating  from  electronic  excitation,  rapidly  forming  a  bubble  beneath  the  fluid  layer.  When  this  bubble  bursts,  an  aliquot  of  protein  solution  is  released,  propelling  a  droplet  away  from  the  quartz  support  to  a  substrate  positioned  25  µm  to  several  millimeters  away.  The  amount  of  protein  solution  in  the  aliquot  is  reproducibly  determined  by  the  focused  laser  spot  size  (variable  from  102  to  3  x  104  µm2)  and  the  thickness  of  the  solution  coating  on  the  support  (variable  from  1  to  100  µm  thick).  Nearly  all  of  the  laser  energy  is  absorbed  interfacially,  so  that  a  minimal  amount  of  the  fluid  coating  is  vaporized  and  the  bulk  protein  solution  is  transferred  in  the  liquid  phase  without  significant  heating  (9).  Movement  of  the  computer-controlled  stages  is  synchronized  to  the  firing  of  the  laser,  enabling  this  tool  to  rapidly  fabricate  complex  2-D  and  3-D  structures,  including  microarrays.
0	Results  and  Discussion
0	Suppression  subtractive  hybridization:  A  method  for  generating  differentially  regulated  or  tissue-specific  cDNA  probes  and  libraries
1	LUDA  DIATCHENKO*,  YUN-FAI  CHRIS  LAU,  AARON  P.  CAMPBELL,  ALEX  CHENCHIK*,  FAUZIA  MOQADAM*,  BETTY  HUANG*,  SERGEY  LUKYANOV,  KONSTANTIN  LUKYANOV,  NADYA  GURSKAYA,  EUGENE  D.  SVERDLOV,  AND  PAUL  D.  SIEBERT*
0	solve  the  problem  of  the  wide  differences  in  abundance  of  individual  mRNA  species.  Consequently,  multiple  rounds  of  subtraction  are  still  needed  (7).  The  mRNA  differential  display  (8)  and  RNA  fingerprinting  by  arbitrary  primed  PCR  (9)  are  potentially  faster  methods  for  identifying  differentially  expressed  genes.  However,  both  of  these  methods  have  a  high  level  of  false  positives  (10,  11),  biased  for  high  copy  number  mRNA  (12)  and  might  be  inappropriate  in  experiments  in  which  only  a  few  genes  are  expected  to  vary  (11).  Here  we  present  a  new  PCR-based  cDNA  subtraction  method,  termed  suppression  subtractive  hybridization  (SSH),  and  demonstrate  its  effectiveness.  SSH  is  used  to  selectively  amplify  target  cDNA  fragments  (differentially  expressed)  and  simultaneously  suppress  nontarget  DNA  amplification.  The  method  is  based  on  the  suppression  PCR  effect  previously  described  by  our  laboratories:  long  inverted  terminal  repeats  when  attached  to  DNA  fragments  can  selectively  suppress  amplification  of  undesirable  sequences  in  PCR  procedures  (14,  15).  We  have  recently  applied  the  suppression  PCR  effect  in  chromosome  walking  (14)  and  rapid  amplification  of  cDNA  ends  (15).  The  subtraction  method  described  here  overcomes  the  problem  of  differences  in  mRNA  abundance  by  incorporating  a  hybridization  step  that  normalizes  (equalizes)  sequence  abundance  during  the  couse  of  subtraction  by  standard  hybridization  kinetics.  It  eliminates  any  intermediate  step(s)  for  physical  separation  of  ss  and  ds  cDNAs,  requires  only  one  subtractive  hybridization  round,  and  can  achieve  greater  than  1,000-fold  enrichment  for  differentially  expressed  cDNAs.  We  demonstrate  the  effectiveness  of  the  SSH  method  by  generating  a  testis-specific  cDNA  library  and  characterizing  selected  cDNA  clones.  Furthermore,  we  show  that  subtracted  cDNA  mixture  can  be  used  directly  as  a  hybridization  probe  for  screening  recombinant  DNA  libraries,  such  as  a  human  Y  chromosome  cosmid  library,  thereby  identifying  chromosome-specific  and  tissuespecific  expressed  sequences.
0	MATERIALS  AND  METHODS
0	Oligonucleotides.  The  following  gel-purified  oligonucleotides  were  used.  (i)  cDNA  synthesis  primer:  Pr16,  5  -TTTTGTACAAGCTT303.  (ii)  Adapters:  adapter  1,  5  -GTAATACGACTCACTATAGGGCTCGAGCGGCCGCCCGGGCAGGT-3  3  -CCCGTCCA-5
0	Abbreviation:  SSH,  suppression  subtractive  hybridization.  Data  deposition:  The  sequences  reported  in  this  paper  have  been  deposited  in  the  GenBank  data  base  (accession  nos.  H48477,  H48478,  H48931-H48939,  H52858  -H54046,  H54559  -H54560,  H56769  -  H56778,  and  H64202-H64207).
0	Biochemistry:  Diatchenko  et  al.
0	Impact  of  surface  chemistry  and  blocking  strategies  on  DNA  microarrays
1	Scott  Taylor1,  Stephanie  Smith1,  Brad  Windle2  and  Anthony  Guiseppi-Elie1,3,*
0	ABSTRACT  The  surfaces  and  immobilization  chemistries  of  DNA  microarrays  are  the  foundation  for  high  quality  gene  expression  data.  Four  surface  modification  chemistries,  poly-L-lysine  (PLL),  3-glycidoxypropyltrimethoxysilane  (GPS),  DAB-AM-poly(propyleminime  hexadecaamine)  dendrimer  (DAB)  and  3aminopropyltrimethoxysilane  (APS),  were  evaluated  using  cDNA  and  oligonucleotide  sub-arrays.  Two  un-silanized  glass  surfaces,  RCA-cleaned  and  immersed  in  Tris±EDTA  buffer  were  also  studied.  DNA  on  amine-modified  surfaces  was  fixed  by  UV  (90  mJ/cm2),  while  DNA  on  GPS-modified  surfaces  was  immobilized  by  covalent  coupling.  Arrays  were  blocked  with  either  succinic  anhydride  (SA),  bovine  serum  albumin  (BSA)  or  left  unblocked  prior  to  hybridization  with  labeled  PCR  product.  Quality  factors  evaluated  were  surface  affinity  for  cDNA  versus  oligonucleotides,  spot  and  background  intensity,  spotting  concentration  and  blocking  chemistry.  Contact  angle  measurements  and  atomic  force  microscopy  were  preformed  to  characterize  surface  wettability  and  morphology.  The  GPS  surface  exhibited  the  lowest  background  intensity  regardless  of  blocking  method.  Blocking  the  arrays  did  not  affect  raw  spot  intensity,  but  affected  background  intensity  on  amine  surfaces,  BSA  blocking  being  the  lowest.  Oligonucleotides  and  cDNA  on  unblocked  GPS-modified  slides  gave  the  best  signal  (spot-tobackground  intensity  ratio).  Under  the  conditions  evaluated,  the  unblocked  GPS  surface  along  with  amine  covalent  coupling  was  the  most  appropriate  for  both  cDNA  and  oligonucleotide  microarrays.  INTRODUCTION  The  DNA  microarray  enables  researchers  to  survey  the  entire  transcriptome  of  virtually  any  cell  population.  This  capability  produces  unprecedented  quantities  of  raw  data  and  enables  the  investigation  of  gene  expression,  functional  genomics  and
0	PAGE  2  OF  19
0	range  of  available  surface  chemistries.  The  GPS  presents  the  reactive  glycidoxy  functional  group  to  which  amine-terminated  oligonucleotides  and  cDNA,  derived  from  amine-terminated  primers,  could  be  covalently  affixed.  The  APS,  PLL  and  DAB  surfaces  present  varying  densities  of  amine  functionalities  for  hydrogen-bonding  interactions  with  DNA.  The  RCAcleaned  glass  slides  served  as  a  reference  surface  while  the  TEB  immersion  deliberately  introduced  surface  contamination  to  otherwise  cleaned  glass  slide  surfaces.  The  nonblocked  surface  served  as  the  control  for  blocking.  These  surfaces  and  blocking  strategies  were  evaluated  by  fabricating  microarrays  of  cDNA  and  30mer  oligonuclotides  prepared  using  the  human  GAPDH  gene  sequence.  The  oligonucleotides  and  cDNA  were  spotted  at  five  different  concentrations  and  hybridized  to  Alexaflour  555-labeled  GAPDH  PCR  product.  Wettability  of  the  surfaces  was  determined  by  contact  angle  measurements  with  hexadecane  and  ultrapure  water.  Surface  morphology  was  characterized  by  atomic  force  microscopy  (AFM).  MATERIALS  AND  METHODS  Cleaning,  preparation  and  surface  modification  of  microarray  slides  In  a  class  1000  clean  room,  50  VWR  brand  glass  microscope  slides  (VWR  48300-025)  were  solvent  cleaned  by  immersion  for  1  min  in  boiling  acetone  followed  by  1  min  in  boiling  isopropanol.  The  slides  were  then  washed  in  ult
0	Normalization  strategies  for  cDNA  microarrays
1	Johannes  Schuchhardt*,  Dieter  Beule,  Arif  Malik1,  Eryc  Wolski1,  Holger  Eickhoff1,  Hans  Lehrach1  and  Hanspeter  Herzel
0	Institute  for  Theoretical  Biology,  Humboldt-Universitaet  zu  Berlin,  Invalidenstrasse  43,  D-10115  Berlin,  Germany  and  1Max  Planck  Institute  of  Molecular  Genetics,  Ihnestrasse  73,  D-14195  Berlin,  Germany
0	ABSTRACT  Multiple  Arabidopsis  thaliana  clones  from  an  experimental  series  of  cDNA  microarrays  are  evaluated  in  order  to  identify  essential  sources  of  noise  in  the  spotting  and  hybridization  process.  Theoretical  and  experimental  strategies  for  an  improved  quantitative  evaluation  of  cDNA  microarrays  are  proposed  and  tested  on  a  series  of  differently  diluted  control  clones.  Several  sources  of  noise  are  identified  from  the  data.  Systematic  and  stochastic  fluctuations  in  the  spotting  process  are  reduced  by  control  spots  and  statistical  techniques.  The  reliability  of  slide  to  slide  comparison  is  critically  assessed  within  the  statistical  framework  of  pattern  matching  and  classification.  INTRODUCTION  Large  areas  of  medical  research  and  biotechnological  development  will  be  transformed  by  the  evolution  of  high  throughput  techniques  (1-3).  Miniaturization  and  automatization  enables  the  concurrent  performance  of  many  thousands  or  even  millions  of  small-scale  experiments  on  oligonucleotide  chips  (4,5)  or  spotted  microarrays  (6-8).  Manufacturing  processes  and  labeling  techniques  will  lead  to  different  performances  (9,10)  and  detection  ranges  (11),  but  questions  of  statistical  significance  (12,13)  and  quality  control  (T.Beissbarth,  K.Fellenberg,  B.Brors,  A.Arribas-Prat,  M.J.Boer,  V.N.Hauser,  M.Scheideler,  D.J.Hoheisel,  G.Schuetz,  A.Poustka  and  M.Vingron,  submitted  for  publication;  14)  are  quite  similar  for  the  different  technologies.  Down-scaling  of  an  experiment  makes  it  generally  sensitive  to  external  and  internal  fluctuations  (7).  Since  reliability  of  interaction  patterns  extracted  from  array  data  is  essential  for  their  interpretation  (15,16),  a  reduction  in  these  fluctuations  by  proper  averaging  and  normalization  procedures  is  of  great  practical  interest  (17).  We  will  address  this  issue  in  the  context  of  cDNA  microarrays,  spotted  on  glass  slides  and  hybridized  with  a  radioactively  labeled  probe.  According  to  the  experimental  steps  listed  in  Materials  and  Methods  we  will  now  give  a  list  of  the  major  sources  of  fluctuations  to  be  expected  in  this  type  of  microarray  experiment.  The  list  addresses  fluctuations  in  probe,  target  and  array
0	MATERIALS  AND  METHODS  Array  preparation  A  complex  probe  from  several  mouse  tissues  was  purified  and  reverse  transcribed  with  radioactively  labeled  cDNA.  Arabidopsis  thaliana  cDNA  (GenBank  accession  nos  AF104328  and  U29785)  was  spiked  in  a  fixed  amount  for  normalization  purposes  (18).  Clones  were  amplified  by  PCR  reaction,  5-amino-modified  for  attachment  to  glass  slides,  and  purified  (19).  Prior  to  spotting,  glass  slides  were  cleaned  and  derivatized  for  covalent  attachment  of  cDNA.  A  384  pin  gridding  head  (X5251;  Genetix,  Christchurch,  UK)  was  used  for  spotting  a  grid  of  384  blocks,  each  containing  36  spots.  All  clones  were  spotted  twice  within  a  block  (double  spotting).  Details  of  the  spotting  pattern  of  library  and  control  clones  are  explained  in  Figure  1.  Altogether  nine  slides  with  an  identical  spotting  pattern  were  produced.  The  radioactively  labeled  probe  was  hybridized  on  the  cDNA  array  for  10  h  at  42°C.  For  details  on  spotting  technique  and  hybridization  procedures  see  Eickhoff  et  al.  (20).  Scanning  and  image  processing  Arrays  were  exposed  for  16  h  to  a  Fuji  BAS-SR  2025  intensifying  screen  (Raytest,  Germany)  and  scanned  at  25  µm  resolution  with  a  Fuji  BAS  5000  phosphorimager  (Raytest).  The  image  was  converted  into  a  table  of  signal  intensities  using  proprietary  software.  Data  processing  Intensity  data  were  ordered  in  a  table,  each  column  corresponding  to  a  slide  and  each  row  to  a  spot  on  the  slide.  The  following  normalization  procedures  were  tested  for  their  efficiency:  ·  no  normalization,  averaging  over  k  slides;  ·  normalization  by  average  intensity  of  control  spots  (slidewise  normalization)  and  averaging  over  k  slides;  ii  ·  division  by  the  intensity  of  the  two  constant  spots  and  averaging  over  k  slides  (pin-wise  normalization);  ·  slide-wise  normalization  of  the  diluted  and  constant  signals,  averaging  of  the  dilution  and  control  signals  over  several  slides,  then  quotient  formation  (average  pin-wise  normalization).  RESULTS  Non-specific  background  and  overshining  The  level  of  background  noise  and  the  influence  of  neighboring  signal  intensities  is  illustrated  in  Figure  2.  The  intensity  of  background  spots  is  plotted  versus  the  average  signal  intensity  of  the  four  next  neighbor  spots.  The  y-axis  intercept  of  the  linear  regression  gives  an  estimation  of  the  non-specific  background.  The  small  background  intensity  indicates  that  there  are  only  weak  overshining  effects  for  the  6  x  6  spotting  pattern.  The  regression  can  be  used  for  correction  of  the  systematic  part  of  these  errors.  The  radius  used  to  quantify  spots  was  varied  systematically:  for  the  given  spotting  density  only  weak  changes  are  observed  if  the  scanning  radius  is  kept  in  a  reasonable  range  of  about  half  the  spotting  distance  (data  not  shown).  The  magnitude  of  the  background  and  overshining  effects  is  substantially  smaller  than  fluctuations  induced  by  spotting  variabilities  quantified  below.  Assessment  of  spotting  variabilities  In  order  to  facilitate  interpretation  of  the  experimental  data  we  neglect  all  non-linearities  from  image  processing  and  assume  that  hybridization  reactions  reach  mass  action  equilibrium.  Due  to  the  fact  that  different  spots  of  a  dilution  series  compete  for  the  same  probe  the  amount  of  probe  bound  in  each  spot  is  proportional  to  the  amount  of  target  cDNA  present  in  the  spot.  The  observed  signal  intensity  then  reflects  the  amount  of  spotted  cDNA.  Fluctuations  in  spot  size  and  in  the  hybridization
0	Comparison  between  Different  Strategies  of  Covalent  Attachment  of  DNA  to  Glass  Surfaces  to  Build  DNA  Microarrays
1	Nathalie  Zammatteo,*  ,1  Laurent  Jeanmart,  Sandrine  Hamels,*  Stephane  Courtois,*  ´  Pierre  Louette,  Laszlo  Hevesi,  and  Jose  Remacle*  ´
0	DNA  microarray  is  a  powerful  tool  allowing  simultaneous  detection  of  many  different  target  molecules  present  in  a  sample.  The  efficiency  of  the  array  depends  mainly  on  the  sequence  of  the  capture  probes  and  the  way  they  are  attached  to  the  support.  The  coupling  procedure  must  be  quick,  covalent,  and  reproducible  in  order  to  be  compatible  with  automatic  spotting  devices  dispensing  tiny  drops  of  liquids  on  the  surface.  We  compared  several  coupling  strategies  currently  used  to  covalently  graft  DNA  onto  a  glass  surface.  The  results  indicate  that  fixation  of  aminated  DNA  to  an  aldehyde-modified  surface  is  a  choice  method  to  build  DNA  microarrays.  Both  the  coupling  procedure  and  the  hybridization  efficiency  have  been  optimized.  The  detection  limit  of  human  cytomegalovirus  target  DNA  amplicons  on  such  DNA  microarrays  has  been  estimated  to  be  0.01  nM  by  fluorescent  detection.  ©  2000  Academic  Press  Key  Words:  glass;  functionalization;  DNA  probe;  microarray.
0	DNA  chip  technology  uses  microscopic  arrays  of  DNA  molecules  immobilized  on  solid  supports  for  biomedical  analysis  such  as  gene  expression  analysis,  polymorphism  or  mutation  detection,  DNA  sequencing,  and  gene  discovery  (1).  Several  approaches  can  be  used  to  prepare  microarrays.  DNA  can  be  synthesized  in  situ  on  a  glass  surface  using  combinational  chemistry  (2).  This  method  typically  produces  microarrays  consisting  of  groups  of  oligonucleotides  ranging  in  size  from  10  to  25  bases
0	Copyright  ©  2000  by  Academic  Press  All  rights  of  reproduction  in  any  form  reserved.
0	ZAMMATTEO  ET  AL.
0	ditions.  Covalent  binding  methods  are  thus  preferred.  Usually,  DNA  is  cross-linked  by  ultraviolet  irradiation  to  form  covalent  bonds  between  thymidine  residues  in  the  DNA  and  positively  charged  amino  groups  added  on  the  functionalized  slides  (8).  However,  the  location  and  the  number  of  fixation  sites  of  the  DNA  are  not  well  defined  so  that  the  length  and  the  sequences  available  for  subsequent  hybridization  can  vary  with  the  fixation  conditions.  An  alternative  method  is  to  fix  DNA  molecules  at  their  extremities.  Thus,  carboxylated  (9)  or  phosphorylated  DNA  (10)  can  be  coupled  on  aminated  supports  as  well  as  the  reciprocal  situation  (11).  Amino-terminal  oligonucleotides  can  also  be  bound  to  isothiocyanate-activated  glass  (12),  to  aldehyde-activated  glass  (13),  or  to  glass  surfaces  modified  with  epoxide  (14).  Thiol-modified  or  disulfide-modified  oligonucleotides  have  also  been  grafted  onto  aminosilane  via  a  heterobifunctional  crosslinker  (15)  or  on  3-mercaptopropylsilane  (16).  However,  in  these  cases,  the  binding  at  high  temperature  was  unstable.  Recently,  a  more  elaborate  chemistry  has  been  proposed  for  the  construction  of  tethered  molecules  on  glass  to  which  DNA  can  be  attached  (17).  A  situation  in  which  the  accessibility  of  a  tethered  single-stranded  probe  covalently  attached  to  the  surface  could  be  combined  with  the  specificity  of  a  long  probe  would  represent  a  breaktrough  in  the  field  of  DNA  chips.  In  this  paper  we  compare  several  methods  of  covalent  coupling  of  DNA  on  activated  glass,  namely,  the  carbodiimide-mediated  coupling  of  aminated,  carboxylated,  and  phosphorylated  DNA  on  carboxylic  acid  or  amine-modified  glass  supports  and  the  binding  of  aminated  DNA  to  aldehyde-activated  glass.
0	MATERIALS  AND  METHODS
0	Chemicals  and  Buffer  2-(N-morpholino)ethanesulfonic  acid  (Mes)  and  1-methylimidazole  (MeIm)  were  from  Acros  Chimica  (Beerse,  Belgium).  Ethanol,  maleic  acid,  NaCl,  and  SDS  were  from  Merck  (Darmstadt,  Germany).  3-Aminopropyltrimethoxysilane,  triethylamine  solution,  undecenoyl  chloride,  trifluoroethanol,  anhydrous  ether,  trichlorosilane,  and  hexachloroplatinic  acid  were  from  Aldrich  Chemical  (Milwaukee,  WI).  NaBH  4,  EDC,  Tween  20  and  streptavidin-Cy3  were  from  Sigma  (St.  Louis,  MO).  NHSS  was  from  Pierce  (Rockford,  IL).  Gloria  milk  powder  was  from  Nestle  (Vervey,  Switzer´  land).  [  -  32P]dCTP  was  from  Dupont  de  Nemours  (Boston,  MA).  Oligonucleotides  were  purchased  from  Eurogentec  (Seraing,  Belgium).  Silylated  (aldehyde)  and  silanated  (amine)  microscope  slides  were  from  Cell  Associates  (Houston,  TX).  Untreated  glass  slides  were  purchased  from  Knittel  Glaser  (Germany).  The  arrayer  ¨  used  was  a  Charlyrobot  model  with  250-  m  pins  from  Genetix  (UK).  DPX  was  from  BDH  Chemicals  (UK).
0	GLASS  FOR  DNA-BINDING  AND  HYBRIDIZATION  ASSAYS
0	The  carboxylic  acid  terminal  groups  were  obtained  by  hydrolysis  of  ester-functionalized  slides  by  immersion  into  8  M  HCl  solution  at  95°C  for  2  h.  The  samples  were  then  ultrasonically  cleaned  through  three  consecutive  steps  (10  min  each)  in  distilled  water,  dried  under  an  argon  flow.  The  aldehyde  functions  were  obtained  in  two  steps:  the  reduction  of  the  ester  groups  into  alcohol  groups  followed  by  oxidation  by  PCC  (pyridinium  chl
0	Analysis  of  repeatability  in  spotted  cDNA  microarrays
0	When  referring  to  a  single  array,  the  measured  log  ratio  of  a  repeatedly  spotted  clone  is  then  denoted  yij,  with  clone  i,  and  repeated  spotting  j  (where  j  =  1,  ¼,  ki).  In  the  context  of  l  several  arrays,  we  will  use  the  notation  yij  to  denote  the  measurement  in  array  l,  with  l  =  1,¼,  d.  Correlation.  For  each  clone,  we  calculated  the  average  Pearson  product-moment  (linear)  correlation  between  pairs  of  spots  across  data  from  the  d  arrays.  If  clone  i  has  been  spotted  ki  times,  there  will  be  [ki(ki  ±  1)]/2  distinct  pairs  in  its  spot  set.  For  a  given  pair  of  spots  (denoted  ij  and  ij¢)  we  l  d  l  d  constructed  the  vectors  [yij  ,  ¼,  yij  ]  and  [yij  ,  ¼,  yijH  ],  and  computed  the  correlation  coefficient  with  respect  to  clone  i  as
0	d  l  l  yij  A  yij  yijH  A  yijH
0	i1  ri  v  Y  ud  d  2  2  u  l  l  t  yij  A  yij  yijH  A  yijH  l1  l1
0	where  yij
0	yij  and  yijH
0	For  a  given  clone,  the  correlation  coefficient  was  calculated  for  all  distinct  pairs  in  the  spot  set,  and  the  average  correlation  coefficient  was  used  as  an  indicator  of  repeatability  for  the  clone.  To  assess 
0	Applications  of  DNA  tiling  arrays  for  whole-genome  analysis
1	Todd  C.  Mocklera,  Joseph  R.  Eckera,b,*
0	The  completion  of  numerous  genome  sequences  has  introduced  an  era  of  whole-genome  study.  Gaining  a  more  complete  understanding  of  the  genome's  information  content  will  dramatically  improve  our  understanding  of  various  biological  processes.  In  parallel  with  the  sequencing  of
0	entire  genomes,  recent  advances  in  microarray  technologies  have  made  it  feasible  to  interrogate  an  entire  genome  sequence  with  arrays.  Such  high-density  whole-genome  DNA  microarrays  can  be  used  as  a  generic  platform  for  numerous  experimental  approaches  to  decode  the  information  contained  within  the  genome.  In  this  review,  we  discuss  several  approaches  using  high-density  whole-genome  oligonucleotide  microarrays  for  transcriptome  characterization,  novel  gene  discovery,  analysis  of  alternative  splicing,  mapping  of  regulatory  DNA  motifs  using  the  chromatin-
0	researchers  to  analyze  various  features  of  the  genome,  including  evidence  of  transcriptional  activity,  binding  of  transcriptional  regulators,  and  DNA  methylation,  at  high  resolution  without  reference  to  prior  annotations.  Other  array  designs  rely  on  prior  genome  annotation  to  interrogate  a  particular  subset  of  features  of  an  entire  genome  (Fig.  2C).  These  arrays  are  clearly  limited  by  the  quality  and  completeness  of  the  annotations  on  which  they  are  based.
0	exon-scanning  arrays  were  designed  using  only  known  and  computationally  predicted  exons,  they  were  of  limited  use  for  discovering  novel  genes  or  gene  features,  such  as  terminal  exons  that  are  often  missed  by  the  gene  prediction  algorithms.  For  some  genomic  regions,  tiling  arrays  with  partially  overlapping  (10-base  increments)  60-mer  probes  were  used  to  demonstrate  the  utility  of  high-resolution  tiling
0	arrays  for  refining  and  confirming  gene  structures  predicted  by  the 
0	Defining  the  sequence-recognition  profile  of  DNA-binding  molecules
1	Christopher  L.  Warren,  Natasha  C.  S.  Kratochvil,  Karl  E.  Hauschild,  Shane  Foister§,  Mary  L.  Brezinski,  Peter  B.  Dervan§,  George  N.  Phillips,  ,  and  Aseem  Z.  Ansari¶
0	Contributed  by  Peter  B.  Dervan,  November  11,  2005
0	Determining  the  sequence-recognition  properties  of  DNA-binding  proteins  and  small  molecules  remains  a  major  challenge.  To  address  this  need,  we  have  developed  a  high-throughput  approach  that  provides  a  comprehensive  profile  of  the  binding  properties  of  DNA-binding  molecules.  The  approach  is  based  on  displaying  every  permutation  of  a  duplex  DNA  sequence  (up  to  10  positional  variants)  on  a  microfabricated  array.  The  entire  sequence  space  is  interrogated  simultaneously,  and  the  affinity  of  a  DNA-binding  molecule  for  every  sequence  is  obtained  in  a  rapid,  unbiased,  and  unsupervised  manner.  Using  this  platform,  we  have  determined  the  full  molecular  recognition  profile  of  an  engineered  small  molecule  and  a  eukaryotic  transcription  factor.  The  approach  also  yielded  unique  insights  into  the  altered  sequence-recognition  landscapes  as  a  result  of  cooperative  assembly  of  DNA-binding  molecules  in  a  ternary  complex.  Solution  studies  strongly  corroborated  the  sequence  preferences  identified  by  the  array  analysis.
0	chemical  genomics  ligand-DNA  recognition
0	central  goal  of  synthetic  biology,  chemical  biology,  and  molecular  medicine  is  the  design  and  creation  of  synthetic  molecules  that  can  target  specific  DNA  sites  in  the  genome  (1,  2).  Such  molecules  can  be  harnessed  to  regulate  biological  processes  such  as  transcription,  recombination,  and  DNA  repair  (1-4).  The  greatest  success  in  designing  molecules  with  programmable  DNAbinding  specificity  has  been  with  polyamides  (2).  However,  a  major  hurdle  in  the  design  of  new  classes  of  sequence-specific  DNAbinding  molecules  is  the  inability  to  comprehensively  define  the  full  range  of  their  DNA  sequence-recognition  properties,  and  therefore,  the  inability  to  predict  all  their  potential  target  sites  in  the  genome.  Given  the  importance  of  understanding  the  basis  of  molecular  recognition  between  DNA  and  its  ligands,  several  methods  have  been  developed  to  determine  the  sequence  specificity  of  DNAbinding  molecules  (small  molecules  as  well  as  proteins).  The  most  frequently  used  approach  is  the  systematic  evolution  of  ligands  by  exponential  enrichment  (SELEX),  which  utilizes  selection  and  enrichment  of  the  DNA  sequences  that  bind  with  the  highest  affinity  to  a  molecule  of  interest  (4).  This  assay,  although  highly  informative,  identifies  only  the  best  binding  sequences,  whereas  the  less  optimal,  and  often  biologically  relevant,  sequences  are  missed.  Other  commonly  used  biochemical  or  biophysical  approaches  are  labor-intensive  and  can  be  used  only  to  study  a  limited  set  of  sequence  variants  (5-10).  Medium-throughput  microarrays  have  also  been  developed  in  which  duplex  DNA  molecules  are  immobilized  on  surfaces  and  protein  binding  is  detected  by  surface  plasmon  resonance  (11)  or  fluorescence  (12,  13).  Despite  such  demonstrations  of  feasibility,  technical  challenges  have  hindered  the  general  application  of  these  array  platforms.  A  solution-phase  medium-throughput  assay  utilizes  DNA  sequence  variants  presented  in  distinct  wells  and  protein  or  small  molecule  binding  detected  by  displacement  of  a  DNA-intercalating  fluorescent  dye  (14).  Each  of  these  medium-throughput  approaches,  however,  is  limited  to  querying  DNA  sequences  with  only  three,  four,  or  five  permuted  positions.
0	In  a  recent  approach,  a  biased  microarray  bearing  only  the  intergenic  regions  of  yeast  chromosome  was  used  to  map  transcription  factor  binding  sites  in  vitro  (15).  These  arrays  provide  a  biased  binding  profile  and  are  limited  to  organisms  with  small  and  well  annotated  genomes.  Another  technique  that  circumvents  this  problem  relies  on  sonicating  genomic  DNA  into  small  fragments  and  adding  a  transcription  factor  to  isolate  putative  binding  sites  (16).  However,  this  method,  like  SELEX,  is  likely  to  overrepresent  strong  binding  sites,  thereby  providing  biased  sequence-recognition  profiles.  These  methods  are  not  amenable  to  an  unbiased  analysis  of  the  binding  properties  of  small  molecule  DNA  ligands.  Chromatin  immunoprecipitated  (ChIP)  DNA  analyzed  on  oligonucleotide  microarrays  (chip)  has  also  been  used  to  map  binding  sites  for  DNA-binding  transcription  factors  (17-19).  Importantly,  ChIP-chip  studies  have  suggested  that  in  vitro  affinity  of  cooperatively  binding  transcription  factors  for  specific  DNA  sequences  is  often  recapitulated  in  the  relative  occupancy  of  these  sequences  in  vivo  (20,  21).  This  observation  suggests  that  for  a  given  transcription  factor  (or  a  set  of  cooperatively  binding  factors),  the  knowledge  of  its  full  sequence-recognition  profile,  measured  in  vitro,  can  be  highly  instructive  in  computationally  identifying  binding  sites  in  the  genome.  Thus  far,  in  the  absence  of  genome-wide  binding  and  expression  data,  computational  approaches  to  identifying  regulatory  sites  have  been  limited  to  phylogenetic  comparisons  of  conserved  noncoding  sequences  (22).  However,  unlike  proteins,  for  most  DNA-binding  small  molecules  with  unknown  DNA-binding  properties,  ChIP-chip  analysis  is  nontrivial,  and  phylogenetic  comparisons  are  irrelevant.  To  bridge  this  gap  between  computational  methods  and  molecular  recognition  properties  of  DNA  ligands,  we  have  developed  a  comprehensive  high-throughput  platform  that  can  rapidly  and  reliably  identify  the  cognate  sites  of  DNA-binding  molecules.  This  platform  provides  an  unbiased  analysis  because  it  consists  of  a  double-stranded  DNA  array  that  displays  the  entire  sequence  space  represented  by  8  bp  (all  possible  permutations  equal  32,896  molecules)  and  can  currently  be  extended  to  as  many  as  10  variable  base  pair  positions.  We  have  also  developed  a  systematic  approach  for  treating  the  array  data  that  can  be  applied  to  arrays  of  greater  complexity.  Because  most  metazoan  DNA-binding  proteins  target  6-10  bp  (23),  and  because  DNA-binding  small  molecules  rarely  exceed  8  bp  (24),  our  cognate  site  identifier  (CSI)  arrays  should  be  capable  of  identifying  and  ranking  sequences  preferred  by  almost  any  DNA-binding  ligand  by  itself,  or,  in  many  cases,  in  cooperatively  binding  pairs.  Our  approach  derives  comprehensive  binding  profiles  from  a  rapid,  unbiased,  and  unsupervised  examination  of  the  entire
0	Conflict  of  interest  statement:  No  conflicts  declared.  Abbreviations:  ChIP-chip,  analysis  of  chromatin-immunoprecipitated  DNA  on  oligonucleotide  microarrays;  CSI,  cognate  site  identifier;  PA1,  polyamide  1;  PA2,  polyamide  2;  PA3,  polyamide  3;  Exd,  extradenticle;  Hox,  homeobox  transcription  factors;  Dp,  dimethylaminopripylamide;  Py,  N-methylpyrrole;  Py*,  Cy3-Py;  Im,  N-methylimidazole.
0	by  The  National  Academy  of  Sciences  of  the  USA
0	January  24,  2006
0	APPLIED  BIOLOGICAL  SCIENCES
0	DNA  sequence  space.  These  analyses  can  be  extended  to  DNAbinding  proteins  from  any  organism  or,  in  the  case  of  small  molecules,  used  to  predict  binding  sites  in  any  genome.  Results
0	Array  Design.  The  duplex  DNA  sequences  are  designed  as  self-
0	averaged  intensities  were  then  converted  into  Z  scores  [Z  signal  mean  standard  deviation]  to  reflect  the  signal-to-noise  ratio  (Fig.  2B).  Sequences  in  the  highest  Z  score  bin  (  25)  were  subjected  to  several  motif-searching  algorithms  (31-33),  which  identified  5  -
0	complementary  palindromes  interrupted  at  the  center  by  a  TCCT  sequence  to  facilitate  the  formation  of  DNA  hairpins  (Fig.  1).  The  34-residue  oligonucleotide  is  synthesized  directly  on  the  glass  surface  by  using  a  maskless  array  synthesizer  (25)  that  can  readily  create  up  to  786,000  spatially  resolved  features.  After  inducing  hairpin  formation,  we  found  that  95%  of  the  oligonucleotides  in  the  array  form  duplexes  (see  Materials  and  Methods).  In  our  hairpin  design,  we  added  three  constant  base  pairs  on  either  side  of  the  8  bp  that  were  permuted  (N1-N8  in  Fig.  1).  Previous  work  shows  that  this  addition  is  sufficient  to  buffer  the  core  of  the  hairpin  stem  against  thermal  end-fraying  of  the  duplex  and  against  deviations  from  B-form  DNA  resulting  from  the  presence  of  the  loop  (26).  There  is  good  evidence  that  the  core  of  a  hairpin  stem  interacts  with  proteins  and  small  molecule  ligands  indistinguishably  from  DNA  duplexes  composed  of  two  individual  complementary  strands  (27,  28).
0	Array  Validation  Using  an  Engineered  Small  Molecule.  To  test  the
0	accuracy  and  fidelity  of  the  CSI  array,  we  used  a  polyamide  engineered  to  target  a  specific  DNA  sequence  (PA1,  Fig.  2A).  Polyamides  are  DNA-binding  small  molecules  composed  of  Nmethylpyrrole  (Py)  and  N-methylimidazole  (Im)  heterocycle  rings.  The  arrangement  of  the  heterocycles  (Im  or  Py)  can  be  programmed  to  create  polyamides  that  target  most  naturally  occurring  6-  to  8-bp  DNA  sequences  (2).  PA1,  in  particular,  was  designed  to  target  the  sequence  5  -WWGWWCWW-3  (W  A  or  T)  (Fig.  2)  (29).  A  Cy3  fluorescent  dye  is  conjugated  to  the  N-methyl  posi
0	Quantifying  DNA-protein  interactions  by  double-stranded  DNA  arrays
1	Martha  L.  Bulyk1,  Erik  Gentalen2,  David  J.  Lockhart2,  and  George  M.  Church1*
0	We  have  created  double-stranded  oligonucleotide  arrays  to  perform  highly  parallel  investigations  of  DNA-protein  interactions.  Arrays  of  single-stranded  DNA  oligonucleotides,  synthesized  by  a  combination  of  photolithography  and  solid-state  chemistry,  have  been  used  for  a  variety  of  applications,  including  large-scale  mRNA  expression  monitoring,  genotyping,  and  sequence-variation  analysis.  We  converted  a  single-stranded  to  a  double-stranded  array  by  synthesizing  a  constant  sequence  at  every  position  on  an  array  and  then  annealing  and  enzymatically  extending  a  complementary  primer.  The  efficiency  of  secondstrand  synthesis  was  demonstrated  by  incorporation  of  fluorescently  labeled  dNTPs  (2´-deoxyribonucleoside  5´-triphosphates)  and  by  terminal  transferase  addition  of  a  fluorescently  labeled  ddNTP.  The  accuracy  of  second-strand  synthesis  was  demonstrated  by  digestion  of  the  arrayed  double-stranded  DNA  (dsDNA)  on  the  array  with  sequence-specific  restriction  enzymes.  We  showed  dam  methylation  of  dsDNA  arrays  by  digestion  with  DpnI,  which  cleaves  when  its  recognition  site  is  methylated.  This  digestion  demonstrated  that  the  dsDNA  arrays  can  be  further  biochemically  modified  and  that  the  DNA  is  accessible  for  interaction  with  DNA-binding  proteins.  This  dsDNA  array  approach  could  be  extended  to  explore  the  spectrum  of  sequence-specific  protein  binding  sites  in  genomes.
0	Keywords:  dsDNA  arrays,  restriction  enzymes,  DNA-protein  interactions
0	Sequence-specific  DNA  binding  by  proteins  controls  transcription1,  recombination2,  restriction3,  and  replication4.  Sequence  requirements  are  usually  determined  by  assays  that  measure  the  effects  of  mutations  on  binding  of  DNA  and  amino  acid  residues  implicated  in  these  interactions.  These  assays,  which  include  nitrocellulose  binding  assays5,  gel  shift  analysis6,  Southwestern  blotting7,8,  or  reporter  constructs  in  yeast9,  are  usually  considered  too  laborious  for  the  analysis  of  many  DNA  variants.  Therefore,  we  have  developed  a  highly  parallel  method  for  studying  the  sequence  specificity  of  DNA-protein  interactions.  We  have  taken  advantage  of  oligonucleotide  arrays,  or  DNA  arrays,  that  have  previously  been  used  for  mRNA  expression  analysis10-12,  polymorphism  analysis13-16,  deletion  strain  analysis17,  and  for  identifying  clones  from  genetic  selections18.  However,  the  arrays  used  for  these  purposes  contain  single-stranded  DNA  (ssDNA)  oligonucleotides,  and  most  sequence-specific  regulatory  DNA-binding  proteins  bind  double-stranded  DNA  (dsDNA).  Therefore,  we  present  a  method  for  enzymatically  converting  ssDNA  arrays  into  arrays  of  duplex  DNA.  Sequence-specific  digestion  at  the  cognate  restriction  sites  has  been  demonstrated  using  restriction-enzyme  digestion  of  dsDNA  arrays.  In  addition,  we  show  that  the  dsDNA  can  be  altered  biochemically.  Arrays  of  biochemically  modified  DNA  may  be  useful  for  applications  that  seek  to  determine  the  effects  of  modifications,  such  as  methylation,  on  sequence-specific  binding.  The  results  presented  here  suggest  that  these  dsDNA  arrays  will  be  well  suited  for  the  analysis  of  DNA-protein  interactions,  particularly  for  the  discovery  of  the  sequences  recognized  by  transcription  factors  and  the  quantitative  assessment  of  those  important  interactions.  Results  and  discussion  Second-strand  synthesis.  ssDNA  arrays  were  made  on  an  Affymetrix  (Santa  Clara,  CA)  DNA  array  synthesizer.  A  constant  sequence  was  synthesized  before  any  variable  sequences  were  introduced,  and  these  strands  were  used  as  templates  for  enzymatic  second-strand
0	synthesis.  A  primer  complementary  to  the  constant  sequence  was  used  in  primer  extension  reactions,  producing  all  the  second  strands  on  the  array  in  a  single  enzymatic  reaction.  For  our  experiments,  there  are  a  number  of  advantages  to  creating  dsDNA  via  primer  extension  instead  of  by  chemically  synthesizing  single-stranded,  self-complementary  oligonucleotides19.  First,  5¢-(4,4´dimethoxytrityl)  (DMT)  synthesis  occurs  with  higher  efficiency  than  that  achieved  with  light-directed,  5¢-(  -methyl-2-nitropiperonyl)oxycarbonyl  (MeNPOC)20,21  synthesis.  Therefore,  longer  strands  of  dsDNA  can  be  made  because  only  half  as  many  nucleotides  need  to  be  produced  by  light-directed  synthesis  when  the  complementary  strand  is  created  via  primer  extension.  Second,  the  exact  complement  of  each  template  strand,  including  any  degenerate  nucleotides  synthesized  into  the  first  strand,  will  be  made  because  the  Klenow  fragment  of  DNA  polymerase  I  is  a  highly  processive  polymerase  with  an  error  rate  of  approximately  10-5.  Third,  this  mode  of  second-strand  synthesis  ensures  a  low  mismatch  rate  as  creation  of  dsDNA  does  not  rely  upon  annealing  a  complex  mix  of  exogenous  complementary  sequences.  In  order  to  verify  initially  that  the  primer  was  annealing  to  all  sequences,  a  fluorescein-labeled  primer  was  hybridized  to  the  array,  and  signal  intensity  was  seen  over  the  entire  chip  (data  not  shown).  Subsequently,  unlabeled  primers  were  used  in  all  primer-extension  reactions.  To  confirm  enzymatic  extension  of  the  primer,  we  included  fluorescein-labeled  dATP  in  a  reaction  along  with  unlabeled  2´deoxyribonucleoside  5´-triphosphates  (dNTPs)  (Figs.  1  and  2A).  As  expected,  there  tended  to  be  higher  signal  intensity  in  features  with  a  greater  proportion  of  adenine  in  the  second  strand  (Fig.  3B).  Of  the  features  with  identical  subsites,  those  with  longer  spacers  had  higher  signal  intensities,  as  expected,  because  longer  spacers  allowed  a  greater  number  of  fluorescein-labeled  dATPs  to  be  incorporated.  The  duplex  DNA  also  can  be  end-labeled  after  synthesis  (Fig.  2B)  instead  of  being  labeled  by  incorporation  of  fluorescein-tagged  dNTPs.  In  this  scheme,  only  unlabeled  dNTPs  were  used  in  the
0	distal  flanking  sequence   half-site  spacer   half-site  proximal  flanking  sequence
0	Second  strand  labeling  by  incorporation  of  fluorescein-labeled  dNTPs
0	Klenow  exo  -  polymerase  unlabeled  and  fluorescein-labeled  dNTPs
0	annealed  primer
0	constant  priming  sequence
0	annealed  primer
0	HEG  synthesis  linker  glass  surface
0	Second  strand  labeling  by  terminal  transferase  addition  of  fluorescein-labeled  ddNTPs
0	terminal  transferase  fluorescein-labelled  ddNTP
0	primer  extension
0	primer-extension  reactions.  The  3´-ends  of  the  newly  synthesized  strands  were  then  end-labeled  by  addition  of  fluorescein-labeled  ddNTP  with  terminal  transferase  (Fig.  4A).  Only  the  3´-ends  of  the  second  strands  were  available  for  addition  in  these  terminal  transferase  reactions,  because  the  3´-ends  of  the  first  strands  were  covalently  attached  to  hexaethylene  glycol  (HEG)  linkers.  The  observed  variation  in  signal  intensity  from  row  to  row  was  due  to  either  different  synthesis  efficiencies  or  different  efficiencies  of  terminal  transferase  addition  for  different  sequences.  Restriction  enzyme  digestion.  To  determine  that  the  duplex  DNA  was  both  physically  accessible  and  of  proper  structure  for  interaction  with  a  protein,  we  digested  dsDNA  arrays  with  a  restriction  enzyme.  This  also  confirmed  that  the  second  strands  were  synthesized  correctly.  A  restriction  enzyme  with  a  4  bp  recognition  site  was  chosen  because  the  two  subsites  on  the  arrays  were  each  either  3  or  4  bp  long,  although  the  design  of  the  array  can  be  changed  according  to  the  particular  type  of  restriction  enzyme  being  studied.  The  fluorescein-labeled  dNTP  included  in  the  primer-extension  reaction  was  chosen  to  be  distal  to  the  cleavage  site  (relative  to  the  glass  surface),  so  that  after  digestion  the  fluorescent  label  that  had  been  incorporated  into  the  second  strand  would  be  released  (Fig.  3A).  For  end-labeled  dsDNA  arrays,  the  signal  was  distal  to  the  cleavage  site  irrespective  of  the  restriction  site.  Strand  density  and  the  distance  of  the  strands  from  the  array  surface  were  varied  to  measure  the  effects  of  accessibility  of  the  DNA  strands  for  primer-extension  reactions  and  enzymatic  digestions.  The  distance  from  the  surface  was  varied  using  either  one  or  two  HEG  linkers.  The  two  HEG  linkers  were  expected  to  make  the  duplex  DNA  more  flexible  and  more  accessible  by  reducing  steric  hindrance  from  the  glass  surface  and  neighboring  molecules.  An  array  with  variable  densities  and  number  of  linkers  was  extended  in  the  presence  of  fluorescein-labeled  dATP,  then  digested  with  RsaI  (Fig.  3B).  As  RsaI  digestion  leaves  blunt  ends  between  the  T  and  the  A  of  its  recognition  site  (5¢-GTAC-3¢),  incorporated  label  is  lost  with  the  portion  of  the  strand  that  is  released.  Signal  intensity  loss  was  evaluated  by  calculating  a  z  score  for  each  feature.  This  statistic  measures  the  amount  of  signal  intensity  loss  beyond  that  due  to  photobleaching  or  other  effects  that  might  cause  general  signal  intensity  loss  over  the  whole  array.  The  average  z  score  in  the  30  features  containing  the  RsaI  recognition  site  was  7  (p
0	New  developments  in  microarray  technology  Dietmar  H  Blohm*  and  Anthony  Guiseppi-Elie
0	Microarrays  have  emerged  as  indispensable  research  tools  for  gene  expression  profiling  and  mutation  analysis.  New  classification  of  cancer  subtypes,  dissecting  the  yeast  metabolism  and  large-scale  genotyping  of  human  single  nucleotide  polymorphisms  are  important  results  being  obtained  with  this  technique.  Realizing  the  microsphere-based  massively  parallel  signature  sequencing  technique  as  fluid  microarrays,  building  new  types  of  protein  arrays  and  constructing  miniaturized  flow-through  systems,  which  can  potentially  take  this  technology  from  the  research  bench  into  industrial,  clinical  and  other  routine  applications,  exemplify  the  intense  developments  that  are  now  ongoing  in  this  field.
0	as  from  more  than  200  companies  worldwide  engaged  in  the  development  and  application  of  this  technology.  The  scope  of  this  review  is  therefore  restricted  to  some  examples  of  recent  technical  advances  and  research  applications,  and  is  focused  on  current  trends  in  the  movement  of  the  microarray  from  being  a  purely  research  method  to  becoming  an  analytical  instrument  applicable  in  the  clinic  as  well  as  in  industry.
0	The  present  state  of  microarray  technology
0	Working  with  microarrays  requires  the  combination  of  at  least  five  different  components  [8]:  the  chip  itself  with  its  special  surface;  the  device  for  producing  microarrays  by  spotting  the  nucleic  acids  (probes)  onto  the  chip  or  for  their  in  situ  synthesis;  a  fluidic  system  for  hybridization  to  target  DNA;  a  scanner  to  read  the  chips;  and  sophisticated  software  programs  to  quantify  and  interpret  the  results.  Additional  tools  are  required  for  extracting  nucleic  acids  from  biological  material  to  prepare  them  for  the  analysis.  For  each  of  these  components  special  equipment  is  now  commercially  available.  In  addition,  microarray  components  or  complete  systems,  ready-to-use  gene  collections  and  PCR  product  libraries  of  cDNA  and  even  comprehensive  microarray  studies  are  commercially  offered  as  services  (for  details  see  [3,9]).  Usually,  the  different  systems  show  very  different  levels  of  reliability  and  reproducibility,  are  not  compatible  with  each  other  and  require  a  skilled  scientist  to  setup,  commission  and  even  to  routinely  run  them.  The  value  of  microarray  experiments  still  depends  critically  on  the  quality  of  arraying,  recently  made  possible  by  bubble  jet  technology  [10·]  or  maskless  in  situ  synthesis  of  oligonucleotides  [11··].  Microarray  experiments  also  depend  on  probe  and  target  preparation,  experimental  variations  during  hybridization  and  specifically  on  the  selection  of  the  nucleic  acids  affixed  to  the  microarray  surface.  Further,  microarray  experiments  depend  on  the  homogeneity  of  the  surface  and  linking  chemistries  on  the  chip  [12]  as  well  as  on  background  and  overexposure  problems  during  image  processing  [13].  Based  on  improvements  in  microarray  surface  chemistry  [14,15·],  scanner  technology  and  software  developments,  quantitative  changes  in  transcription  activity  can  now  be  measured  reproducibly  in  the  range  twofold  or  less,  except  in  the  case  of  low  abundant  mRNAs.  However,  technical  standards  or  established  procedures  for  the  exact  comparison  of  the  different  technical  systems  or  among  different  approaches,  such  as  cDNA-arrays  versus  oligonucleotidearrays  [16,17],  are  still  missing.  Now,  as  before,  the  microarray  field  is  moving  very  fast  and  new  technical  approaches  and  applications  are  emerging  continuously.  A  remarkable  recent  advance  is  the  development  of  `fluidic'  microarrays,  a  system  for  massively  parallel  signature  sequencing  (MPSS).  Millions  of  DNA-signatured  microbeads,  each
0	Analytical  biotechnology
0	carrying  a  different  cDNA  attached  by  in  vitro  cloning,  are  repeatedly  cycled  between  restriction  type  II  cleavage,  ligation  steps  and  hybridization  reactions  to  add  decoder  probes  for  reading  the  signatures.  The  number  of  microbeads  carrying  identical  cDNAs  are  then  counted  by  imaging  them  onto  a  charge-coupled  device  camera  using  a  flow  cell.  Because  ~250,000  microbeads  are  processed  at  once,  even  rare  mRNAs  can  be  assessed  without  prior  knowledge  of  their  sequence  [18··].  Microbeads  are  also  employed  to  attach  molecular  beacons  that  produce  a  fluorescence  signal  after  binding  of  (unlabeled)  target  molecules.  By  encoding  them  with  a  particular  dye  signature,  >107  randomly  ordered  microbeads  can  be  analyzed  simultaneously  in  a  high-density  fiber  array  using  an  imaging  fluorescence  system  [19·].  To  increase  the  sensitivity  of  microarrays  a  new  `scanometric'  detection  system  based  on  gold-nanoparticle-promoted  silver  reduction  has  been  reported  to  be  100  times  more  sensitive  than  fluorescence  measurement  [20··].  As  a  method  connecting  genomics  and  proteomics  (for  review  see  [21])  microarray  technology  has  also  been  used  for  large-scale  peptide  and  protein  analysis  [22].  New  protein  microarrays  can  be  used  instead  of  the  yeast  two-hybrid  system  for  in  vitro  analyzing  protein-protein  interactions,  for  identifying  protein  kinase  substrates  and  for  measuring  interactions  between  proteins  and  low-molecular  weight  molecules  and  even  low-affinity  interactions  [23,24·].  In  addition,  the  microarray  technique  has  been  used  to  screen  >18,000  antibodies  against  15  different  antigens  in  one  experiment  using  high-density  gridding  of  bacteria  containing  antibody  genes  and  testing  them  using  a  solid-phase  enzyme-linked  immunosorbent  assay  (ELISA)  [25].  Single-stranded  nucleic  acids  coupled  to  proteins  have  been  used  to  convert  DNA  microarrays  into  protein  microarrays  in  a  one-step,  self-assembling  hybridization  process  [26]  and  plasma  polymerized  protein  films  have  been  used  to  fabricate  DNA-arrays  [27·].  Another  area  of  noteworthy  advance,  and  one  that  has  long  been  neglected,  is  the  proper  identification  of  sources  of  noise,  error  analyses  and  quantitative  treatments  of  systematic  and  stochastic  errors  in  DNA  microarray  analyses  [13].
0	and  ovarian  tissue  used  in  the  National  Cancer  Institute  for  anti-cancer  drug  screening  revealed  clearly  distinguishable  profiles  if  assayed  with  9703  human  cDNA  probes  [33].  Fundamentally  new  insights  have  also  been  obtained  in  studies  comparing  highly  and  less  metastatic  melanoma  cells  [34],  tumor  and  normal  colon  tissue  [35],  and  acute  myeloid  leukemia  versus  acute  lymphoblastic  leukemia  [36··].  Whether  some  results  of  this  kind  might  be  questionable  has  to  be  clarified,  because  aneuploidy  was  shown  to  lead  to  spurious  correlation  among  expression  profiles  and  to  be  more  widespread  then  expected  [37].  Full-genome  expression  profiles  from  300  different  mutants,  physiological  situations  or  chemical  treatments  of  a  yeast  culture  have  been  measured  from  4553  genes  and  compared  with  63  such  profiles  of  an  isogenic  strain  grown  under  standard  conditions  [38··].  The  resulting  `compendium'  database  allowed  the  monitoring  of  hundreds  of  different  cellular  functions  as  one  single  assay  using  the  microarray.  This  database  was  used  to  estimate  that  under  constant  conditions  the  level  of  gene  induction  or  repression  natively  fluctuates  in  the  range  of  twofold,  but  also  to  identify  eight  yeast  ORFs  as  being  involved  in  ergosterol  biosynthesis,  cellwall  structure,  mitochondrial  function  or  protein  synthesis.  In  addition,  this  database  allowed  the  discovery  that  the  cellular  target  of  the  anesthetic  drug  dyclonine  in  humans  is  the  neuroactive  sigma  factor,  which  shows  the  greatest  sequence  homology  to  the  effected  yeast  gene  erg2p.  Using  the  method  of  singular  value  decomposition  (SVD),  the  complexity  of  large  sets  of  microarray  expression  data  can  be  reduced  to  show  that  the  `music  of  genes  is  orchestrated'  through  a  few  simple  underlying  patterns  [39].  Meanwhile,  experiments  including  up  to  15,000  genes  and  more  have  been  carried  out  to  analyze  the  susceptibility  of  murine  B  cell  lymphoma  to  apoptosis  after  irradiation  [40],  to  characterize  the  different  gene  activities  between  placenta  and  embryos  in  mice  [41],  to  measure  the  response  of  the  human  intestinal  cells  to  infection  with  Salmonella  bac
0	Making  and  Using  DNA  Microarrays:  A  Short  Course  at  Cold  Spring  Harbor  Laboratory
1	David  J.  Stewart1
0	Meetings  and  Courses,  Cold  Spring  Harbor  Laboratory,  Cold  Spring  Harbor,  New  York  11724  USA
0	conundrum  is  familiar.  You  are  sent  back  in  time  to  the  Middle  Ages  with  no  artifact  from  the  present,  brought  before  the  local  ruler,  and  given  24  hours  to  prove  you  are  indeed  from  the  future,  to  impress  the  ruler  and  his  advisors  in  some  way,  before  you  are  executed  in  some  suitably  hideous  fashion.  What  do  you  do?  Toying  with  this  conundrum  reveals  how  little  we  know  in  a  practical  sense  about  the  everyday  items  that  surround  us.  Can  you  fix  your  car  and  your  computer?  My  guess  is  that  few,  if  any,  readers  can  do  so.  And  so  it  was  with  some  trepidation  that  Cold  Spring  Harbor  Laboratory  agreed  to  host  a  short  course  in  the  Fall  of  1999,  funded  in  part  by  the  National  Cancer  Institute,  in  which  students,  primarily  biologists,  would  not  only  print,  use,  and  analyze  DNA  microarrays  but  would  start  the  course  by  building  the  machines  used  to  print  the  arrays.  For  some  time,  Patrick  Brown  and  colleagues  (Chu  et  al.  1998;  DeRisi  et  al.  1997;  Lashkari  et  al.  1997)  at  Stanford  had  been  advocating  the  idea  that  smaller  laboratories  could  enter  the  fray  and  hype  surrounding  these  emerging  microarray  technologies  by  building  machines  rather  than  by  buying  them,  a  self-help  philosophy  that  was  strengthened  by  the  Brown  laboratory's  webbased  publication  in  June  1998  of  the  MGuide,  a  step-by-step  guide  to  construct  the  arrayer,  complete  with  parts  list.  Indeed,  a  number  of  laboratories  have  gone  ahead  and  built  their  own  machines.  Commercial  vendors  already  offer  some  solutions  for  investigators  interested  in  studying  changes  in  genomewide  gene  expression.  Efforts  by  Steve
0	from  similar  restrictions  to  the  Affymetrix  approach  in  terms  of  which  genes  the  companies  decide  to  array.  Many  of  these  products  consist  of  low  thousands,  hundreds,  or  even  tens  of  arrayed  sequences.  Meanwhile,  a  third  approach,  midway  between  the  second  strategy  and  the  purist  Stanford  approach,  is  to  buy  an  arrayer  from  a  commercial  vendor  such  as  Cartesian  Technologies  (Irvine,  CA),  and  then  make  the  DNA  chips  de  novo.  This  offers  flexibility  to  the  investigator  in  terms  of  which  sequences  are  arrayed,  and  the  technical  support  of  the  vendor  in  case  the  printing  robot  breaks  down  or  becomes  unaligned--printing  tens  of  thousands  of  discrete  DNA  "features"  requires  that  these  arrayers  are  tightly  aligned  in  both  horizontal  directions.  However,  these  arrayers  have  specifications  no  better  and  are  currently  at  least  twice  the  cost  of  home-built  machines.  This  brings  us  back  to  the  Stanford  approach--build  the  machines  from  scratch.  And  to  our  own  trepidation,  could  a  group  of  16  biologists--selected  from  a  pool  of  >125  applicants  on  the  basis  of  their  biological  interests  rather  than  their  machining  skills--actually  build  the  machines,  albeit  with  expert  guidance  from  members  and  former  members  of  the  Brown  and  Botstein  laboratories  in  Stanford,  such  that  they  could  be  used  to  print  highdensity  DNA  microarrays  (Table  1)?  As  is  usual  for  Cold  Spring  Harbor  courses,  the  students  included  laboratory  heads,  senior  scientists,  and  postdocs,  plus  two  from  Britain,  and  one  each  from  Sweden,  Germany,  and  New  Zealand,  with  the  remainder  coming  from  academic  laboratories  in  the  United  States  with  widespread  interest  in  topics  ranging  from  the  cell  cycle,  origins  of  replication,  cancer  (and  the  development  of  anti-cancer  vaccines),
0	Genome  Research
0	Table  1.
1	Juerg  Baehler  Arul  Chinnaiyan  David  Collingwood  Bruce  Futcher  Janet  Hager  Christian  Kaltschmidt  Thomas  Kocarek  Maria  Lagerstrom-Fermer  Matthias  Lorenz  Donald  Love  Michele  Marron  Vivek  Mittal  Daniel  Notterman  Michael  Ryan  Arthur  Thompson  Sudha  Veeraraghavan
0	Instructors:  Ash  Alizadeh  (Stanford),  Patrick  Brown  (Stanford),  Max  Diehn  (Stanford),  Michael  Eisen  (Lawrence  Berkeley  National  Laboratory),  Jo  DeRisi  (UCSF),  and  Paul  Spellman  (Stanford).
0	signal  transduction,  apoptosis  and  neurobiology.  Preference  was  given  to  individuals  whose  applications  strongly  suggested  that  they  would  move  swiftly  to  develop  and  apply  this  technology  at  their  home  institutions  and  make  it  available  to  other  investigators.  The  explicit  intention  was  to  spread  the  application  of  these  techniques  as  widely  as  possible,  both  geographically  and  scientifically.  The  students  assembled  at  Cold  Spring  Harbor  Laboratory  on  the  night  of  October  19  to  begin  the  2-week  course,  and  began  building  the  arrayers  the  next  morning.  With  one  arrayer  built  in  advance  by  Vishy  Iyer  and  Jo  DeRisi,  a  lead  instructor  in  the  course,  serving  as  a  guide,  the  students  were  able  to  build  three  complete  machines  by  the  third  day  of  the  course--these  were  long  16  hour  days--despite  "teething  problems"  in  terms  of  broken  or  malfunctioning  components  (Fig.  1).  Predictably,  the  students  learned  more  from  the  problems  that  they  encountered  than  an  error-free  assembly  of  the  equipment  might  have  offered.  By  the  fourth  and  fifth  days,  the  course  was  printing  duplicate  arrays  of  the  entire  6200-gene  set  of  Saccharomyces  cerevisiae,  chips  valued  in  excess  of  several  tens  of  thousands  of  dollars  by  current  commercial  prices,  using  clones
0	reduced  by  increasing  the  number  of  replicate  arrays  or  even  by  altering  the  pattern  of  printing.  With  sufficient  arrays  printed  and  available  for  experimentation,  the  students  were  ready  to  prepare  samples  for  hybridization.  Regardless  of  how  DNA  microarrays  are  fabricated,  at  this  point  methods  for  using  these  arrays  start  to  coalesce,  particularly  in  terms  of  gene  expression  analysis.  Because  of  the  enormous  variation  in  the  number  of  mRNA  molecules  being  analyzed,  and  because
0	Genome  Research
0	of  the  complexities  of  the  hybridization  kinetics  of  individual  DNA  sequences,  microarrays  are  used  to  measure  the  ratio  between  a  reference  and  a  sample,  typically  labeled  with  green  and  red  fluorescent  dyes,  rather  than  the  absolute  quantity  of  transcript.  It  is  for  this  reason  that  raw  array  data  are  typically  represented  as  a  grid  of  dots  of  varying  intensities  of  red,  yellow  and  green.  The  individu
0	REVIEW  Experiments  using  microarray  technology:  limitations  and  standard  operating  procedures
1	T  Forster,  D  Roy  and  P  Ghazal
0	Abstract  Microarrays  are  a  powerful  method  for  the  global  analysis  of  gene  or  protein  content  and  expression,  opening  up  new  horizons  in  molecular  and  physiological  systems.  This  review  focuses  on  the  critical  aspects  of  acquiring  meaningful  data  for  analysis  following  fluorescence-based  target  hybridisation  to  arrays.  Although  microarray  technology  is  adaptable  to  the  analysis  of  a  range  of  biomolecules  (DNA,  RNA,  protein,  carbohydrates  and  lipids),  the  scheme  presented  here  is  applicable  primarily  to  customised  DNA  arrays  fabricated  using  long  oligomer  or  cDNA  probes.  Rather  than  provide  a  comprehensive  review  of  microarray  technology  and  analysis  techniques,  both  of  which  are  large  and  complex  areas,  the  aim  of  this  paper  is  to  provide  a  restricted  overview,  highlighting  salient  features  to  provide  initial  guidance  in  terms  of  pitfalls  in  planning  and  executing  array  projects.  We  outline  standard  operating  procedures,  which  help  streamline  the  analysis  of  microarray  data  resulting  from  a  diversity  of  array  formats  and  biological  systems.  We  hope  that  this  overview  will  provide  practical  initial  guidance  for  those  embarking  on  microarray  studies.
0	experiments  with  each  chip  hybridised  with  experimental  and  reference  samples,  thought  must  go  into  the  correct  selection  of  the  reference  material  to  ensure  biological  relevance  to  the  study.  Due  consideration  must  be  given  to  whether  material  is  pooled  or  individually  sampled.  The  entire  planning  stage  is  as  important  as  the  subsequent  implementation  (see  below)  and  omissions  at  this  stage  can  easily  lead  to  non-representative  or  false  results.  Planning  of  a  study  benefits  from  multiple  inputs  from  biological  researchers  as  well  as  statistician/bioinformaticians  with  experience  in  microarray  technology.  Experimental  sampling  and  extraction  of  RNA  is  a  vitally  important  component  of  this  process  since  successful  microarray  studies  are  dependent  on  the  consistent  extraction  of  high  quality  RNA.  In  broad  terms,  microarrays  are  performed  on  two  basic  biological  systems:  simple  and  complex.  Simple  biological  systems  are  those  where  homogeneous  cell  populations  are  present,  such  as  cell  lines  or  purified  cell  populations.  Sampling  from  simple  systems  is  more  likely  to  represent  the  expression  level  for  the  particular  cell  or  tissue  under  study.  Complex  systems  are  typified  by  tissues  and  organs  where  there  is  a  diversity  of  cellular  substructures  and  mixed  cellularity.  Extraction  of  RNA  from  complex  systems  means  that  critical  spatial  and  cellular  information  as  to  the  origin  of  the  signal  is  lost.  This  reduction  of  contextual  information  makes
1	T  FORSTER
0	and  others
0	Microarray  standard  operating  procedures
0	gram)  quantities  of  RNA  are  gleaned  from  these  sampling  strategies  -  quantities  that  are  usually  too  small  for  conventional  labelling  strategies.  New  amplification  methods  for  the  labelling  of  minute  quantities  of  RNA  are  now  being  employed.  However,  it  is  becoming  increasingly  evident  that  even  highly  purified  cell  populations  and  apparently  homogeneous  cell  lines  may  demonstrate  complexity  of  phenotype  and  metabolism  at  the  individual  cell  level.  This  variation  is  likely  to  encompass  differences  in  RNA  turnover,  sublocalisation,  splicing  and  translational  activity.  This  only  serves  to  highlight  the  importance  of
0	Microarray  standard  operating  procedures  ·
1	T  FORSTER
0	and  others  197
0	standardising  culture  and  purification  methods  as  rigorously  as  possible  to  achieve  consistency  during  sampling  and  extraction  phases.  Regardless  of  the  RNA  sampling  methods  employed,  it  is  important  to  apply  rigorous  quality  control  to  purified  RNA  populations.  For  instance,  the  Bioanalyser  system  from  Agilent  Technologies  (Cheadle  Royal  Business  Park,  Stockport,  Cheshire,  UK)  is  now  commonly  employed  to  check  the  quality  and  consistency  of  RNA  samples.  The  resulting  absorbance  profile  provides  a  useful  means  of  assessing  the  suitability  of  RNA  for  labelling.  At  this  stage,  consistency  during  labelling  and  hybridisation  steps  is  the  starting  point  for  the  generation  of  consistent  array  data  (Hegde  et  al.  2000).  The  selection  and  production  of  the  correct  array  format  is  important  and  a  central  feature  of  the  process.  The  majority  of  custom  arrays  are  produced  by  the  direct  deposition  of  nucleic  acid  probes  as  cDNA  or  long  oligomeric  sequences  onto  treated  glass  substrates.  The  production  of  reproducible  arrays  with  current  pin  printing  methods  is  challenging.  In  our  own  Centre  we  have  introduced  a  number  of  quality  control  steps  to  ensure  consistency  of  array  production,  but  these  are  outside  the  scope  of  this  review.  An  essential  theme  is  the  requirement  for  microarray  data  to  be  MIAME  (minimum  information  about  a  microarray  experiment)  (Brazma  et  al.  2001)  compliant.  In  essence,  this  addition  of  standardised  information  about  all  stages  of  a  microarray  experiment  allows  for  amalgamation  of  array  data  from  different  groups  and  sources  in  the  public  domain,  ultimately  permitting  advanced  and  automatic  data  mining.  Accordingly,  there  is  an  absolute  necessity  for  the  implementation  of  M-SOPs.  The  M-SOPs  outlined  here  aid  in  the  production  of  standardised  project  documentation,  which  ensures  MIAME  compliance  for  publication.  In  the  following  sections  we  outline  in  more  detail  the  analytical  steps  of  the  workflow.  Data  Generation  and  Validation  The  chronological  order  of  processes  in  a  microarray  project  utilising  customised  arrays  is  given  in  Fig.  1.  Approaches  for  individual  and  combined  processing  and  analysis  steps  have  recently  been  reviewed  (Nature  Genetics  2002,  Speed  2002).  Array  scanning  and  image  quantification  The  process  of  scanning  an  array  is  known  as  image  acquisition,  whereas  the  process  of  converting  images  to  numerical  data  is  referred  to  as  image  quantification  or  processing.  The  majority  of  microarray  experiments  involve  the  fluorescent  detection  of  hybridised  signal  using  confocal  laser  scanners.  A  wide  variety  of  different  scanning  instruments  are  available,  and  a  number  of  different
0	image  acquisition  and  quantification  packages  are  associated  with  them.  In  general,  selection  of  image  quantification  parameters  (e.g.  `adaptive',  `fixed  circle',  `spot  distance')  should  be  carefully  assessed  and  decided  for  each  project  as  a  whole,  and  will  depend  on  array  design,  slide  type  and  spot  morphology.  As  an  exception  to  this,  a  limited  form  of  manual  input  is  often  required  to  fine-tune  the  layout  of  the  template  quantification  grid  for  individual  arrays  and  care  should  be  taken  to  avoid  user  bias.  Apart  from  this  limited  fine-tuning,  it  should  be  noted  that  the  image  quantification  method  should  be  identical  for  all  slides  constituting  a  project,  whereas  image  acquisition  parameters,  for  instance  laser  power  and/or  photo  multiplier,  can  be  optimised  from  slide  to  slide.  For  a  comparative  discussion  of  issues  concerned  with  statistical  image  a
0	TRENDS  in  Biotechnology
0	directed  towards  improvement  of  agricultural  qualities,  perhaps  these  goals  can  be  combined  to  increase  tolerance  to  temperature  extremes,  salinity,  flooding,  or  insect  pests  in  plants  capable  of  pollutant  detoxification  or,  more  importantly  for  value-enhancement  -  transfer  of  phytoremediative  traits  to  elite  plant  cultivars  having  the  highest  biomass  or  agricultural  productivity.  Obviously,  concerns  about  contaminant  uptake  and  accumulation  will  limit  the  use  of  phyto-crops  for  food  or  human  contact  products,  so  every  effort  must  be  made  to  identify  parent  compound  fate  and  toxicity  for  these  applications.  However,  as  observed  with  the  development  of  chemopreventative  enriched,  Se-hyperaccumulating  plants,  opportunities  exist  to  combine  pollutant  decontamination  capabilities  with  beneficial  human  and  ecological  health  qualities  in  engineered  plants.
0	see  front  matter  Q  2004  Elsevier  Ltd.  All  rights  reserved.  doi:10.1016/j.tibtech.2004.08.003
0	Exploring  the  post-transcriptional  RNA  world  with  DNA  microarrays
1	Vishwanath  R.  Iyer
0	Genomic  approaches  are  valuable  for  understanding  the  complex  layer  of  gene  regulation  that  involves  the  control  of  RNA  processing,  localization  and  stability.  Recent  work
0	provides  a  prime  example  of  the  power  of  unbiased  microarray-based  methods  to  discover  unexpected  functions  for  proteins  in  the  RNA  world.  The  challenges  ahead  relate  to  extending  such  approaches  to  larger  genomes  and  to  integrating  this  type  of  information  with  that  generated  by  standard  expression  profiling.
0	TRENDS  in  Biotechnology
0	Although  gene  expression  is  often  regulated  by  transcription  factors  at  the  level  of  transcription  initiation,  the  subsequent  steps  of  RNA  processing,  turnover,  subcellular  localization  and  entry  into  the  translation  machinery  strongly  influence  the  extent  of  protein  translation  and  the  function  of  encoded  proteins.  Such  post-transcriptional  steps  therefore  have  marked  effects  on  the  expression  and  function  of  genes  in  processes  as  diverse  as  cytokinesis,  early  embryonic  development  and  neuronal  function  [1].  When  trying  to  infer  the  global  phenotypes  of  cells  from  large-scale  mRNA  expression  profiling  data,  it  is  important  to  be  aware  of  this  intervening  layer  of  gene  regulation.  Most  post-transcriptional  events  are  mediated  by  the  association  of  RNAs  with  specific  proteins  or  macromolecular  protein  complexes.  Comprehensive  determination  of  the  RNA  targets  of  RNA-binding  proteins  is  therefore  likely  to  be  important  in  deciphering  the  complex  events  at  this  level  of  gene  regulation.  The  La  protein  is  a  conserved  eukaryotic  protein  that  is  thought  to  be  important  in  the  realm  of  posttranscriptional  regulation  and,  as  we  discuss  here,  a  recent  study  by  Inada  and  Guthrie  [2]  provides  a  prime  example  of  the  use  of  a  genomic  approach  to  elucidate  the  targets  and  potential  function  of  such  an  RNA-binding  protein.  Ribonomics  with  cDNA  microarrays  cDNA  microarrays  have  been  heavily  used  for  quantitative  mRNA  profiling,  but  there  are  increasing  examples  of  the  varied  use  of  cDNA  microarrays  to  follow  the  fates  of  mRNAs  in  the  cell  after  they  are  made,  rather  than  to  measure  only  their  steady-state  levels.  One  objective  is  to  determine  the  binding  targets  of  proteins  that  interact  with  RNAs  at  any  point  during  the  lifetime  of  the  RNA.  Protein-RNA  interactions  represent  one  of  the  most  abundant  categories  of  molecular  interactions  in  cells,  and  the  total  number  of  RNA-interacting  proteins  rivals  that  of  other  categories  such  as  transcription  factors  and  signaling  molecules,  even  if  one  excludes  the  hundreds  of  proteins  that  are  integral  components  of  the  spliceosome  and  ribosome  [3,4].  Proteins  can  interact  with  RNA  from  the  time  that  they  are  transcribed,  and  they  affect  transcriptional  efficiency,  capping,  3  0  -end  processing,  splicing,  nuclear  export,  subcellular  localization,  translation  and  turnover  of  RNA  [5].  The  sheer  diversity,  cell-  and  tissue-specificity,  and  conservation  of  RNA-binding  proteins  has  led  to  the  notion  that  primary  transcripts,  rather  than  advancing  smoothly  through  each  of  the  subsequent  RNA  processing  steps,  participate  in  a  complex  network  of  regulatory  processes  at  the  post-transcriptional  level  [6].  Clearly,  identifying  the  RNA  targets  of  specific  RNA-binding  proteins  is  likely  to  be  at  least  as  informative  and  important  with  regard  to  understanding  global  gene  regulation  as  is  measuring  changes  in  steady-state  levels  of  RNAs  in  response  to  cellular  signals.  The  genomic  strategy  for  determining  the  RNA  partners  of  RNA-binding  proteins  involves  immunoprecipitation  of  the  protein  of  interest  along  with  its  associated  RNA,  fluorescent  labeling  of  the  enriched  RNA  (as  cDNA),  and  finally  microarray  hybridization  in  conjunction  with  an  appropriate  reference  probe  (Figure  1).  This  approach
0	was  first  used  independently  in  the  laboratories  of  Ron  Vale  [7]  and  Jack  Keene  [8]  and  was  termed  `ribonomics'  by  the  latter.  Variations  of  this  method  have  been  subsequently  used  to  identify  the  targets  of  more  than  a  dozen  RNA-binding  proteins  (see  Gerber  et  al.  [9]  and  references  therein).  The  function  of  La  in  the  cell  A  prime  example  of  the  power  of  ribonomics  has  been  provided  recently  by  Maki  Inada  and  Christine  Guthrie  [2]  in  their  analysis  of  the  function  of  the  La  protein  in  yeast.  La  is  a  ubiquitous,  nuclear  RNA-binding  protein  that  is  conserved  among  eukaryotes.  It  is  known  to  associate  with  the  3  0  -UUU-OH  co
0	YOUNG  INVESTIGATOR  PERSPECTIVES  DNA  Microarray  Analyses  of  Circadian  Timing:  The  Genomic  Basis  of  Biological  Time
1	G.  E.  Duffield
0	Department  of  Integrative  and  Molecular  Neuroscience,  Division  of  Neuroscience  and  Psychological  Medicine,  Faculty  of  Medicine,  Imperial  College  London,  London,  UK.  Key  words:  circadian  rhythm,  microarray,  clock  gene,  gene  expression,  clock  controlled  gene.
0	Abstract  Many  aspects  of  physiology  and  behaviour  are  organized  around  a  daily  rhythm,  driven  by  an  endogenous  circadian  clock.  Studies  across  numerous  taxa  have  identified  interlocked  autoregulatory  molecular  feedback  loops  which  underlie  circadian  organization  in  single  cells.  Until  recently,  little  was  known  of  (i)  how  the  core  clock  mechanism  regulates  circadian  output  and  (ii)  what  proportion  of  the  cellular  transcriptome  is  clock  regulated.  Studies  using  DNA  microarray  technology  have  addressed  these  questions  in  a  global  fashion  and  identified  rhythmically  expressed  genes  in  numerous  tissues  in  the  rodent  (suprachiasmatic  nucleus,  pineal  gland,  liver,  heart,  kidney)  and  immortalized  fibroblasts,  in  the  head  and  body  of  Drosophila,  in  the  fungus  Neurospora  and  the  higher  plant  Arabidopsis.  These  clock  controlled  genes  represent  0.5±9%  of  probed  genes,  with  functional  groups  covering  a  broad  spectrum  of  cellular  pathways.  There  is  considerable  tissue  specificity,  with  only  approximately  10%  rhythmic  genes  common  to  at  least  one  other  tissue,  principally  consisting  of  known  clock  genes.  The  remaining  common  genes  may  constitute  genes  operating  close  to  the  clock  mechanism  or  novel  core  clock  components.  Microarray  technology  has  also  been  applied  to  understand  input  pathways  to  the  clock,  identifying  potential  signalling  components  for  clock  resetting  in  fibroblasts,  and  elucidating  the  temperature  entrainment  mechanism  in  Neurospora.  This  review  explores  some  of  the  common  themes  found  between  tissues  and  organisms,  and  focuses  on  some  of  the  striking  connections  between  the  molecular  core  oscillator  and  aspects  of  circadian  physiology  and  behaviour.  It  also  addresses  the  limitations  of  the  microarray  technology  and  analyses,  and  suggests  directions  for  future  studies.  The  circadian  timing  system
0	Circadian  rhythms  are  endogenous,  near  24-h  rhythms  of  physiology  and  behaviour  generated  by  underlying  genetic  feedback  loops  occurring  in  a  majority  of  organisms  from  prokaryotes  to  humans  (1).  Their  importance  to  human  health  is  becoming  apparent,  such  as  in  the  increasing  occurrence  of  shift  work  and  jet-lag  (2),  sleep  syndromes  (e.g.  advanced  sleep  phase  syndrome)  (3),  and  in  the  connection  of  clock  genes  with  cell  division,  tumour  development  and  DNA  damage±response  pathways  (4).  It  has  long  been  appreciated  that  many  neuroendocrine  systems  are  regulated  on  a  circadian  basis,  examples  being  rhythms  of  plasma  melatonin  and  cortisol,  behavioural  parameters  such  as  sleep  onset  and  offset,  cognitive  attention,  and  physiological  parameters  such  as  core  body  temperature  and  urine  output  (5).  In  mammals,  the  master  clock  resides  in  the  paired  supra-
0	chiasmatic  nuclei  (SCN)  of  the  hypothalamus  (6,  7).  Studies  monitoring  electrical  activity  of  single  dissociated  SCN  neurones  have  revealed  that  this  oscillator  mechanism  resides  within  individual  cells  (6).  These  oscillators  consist  of  interconnected  molecular  feedback  loops  composed  of  a  positive  loop,  where  activators  drive  the  transcription  of  gene  products,  which  feedback  to  repress  the  transcription  of  themselves  and/or  other  core  oscillator  molecules  (1,  6).  The  core  feedback  loops  of  the  mammalian  clock  consist  of  three  period  (Per)  genes  and  two  cryptochrome  (Cry)  genes  (negative  loop)  and  PAS  domain  proteins  Bmal1  and  Clock  (positive  loop)  (6).  The  Per  and  Cry  genes  are  activated  by  CLOCK:BMAL1  heterodimers.  The  PER  and  CRY  proteins  are  then  translated  in  the  cytoplasm  where  PER1  and  PER2  are  phosphorylated  by  casein  kinase  Ie/d.  The  phosphorylated  PER  proteins  dimerize  with  CRY1  allowing  them  entry  into  the  nucleus,  where  CRY1  is  proposed  to  repress  the  activation  of
0	DNA  microarrays  and  their  application  to  circadian  biology
0	A  number  of  central  questions  regarding  circadian  biology  are  amenable  to  investigation  by  DNA  microarray  technologies.  (i)  Although  there  exists  considerable  knowledge  about  the  core  oscillator  mechanism,  and  some  of  the  physiological  and  behavioural  processes  that  are  under  circadian  control,  little  is  known  about  the  connection  between  the  oscillator  and  down-stream  biological  processes  that  are  under  clock  control.  Profiling  of  gene  expression  over  several  days  can  identify  novel  downstream  genes
0	Computational  Approach  to  Systems  Biology:  From  Fraction  to  Integration  and  Beyond
1	Pawan  K.  Dhar,  Hao  Zhu,  and  Santosh  K.  Mishra*
0	Abstract--Systems  biology  is  an  approach  to  understanding  the  workings  of  whole  biological  systems.  The  various  methods  used  for  systems  analyses  range  from  experimental  to  computational.  In  this  paper,  we  describe  basic  concepts  of  systems  biology,  modeling  challenges  that  arise  from  the  massively  parallel  interaction  among  components  in  biological  systems,  and  what  lies  beyond  integration  of  modular  knowledge.  Index  Terms--Cellular  automata,  modeling,  signaling  pathway,  systems  biology.
0	I.  ORIGIN  OF  SYSTEMS  BIOLOGY
0	IOLOGY  IS  systems  by  default.  Surprisingly,  biology  has  hardly  been  practiced  that  way.  Reductionism  has  been  the  dominant  approach  of  experimental  biologists,  who  like  to:  1)  reduce  a  problem  into  components  (or  modules);  2)  integrate  the  modular  knowledge  by  using  assumptions;  and  3)  iterate  reductionism  and  integration  till  a  reasonably  good  understanding  of  the  system  appears.  This  classical  way  of  doing  biology  was  successfully  practiced  till  recently,  when  researchers  shifted  focus  from  reductionism  to  integration.  The  advent  of  high-throughput  technologies,  such  as  microarrays  that  simultaneously  measure  thousands  of  gene  expression  profiles,  significantly  influenced  the  move  toward  a  systems  approach.  However,  going  by  the  documented  literature,  the  concept  of  systems  approach  was  born  more  than  seven  decades  ago.  In  the  early  part  of  the  last  century,  von  Bertalanfy  described  a  system  as  a  group  of  dynamic  and  mutually  interacting  parts  and  processes  and  argued  that  the  fundamental  task  of  biology  was  to  discover  laws  of  biological  systems  [1].  In  the  1940s,  Wiener  (1894-1964)  searched  for  general  biological  laws  using  "cybernetics"  as  a  guiding  principle.  Cybernetics  is  a  field  that  describes  common  factors  of  control  and  communication  in  automatic  machines,  organizations,  and  living  organisms  [2].  This  was  the  first  attempt  to  look  at  biological  complexity  from  a  computational  standpoint.  Based  on  his  work  on  communication  engineering  during  World  War  II,  Weiner  proposed  a  common  conceptual  framework  from  men
0	to  machines.  Though  his  contributions  in  communication  engineering  are  well  known,  his  discrete  contribution  in  biology  has  largely  been  unappreciated,  due  to  the  unavailability  of  relevant  biological  data  and  unvalidated  biological  models  at  that  time.  Throughout  the  1960s  and  1970s,  researchers  from  the  fields  of  mathematics  and  engineering  continued  their  hunt  for  mathematical  and  physical  principles  of  biological  systems,  but  faced  similar  problems  of  data  scarcity  and  model  validation.  However,  a  much  bigger  issue  was  the  lack  of  understanding  of  the  fundamental  properties  of  living  systems,  i.e.,  dynamic  and  nonlinear  behavior.  Due  to  this  reason,  initial  modeling  efforts  were  helpful  only  to  the  extent  of  simulating  isolated  events  without  explaining  their  fundamental  principles.  The  proposition  of  biochemical  system  theory  and  metabolic  control  theory  sparked  a  renewed  interest  in  this  field  [3]-[7].  With  an  enormous  increase  in  computational  bandwidth,  the  capability  of  solving  large-scale  mathematical  equations  registered  a  significant  jump,  making  it  a  routine  task  to  build  large  and  complex  biological  models  using  mathematical  equations.  Recently  a  paradigm  shift  in  biology,  i.e.,  from  low  throughput,  single  investigator  driven  to  high  throughput,  consortia  driven,  has  occurred  [8].  In  parallel  to  these  changes,  a  new  era  of  modeling  efforts,  with  novel  strategies  and  methods,  has  emerged.  Starting  from  the  classical  ordinary  differential  equations,  new  mathematical  representations  have  been  invented  [9]-[12],  broadening  the  area  and  encouraging  more  applications  ranging  from  basic  sciences  to  drug  discovery.  Even  though  systems  biology  has  found  widespread  acceptance  among  researchers,  a  few  fundamental  issues  remain.  One  of  the  main  concerns  has  been  the  meaning  and  application  of  the  "systems  biology"  itself.  There  is  also  an  apprehension  whether  the  term  has  gotten  well  ahead  of  the  science.  Though  the  term  "systems  biology"  was  coined  many  decades  ago,  Hood  brought  it  into  mainstream  science  few  years  back  [13].  Alternative  terms  like  network  biology,  integrative  biology,  or  interactive  biology  have  also  been  proposed.  We  hold  the  view  that  systems  biology  is  a  new  way  of  doing  biology,  starting  with  experimental  knowledge,  passing  through  in  silico  modeling,  and  finally  returning  to  biological  experiments.  Systems  biology  is  an  approach  that  works  best  when  integrated  with  experimental  biology.  In  this  paper,  we  try  to  assess  the  role  of  computation  in  moving  the  biological  knowledge  from  fraction  to  integration,  the  key  features  that  differentiate  systems  biology  from  traditional  biology.  II.  INTRODUCTION  AND  TERMINOLOGY  In  2003,  as  the  scientific  community  was  commemorating  the  golden  jubilee  of  the  discovery  of  DNA's  double  helical  struc-
0	IEEE
0	DHAR  et  al.:  COMPUTATIONAL  APPROACH  TO  SYSTEMS  BIOLOGY
0	ture,  the  question  "what  next?"  was  raised.  There  was  a  general  consensus  that  transcriptomics  and  proteomics  are  much  more  challenging  than  genomics--a  problem  thought  to  be  most  demanding  till  recently.  An  accelerated  postgenomics  effort  was  triggered  mainly  due  to  the  invention  of  high-throughput  technologies.  It  is  unlikely  that  a  morass  of  data  produced  by  sequencing,  microarray,  and  gene  knockout  experiments  can  be  fully  captured  with  the  current  tools  and  technologies.  The  pressing  need  is  not  the  quantity  of  data  but  their  quality  and  semantics--something  that  cannot  be  addressed  by  a  divide-and-conquer  approach  alone.  Knowledge  from  modular  biology  gathered  from  "isolated"  systems  is  conceptually  crosslinked  to  create  a  molecular  level  and,  by  extension,  cell,  organ,  and  even  organism  level  understanding.  However,  very  often  knowledge  acquired  through  such  an  approach  comes  with  exceptions  and  gaps.  For  example,  Mendelian  laws  of  inheritance  apply  in  all  conditions  except  when  traits  are  multifactorial,  in  which  an  additive  effect  predominates.  Another  example  is  the  coexistence  of  dominant  alleles  in  the  ABO  blood  group  in  humans.  Likewise,  the  expansion  of  triplet  repeats  (CGG)  is  an  "anticipation"  phenomenon  that  sometimes  results  in  neuromuscular  diseases.  Added  to  this  is  the  "intramodular"  inaccuracy  and  incompleteness  of  data.  Thus,  to  gain  a  holistic  view  of  cell  transactions,  the  classical  reductionism  needs  to  be  supplemented  with  an  approach  that  builds  the  system  bottom  up  and  analyzes  it  top  down.  With  the  recent  development  in  instrumentation  and  information  technology,  this  goal  looks  realistic.  A  cell  is  a  massively  parallel  and  interacting  system.  The  parallel  nature  of  the  system  speeds  up  the  transfer  of  instructions  within  the  cell,  while  the  interactive  feature  determines  the  nonlinear  and  dynamic  behavior  of  a  system  that  exhibits  feedback  loops,  noise,  redundancy,  and  robustness.  Furthermore,  the  cross-interactive  nature  of  intracellular  processes  gives  rise  to  fuzzy  boundaries  among  pathways.  For  example,  DNA  polymerase  participates  in  both  the  synthesis  of  a  new  DNA  strand  and  the  repair  of  DNA  damage.  Thus,  it  may  be  considered  a  common  link  between  replication  and  repair  pathways.  Likewise  there  are  hundreds  of  parallel  and  cross-interacting  events  within  the  cell,  making  it  difficult  to  draw  boundaries  between  pathways.  Therefore,  a  practical  rule  of  thumb  is,  a  system  is  really  where  you  draw  a  box.  If  systems  biology  could  be  simply  defined  as  comprehensive  biology  or  as  biology  at  system  level,  then  traditional  Chinese  and  Indian  medicine  could  be  considered  as  precursors  of  systems  biology.  The  systems  biology  is  based  on  two  prominent  features.  First,  it  is  built  on  the  knowledge  gained  from  experimental  biology;  second,  computational  technologies  are  used  to  bridge  multilayer  experimental  data.  The  goal  is  to  describe  biology  not  only  at  molecular  level  and  the  system  level  but  also  to  understand  life  in  the  form  of  mechanisms  and  principles.  Computational  methods  have  been  key  driving  forces  in  mathematics,  physics,  chemistry,  and  also  biology.  In  systems  biology,  they  function  as  hubs  connecting  theoretical,  mathematical,  and  quantitative  findings.  The  key  is  to  find  appropriate  representations  of  biological  events  for  numerically  describing  cellular  processes.  In  fact,  many  biological  processes  are  more  suitably  described  in  the  language  of  computer
0	science  than  that  of  mathematics,  especially  for  those  in  which  phenomenological  knowledge  is  more  easily  available  than  precise  mechanistic  and  quantitative  description.  Developmentally  regulated  pathways,  signal  transduction,  and  pattern  formation  are  such  cases.  The  importance  of  computational  approach  in  systems  biology  is  underscored  by  the  fact  that  it  can  provide  effective  description  for  systems  at  different  levels.  Though  systems  biology  is  sometimes  practiced  and  termed  as  quantitative  systems  biology  (QSB)  or  computational  systems  biology  (CSB),  both  are  approaches  rather  than  hierarchical  branches  of  systems  biology.  With  the  availability  of  high-throughput  quantitative  methods,  concentrations  of  gene  products,  metabolites,  and  small  molecules  in  di
0	Array  of  hope
1	Eric  S.  Lander
1	Bob  Crimi
0	Genomics  aims  to  provide  biologists  with  the  equivalent  of  chemistry's  Periodic  Table1  --an  inventory  of  all  genes  used  to  assemble  a  living  creature,  together  with  an  insightful  system  for  classifying  these  building  blocks.  A  short  decade  ago,  the  task  of  enumeration  alone  appeared  to  many  to  be  a  quixotic  quest.  Whereas  chemical  matter  is  composed  of  a  mere  hundred  or  so  elements,  organismal  parts  lists  are  huge--running  into  the  thousands  for  bacteria  and  hundreds  of  thousands  for  mammals.  Genomic  mapping  and  sequencing,  however,  has  steadily  extended  its  dominion:  it  has  domesticated  the  Megabase  and  will  tame  the  Gigabase  in  the  not-too-distant  future.  The  next  great  challenge  is  to  discern  the  underlying  order.  The  Periodic  Table  summarized  chemical  propensities  in  its  rows  and  columns,  and  thereby  foreshadowed  the  secrets  of  subatomic  structure.  Understanding  biological  systems  with  100,000  genes  will  similarly  require  organizing  the  parts  by  their  properties.  The  Biological  Periodic  Table  will  not  be  two-dimensional,  but  will  reflect  similarities  at  diverse  levels:  primary  DNA  sequence  in  coding  and  regulatory  regions;  polymorphic  variation  within  a  species  or  subgroup;  time  and  place  of  expression  of  RNAs  during  development,  physiological  response  and  disease;  and  subcellular  localization  and  intermolecular  interaction  of  protein  products.  The  traditional  gene-by-gene  approach  will  not  suffice  to  meet  the  sheer  magnitude  of  the  problem.  It  will  be  necessary  to  take  `global  views'  of  biological  processes:  simultaneous  readouts  of  all  components.  Arrays  offer  the  first  great  hope  for  such  global  views  by  providing  a  systematic  way  to  survey  DNA  and  RNA  variation.  They  seem  likely  to  become  a  standard  tool  of  both  molecular  biology  research  and  clinical  diagnostics.  These  prospects  have  attracted  great  interest  and  investment  from  both  the  public  and  private  sectors.  The  reviews  in  this  supplement  describe  important  issues  in  this  fast-moving  area2-12.
0	used  in  semiconductor  manufacture  to  produce  arrays  with  400,000  distinct  oligonucleotides,  each  in  its  own  20  µm2  region15.  Other  companies  are  developing  in  situ  synthesis  with  reagents  delivered  by  ink-jet  printer  devices.  The  new  generation  of  array  technologies  is  still  in  its  infancy.  As  one  reviewer  wryly  notes8,  the  scientific  literature  contains  more  reviews  about  arrays  than  primary  research  papers  applying  them.  The  techniques  have  become  established  in  only  a  few  places.  The  tools  remain  prohibitively  expensive  for  many  laboratories  (owing  to  the  actual  capital  cost  of  setting  up  an  arraying  facility  or  the  amortized  capital  costs  reflected  in  the  purchase  price  of  arrays).  Still,  these  problems  are  likely  to  be  solved  by  economies  of  scale,  free-market  competition  and  time--just  as  they  are  for  new  generations  of  computer  microprocessors.
0	differed  (for  example,  in  metastatic  versus  nonmetastatic  derivatives  of  a  tumour  cell  line).  Deeper  biological  insight  is  likely  to  emerge  from  examining  datasets  with  scores  of  samples--for  example,  multiple  time  points  from  multiple  cell  lines  treated  independently  with  multiple  growth  factors.  Each  gene  defines  a  point  in  k-dimensional  space  (where  k  is  the  number  of  samples  studied),  and  functional  similarities  are  likely  to  reveal  themselves  as  `clusters'  in  this  space.  Computational  scientists  working  in  the  field  of  `data  mining'  have  devised  a  dizzying  assortment  of  techniques  for  clustering,  predicting  and  visualizing  patterns  in  high-dimensional  space--most  based  on  inherent  assumptions  about  the  types  of  patterns  to  be  found.  Empirical  exploration  will  be  needed  to  flesh  out  which  types  of  datasets  and  analytical  tools  will  be  most  fruitful  for  biology.  How  well  can  causation  be  inferred  from  correlation?  The  problem  is  akin  to  inferring  the  design  of  a  microprocessor  based  on  the  readout  of  its  transistors  in  response  to  a  variety  of  inputs.  The  task  is  impossible  in  a  strict  mathematical  sense,  in  that  the  microprocessor  layout  could  be  arbitrarily  complicated,  but  is  likely  to  prove  at  least  somewhat  tractable  in  a  more  constrained  biological  setting,  especially  when  combined  with  ways  to  cut  specific  wires  in  biological  circuits  using  antisense  and  related  techniques.  The  great  opportunities  ahead  would  well  justify  an  influx  of  bright  young  computational  scientists  and  technologists  into  biology.
0	DNA  variation  Arrays  can  also  be  used  to  study  DNA,  with  the  primary  application  being  identification  and  genotyping  of  mutations  and  polymorphisms.  These  applications  pose  rather  different  challenges  than  RNA  expression  monitoring,  and  many  issues  remain  to  be  worked  out.  Identification  of  novel  DNA  variants  has  largely  been  the  province  of  oligonucleotide,  as  opposed  to  spotted,  arrays7,9.  Exploiting  the  ability  to  perform  custom  synthesis  at  high  density,  one  can  construct  a  `tiling'  array  to  scan  a  target  sequence  for  mutations.  Each  overlapping  25-mer  in  the  sequence  is  covered  by  four  complementary  oligonucleotide  probes  that  differ  only  by  having  A,  T,  C  or  G  substituted  at  the  central  position.  An  amplified  product  containing  the  expected  sequence  will  hybridize  best  to  the  expected  probe,  whereas  a  sequence  variation  will  typically  alter  the  hybridization  pattern.  Such  tiling  arrays  have  been  used  to  detect  variants  in  such  targets  as  the  HIV  genome,  human  mitochondria  and  the  gene  encoding  p53.  In  such  specific  settings,  the  process  can  be  optimized  to  have  high  specificity  and  sensitivity.  The  approach  has  also  been  used  for  much  larger  surveys--for  example,  a  set 
0	Microarray  data  normalization  and  transformation
1	John  Quackenbush
0	The  goal  of  most  microarray  experiments  is  to  survey  patterns  of  gene  expression  by  assaying  the  expression  levels  of  thousands  to  tens  of  thousands  of  genes  in  a  single  assay.  Typically,  RNA  is  first  isolated  from  different  tissues,  developmental  stages,  disease  states  or  samples  subjected  to  appropriate  treatments.  The  RNA  is  then  labeled  and  hybridized  to  the  arrays  using  an  experimental  strategy  that  allows  expression  to  be  assayed  and  compared  between  appropriate  sample  pairs.  Common  strategies  include  the  use  of  a  single  label  and  independent  arrays  for  each  sample,  or  a  single  array  with  distinguishable  fluorescent  dye  labels  for  the  individual  RNAs.  Regardless  of  the  approach  chosen,  the  arrays  are  scanned  after  hybridization  and  independent  grayscale  images,  typically  16-bit  TIFF  (Tagged  Information  File  Format)  images,  are  generated  for  each  pair  of  samples  to  be  compared.  These  images  must  then  be  analyzed  to  identify  the  arrayed  spots  and  to  measure  the  relative  fluorescence  intensities  for  each  element.  There  are  many  commercial  and  freely  available  software  packages  for  image  quantitation.  Although  there  are  minor  differences  between  them,  most  give  high-quality,  reproducible  measures  of  hybridization  intensities.  For  the  purpose  of  the  discussion  here,  we  will  ignore  the  particular  microarray  platform  used,  the  type  of  measurement  reported  (mean,  median  or  integrated  intensity,  or  the  average  difference  for  Affymetrix  GeneChipsTM),  the  background  correction  performed,  or  spot-quality  assessment  and  trimming  used.  As  our  starting  point,  we  will  assume  that  for  each  biological  sample  we  assay,  we  have  a  high-quality  measurement  of  the  intensity  of  hybridization  for  each  gene  element  on  the  array.  The  hypothesis  underlying  microarray  analysis  is  that  the  measured  intensities  for  each  arrayed  gene  represent  its  relative  expression  level.  Biologically  relevant  patterns  of  expression  are  typically  identified  by  comparing  measured  expression  levels  between  different  states  on  a  gene-by-gene  basis.  But  before  the  levels  can  be  compared  appropriately,  a  number  of  transformations  must  be  carried  out  on  the  data  to  eliminate  questionable  or  low-quality  measurements,  to  adjust  the  measured  intensities  to  facilitate  comparisons,  and  to  select  genes  that  are  significantly  differentially  expressed  between  classes  of  samples.
0	Expression  ratios:  the  primary  comparison  Most  microarray  experiments  investigate  relationships  between  related  biological  samples  based  on  patterns  of  expression,  and  the  simplest  approach  looks  for  genes  that  are  differentially  expressed.  If  we  have  an  array  that  has  Narray  distinct  elements,  and  compare  a  query  and  a  reference  sample,  which  for  convenience  we  will  call  R  and  G,  respectively  (for  the  red  and  green  colors  commonly  used  to  represent  array  data),  then  the  ratio  (T)  for  the  ith  gene  (where  i  is  an  index  running  over  all  the  arrayed  genes  from  1  to  Narray)  can  be  written  as  R  Ti  =  i  .  Gi
0	(Note  that  this  definition  does  not  limit  us  to  any  particular  array  technology:  the  measures  Ri  and  Gi  can  be  made  on  either  a  single  array  or  on  two  replicate  arrays.  Furthermore,  all  the  transformations  described  below  can  be  applied  to  data  from  any  microarray  platform.)  Although  ratios  provide  an  intuitive  measure  of  expression  changes,  they  have  the  disadvantage  of  treating  up-  and  downregulated  genes  differently.  Genes  upregulated  by  a  factor  of  2  have  an  expression  ratio  of  2,  whereas  those  downregulated  by  the  same  factor  have  an  expression  ratio  of  (-0.5).  The  most  widely  used  alternative  transformation  of  the  ratio  is  the  logarithm  base  2,  which  has  the  advantage  of  producing  a  continuous  spectrum  of  values  and  treating  up-  and  downregulated  genes  in  a  similar  fashion.  Recall  that  logarithms  treat  numbers  and  their  reciprocals  symmetrically:  log2(1)  =  0,  log2(2)  =  1,  log2(1/2)  =  -1,  log2(4)  =  2,  log2(1/4)  =  -2,  and  so  on.  The  logarithms  of  the  expression  ratios  are  also  treated  symmetrically,  so  that  a  gene  upregulated  by  a  factor  of  2  has  a  log2(ratio)  of  1,  a  gene  downregulated  by  a  factor  of  2  has  a  log2(ratio)  of  -1,  and  a  gene  expressed  at  a  constant  level  (with  a  ratio  of  1)  has  a  log2(ratio)  equal  to  zero.  For  the  remainder  of  this  discussion,  log2(ratio)  will  be  used  to  represent  expression  levels.
0	Normalization  Typically,  the  first  transformation  applied  to  expression  data,  referred  to  as  normalization,  adjusts  the  individual  hybridiza-
0	R-I  plot  raw  data
0	where  Gi  and  Ri  are  the  measured  intensities  for  the  ith  array  element  (for  example,  the  green  and  red  intensities  in  a  two-color  microarray  assay)  and  Narray  is  the  total  number  of  elements  represented  in  the  microarray.  One  or  both  intensities  are  appropriately  scaled,  for  example,
0	Gk  =  NtotalGk  and  Rk  =  Rk  ,
0	tion  intensities  to  balance  them  appropriately  so  that  meaningful  biological  comparisons  can  be  made.  There  are  a  number  of  reasons  why  data  must  be  normalized,  including  unequal  quantities  of  starting  RNA,  differences  in  labeling  or  detection  efficiencies  between  the  fluorescent  dyes  used,  and  systematic  biases  in  the  measured  expression  levels.  Conceptually,  normalization  is  similar  to  adjusting  expression  levels  measured  by  northern  analysis  or  quantitative  reverse  transcription  PCR  (RT-PCR)  relative  to  the  expression  of  one  or  more  reference  genes  whose  levels  are  assumed  to  be  constant  between  samples.  There  are  many  approaches  to  normalizing  expression  levels.  Some,  such  as  total  intensity  normalization,  are  based  on  simple  assumptions.  Here,  let  us  assume  that  we  are  starting  with  equal  quantities  of  RNA  for  the  two  samples  we  are  going  to  compare.  Given  that  there  are  millions  of  individual  RNA  molecules  in  each  sample,  we  will  assume  that  the  average  mass  of  each  molecule  is  approximately  the  same,  and  that,  consequently,  the  number  of  molecules  in  each  sample  is  also  the  same.  Second,  let  us  assume  that  the  arrayed  elements  represent  a  random  sampling  of  the  genes  in  the  organism.  This  point  is  important  because  we  will  also  assume  that  the  arrayed  elements  randomly  interrogate  the  two  RNA  samples.  If  the  arrayed  genes  are  selected  to  represent  only  those  we  know  will  change,  then  we  will  likely  over-  or  under-sample  the  genes  in  one  of  the  biological  samples  being  compared.  If  the  array  contains  a  large  enough  assortment  of  random  genes,  we  do  not  expect  to  see  such  bias.  This  is  because  for  a  finite  RNA  sample,  when  representation  of  one  RNA  species  increases,  representation  of  other  species  must  decrease.  Consequently,  approximately  the  same  number  of  labeled  molecules  from  each  sample  should  hybridize  to  the  arrays  and,  therefore,  the  total  hybridization  intensities  summed  over  all  elements  in  the  arrays  should  be  the  same  for  each  sample.  Using  this  approach,  a  normalization  factor  is  calculated  by  summing  the  measured  intensities  in  both  channels
0	Narray  i=1  Ntotal  =  Narray  ,  Gi
0	so  that  the  normalized  expression  ratio  for  each  element  becomes  Ri  1  =  ,  Ntotal  Gi
0	which  adjusts  each  ratio  such  that  the  mean  ratio  is  equal  to  1.  This  process  is  equivalent  to  subtracting  a  constant  from  the  logarithm  of  the  expression  ratio,
0	which  results  in  a  mean  log2(ratio)  equal  to  zero.  There  are  many  variations  on  this  type  of  normalization,  including  scaling  the  individual  intensities  so  that  the  mean  or  median  intensities  are  the  same  within  a  single  array  or  across  all  arrays,  or  using  a  selected  subset  of  the  arrayed  genes  rather  than  the  entire  collection.
0	Lowess  normalization  In  addition  to  total  intensity  normalization  described  above,  there  are  a  number  of  alternative  approaches  to  normalizing  expression  ratios,  including  linear  regression  analysis1,  log  centering,  rank  invariant  methods2  and  Chen's  ratio  statistics3,  among  others.  However,  none  of  these  approaches  takes  into  account  systematic  biases  that  may  appear  in  the  data.  Several  reports  have  indicated  that  the  log2(ratio)  values  can  have  a  systematic  dependence  on  intensity4,5,  which  most  commonly  appears  as  a  deviation  from  zero  for  low-intensity  spots.  Locally  weighted  linear  regression  (lowess)6  analysis  has  been  proposed4,5  as  a  normalization  method  that  can  remove  such  intensity-dependent  effects  in  the  log2(ratio)  values.  The  easiest  way  to  visualize  intensity-dependent  effects,  and  the  starting  point  for  the  lowess  analysis  described  here,  is  to  plot  the  measured  log2(Ri/Gi)  for  each  element  on  the  array  as  a  function  of  the  log10(Ri*Gi)  product  intensities.  This  `R-I'  (for  ratiointensity)  plot  can  reveal  intensity-
0	research  focus
0	Protein  microarray  technology
1	Markus  F.  Templin,  Dieter  Stoll,  Monika  Schrenk,  Petra  C.  Traub,  Christian  F.  Voehringer  and  Thomas  O.  Joos
0	Microarray  technology  allows  the  simultaneous  analysis  of  thousands  of  parameters  within  a  single  experiment.  Microspots  of  capture  molecules  are  immobilised  in  rows  and  columns  onto  a  solid  support  and  exposed  to  samples  containing  the  corresponding  binding  molecules.  Readout  systems  based  on  fluorescence,  chemiluminescence,  mass  spectrometry,  radioactivity  or  electrochemistry  can  be  used  to  detect  complex  formation  within  each  microspot.  Such  miniaturised  and  parallelised  binding  assays  can  be  highly  sensitive,  and  the  extraordinary  power  of  the  method  is  exemplified  by  array-based  gene  expression  analysis.  In  these  systems,  arrays  containing  immobilised  DNA  probes  are  exposed  to  complementary  targets  and  the  degree  of  hybridisation  is  measured.  Recent  developments  in  the  field  of  protein  microarrays  show  applications  for  enzyme-substrate,  DNA-protein  and  different  types  of  protein-protein  interactions.  This  article  discusses  theoretical  advantages  and  limitations  of  any  miniaturised  capture-molecule-ligand  assay  system  and  discusses  how  the  use  of  protein  microarrays  will  change  diagnostic  methods  and  genome  and  proteome  research.
0	w  The  fundamental  principles  of  miniaturised
0	and  parallelised  microspot  ligand-binding  assays  were  described  more  than  a  decade  ago.  In  the  `ambient  analyte  theory',  Roger  Ekins  and  coworkers  [1-4]  explained  why  microspot  assays  are  more  sensitive  than  any  other  ligand-binding  assay.  At  that  time,  the  high  sensitivity  and  enormous  potential  of  microspot  technology  had  already  been  demonstrated  using  miniaturised  immunological  assay  systems.  Nevertheless,  the  enormous  interest  that  microarray-based  assays  evoked  came  from  work  using  DNA  chips.  The  possibility  of  determining  thousands  of  different  binding  events  in  one  reaction  in  a  massively  parallel  fashion  perfectly  suited  the  needs  of  genomic  approaches  in  biology.  The  rapid  progress  in  whole-genome  sequencing  (e.g.  [5,6])  and  the  increasing  importance  of  expression  studies  (expressed  sequence  tag  [EST]  sequencing)  was  matched  with  efficient  in  vitro  techniques  for  synthesising  specific  capture  molecules  for  ligand-binding  assays.  Oligonucleotide  synthesis  and  PCR  amplification  allow  thousands  of  highly  specific  capture  molecules  to  be  generated  efficiently.  New  trends  in  technology,  mainly  in  microtechnology  and  microfluidics,  newly  established  detection  systems  and  improvements  in  computer  technology  and  bioinformatics  were  rapidly  integrated  into  the  development  of  microarray-based  assay  systems.  Now,  DNA  microarrays,  some  of  them  built  from  tens  of  thousands  of  different  oligonucleotide  probes  per  square  centimetre,  are  well-established  high-throughput  hybridisation  systems  that  generate  huge  sets  of  genomic  data  within  a  single  experiment  (Fig.  1).  Their  use  for  the  analysis  of  single  nucleotide  polymorphisms  and  in  expression  profiling  has  already  changed  pharmaceutical  research,  and  their  use  as  diagnostic  tools  will  have  a  big  impact  on  medical  and  biological  research.  As  known  from  gene  expression  studies,  however,  mRNA  level  and  protein  expression  do  not  necessarily  correlate  [7-9].  Protein  functionality  is  often  dependent  on  posttranslational  processing  of  the  precursor  protein  and  regulation  of  cellular  pathways  frequently  occurs  by  specific  interaction  between  proteins  and/or  by  reversible  covalent  modifications  such  as  phosphorylation.  To  obtain  detailed  information  about  a  complex  biological  system,  information  on  the  state  of  many  proteins  is  required.  The  analysis  of  the  proteome  of  a  cell  (i.e.  the  quantification  of  all  proteins  and  the  determination  of  their  post-translational  modifications  and  how  these  are  dependent  on  cell-state  and  environmental  influences)  is  not  possible  without  novel  experimental  approaches.  High-throughput  protein  analysis  methods  allowing  a  fast,  direct  and  quantitative  detection  are  needed.  Efforts  are  underway,  therefore,  to  expand  microarray  technology  beyond  DNA  chips  and
0	research  focus
0	Internal  parameters  Genetic  Aging  Diseases
0	External  parameters  Drugs  Environment
0	Signal  density  Decrease
0	Cell  Signal  log  (Total  intensity)  Signal  density  log  (Signal/area)
0	Genetic  analysis  ·  SNP  ·  Mutation  ·  Sequencing
0	Expression  analysis  ·  mRNA  ·  Protein
0	Interaction  analysis  ·  ·  ·  ·  ·  Protein-protein  Antigen-antibody  Enzyme-substrate  Protein-DNA  Ligand-receptor
0	Drug  Discovery  Today
0	Total  amount  of  antibody
0	Drug  Discovery  Today
0	establish  array-based  approaches  to  characterise  proteomes  (Fig.  1)  [10-12].
0	Miniaturised  ligand-binding  assays:  theoretical  considerations
0	The  ambient  analyte  assay  theory  shows  that  miniaturised  ligand-binding  assays  are  able  to  achieve  a  superior  sensitivity.  A  system  that  uses  a  small  amount  of  capture  molecules  and  a  small  amount  of  sample  can  be  more  sensitive  than  a  system  that  uses  a  hundred  times  more  material.  Ekins  and  coworkers  [1-4]  developed  a  sensitive  microarray-based  analytical  technology  and  proved  the  high  sensitivity  of  the  miniaturised  assay.  With  this  system,  analytes,  such  as  thyroid  stimulating  hormone  (TSH)  or  Hepatitis  B  surface  antigen  (HbsAG),  could  be  quantified  down  to  the  femtomolar  concentration  range  (corresponding  to  106  molecules  ml-1).  Miniaturisation  is  the  key  to  understanding  the  principle  of  miniaturised  binding  assays.  Capture  molecules  are  immobilised  to  the  solid  phase  only  in  a  very  small  area,  the  microspot  -  although  the  amount  of  capture  molecules  present  in  the  system  is  low,  a  high  density  of  molecules  in  the  microspot  can  be  obtained  (Fig.  2).  During  an  assay,  target  molecules,  or  analytes,  are  captured  by  the  microspot  but  the  number  of
0	research  focus
0	DNA  mRNA  Protein  Biological  target
0	DNA  mRNA  Protein
0	Protein  Biological  target
0	Amplification  Different  labeling
0	Amplification  Labeling
0	Competitive  binding
0	Differentially  regulated  targets
0	Quantification  YYYY  YYYY  YYYY
0	Drug  Discovery  Today
0	THE  USE  AND  ANALYSIS  OF  MICROARRAY  DATA
1	Atul  Butte
0	Functional  genomics  is  the  study  of  gene  function  through  the  parallel  expression  measurements  of  genomes,  most  commonly  using  the  technologies  of  microarrays  and  serial  analysis  of  gene  expression.  Microarray  usage  in  drug  discovery  is  expanding,  and  its  applications  include  basic  research  and  target  discovery,  biomarker  determination,  pharmacology,  toxicogenomics,  target  selectivity,  development  of  prognostic  tests  and  disease-subclass  determination.  This  article  reviews  the  different  ways  to  analyse  large  sets  of  microarray  data,  including  the  questions  that  can  be  asked  and  the  challenges  in  interpreting  the  measurements.
0	NATURE  REVIEWS  |  DRUG  DISCOVERY
0	Nature  Publishing  Group
0	Tissue  or  tissue  under  influence
0	cDNA  or  cRNA  copy
0	Tagged  or  incorporating  fluor
0	Fluorescent  intensities  scanned  into  computer
0	cDNA  spotted  on  glass  slide  or  oligonucleotides  built  on  slide
0	Instead  of  fitting  a  complex  polynomial  curve  to  data,  splines  allow  the  fitting  of  data  by  putting  together  smaller,  less  complex  curves.
0	NORTHERN  BLOT
0	Different  RNA  molecules  are  separated  by  mass  on  a  gel,  then  radioactively  labelled  complementary  DNA  or  RNA  molecules  are  used  to  quantify  specific  RNA  amounts.
0	REVERSE  TRANSCRIPTION
0	determine  differences  in  gene  expression  in  tissues  exposed  to  various  doses  of  compounds;  toxicogenomics,  to  find  gene-expression  patterns  in  a  model  tissue  or  organism  exposed  to  a  compound  and  their  use  as  early  predictors  of  adverse  events  in  humans;  target  selectivity,  to  define  a  compound  by  the  geneexpression  pattern  it  provokes  in  a  target  tissue  and  then  compare  it  with  other  compounds  using  these  patterns;  prognostic  tests,  to  find  a  set  of  genes  that  accurately  distinguishes  one  disease  from  another;  and  diseasesubclass  determination,  to  find  multiple  subcategories  of  tumours  in  a  single  clinical  diagnosis.  Many  free  (BOX  1)  and  commercial  software  packages  are  now  available  to  analyse  microarray  data  sets,  although  it  is  still  difficult  to  find  a  single  off-the-shelf  software  package  that  answers  all  functional-genomics  questions.  As  the  field  is  still  young,  when  developing  a  bioinformatics  analysis  pipeline,  it  is  more  important  to  have  a  good  understanding  of  both  the  biology  involved  and  the  analytical  techniques  rather  than  having  the  right  software.  This  article  reviews  the  different  ways  to  analyse  microarray  data,  and  will  concentrate  on  choosing  the  appropriate  method  for  the  given  hypothesis.
0	Normalization  and  noise
0	The  synthesis  of  a  strand  of  DNA  from  RNA,  which  is  used  to  make  a  complementary  DNA  copy  of  sample  RNA.
0	Before  multiple  microarray  measurements  can  be  integrated  into  a  single  analysis,  the  reported  measurements  need  to  be  normalized,  or  modified  (possibly  corrected)  to  make  them  comparable.When  microarrays  are  used  to  collect  gene-expression  data  in  an  experiment  in  which  the  measurements  are  made  at  the  same  time,  with  homogeneous  populations  of  similar  cells  and  using  a
0	single  microarray  technology,  normalization  might  simply  be  a  matter  of  adjusting  the  overall  brightness  of  each  scanned  microarray  image,  assuming  that  the  quantity  of  RNA  is  equal4.  Other  normalization  methods  include:  using  expression  levels  of  `housekeeping'  genes5;  using  assumptions  that  most  genes  do  not  change  across  experiments6;  using  SPLINES7;  or  other  nonlinear  techniques8,9.  Typically,  however,  functional-genomics  experiments  are  more  complicated.  Recently,  increasing  efforts  have  been  invested  in  characterizing  the  `noise'  in  microarray  technology.  Studies  addressing  the  reproducibility  of  microarray  data  analysed  replicated  data10,  compared  microarray  measurements  with  NORTHERN  11,12  BLOTS  and  SAGE13,  and  evaluated  strategies  for  14  REVERSE  TRANSCRIPTION  and  in  vitro  transcription  amplification15.  As  a  result,  it  has  become  increasingly  clear  that  there  are  several  substantial  sources  of  noise  in  microarray  data.  Intra-  and  inter-microarray  variations  can  markedly  skew  the  interpretation  of  such  expression  data.  First,  improving  the  reliability  of  expression  measurements  starts  with  proper  experimental  design.  For  example,  microarrays  can  measure  across  the  genome,  including  genes  with  expression  that  is  controlled  by  hormones,  such  as  growth  hormone  or  cortisol.  So,  if  organ  samples  are  acquired  at  various  times  during  the  day,  genes  that  appear  to  be  differentially  expressed  might  only  be  reflecting  normal  circadian  physiology.  Pooling  samples  before  hybridization  might  control  for  this  biological  `noise.'  In  addition,  scanned  hybridization  images  need  to  be  inspected  for  artefacts,  such  as  scratches  and  bubbles16,17.  Measuring  replicate  microarrays  for  each  biological  sample  allows  the  modelling  of  this  technical  noise.
0	Nature  Publishing  Group
0	Most  reported  expression  data  have  been  obtained  on  relatively  homogeneous  cell  populations.  However,  when  RNA  is  extracted  from  whole  organs  or  from  tumour  biopsies,  the  sources  of  variation  increase.  There  is  substantial  heterogeneity  of  expression  in  cell  subpopulations  in  most  organs  and  in  many  tumours.  Failure  to  account  for  such  variation  could  lead  to  overinterpretation  or  spurious  functional  gene  association.  Microdissection  of  cell  subpopulations  (for  example,  with  laser  capture18)  is  possible  only  in  a  minority  of  the  systems  of  interest.  If  microarray-based  geneexpression  measurements  are  to  be  reliable  and  economical,  both  at  the  level  of  basic  biology  and  clinical  assays,  then  all  of  these  further  sources  of  noise/variation  must  be  incorporated  directly  into  the  analytical  tools  that  interpret  these  data.  A  further  issue  that  needs  to  be  addressed  is  the  difference  between  the  two  most  commonly  used  microarray  technologies:  spotted  cDNA  microarrays,  which  report  differences  in  gene  expression  between  two  samples,  and  oligonucleotide  microarrays,  which  report  absolute  expression  levels.  Normalization  techniques  for  one  microarray  technology  might  not  apply  to  another,  owing  to  differences  in  assumptions  and  the  distributions  of  the  output  measurements.  For  example,  if  we  assume  th
0	A  strategy  for  optimizing  quality  and  quantity  of  DNA  extracted  from  soil
1	Helmut  Burgmann  a,)  ,  Manuel  Pesaro  a  ,  Franco  Widmer  a,b,  Josef  Zeyer  a  ¨
0	Keywords:  DNA;  Bead  beating;  Soil;  Extraction
0	Introduction  Molecular  ecology  relies  heavily  on  methods  for  the  direct  extraction  of  DNA  from  environmental  samples.  Molecular  methods  for  the  analysis  of  gene  pools  using  polymerase  chain  reaction  ZPCR.  or
0	cloning  techniques  rely  on  high  quality  nucleic  acids  as  template,  as  these  techniques  require  pure,  unfragmented  DNA  templates.  Extraction  of  pure  nucleic  acids  from  soil  samples  has  been  a  challenge  because  of  the  complex  and  heterogeneous  nature  of  the  soil  matrix  and  the  inhibition  of  biochemical  reactions  by  coextracted  substances  such  as  humic  acids  ZPorteous  and  Armstrong,  1993;  Steffan  and  Atlas,  1991;  Young  et  al.,  1993..  The  efficiency  of  the  extraction  is  of  equal  importance.  High  DNA  yields  are  important  to  obtain  a  low  detection  limit  and  to  ensure  the
0	from  soil  with  a  method  optimized  for  quality  of  the  extracted  DNA,  and  we  investigate  the  impact  of  the  extraction  method  on  the  apparent  microbial  community.
0	Materials  and  methods  2.1.  Soil  sampling  and  storage  One  agricultural  and  five  forest  soil  samples  were  collected  in  August  1999  from  sites  in  northern  Switzerland  and  the  upper  Rhone  valley  in  southern  ^  Switzerland.  They  represent  a  range  of  typical  European  soils  with  respect  to  parameters  like  pH,  texture  and  organic  matter  content  ZTable  1.  ZFavre,  1982;  Richard  and  Luscher,  1983..  At  each  site,  a  block  of  ¨  soil  was  removed  with  a  spade  and  the  A  horizons  were  separated  and  transported  to  the  laboratory  in  plastic  bags.  All  soils  were  passed  through  a  2.5-mm  sieve  and  stored  at  108C.  DNA  extractions  were  performed  after  an  equilibration  time  of  at  least  3  weeks.  While  this  method  of  storage  allows  for  some  change  in  the  microbial  communities  over  time,  it  was  undesirable  to  freeze  soil  samples  because  of  the  additional  physical  stress  introduced  by  freezing  and  thawing.  2.2.  DNA  extraction  procedures  Extractions  were  performed  with  a  modification  of  a  buffer  previously  described  for  RNA  extraction  ZCheung  et  al.,  1994..  The  buffer  contains  0.2%  hexadecyltrimethylammonium  bromide  ZCTAB.,  1  mM  dithiotreitol  ZDTT.,  0.2  M  sodium  phosphate  buffer  ZpH  8.,  0.1  M  NaCl  and  50  mM  EDTA.  Silica  or  ceramic  beads  ZTable  2,  types  A,  C,  and  D.  or  bead  mixtures  ZTable  2,  types  B  and  E.  were  weighed  into  sterile  2-ml  microtubes,  an  amount  of  soil  was  added  and  the  buffer  was  pipetted  directly  into  the  tube.  The  tubes  were  processed  in  the  bead-beater  ZFastPrep  FP120  bead-beater,  Bio101rSavant,  Farmingdale,  NY.,  which  allowed  simultaneous  processing  of  up  to  12  samples.  The  machine  supports  beating  speeds  Zmaximum  speed  of  the  tube  during  vertical  movement.  between  4.0  and  6.5  m  sy1  Zin  0.5  m  sy1  increments.,  corresponding  to  approxi-
1	Osterliwald  Rafz  Steig  Winzlerboden
0	Data  from  Favre  Z1982.  and  Richard  and  Luscher  Z1983..  ¨  Gartenacker  from  the  upper  Rhone  valley  in  southern  Switzerland,  all  other  soils  from  northern  Switzerland.  ^
0	Experiment  FastPrep  parameters  Bead  types  Amount  of  beads  Temperature  Reextractionc  Maximum  extractiond  Comparison  of  soils
0	Beads  Ztype.  a  A  A,  B,  C,  D,  E  A  A  A  A  A
0	Three-Detergent  Method  for  the  Extraction  of  RNA  from  Several  Bacteria
0	Recent  trends  in  molecular  bacteriology  have  highlighted  the  importance  of  examining  and  comparing  gene  expression  in  different  species  in  many  cases.  Also,  studies  with  a  number  of  different  bacterial  strains  may  be  required  when  working  on  their  ecology  or  population  biology.  In  all  such  cases,  high-efficiency  protocols  applicable  to  a  variety  of  bacteria  are  relevant.  A  potential  hurdle  in  the  isolation  of  intact
0	RNA  from  bacteria  is  the  relatively  short  half-life  of  the  messenger  RNA.  Hence,  the  rapidity  of  cellular  lysis  and  complete  inhibition  of  RNases  is  of  particular  importance  in  such  protocols.  A  mixture  of  detergents  at  low  pH  was  previously  shown  to  be  efficient  for  cellular  lysis  for  mycobacteria  (4).  On  this  basis,  we  have  developed  a  threedetergent  method  for  the  isolation  of  RNA  from  several  gram-negative  bacterial  species.  In  our  method,  cellular  lysis  is  achieved  through  a  combination  of  SDS,  Tweenfi  20  and  Tritonfi  X-100  while  genomic  DNA  contamination  is  reduced  through  acid  depurination-cumdeproteination  through  the  use  of  citrate-buffered  phenol  (pH  4.0).  The  three  detergents  are  readily  available:  SDS  is
0	tity  of  the  RNA  obtained.  The  RNA  yields  ranged  between  21.8  and  47.2  µg  RNA/mL  starting  culture,  and  the  A260/A280  nm  ratios  were  between  1.80  and  2.09.  Figure  1A  shows  the  gel  profile  of  total  RNA  obtained  from  P.  putida  wild-type  using  different  methods.  A  non-denaturing  gel  was  used  because  it  shows  more  clearly  both  the  RNA  quantity  and  quality  and  the  degree  of  persisting  DNA.  Figure  1,  lane  3  shows  that  the  quantity  of  RNA  isolated  using  the  three-detergent  technique  was  significantly  higher  than  when  a  single  detergent  was  used  (Figure  1,  lanes  1  and  2,  2%  and  5%  SDS,  respectively).  Having  established  that  this  threedetergent  method  was  the  most  efficient,  we  then  proceeded  to  optimize  the  reduction  of  chromosomal  DNA  carry-over.  The  persisting  DNA  and  RNA  yields  obtained  from  LiCl  precipitation  for  1  h,  3  h  and  overnight  are  shown  in  Figure  1,  lanes  4-6,  respectively.  Total  yields  are  reduced,  but  so  are  the  persisting  DNA.  Lane  7  shows
0	Table  1.  Average,  Based  on  Three  Experiments,  RNA  Recovery  from  Different  Bacterial  Strains
0	Strains  P.  putida  39169  P.  putida  39169  P.  putida  39169  P.  putida  39169  Epicurian  colifi  XL1-Blue  P.  aeruginosa  BO267  E.  tarda  PPD  130/91  B.  cepacia  53267  A.  tumefaciens  AGL1  B.  cereus  14579  B.  subtilis  6051
0	Yield  (µg  RNA/mL  µ  Starting  Culture)  47.2  ±  3.2  8.1  ±  22.1  25.1  ±  2.8  34.5  ±  2.7  35.7  ±  2.0  46.9  ±  3.3  25.3  ±  2.4  21.8  ±  2.0  24.4  ±  1.7  37.4  ±  2.4  (24)b  39.2  ±  1.9  (9)b
0	Cells  were  lysed  in  20  mL  of  STT  extraction  buffer,  and  RNA  was  precipitated  with  a  1  vol  of  isopropanol;  method  2:  as  in  method  1,  but  with  an  additional  lysozyme  treatment  prior  to  cell  lysis  with  STT;  method  3:  RNA  was  precipitated  with  LiCl  for  either  1  h,  3  h  or  overnight;  method  4:  RNA  was  first  precipitated  with  isopropanol  and  then  DNase-treated.  yields  obtained  if  lysozyme  treatment  was  omitted.
0	the  RNA  obtained  from  isopropanol  precipitation  followed  by  DNase  I  treatment.  The  contaminating  DNA  is  fully  removed,  and  the  RNA  yields  are  still  higher  (1.1-  to  3.1-fold)  than  that  obtained  from  LiCl  precipitation.  RNA  was  also  isolated  u
0	mRNA  Extraction  and  Reverse  Transcription-PCR  Protocol  for  Detection  of  nifH  Gene  Expression  by  Azotobacter  vinelandii  in  Soil
1	Helmut  Burgmann,1*  Franco  Widmer,2  William  V.  Sigler,1  and  Josef  Zeyer1  ¨
0	Soil  Biology,  Institute  of  Terrestrial  Ecology,  Swiss  Federal  Institute  of  Technology  (ETH-Zurich),  ¨  CH-8952  Schlieren,1  and  Swiss  Federal  Research  Station  for  Agroecology  and  Agriculture  (FAL  Reckenholz),  CH-8046  Zurich,2  Switzerland  ¨
0	A.  VINELANDII  nifH  ACTIVITY  IN  SOIL  AND  LIQUID  CULTURE  TABLE  1.  Starting  conditions  for  the  experimental  treatments  and  controls
0	A.  vinelandii  concn  (cells  ml  1  or  cells  g  1)b
0	Sucrose  concn  (%)c
0	NH4NO3  concn  (  mol  ml  1  or  mol  g
0	No.  of  replicates
0	LC  N  LC  N  SC  N  SC  N  LC  control  SC  control  Reference  soil
0	Liquid  medium  Liquid  medium  Sterile  soil  Sterile  soil  Liquid  medium  Sterile  soil  Nonsterile  soil
0	The  Liquid  medium  was  ATTC  14  medium,  and  the  soil  was  Pappelacker  (see  text).  Strain  DSM  85.  c  The  concentration  in  liquid  medium  was  2%  (wt/vol),  and  the  concentration  in  soil  was  2%  (wt/wt).  d  Concentration  of  NH4NO3  added.  The  soil  contained  additional  indigenous  nitrogen.  e  NA,  not  applicable.
0	most  previous  investigations  high-density  inoculation  or  very  active  communities  were  required  in  order  to  reliably  detect  mRNA.  Reliable  extraction  of  mRNA  from  soil  is  still  considered  a  challenge  in  soil  microbiological  research  (17).  Recent  progress  in  extraction  technology,  however,  has  shown  that  the  approach  is  feasible  (19).  Here  we  describe  an  effective  total  RNA  extraction  protocol  based  on  a  previously  described  direct  extraction  procedure  for  total  nucleic  acids  (8).  Azotobacter  vinelandii,  an  aerobic  freeliving  soil  diazotroph,  was  cultivated  in  a  previously  sterilized  soil  and  in  liquid  culture.  This  system  was  used  to  establish  and  verify  a  method  for  nifH  mRNA  extraction  and  detection  by  reverse  transcription  (RT)  and  PCR.  N  fixation  was  either  induced  by  providing  excess  organic  carbon  (sucrose)  or  repressed  by  providing  excess  bioavailable  N  (NH4NO3).  Population  growth,  bulk  N-fixing  activities,  and  nifH  mRNA  expression  were  monitored  and  compared  in  order  to  link  nifH  gene  expression  to  N-fixing  activity  in  a  soil  environment.
0	aubergine  enhances  oskar  translation  in  the  Drosophila  ovary
1	Joan  E.  Wilson,  Joanne  E.  Connell  and  Paul  M.  Macdonald*
0	Key  words:  aubergine,  oskar,  translation,  maternal  mRNA,  Drosophila
0	RESEARCH  ARTICLE
0	A  Gene  Expression  Map  for  the  Euchromatic  Genome  of  Drosophila  melanogaster
1	Viktor  Stolc,1,5*  Zareen  Gauhar,1,2*  Christopher  Mason,2*  Gabor  Halasz,7  Marinus  F.  van  Batenburg,7,9  Scott  A.  Rifkin,2,3  Sujun  Hua,2  Tine  Herreman,2  Waraporn  Tongprasit,6  Paolo  Emilio  Barbano,2,4  Harmen  J.  Bussemaker,7,8  Kevin  P.  White2,3.
0	We  used  a  maskless  photolithography  method  to  produce  DNA  oligonucleotide  microarrays  with  unique  probe  sequences  tiled  throughout  the  genome  of  Drosophila  melanogaster  and  across  predicted  splice  junctions.  RNA  expression  of  protein  coding  and  nonprotein  coding  sequences  was  determined  for  each  major  stage  of  the  life  cycle,  including  adult  males  and  females.  We  detected  transcriptional  activity  for  93%  of  annotated  genes  and  RNA  expression  for  41%  of  the  probes  in  intronic  and  intergenic  sequences.  Comparison  to  genome-wide  RNA  interference  data  and  to  gene  annotations  revealed  distinguishable  levels  of  expression  for  different  classes  of  genes  and  higher  levels  of  expression  for  genes  with  essential  cellular  functions.  Differential  splicing  was  observed  in  about  40%  of  predicted  genes,  and  5440  previously  unknown  splice  forms  were  detected.  Genes  within  conserved  regions  of  synteny  with  D.  pseudoobscura  had  highly  correlated  expression;  these  regions  ranged  in  length  from  10  to  900  kilobase  pairs.  The  expressed  intergenic  and  intronic  sequences  are  more  likely  to  be  evolutionarily  conserved  than  nonexpressed  ones,  and  about  15%  of  them  appear  to  be  developmentally  regulated.  Our  results  provide  a  draft  expression  map  for  the  entire  nonrepetitive  genome,  which  reveals  a  much  more  extensive  and  diverse  set  of  expressed  sequences  than  was  previously  predicted.  Characterization  of  the  complete  expressed  set  of  RNA  sequences  is  central  to  the  functional  interpretation  of  each  genome.  For  almost  3  decades,  the  analysis  of  the  Drosophila  genome  has  served  as  an  important  model  for  studying  the  relationship  between  gene  expression  and  development.  In  recent  years,  Drosophila  provided  the  initial  demonstration  that  DNA  microarrays  could  be  used  to  study  gene  expression  during  development  (1),  and  subsequent  large-scale  studies  of  gene  expression  in  this  and  other  developmental  model  organisms  have  given  new  insights  into  how
0	of  the  human  genome  and  for  Arabidopsis  (11-13).  Microarrays  have  also  recently  been  used  to  characterize  the  great  diversity  of  RNA  transcripts  brought  about  by  differential  splicing  in  human  tissues  (14).  We  used  both  types  of  approaches  to  characterize  the  Drosophila  genome.  Experimental  design.  To  determine  the  expressed  portion  of  the  Drosophila  genome,  we  designed  high-density  oligonucleotide  microarrays  with  probes  for  each  predicted  exon  and  probes  tiled  throughout  the  predicted  intronic  and  intergenic  regions  of  the  genome.  We  used  maskless  array  synthesizer  (MAS)  technology  (15,  16)  to  synthesize  custom  microarrays  containing  179,972  unique  36-nucleotide  (nt)  probes  (17).  Of  these,  61,371  exon  probes  (EPs)  assayed  52,888  exons  from  13,197  predicted  genes,  87,814  nonexon  probes  (NEPs)  assayed  expression  from  intronic  and  intergenic  regions,  and  30,787  splice  junction  probes  (SJPs)  assayed  potential  exon  junctions  for  a  test  subset  of  3955  genes.  For  the  SJPs,  we  used  36-nt  probes  spanning  each  predicted  splice  junction,  with  18  nt  corresponding  to  each  exon  (14).  RNA  from  six  developmental  stages  during  the  Drosophila  life  cycle  (early  embryos,  late  embryos,  larvae,  pupae,  and  male  and  female  adults)  was  isolated  and  reversetranscribed  in  the  presence  of  oligodeothymidine  and  random  hexamers,  and  the  labeled  cDNA  was  hybridized  to  these  arrays.  The  stages  were  chosen  to  maximize  the  number  of  transcripts  that  would  be  differentially  expressed  between  samples  on  the  basis  of  previous  results  (3,  7).  Each  sample  was  hybridized  four  times,  twice  with  Cy5  labeling  and  twice  with  Cy3  labeling  (fig.  S1).  Genomic  and  chromosomal  expression  patterns.  We  determined  which  exon  or  nonexon  probes  correspond  to  genomic  regions  that  are  transcribed  at  any  stage  during  development  (18).  We  used  a  negative  control  probe  (NCP)  distribution  (fig.  S3)  to  score  the  statistical  significance  of  the  EP  or  NEP  signal  intensities  for  each  of  the  24  unique  combinations  of  stage,  dye,  and  array,  correcting  for  probe  sequence  bias  (17,  19).  These  results  were  combined  into  a  single  expression-level  estimate  (19),  a  threshold  for  which  was  determined  by  requiring  a  false  discovery  rate  of  5%  (20).  This  threshold  shows  47,419  of  61,371  EPs  (77%)  and  35,985  out  of  87,814  NEPs  (41%)  were  significantly  expressed  at  some  point  during  the  fly  life  cycle.  Significantly  expressed  EPs  correspond  to  79%  (41,559/52,888)  of  all  exons  probed  and  93%  (12,305/13,197)  of  all  probed  gene  annotations.  Our  results  confirmed  2426  annotated  genes  not  yet  validated  through  an  EST  sequence  (Fig.  1A).  Out  of  10,280  genes  represented  by  EST  sequences,
0	OCTOBER  2004
0	RESEARCH  ARTICLE
0	only  401  (3.0%)  were  not  detected  in  these  microarray  experiments.  Our  finding  that  a  large  fraction  of  intergenic  and  intronic  regions  (NEPs)  is  expressed  in  D.  melanogaster  mirrors  similar  observations  for  chromosomes  21  and  22  in  humans  (16)  and  for  Arabidopsis  (14).  These  results  support  the  conclusion  that  extensive  expression  of  intergenic  and  intronic  sequences  occurs  in  the  major  evolutionary  lineages  of  animals  (deuterostomes  and  protostomes)  and  in  plants.  We  noted  that  mRNA  expression  levels  for  protein-encoding  genes  varied  with  the  protein  function  assigned  in  the  Drosophila  Gene  Ontology  (fig.  S2)  (21).  For  example,  genes  encoding  G  protein  receptors  were  expressed  at  relatively  low  levels,  whereas  genes  encoding  ribosomal  proteins  were  highly  expressed.  A  gene's  expression  level  was  also  associated  with  cellular  compartmentalization  and  the  biological  process  it  mediates  (fig.  S2).  For  example,  genes  encoding  cytosolic  and  cytoskeletal  factors  were  more  highly  expressed  than  those  predicted  to  function  within  organelles  such  as  the  endoplasmic  reticulum,  Golgi,  and  peroxisome.  To  determine  whether  a  high  level  of  gene  expression  was  associated  with  essential  genetic  functions,  we  examined  the  expression  levels  of  genes  recently  shown  to  be  required  for  cell  viability  (Fig.  1B)  in  a  genome-wide  RNA  interference  (RNAi)  screen  in  Drosophila  (22).  Compared  to  the  rest  of  the  genome,  the  genes  identified  as  essential  by  RNAi  showed  a  significant  increase  in  expression  during  all  stages  of  development  (P  0  0.0009,  t  test),  even  when  the  highly  expressed  ribosomal  protein  genes  were  omitted  (P  0  0.0005,  t  test).  This  result  is  also  consistent  with  the  observation  that  genes  with  mutant  phenotypes  from  the  3-Mbase  Adh  genomic  region  are  overrepresented  in  EST  libraries  (23).  High  levels  of  essential  gene  expression  may  in  part  reflect  widespread  expression  in  cells  throughout  the  animal,  and  the  relative  RNA  expression  level  may  serve  as  a  rough  predictor  of  essential  cellular  function.  We  also  examined  changes  in  gene  expression  during  the  fly  life  cycle  to  determine  what  fraction  of  the  entire  genome  is  differentially  expressed  between  developmental  stages.  Figure  2A  shows  the  expression  signal  intensities  of  transcripts  from  a  typical  50-kilobase  pair  (kbp)  region  of  the  Drosophila  genome  during  each  major  developmental  stage.  Stage-specific  variation  in  expression  is  observed  not  only  for  exon  probes,  as  expected,  but  also  for  intergenic  and  intronic  probes.  We  used  analysis  of  variance  (ANOVA)  (24)  to  systematically  identify  probes  as  differentially  expressed  at  a  false  discovery  rate  of  5%  (16).  As  expected,  the  majority  of  probes  detecting  differentially  expressed  sequences  are  also  expressed  above  background  noise  level  (89%  of  EPs  and  81%  of  NEPs)  (17)  (Table  1).  We  found  27,176  EPs  to  be  differentially  expressed,  corresponding  to  76%  of  annotated  genes,  and  even  more  when  we  applied  a  less  conservative  background  model  (fig.  S4).  The  fact  that  the 
0	Review  articles
0	Control  of  developmental  timing  by  small  temporal  RNAs:  a  paradigm  for  RNA-mediated  regulation  of  gene  expression
1	Diya  Banerjee  and  Frank  Slack*
0	BioEssays  24.2
0	Review  articles
0	For  the  majority  of  animals,  spatial  pattern  is  laid  down  over  time  and  hence  spatial  identity  is  often  a  result  of  the  temporal  sequence  of  patterning  events.  The  key  role  that  developmental  time  plays  in  pattern  formation  is  illustrated  in  the  exquisite  series  of  heterochronic  grafting  experiments  performed  by  Summerbell  et  al.(20)  When  the  tips  of  young  chick  limb  buds  are  grafted  onto  older  limb  buds,  the  limbs  develop  with  reiterations  of  limb  segments  along  the  proximal±distal  (shoulder  to  fingers)  axis,  i.e.  these  limbs  develop  with  two  consecutive  sets  of  humerus,  radius,  and  ulna  bones  (Fig.  1).  In  the  reciprocal  heterochronic  graft,  old  limb  buds  are  grafted  onto  young  limb  buds  and  the  limbs  develop  with  deletion  of  segments  along  the  proximal±distal  axis,  i.e.  these  limbs  develop  with  a  humerus  immediately  followed  by  digits,  deleting  the  radius  and  ulna.  The  proximal±distal  axis  of  the  limb  develops  over  time  with  the  proximal  elements  being  produced  first  and  the  distal  elements  last.  Undifferentiated  cells  in  the  progress  zone  divide  under  the  influence  of  fibroblast  growth  factors  (FGFs)  produced  from  the  apical  epidermal  ridge,  the  most  distal  structure  in  the  limb  bud.  As  their  daughter  cells  move  away  from  the  FGF  signal,  they  differentiate  into  limb  elements.(21±23)  The  progress  zone  model  proposes  that  the
0	BioEssays  24.2
0	Review  articles
0	length  of  time  that  a  progenitor  cell  spends  in  the  progress  zone  dictates  which  proximal±distal  fates  its  daughters  will  assume.  Thus,  spatial  patterning  in  the  proximal±distal  axis  can  be  thought  of  as  a  consequence  of  temporal  patterning  because  the  specification  of  each  limb  element  is  dependent  on  the  relative  age  of  the  progenitor  cell  in  the  progress  zone.  Proximal  elements  are  derived  from  daughters  of  younger  progenitor  cells  and  distal  elements  are  derived  from  daughters  of  older  progenitor  cells.  Another  example  of  dependence  on  time  for  correct  spatial  patterning  can  be  found  during  anterior±posterior  patterning  by  Hox  genes  in  vertebrates.  Hox  genes  are  arranged  in  linear  clusters  in  which  the  physical  order  of  individual  Hox  genes  along  the  DNA  correlates  with  their  time  of  expression  as  well  as  their  spatial  domains  of  expression  along  the  anterior±  posterior  axis.  As  cell  proliferation  progresses  in  the  posteriorly  migrating  primitive  streak,  cells  that  are  derived  from  developmentally  younger  progenitors  become  anteriorly  located  and  express  genes  in  the  Hox  cluster  that  are  located  near  the  30  end  of  the  cluster.  More  posteriorly  located  cells  derived  from  older  progenitors  express  genes  closer  to  the  50  end  of  the  cluster.  This  correlative  relationship,  known  as  ``colinearity'',  emphasizes  the  intimacy  of  the  relationship  between  developmental  space  and  time.(24,25)  The  observation  of  Hox  gene  colinearity  raises  the  possibility  that  temporal  and  spatial  patterning  pathways  may  share  common  mechanisms  and  genes.  A  first  hint  of  this  possibility  is  the  recent  observation  that  hunchback  and  kruppel,  two  well-known  regulators  of  spatial  identity  in  Drosophila  embryogenesis,  are  also  required  for  temporal  identity  of  neurons.(26)  Temporal  boundaries  and  segment  identities  Heterochronic  genes  can  be  thought  of  as  the  temporal  equivalents  of  the  homeotic  spatial  patterning  genes.  While  homeotic  mutations  result  in  alterations  as  to  where  particular  cell  fates  are  expressed,  heterochronic  mutations  result  in  temporal  transformations  of  cell  fate,  that  is,  changes  in  when  a  particular  cell  fate  is  expressed  (Fig.  1).  Both  sets  of  genes  generate  graded  levels  of  morphogens  that  modify  a  basic  reiterated  pattern  of  segments.  In  Drosophila,  spatial  patterning  involves  expression  of  segmentation  genes  defining  the  segment  boundaries  in  the  early  embryo,  followed  by  specification  of  segment  identity  by  the  homeotic  genes.  Similarly,  one  can  define  two  broad  classes  of  developmental  timing  genes,  temporal  identity  genes  that  affect  the  fate  choices  that  a  cell  makes  at  a  specific  time  and  temporal  boundary  genes  that  set  the  pace  of  development,  for  example,  the  genes  that  control  the  timing  of  molting.  The  C.  elegans  heterochronic  mutations  identified  thus  far  transform  temporal  cell  fate  identity  without  appreciably  affecting  the  periodicity  of  progression  through  the  larval  stages.  These  mutations  thus  define  temporal  identity  genes.  The  larval  molting  cycle  is  unaffected  by  the  known  heterochronic  mutations  in  C.  elegans,  sug
0	Functional  anatomy  of  siRNAs  for  mediating  efficient  RNAi  in  Drosophila  melanogaster  embryo  lysate
1	Sayda  M.Elbashir,  Javier  Martinez,  Agnieszka  Patkaniowska,  Winfried  Lendeckel  and  Thomas  Tuschl1
0	Department  of  Cellular  Biochemistry,  Max-Planck-Institute  for  E  Biophysical  Chemistry,  Am  Fassberg  11,  D-37077  Gottingen,  Germany
0	Duplexes  of  21±23  nucleotide  (nt)  RNAs  are  the  sequence-specific  mediators  of  RNA  interference  (RNAi)  and  post-transcriptional  gene  silencing  (PTGS).  Synthetic,  short  interfering  RNAs  (siRNAs)  were  examined  in  Drosophila  melanogaster  embryo  lysate  for  their  requirements  regarding  length,  structure,  chemical  composition  and  sequence  in  order  to  mediate  efficient  RNAi.  Duplexes  of  21  nt  siRNAs  with  2  nt  3¢  overhangs  were  the  most  efficient  triggers  of  sequence-specific  mRNA  degradation.  Substitution  of  one  or  both  siRNA  strands  by  2¢-deoxy  or  2¢-O-methyl  oligonucleotides  abolished  RNAi,  although  multiple  2¢-deoxynucleotide  substitutions  at  the  3¢  end  of  siRNAs  were  tolerated.  The  target  recognition  process  is  highly  sequence  specific,  but  not  all  positions  of  a  siRNA  contribute  equally  to  target  recognition;  mismatches  in  the  centre  of  the  siRNA  duplex  prevent  target  RNA  cleavage.  The  position  of  the  cleavage  site  in  the  target  RNA  is  defined  by  the  5¢  end  of  the  guide  siRNA  rather  than  its  3¢  end.  These  results  provide  a  rational  basis  for  the  design  of  siRNAs  in  future  gene  targeting  experiments.  Keywords:  PTGS/RNA  interference/small  interfering  RNA
0	a  European  Molecular  Biology  Organization
0	S.M.Elbashir  et  al.
0	nucleotide  mismatches  between  the  siRNA  duplex  and  the  target  mRNA  abolish  interference.  These  results  provide  a  rational  basis  for  the  design  of  siRNAs  for  future  gene  targeting  experiments.
0	We  reported  previously  that  two  or  three  unpaired  nucleotides  at  the  3¢  end  of  siRNA  duplexes  were  more  efficient  in  target  RNA  degradation  than  blunt-ended  duplexes  (Elbashir  et  al.,  2001b).  To  perform  a  more  comprehensive  analysis  of  the  function  of  the  terminal  nucleotides,  we  synthesized  five  21  nt  sense  siRNAs,  each  displaced  by  one  nucleotide  relative  to  the  target  RNA,  and  eight  21  nt  antisense  siRNAs,  each  displaced  by  one  nucleotide  relative  to  the  target  (Figure  1A).  By  combining  these  sense  and  antisense  siRNAs,  a  series  of  eight  siRNA  duplexes  with  symmetric  overhanging  ends  were  generated  spanning  a  range  from  7  nt  3¢  overhang  to  4  nt  5¢  overhang.  The  interference  was  measured  using  the  dual  luciferase  assay  system  (Tuschl  et  al.,  1999;  Zamore  et  al.,  2000).  siRNA  duplexes  were  directed  against  firefly  luciferase  mRNA  and  sea  pansy  luciferase  mRNA  was  used  as  internal  control.  The  luminescence  ratio  of  target  to  control  luciferase  activity  was  determined  in  the  presence  of  siRNA  duplex  and  was  normalized  to  that  observed  in  its  absence.  For  comparison,  the  interference  ratios  of  long  dsRNAs  (39±504  bp)  are  shown  in  Figure  1B  (Elbashir  et  al.,  2001b).  The  interference  ratios  were  determined  at  concentrations  of  5  nM  for  long  dsRNAs  (Figure  1A)  and  at  100  nM  for  siRNA  duplexes  (Figure  1C±J).  The  100  nM  concentration  of  siRNAs  was  chosen  because  complete  processing  of  5  nM  504  bp  dsRNA  would  result  in  120  nM  total  siRNA  duplexes.  The  ability  of  21 
0	CHAPTER  8
0	Preparation  and  Analysis  of  Pure  Cell  Populations  from  Drosophila
1	Susan  Cumberledge'  and  Mark  A.  Krasnow
0	I.  Introduction  .II.  Purifying  Embryonic  Cells  by  Fluorescence-Activated  Cell  Sorting
0	A  .  Equipment  and  Reagents  B.  Methods  111.  Culturing  and  Analysis  of  Purified  Cells  A.  Short-Term  Culturing  B.  Fixation  and  Staining  with  Antibodies  C  .  Stable  Fluorescent  Marking  of  Purified  Cells  IV.  Conclusions  References
0	I.  Introduction
0	As  the  genetic  analysis  of  development  and  cell  function  in  Drosophila  melanogaster  has  burgeoned  over  the  last  15  years,  so  has  our  ability  to  distinguish  various  cell  types  in  developing  tissues,  using  molecular  cell  markers  that  have  become  available  mostly  through  gene  cloning.  As  our  understanding  of  development  and  cell  function  in  vivo  becomes  more  sophisticated,  it  is  increasingly  important  to  isolate  the  various  cell  types  so  that  they  can  be  more  fully  analyzed  and  manipulated  in  various  ways.  This  allows  one  to  test  the  emerging  models  of  the  underlying  cellular  and  molecular  processes  and  to  characterize  these  processes  biochemically  and  discover  new  components.
1	Susan  Cumberledge  and  Mark  A.  Krasnow
0	What  has  been  needed  is  a  convenient,  reliable  way  to  purify  large  quantities  of  different  cell  types  from  Drosophila.  A  wealth  of  knowledge  has  emerged  from  studies  of  purified  cells  and  continuous  cell  lines  from  vertebrates,  with  the  mammalian  immune  system  perhaps  the  most  dramatic  example  (Parks  et  a  f  .  ,  1989).  In  contrast,  there  have  been  only  a  few  serious  attempts  to  isolate  and  study  pure  populations  of  Drosophila  cells.  Mahowald  and  his  colleagues  have  shown  that  highly  enriched  populations  of  pole  cells  (germ-line  precursors)  and  neuroblasts  can  be  obtained  in  reasonable  quantity  from  embryos  (Allis  et  al.,  1977;  Furst  and  Mahowald,  1985),  and  other  groups  (Bernstein  et  al.,  1978;  Storti  et  al.,  1978)  have  described  procedures  for  the  isolation  of  myoblasts  (see  Mahowald  (Chapter  7)  and  Ashburner  (1989a)  for  reviews).  This  pioneering  work  demonstrated  the  feasibility  of  cell  purification  from  Drosophila  embryos,  and  it  showed  that  purified  cells  can  retain  the  ability  to  differentiate  appropriately  into  morphologically  distinct  cell  types.  The  fractionation  schemes  relied  primarily  on  differences  in  general  physical  characteristics  of  the  cells,  such  as  their  size,  shape,  density,  or  adhesive  properties.  For  example.  pole  cells,  because  they  tend  to  have  a  low  lipid  content  and  are  larger  than  most  embryonic  cells,  can  be  purified  by  equilibrium  density  centrifugation  followed  by  sedimentation  velocity  centrifugation  (Allis  et  a  f  .  ,  1977).  Neuroblasts  also  tend  to  be  large  and  can  be  selectively  enriched  by  centrifugal  elutriation  and  adherence  to  glass  (Furst  and  Mahowald,  1985).  However,  most  Drosophila  embryonic  cells,  at  least  during  early  embryogenesis,  are  rather  unexceptional  in  morphology  and  hence  may  not  be  amenable  to  purification  by  methods  based  solely  on  such  physical  characteristics.  Methods  for  purifying  these  cells  must  rely  on  other  properties  of  the  cells,  such  as  expression  of  cell  type-specific  molecular  maikers.  Surface  markers  have  been  widely  used  in  mammalian  systems  to  isolate  specific  cell  types,  particularly  cells  of  the  immune  system  (Parks  et  al.,  1989).  Antibodies  that  recognize  specific  cell  surface  antigens  are  commonly  employed  in  the  purification  by  using  the  antibodies  to  fluorescently  label  the  cells  followed  by  flow  cytometry/fluorescence-activated  cell  sorting  (  FACS)  or  by  coupling  the  antibodies  to  a  solid  phase  and  selectively  resorbing  the  cells  of  interest  ("panning")  (Wysocki  and  Sato,  1978).These  techniques  have  not  been  applied  to  Drosophila,  at  least  in  part  because  few  antibodies  to  cell  type-specific  surface  antigens  have  been  available  until  recently.  However,  in  Drosophila,  many  intracellular  markers  are  known,  perhaps  the  most  important  of  which  is  the  Escherichia  coli  lac2  (P-galactosidase)gene,  which  is  not  normally  present  but  is  easily  introduced  by  P-element-mediated  transformation.  Thousands  of  different  strains  expressing  lac2  under  control  of  various  cell-  and  tissue-specific  promoters  and  regulatory  elements  have  been  constructed,  many  by  random  insertion  of  a  lac2  transposon  such  that  lac2  expression  comes  under  the  control  of  an  endogenous  enhancer  or  regulatory  element  ("enhancer  trap")  (O'Kane  and  Gehring,  1987;  Bier  et  al.,  1989;  Bellen  et  al.,  1989).  We  have  established  a  method,  called  whole  animal  cell  sorting  (WACS),  for  purifying  the  P-galactosidase  expressing  cells  from  such  transgenic  strains  by  FACS
0	Preparation  and  Analysis  of  Pure  Cell  Populations
0	(Krasnow  et  al.,  1991).  The  key  technical  innovation  that  opened  the  way  to  this  approach  was  the  development  of  a  viable,  fluorogenic  P-galactosidase  substrate  (fluorescein  di-P-D-galactopyranoside)  that  was  shown  to  be  effective  in  the  analysis  and  purification  of  cultured  mammalian  cells  engineered  to  express  P-galactosidase  (Nolan  et  af.,  1988;  Fiering  et  af.,  1991).  The  general  scheme  for  WACS  is  as  follows  (Fig.  1).  (1)  Embryos  carrying  a  lac2  transgene  expressed  in  a  specific  cell  type  are  grown  to  the  desired  developmental  stage.  (2)  Cells  of  the  developing  embryos  are  dissociated  and  stained  with  FDG  and  then  stained  with  a  viable  cell  stain  and  a  dead  cell  stain.
0	Embryo  with  lacZ  transgene
0	Grow  to  desired  developmental
0	Cells  expressing  &galactosidase
0	Dissociate  cells
0	Stain  with  a  fluorogenic  p-galactosidase  substrate  (FDG)
0	Stain  with  vital  dead  cell  dye
0	e  (CBAM)  and
0	Purifylive,  p-galactosidaseexpressing  cells  by  FACS
0	Analyze  directly
0	Culture  in  vitro
0	Transplant  into  recipient  embryo
1	Susan  Cumberledge  and  Mark  A.  Krasnow
0	Purifying  Embryonic  Cells  by  Fluorescence-Activated  Cell  Sorting
0	A.  Equipment  and  Reagents
0	Flow  CytometedFACS  Instrument  We  have  used  a  modified  Becton  Dickinson  FACStar  Plus  flow  cytometer,  equipped  with  two  argon-ion  lasers.  Dual  laser  flow  cytometry,  data  collection,  and  multiparameter  analysis  are  performed  essentially  as  described  by  Parks
0	Preparation  and  Analysis  of  Pure  Cell  Populations
0	et  of.(1986,1989).  One  argon-ion  laser  (488  nrn,  400  mW  output)  is  used  to  generate  four  signals:  forward  light  scatter,  large  angle  light  scatter,  fluorescein  (detected  through  a  530/30-nm  bandpass  filter),  and  propidium  iodide  (detected  through  a  575/26-nm  bandpass  filter).  A  second  argon-ion  laser  was  used  as  an  ultraviolet  light  source  (351-363  nm,  50  mW)  to  excite  calcein  blue,  whose  emission  was  detected  through  a  405/20-nm  filter.  Data  collection  and  multiparameter  analysis  are  carried  out  on  a  Digital  VAX  computer  system  using  the  FACSiDESK  software  (Moore  and  Kautz,  1986).  For  applications  in  which  the  highest  degree  of  cell  purity  and  viability  are  not  required,  calcein  blue  staining  can  be  omitted  and  a  single  laser  flow  cytometer  (488  nm  excitation)  used  for  cell  isolation.
0	Fluorescent  Dyes  and  P-Gal
0	Fluorescence-activated  cell  sorting  (FACS)  of  Drosophila  hemocytes  reveals  important  functional  similarities  to  mammalian  leukocytes
1	Rabindra  Tirouvanziam*,  Colin  J.  Davidson,  Joseph  S.  Lipsick,  and  Leonard  A.  Herzenberg*
0	Drosophila  is  a  powerful  model  for  molecular  studies  of  hematopoiesis  and  innate  immunity.  However,  its  use  for  functional  cellular  studies  remains  hampered  by  the  lack  of  single-cell  assays  for  hemocytes  (blood  cells).  Here  we  introduce  a  generic  method  combining  fluorescence-activated  cell  sorting  and  nonantibody  probes  that  enables  the  selective  gating  of  live  Drosophila  hemocytes  from  the  lymph  glands  (larval  hematopoietic  organ)  or  hemolymph  (blood  equivalent).  Gated  live  hemocytes  are  analyzed  and  sorted  at  will  based  on  precise  quantitation  of  fluorescence  levels  originating  from  metabolic  indicators,  lectins,  reporters  (GFP  and  -galactosidase)  and  antibodies.  With  this  approach,  we  discriminate  and  sort  plasmatocytes,  the  major  hemocyte  subset,  from  lamellocytes,  an  activated  subset  present  in  gain-of-function  mutants  of  the  Janus  kinase  and  Toll  pathways.  We  also  illustrate  how  important,  evolutionarily  conserved,  blood-cell-regulatory  molecules,  such  as  calcium  and  glutathione,  can  be  studied  functionally  within  hemocytes.  Finally,  we  report  an  in  vivo  transfer  of  sorted  live  hemocytes  and  their  successful  reanalysis  on  retrieval  from  single  hosts.  This  generic  and  versatile  fluorescence-activated  cell  sorting  approach  for  hemocyte  detection,  analysis,  and  sorting,  which  is  efficient  down  to  one  animal,  should  critically  enhance  in  vivo  and  ex  vivo  hemocyte  studies  in  Drosophila  and  other  species,  notably  mosquitoes.
0	tudies  focusing  on  hematopoiesis  and  innate  immunity  in  the  model  organism  Drosophila  melanogaster  have  identified  extensive  homologies  between  Drosophila  hemocytes  (blood  cells)  and  mammalian  leukocytes.  Whole-animal  functional  studies  have  suggested  that  Drosophila  hemocytes  participate  in  similar  activities  to  mammalian  leukocytes,  including  phagocytosis  encapsulation  of  pathogens,  release  of  reactive  oxygen  species  (ROS)  and  reactive  nitrogen  species  and  antimicrobial  peptides,  activation  of  humoral  serine  protease  cascades,  scavenging  of  dead  bodies,  wound  repair,  and  extracellular  matrix  deposition  (1-6).  Molecular  genetic  studies  have  unravelled  important  evolutionarily  conserved  regulatory  elements,  including  transcription  factors  of  the  Runt  acute  myelogenous  leukemia  (7),  GATA  (8),  and  Polycomb  (9)  families  and  integral  transduction  cascades,  including  the  immune  deficiency  tumor  necrosis  receptor  (2),  Toll  IL-1  receptor  (2),  Janus  kinase  (10,  11),  mitogen-activated  protein  kinase  (12),  Notch  (13),  steroid  (14),  and  vascular  endothelial  growth  factor  (15)  pathways.  Compared  to  mammalian  species,  Drosophila  is  particularly  well  suited  to  study  the  molecular  genetics  of  blood  cell  development  and  function,  thanks  to  the  existence  of  a  well  annotated  genome  database,  assorted  genetic  tools,  and  large  mutant  collections  (16).  By  contrast,  the  lack  of  single-cell  assays  for  Drosophila  hemocytes  severely  restricts  the  scope  of  cellular  studies  (10,  11).  Accordingly,  our  knowledge  of  Drosophila  hemocyte  subsets  and  functions  remains  very  limited.  In  mammals,  the  use  of  fluorescence-activated  cell  sorting  (FACS)  has  driven  much  of  the  progress  in  subset  discrimination  and  functional  analysis  of  leukocytes  (17).  Current  three-laser,  ``multidimensional,''  FACS  machines  enable  up  to  14  simultaneous
0	Drosophila  Stocks.  Stocks  used  in  this  study  include  y,  w67  (control),  Tum-l  [Janus  kinase  gain-of-function  mutant  (24)],  and  Toll10B  [Toll  gain-of-function  mutant  (25)].  The  Tum-l  11707  line  was  generated  by  crossing  the  Tum-l  line  and  the  LacZ  enhancer-trap  line,  11707  (26).  The  GAL4-e33c  upstream  activating  sequence  (UAS)-gfp  strain  was  generated  by  crossing  flies  carrying  the  GAL4-e33c  enhancer  trap  (27)  to  flies  carrying  the  gfp  transgene  under  control  of  the  UAS  (GAL4  response  element),  thus  achieving  constitutive  GFP  expression  in  hemolymph  and  lymph  glands  hemocytes.  For  in  vivo  transfers,  we  used  two  GFP-expressing  lines:  His::GFP  [ubiquitous  expression  of  a  fusion  protein  between  histone  His2AvD  and  GFP  (28)]  and  Tum-l;  His::GFP  (generated  by  standard  crossing).  Stocks  were  fed  standard  cornmeal,  molasses,  yeast,  and  agar  medium  and  were  maintained  at  25°C.  Late  wandering  third  instar  larvae  were  used  for  all  experiments  because  they  show  maximal  hemocyte  numbers  in  lymph  glands  and  hemolymph  (6,  14).
0	Abbreviations:  DHR,  dihydrorhodamine  123;  FACS,  fluorescence-activated  cell  sorting;  GSB,  glutathione-S-bimane;  GSH,  glutathione;  LacZ,  -galactosidase;  MCB,  monochlorobimane;  PI,  propidium  iodide;  ROS,  reactive  oxygen  species;  UAS,  upstream  activating  sequence;  WGA,  wheat  germ  agglutinin.
0	by  The  National  Academy  of  Sciences  of  the  USA
0	CELL  BIOLOGY
0	Hemocyte  Collection.  Hemolymph  cells  were  collected  by  rupturing  the  larval  cuticle  with  a  pair  of  fine  forceps.  For  the  collection  of  lymph  glands  cells,  lymph  glands  were  carefully  dissected  out,  rinsed,  and  ruptured  by  repeated  pipetting  with  siliconized  tips.  Cells  were  collected  in  ice-cold  Schneider's  medium  (Invitrogen  GIBCO)  containing  1  complete  mini  protease  inhibitor  mixture  (Roche  Applied  Science)  to  prevent  melanization,  clump  formation,  and  autolysis  and  kept  on  ice  until  incubation  with  FACS  probes.  Most  analyses  were  performed  with  cells  from  5-10  animals.  However,  several  analyses  were  also  performed  with  cells  from  one  animal  to  validate  single-animal  hemocyte  assays  with  both  hemolymph-  and  lymph  glandsderived  hemocytes.
0	Tirouvanziam  et  al.
0	FACS  Probes  and  Staining  Procedures.  The  main  probes  validated  so
0	For  this  purpose,  H2,  antilamellocyte  antibody  (L1a),  and  antiplasmatocyte  antibody  (P1b
0	GAL4  Enhancer  Trap  Targeting  of  the  Drosophila  Sex  Determination  Gene  fruitless
1	Anthony  J.  Dornan,1  Donald  A.  Gailey,2  and  Stephen  F.  Goodwin1*
0	INTRODUCTION  The  Drosophila  sex-determination  gene  fruitless  (fru)  encodes  transcription  factors  with  a  conserved  BTB/  POZ  dimerization  domain  at  the  amino  terminus  and  one  of  four  alternatively  spliced  zinc-finger  domains  at  the  carboxyl  terminus  (Ito  et  al.,  1996;  Ryner  et  al.,  1996;  Goodwin  et  al.,  2000;  Usui-Aoki  et  al.,  2000).  With  at  least  four  identified  promoters  (designated  P1,  P2,  P3,  and  P4)  and  both  sex-  and  nonsex-specific  alternative  splicing,  the  gene's  molecular  complexity  speaks  to  fru's  pleiotropy  (Ito  et  al.,  1996;  Ryner  et  al.,  1996;  Goodwin  et  al.,  2000;  Usui-Aoki  et  al.,  2000;  Anand  et  al.,  2001).  For  example,  fru  regulates  not  only  sex-specific  aspects  of  the  male  nervous  system  associated  with  sexual  behavior,  but  also  other  aspects  of  development  com-
0	mon  to  both  sexes  (Anand  et  al.,  2001;  Song  et  al.,  2002;  Song  and  Taylor,  2003).  Transcripts  from  the  P1  promoter  undergo  sex-specific  alternative  splicing  (Ryner  et  al.,  1996;  Heinrichs  et  al.,  1998;  Goodwin  et  al.,  2000;  Usui-Aoki  et  al.,  2000),  leading  to  a  class  of  Fru  proteins  (FruM)  that  are  present  only  in  males  (Lee  et  al.,  2000).  FruM  proteins  are  expressed  exclusively  in  the  central  nervous  system  (CNS)  (Lee  et  al.,  2000)  and  subserve  the  establishment  of  stereotypical  male  courtship  behaviors,  such  as  the  ability  of  males  to  bend  the  abdomen  in  order  to  initiate  mating,  generation  of  a  species-specific  courtship  song,  fertility,  and  the  concomitant  differentiation  of  male-specific  serotonergic  innervation  of  parts  of  the  internal  reproductive  organs  and  of  a  male-specific  neuronally  determined  abdominal  muscle,  the  muscle  of  Lawrence  (MOL)  (Gailey  et  al.,  1991,  Ito  et  al.,  1996;  Ryner  et  al.,  1996;  Goodwin  et  al.,  2000;  Usui-Aoki  et  al.,  2000;  Lee  and  Hall,  2000,  2001;  Lee  et  al.,  2001;  Billeter  and  Goodwin,  2004;  Manoli  and  Baker,  2004).  fru  also  performs  nonsex-specific  essential  roles  in  the  development  of  the  fly  (Lee  et  al.,  2000;  Anand  et  al.,  2001;  Song  et  al.,  2002;  Song  and  Taylor,  2003).  Genetic  analysis  of  fru  mutants  demonstrated  that  P3-  (and  perhaps  P4-)  derived  transcripts  are  necessary  for  viability  in  the  adult  and  for  fru's  nonsex-specific  functions  (Ryner  et  al.,  1996;  Goodwin  et  al.,  2000;  Lee  et  al.,  2000;  Anand  et  al.,  2001).  Application  of  an  antibody  capable  of  detecting  all  classes  of  fru  proteins  (antiFrucom;  Lee  et  al.,  2000;  Song  et  al.,  2002)  showed  that  the  other  promoters  (P2,  P3,  and  P4)  produce  nonsexually  dimorphic  products  with  differing  spatial  and  tem-
0	FRUITLESS  GAL4  ENHANCER  TRAP  LINES
0	FruCom  expression  (Lee  et  al.,  2000),  a  pattern  that  reflects  P3-  and  P4-derived  transcript  expression.  Given  the  lack  of  information  pertaining  to  the  function  of  these  transcripts,  the  availability  of  a  novel  GAL4  element  that  recapitulates  the  associated  endogenous  FruCom  expression  provides  a  unique  avenue  to  investigate  the  essential  roles  of  these  promoters  and  the  sexspecific  and  nonsex-specific  functions  of  fruitless  in  Drosophila  development.  RESULTS  Molecular  Verification  of  the  Precise  Replacement  Events  Using  a  targeted  transposition  strategy,  10  lines  were  confirmed  to  have  precisely  replaced  the  extant  fru4  (P[PZ])  element  insert  with  the  donor  GAL4  (P[GawB])  element  at  the  original  point  of  insertion  (Fig.  1).  Southern  blots,  PCR  amplification  of  regions  spanning  the  junction  between  the  gene  and  the  inserted  P-element,  and  direct  sequencing  of  these  products  confirmed  the  absence  of  the  original  element,  and  the  presence  of  a  single  GAL4  P-element  for  each  replacement  line  and  that  no  deletions,  either  of  the  element  itself  or  of  the  flanking  regions  of  the  locus,  had  occurred  (data  not  shown;  Gloor  et  al.,  1991;  Johnson-Schlitz  and  Engels,  1993;  Sepp  and  Auld,  1999).  This  also  determined  the
0	DORNAN  ET  AL.
0	orientation  of  the  inserted  P-element.  The  original  fru4  element  is  oriented  such  that  the  rosyþ  marker  gene  is  expressed  from  the  same  strand  as  fru  (Goodwin  et  al.,  2000),  designated  the  ``same''
0	letters  to  nature
0	Median  bundle  neurons  coordinate  behaviours  during  Drosophila  male  courtship
1	Devanand  S.  Manoli1,2  &  Bruce  S.  Baker2
0	Throughout  the  animal  kingdom  the  innate  nature  of  basic  behaviour  routines  suggests  that  the  underlying  neuronal  substrates  necessary  for  their  execution  are  genetically  determined  and  developmentally  programmed1-2.  Complex  innate  behaviours  require  proper  timing  and  ordering  of  individual  component  behaviours.  In  Drosophila  melanogaster,  analyses  of  combinations  of  mutations  of  the  fruitless  (fru)  gene  have  shown  that  male-specific  isoforms  (FruM)  of  the  Fru  transcription  factor  are  necessary  for  proper  execution  of  all  steps  of  the  innate  courtship  ritual3-9.  Here,  we  eliminate  FruM  expression  in  one  group  of  about  60  neurons  in  the  Drosophila  central  nervous
0	Nature  Publishing  Group
0	letters  to  nature
0	Males  in  which  Fru  M  expression  had  been  eliminated  in  median  bundle  neurons  by  the  P52a-GAL4-directed  expression  of  UAS-fru  MIR  (P52a/fru  MIR)  were  used  in  standard  courtship  assays  (see  Methods)  to  assess  the  FruM-dependent  roles  of  these  neurons  in  courtship.  In  P52a/fru  MIR  males,  courtship  latency--the  period  from  the  initial  presentation  of  a  virgin  female  to  the  initiation  of  courtship  behaviour,  defined  here  as  wing  extension;  Fig.  1a--  decreased  (8  ^  1  s  (^s.e.m.)  for  P52a/fru  MIR  males,  compared  with  94  ^  8  s  for  control  males;  Fig.  3a  and  Table  1).  However,  P52a/fru  MIR  males  can  still  distinguish  females  from  males,  because  they  do  not  sustain  courtship  towards  each  other  or  towards  control  males  (data  not  shown),  unlike  previously  described  mutants  that  exhibited  a  rapid  initiation  of  courtship  towards  both  virgin  females  and  mature  males13.We  did  several  controls  to  ensure  that  the  rapid  initiation  of  courtship  seen  in  P52a/fru  MIR  males  is  the  consequence  of  blocking  FruM  expression  in  these  60  median  bundle  neurons.  All  of  the  individual  transgenes  used  in  these  studies  were  backcrossed  into  a  common  genetic  background  before  use.  For  each  of  these  transgenes  the  courtship  behaviours  of  males  carrying  that  transgene  alone  did  not  differ  from  our  controls  (Fig.  3a).  Additionally,  the  P52a-GAL4-directed  expression  of  a  UAS-traF  transgene  (Fig.  1b)  also  eliminates  FruM  expression  in  these  60  neurons  (data  not  shown)  and  reduces  courtship  latency  (10  ^  2  s  versus  94  ^  8  s)  (Fig.  3a  and  Table  1).  On  the  basis  of  these  and  other  controls  (see  Methods),  we  conclude  that  it  is  the  elimination  of  FruM  protein  expression  in  the  ,60  median  bundle  neurons,  through  the  P52a-driven  expression  of  UAS-fru  MIR,  that  is  responsible  for  the  decreased  courtship  latency.  To  address  whether  rapid  courtship  by  P52a/fru  MIR  males  was  a  reflection  of  general  heightened  activity,  we  performed  short-,  intermediate  and  long-term  locomotor  assays  on  both  control  and  P52a/fru  MIR  males14  (Table  1  and  Fig.  3b).  There  were  no  significant  differences  in  their  activity  (see  Methods),  suggesting  that  the  behavioural  differences  observed  in  P52a/fru  MIR  males  are  specific  to  courtship.  The  longer  courtship  latency  seen  in  wild-type  relative  to  P52a/  fru  MIR  males  suggests  that  initiation  of  courtship  by  wild
0	Dispatch  R23
0	Sexual  Behaviour:  Do  a  Few  Dead  Neurons  Make  the  Difference?
0	Why  do  males  and  females  behave  so  differently?  Sexually  dimorphic  neural  circuitry  has  just  been  found  in  parts  of  the  fly's  brain  thought  to  control  mating  behaviour.  Might  this  explain  why  males  and  females  have  such  distinct  sexual  behaviours?  Jai  Y.  Yu  and  Barry  J.  Dickson  Males  and  females  of  most  species  behave  rather  differently,  particularly  when  it  comes  to  sex.  This  makes  sexual  behaviours  attractive  models  for  trying  to  understand  innate  behaviours  in  general.  Instead  of  trying  to  identify  all  the  genes  and  all  the  neurons  involved  in  a  given  behaviour,  and  then  figure  out  how  they  all  work,  one  can  just  look  for  the  genes  and  neurons  that  make  the  sexes  different,  and  try  to  understand  how  these  genes  and  neurons  shape  the  distinct  sexual  behaviours  of  males  and  females.  In  what  might  be  a  major  step  towards  this  goal,  Kimura  et  al.  [1]  have  now  discovered  a  clear  difference  in  neural  circuitry  in  the  brains  of  male  and  female  fruit  flies.  This  difference,  they  speculate,  might  just  explain  why  male  flies  do  the  male  thing  and  females  do  not.  Fly  sex  is  a  complicated  business.  To  woo  a  female,  the  male  must  perform  an  elaborate  song-and-dance  courtship  ritual  [2].  The  fruitless  (fru)  gene,  the  RNA  transcript  of  which  is  spliced  differently  in  males  and  females,  plays  a  key  role  during  development  to  lay  the  foundation  for  this  behaviour  (Figure  1).  In  males,  fru  RNA  is  spliced  in  such  a  way  as  to  encode  male-specific  FruM  proteins.  Males  that  lack  the  fru  gene  [3],  or  splice  it  the  wrong  way  [4],  make  a  complete  mess  of  the  courtship  ritual.  For  the  most  part,  they  do  not  even  bother,  and  if  they  do,  they  are  just  as  likely  to  try  to  woo  another  male  as  a  female.  What  is  more,  females  that  splice  fru  RNA  in  the  male  way,  and  therefore  make  FruM,  behave  like  males  and  try  to  woo  other  females  [4].  So,  genetically,  fru  seems  to  account  for  much  of  the  difference  between  male  and  female  sexual  behaviour.  Can  fru  also  lead  us  to  the  neuronal  circuits  in  the  brain  that  make  the  difference?  It  turns  out  that  FruM  is  made  in  3000  neurons  in  the  male  brain,  or  3%  of  the  total  number  of  neurons  [5].  These  neurons  are  grouped  into  distinct  clusters  in  various  regions  of  the  brain.  Are  these  neurons  also  present  in  females,  and  if  so,  what  is  different  about  them?  Because  the  female  fru  transcripts  do  not  encode  FruM,  it  has  been  rather  difficult  to  identify  cells  in  females  that  correspond  to  the  FruMexpressing  cells  in  males.  To  circumvent  this  problem,  two  groups  [6,7]  recently  used  gene  targeting  to  insert  coding  sequences  for  an  independent  marker  (GAL4)  into  the  fru  locus,
0	replacing  the  alternatively  spliced  exon  so  that  the  marker  would  be  produced  in  both  males  and  females.  Surprisingly,  these  studies  revealed  that  almost  all  of  the  FruM-producing  neurons  in  the  male  have  counterparts  in  the  female,  and  at  a  gross  level,  they  seem  to  be  wired  up  the  same  way.  Of  course,  this  does  not  exclude  more  subtle  differences  in  neuroanatomy,  but  without  knowing  which  of  these  3000  neurons  make  the  essential  difference,  there  seemed  little  point  to  go  on  examining  them  all  at  higher  resolution.  Kimura  et  al.  [1]  took  a  different  line  of  attack,  both  technically  and  strategically.  They  isolated  a  random  enhancer  trap  insertion  further  downstream  in  the  fru  locus,  called  NP21  (Figure  1).  NP21  labels  many,  but  not  all,  of  the  FruM  neurons  in  males,  as  well  as  the  corresponding  cells  in  females.  Kimura  et  al.  [1]  then  went  on  to  characterize  some  of  these  neurons  at  higher  resolution,  undeterred  by  the  lack  of  behavioural  data  to  indicate  which  of  them  might  be  the  most  relevant.  Nevertheless,  two  sets  of  NP21-positive  neurons  clearly  differed  anatomically  in  males  and  females  (Figure  1).  One  of  these,  belonging  to  the  so-called  frumAL  cluster  [5],  particularly  attracted  their  attention.  These  neurons  seem  to  serve  as  a  relay  between  the  primary  gustatory  centre  of  the  brain  and  higher  brain  regions  thought  to  integrate  information  from  multiple  sensory  modalities.  There  are,  on  average,  about  30  NP21positive  fru-mAL  cells  in  males  and  about  five  in  females.  In  a
0	No  FruM
0	Early-born  fru-mAL  neurons
0	Late-born  fru-mAL  neurons
0	Male  sexual  behaviour?
0	Female  sexual  behaviour?
0	Current  Biology
0	clever  set  of  cell-labelling  and  lineage-tracing  experiments,  Kimura  et  al.  [1]  found  that  these  cells  all  derive  from  a  common  precursor  which,  in  males,  gives
0	rise  to  two  distinct  classes  of  neurons:  early-born  neurons  with  contralateral  dendritic  projections,  and  later-born  neurons  with  bilateral  projections.  In  females,
0	Dispatch  R25
0	need  to  find  out  what,  if  anything,  such  sex-specific  circuits  contribute  to  the  all-important  difference  in  sexual  behaviour  between  males  and  females.
0	Kimura,  K.,  Ote,  M.,  Tazawa,  T.,  and  Yamamoto,  D.  (2005).  Fruitless  specifies  sexually  dimo
0	Functional  analysis  of  fruitless  gene  expression  by  transgenic  manipulations  of  Drosophila  courtship
1	Adriana  Villella*,  Sarah  L.  Ferri,  Jonathan  D.  Krystal,  and  Jeffrey  C.  Hall*
0	A  gal4-containing  enhancer-trap  called  C309  was  previously  shown  to  cause  subnormal  courtship  of  Drosophila  males  toward  females  and  courtship  among  males  when  driving  a  conditional  disrupter  of  synaptic  transmission  (shiTS).  We  extended  these  manipulations  to  analyze  all  features  of  male-specific  behavior,  including  courtship  song,  which  was  almost  eliminated  by  driving  shiTS  at  high  temperature.  In  the  context  of  singing  defects  and  homosexual  courtship  affected  by  mutations  in  the  fru  gene,  a  tra-regulated  component  of  the  sex-determination  hierarchy,  we  found  a  C309  traF  combination  also  to  induce  high  levels  of  courtship  between  pairs  of  males  and  ``chaining''  behavior  in  groups;  however,  these  doubly  transgenic  males  sang  normally.  Because  production  of  male-specific  FRUM  protein  is  regulated  by  TRA,  we  hypothesized  that  a  fru-derived  transgene  encoding  the  male  (M)  form  of  an  Inhibitory  RNA  (fruMIR)  would  mimic  the  effects  of  traF;  but  C309  fruMIR  males  exhibited  no  courtship  chaining,  although  they  courted  other  males  in  single-pair  tests.  Doublelabeling  of  neurons  in  which  GFP  was  driven  by  C309  revealed  that  10  of  the  20  CNS  clusters  containing  FRUM  in  wild-type  males  included  coexpressing  neurons.  Histological  analysis  of  the  developing  CNS  could  not  rationalize  the  absence  of  traF  or  fruMIR  effects  on  courtship  song,  because  we  found  C309  to  be  coexpressed  with  FRUM  within  the  same  10  neuronal  clusters  in  pupae.  Thus,  we  hypothesize  that  elimination  of  singing  behavior  by  the  C309  shiTS  combination  involves  neurons  acting  downstream  of  FRUM  cells
0	reproductive  behavior  C309  enhancer  trap  shiTS  transgene  traF  transgene  inhibitory  fru  RNA  transgene
0	revealed  that  C309  drives  marker  expression  in  a  widespread  manner  (18).  Therefore,  we  sought  to  correlate  various  CNS  regions  in  which  this  transgene  is  expressed  with  its  effects  on  male  behavior,  emphasizing  a  search  for  ``C309  neurons''  that  might  overlap  with  elements  of  the  FRUM  pattern.  We  also  entertained  the  possibility  that  the  C309  shiTS  combination  causes  a  mere  caricature  of  fruitless-like  behavior.  Therefore,  what  would  be  the  courtship  effects  of  C309  driving  a  transgene  that  produces  the  female  form  of  the  transformer  gene  product?  This  TRA  protein  participates  in  posttranscriptional  control  of  fru's  primary  ``sex  transcript,''  so  that  FRUM  protein  is  not  produced  in  females  (reviewed  in  ref.  8;  also  see  refs.  16  and  21).  If  C309  and  traF  are  naturally  coexpressed  in  a  subset  of  the  to-be-analyzed  neurons,  feminization  of  the  overlapping  cells  should  eliminate  this  protein.  We  extended  these  transgenic  experiments  to  target  fruitless  expression  specifically  by  gal4  driving  of  an  inhibitory  RNA  (IR)  construct,  which  was  generated  with  fru  DNA  by  Manoli  and  Baker  (22).  Their  experiments  furnish  one  object  lesson  as  to  how  ``enhancer-trap  mosaics''  can  delve  into  the  neural  substrates  of  a  complex  behavioral  process,  an  approach  commonly  taken  to  manipulate  brain  structures  and  functions  in  courtship  experiments  (2-7).  Because  few  genetic  loci  putatively  identified  by  such  transposons  have  been  specified,  the  tactics  we  applied  are  in  the  context  of  CNS  regions  in  which  expression  of  a  ``real  gene''  is  hypothesized  to  underlie  well  defined  behaviors.  Materials  and  Methods
0	Supporting  Information.  For  further  details,  see  Tables  3-5  and
0	arious  portions  of  the  CNS  in  Drosophila  melanogaster  are  inferred  to  control  separate  elements  of  normal  male  courtship  (e.g.,  refs.  1  and  2),  in  part  by  analysis  of  abnormal  behavior  (e.g.,  refs.  3-7).  Some  such  studies  have  involved  brainbehavioral  analyses  of  the  fruitless  (  fru)  gene  and  its  mutants  (reviewed  in  ref.  8).  Different  fru  mutants  exhibit  courtship  subnormalities  to  varying  degrees  and  at  separate  stages  of  the  courtship  sequence,  depending  on  the  mutant  allele  (e.g.,  refs.  9-12).  Most  fru  mutants  court  other  males  substantially  above  levels  normally  exhibited  by  pairs  or  groups  of  wild-type  males  (e.g.,  refs.  12  and  13).  The  original  fruitless  mutation  leads  to  spatially  nonrandom  decreases  of  fru-product  presence  (14,  15)  within  particular  subsets  of  the  normal  CNS  expression  pattern  (16,  17),  which  may  be  causally  connected  with  the  breakdown  of  recognition  that  is  a  salient  effect  of  fru1  on  male  behavior  (9,  12).  fru-like  courtship  can  be  induced  by  the  effects  of  a  transgene  that  encodes  GAL4  (a  transcription  factor  derived  from  yeast).  When  this  C309  enhancer  trap  was  combined  with  a  GAL4drivable  factor  containing  a  dominant-negative,  conditionally  expressed  variant  of  the  shibire  gene  (shiTS),  heat  treatment  of  doubly  transgenic  males  caused  them  to  court  females  subnormally  and  to  court  other  males  vigorously  (18).  Although  this  strain  had  been  termed  a  mushroom  body  enhancer  trap  in  terms  of  the  gal4  sequence  it  contains,  being  expressed  ``predominantly''  within  that  dorsal-brain  structure  (19,  20),  Kitamoto
0	Stocks  of  D.  melanogaster,  Crosses,  and  Fly  Handlings.  Cultures  were
0	maintained  as  in  ref.  23.  Pure  control  males  came  from  a  Canton-S  wild-type  (WT)  stock.  Other  control  types  were  male  progeny  of  a  given  transgenic  strain  (see  below)  crossed  to  Canton-S.  Adult  males  and  females  were  collected  and  stored  as  in  refs.  12  and  23  (see  below  for  exceptions).  The  enhancer-trap  line  C309  (19)  is  homozygous  for  a  gal4-containing  transposon  inserted  into  chromosome  2;  such  females  were  crossed  separately  to  males  carrying  the  following  transgenes:  UAS-shiTS  (homozygous  on  chromosome  3),  which  disrupts  synaptic  transmission  in  a  heat-sensitive  manner  under  the  control  of  a  given  gal4-containing,  neurally  expressed  transgene  (24);  UAS-traF  (homozygous  on  chromosome  2),  which,  when  GAL4-driven,  causes  the  female  form  of  transformer  (tra)  mRNA  to  be  produced  (e.g.,  refs.  3  and  4);  UAS-fruMIR  [inserted  into  both  the  second  and  third  chromosomes,  the  former  heterozygous  for  the
0	PNAS  Early  Edition
0	INAUGURAL  ARTICLE
0	transgene  and  In(2LR)O,Cy,  the  latter  homozygous],  designed  to  produce  a  double-stranded  IR  that  blocks  production  of  male  (M)-specific  protein  encoded  by  the  endogenous  fru  gene  (22);  and  UAS-egfp  (homozygous  on  chromosome  2),  which  encodes  an  ``enhanced''  nuclear  form  of  GFP  (25).  Most  culture  rearings  occurred  at  25°C;  but  those  involving  UAS-fruMIR  were  effected  separately  at  25°C  and  29°C,  because  the  hotter  condition  was  reported  to  accentuate  the  inhibitory  effects  of  this  transgene  (22).  Histochemistry  involving  effects  of  traF  or  fruMIR  on  the  presence  of  FRUM  in  C309-expressing  neurons  used  females  from  a  stock  carrying  both  C309  and  UAS-egfp  on  the  second  chromosome  (generated  by  meiotic  recombination),  crossed  to  UAS-traF  or  to  ``double-insert''  UASfruMIR  males.  Additional  transgene  combinations  used  females  from  a  C309  C309  Cha-gal80  In(3LR)TM6B,Hu  transgenic  stock,  crossed  separately  to  UAS-shiTS,  UAS-traF,  UAS-fruMIR,  or  UAS-egfp  males;  triply  transgenic  progeny  should  have  gal4  driving  eliminated  in  neurons  that  coexpress  gal80  (see  ref.  26)  under  the  control  of  regulatory  sequences  from  the  Cholineacetyltransferase  (Cha)  gene  (see  refs.  18  and  27).
0	Behavior.  Basic  courtship  quantification.  Audio  video  recordings  were  obtained  and  processed  as  in  refs.  12  and  23,  but  most  of  the  current  records  were  captured  with  a  Sony  VX2100  digital  camera.  For  transgenic-male  WT-female  pairings,  the  two  types  of  flies  were  readily  distinguishable  despite  the  largely  feminized  external  appearance  of  XY  flies  carrying  C309  UAS-traF  or  C309  UAS-traF  Cha-gal80.  For  transgenic  male  WT  male  observations  involving  UAS-shiTS  or  UAS-fruMIR,  the  two  male  types  look  the  same,  so  each  WT  male  had  the  tip  of  one  wing  clipped  off  at  the  time  of  collection.  Males  including  UAS-shiTS  were  stored  at  25°C  (permissive  temperature)  before  testing.  For  restrictive-temperature  observations,  a  male-  and  food-containing  tube  was  placed  in  a  30°C  water  bath  for  20-40  min,  then  aspirated  into  a  mating  cell  for  recording  at  30°C.  For  permissive-temperature  controls,  test  males  remained  in  food  containers  at  25°C  before  transfer  into  female-containing  chambers  at  that  temperature.  Recordings  were  converted  to  computerized  files,  and  behaviors  were  ``logged''  and  analyzed  by  using  LIFESONGX  (http:  lifesong.bio.brandeis.edu,  compare  ref.  28)  to  compute  percentages  of  observation  periods  during  which  any  interfly  interactions  occurred  (courtship  index,  CI)  or  courtship  wing  displays  (wing  extension  index,  WEI).  Song  sounds.  Digitized  audio  tracks  were  logged  then  analyzed  (as  in  refs.  12  and  23),  leading  to  computations  of  the  parameters  specified  in  Table  3.  Mating  behaviors.  Attempted  copulations,  Mating-initiation  latencies,  and  copulation  successes  were  quantified  for  several  fly  pairs  in  a  plastic  device  (see  ref.  1),  at  25°C  or  at  30°C  for  tests  involving  shiTS.  Courtship  chaining.  Eight  to  10  males  of  a  given  genotype  were  grouped  in  a  food  vial  upon  collection,  stored  for  3-4  days  (at  25°C  or  20°C),  and  then  hand-timer  recorded  at  25°C  for  the  amount  of  time  that  at  least  three  males  spent
0	NEWS  &  VIEWS
0	If  decreasing  atmospheric  CO2  stabilized  the  glacial  state  in  the  Oligocene,  might  increasing  atmospheric  CO2  from  fossil-fuel  burning  destabilize  it  in  the  future?  The  lesson  to  be  learned  here  is  that  we  should  watch  for  subtle  signs  that  we  are  moving  from  the  icehouse  world  in  which  Earth  has  remained  for  34  million  years  into  a  new,  greenhouse  world.
0	BEHAVIOURAL  GENETICS
0	Sex  in  fruitflies  is  fruitless
0	Charalambos  P.  Kyriacou  The  courtship  rituals  of  fruitflies  are  disrupted  by  mutations  in  the  fruitless  gene.  A  close  look  at  the  gene's  products  --  some  of  which  are  sex-specific  --  hints  at  the  neural  basis  of  the  flies'  behaviour.
0	Tra-binding  sequences.)  Similarly,  Tra  protein  binds  to  the  doublesex  (dsx)  gene  and  splices  it  in  male-  and  female-specific  modes  (DsxM  and  DsxF,  respectively)8.  The  DsxM  and  DsxF  transcription  factors  mainly  determine  sexual  morphologies8,  but  the  sexual  identity  of  the  nervous  system  is  shaped  by  fru.  By  forcing  males  to  express  the  femalespecific  fruF  transcript,  Demir  and  Dickson1  produced  males  that  showed  the  characteristics  of  the  worst-affected  fru  mutants.  These  males  were  sterile,  they  barely  courted  females  and  they  were  more  interested  in  courting  males,  forming  courtship  chains.  By  contrast,  females  jammed  into  fruM  mode  mated  poorly,  produced  very  few  eggs,  but  --  astonishingly  --  courted  other  females  (Fig.  2),  even  to  the  point  of  forming  chains.  And  an  identity  crisis  of  similar  epic  proportions  was  observed  in  females  that  were  `masculinized'  using  a  different  fru-related  genetic  trick3.  Finally,  by  feminizing  specific  abdominal  glands  in  males  to  produce  female  pheromones,  and  placing  the  altered  males  with  fruM  females,  the  sex  roles  were  reversed,  so  that  the  females  courted  the  males1.  In  another  nifty  piece  of  genetic  engineering,  both  teams2,3  generated  flies  in  which  they  could,  among  other  things,  mark  the  parts  of  the  nervous  system  (just  2%)  that  show  sexspecific  expression  of  Fru.  Further  genetic  manipulations  showed  that  high  levels  of  male-male  courtshipresult  when  the  communication  between  these  neurons  is  shut  down,  or  when  fruM  expression  in  these  neurons  in  males  is  inhibited2,3.  Both  studies  found  that  the  central  nervous  system  of  males  and  females  looked  very  similar  in  terms  of  sexspecific  fru  expression,  with  few  differences  between  the  sexes  in  the  numbers,  positions  or  wiring  of  cells  expressing  Fru.  The  fru  products  were  found  in  almost  all  sensory  organs  that  have  been  implicated  in  courtship2,3.  Olfactory  sensory  neurons  showed  some  evidence  for  sexual  dimorphisms.  Those  receptors  that  respond  to  pheromones  project  to  certain  other  brain  regions  that  are  larger  in  males  than  females,  reflecting  the  fact  that  sex  pheromones  have  a  greater  functional  significance  in  male  Drosophila2.  By  reversibly  shutting  down  the  fru-expressing  olfactory  receptors,  both  in  males  and  in  masculinized  females  in  the
0	Nature  Publishing  Group
0	NEWS  &  VIEWS
0	the  focus  of  attention  for  those  interested  in  the  debate  (scientific  and  political)  on  the  genetic  versus  environmental  bases  of  human  sexuality.  Perhaps  we  should  remind  ourselves  that  normal  fly  sexual  preferences,  unlike  human  sexual  behaviour,  cannot  be  modulated  to  any  significant  extent  by  altering  experience11.
0	other  females  --  apparently  because  of  a  genetic  factor(s)  on  chromosome  2  (fru  is  on  chromosome  3).  Might  this  long-lost  strain  have  carried  a  mutation  in  one  of  the  fru  target  genes?  The  work  discussed  here  may  well  find  itself
0	Shaken  on  impact
0	Erik  Asphaug  A  single  recent  impact  may  have  modified  the  craters  on  the  asteroid  Eros  into  the  pattern  we  see  today.  This  finding  has  implications  for  how  we  view  the  structure  of  asteroids  --  and  for  addressing  any  hazards  they  present.
0	Asteroids  seem  to  get  stranger  with  every  passing  year.  Thomas  and  Robinson's  finding  (page  366  of  this  issue)1  --  that  impact-induced  vibrations  of  an  asteroid  may  be  the  dominant  mechanism  reshaping  its  surface  --  shakes  things  up  still  further.  In  the  case  of  the  wellstudied  asteroid  Eros,  the  authors  link  this  resurfacing  mechanism  to  the  recent  impact  of  a  meteoroid  that  left  a  particularly  large  crater.  They  thereby  make  the  first  detailed  mechanical  connection  between  surface  observations  and  an  asteroid'